Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2014 Aug 14;10(8):e1003734. doi: 10.1371/journal.pcbi.1003734

Epigenetic Landscapes Explain Partially Reprogrammed Cells and Identify Key Reprogramming Genes

Alex H Lang 1,2,*, Hu Li 3,4,¤, James J Collins 3,4,5,6, Pankaj Mehta 1,2,*
Editor: Alexandre V Morozov7
PMCID: PMC4133049  PMID: 25122086

Abstract

A common metaphor for describing development is a rugged “epigenetic landscape” where cell fates are represented as attracting valleys resulting from a complex regulatory network. Here, we introduce a framework for explicitly constructing epigenetic landscapes that combines genomic data with techniques from spin-glass physics. Each cell fate is a dynamic attractor, yet cells can change fate in response to external signals. Our model suggests that partially reprogrammed cells are a natural consequence of high-dimensional landscapes, and predicts that partially reprogrammed cells should be hybrids that co-express genes from multiple cell fates. We verify this prediction by reanalyzing existing datasets. Our model reproduces known reprogramming protocols and identifies candidate transcription factors for reprogramming to novel cell fates, suggesting epigenetic landscapes are a powerful paradigm for understanding cellular identity.

Author Summary

Traditionally, standard development has been viewed as a one-way process; an organism starts as a single cell (embryonic stem cell, ESC) that divides into a multitude of mature cell types (skin cells, heart, liver, etc). But, in 2006 Takahashi and Yamanaka revolutionized this view by stochastically converting skin cells into cell types resembling ESC (called induced pluripotent stem cells, iPSC). Following this groundbreaking experiment, other reprogramming protocols have been found, so now scientists can switch between a variety of cell types such as ESC, skin, liver, neurons, and heart. This has already revolutionized the understanding of biology and could change the future of medicine. A common metaphor for development is Waddington's landscape, in which an ESC is like a ball rolling down a hill which eventually ends in a valley (mature cell type). In this paper, we make this analogy more precise by developing a mathematical model of cellular development. Using data on real cell types, we can provide insight into existing reprogramming protocols and potentially predict new reprogramming protocols.

Introduction

Understanding the molecular basis of cellular identity and differentiation is a major goal of modern biology. This is especially true in light of the work of Takahashi and Yamanaka demonstrating that the overexpression of just four transcription factors (TFs) is sufficient to convert somatic fibroblasts into cells resembling embryonic stem cells (ESCs), dubbed induced pluripotent stem cells (iPSCs) [1]. The idea of using a small set of TFs to reprogram cell fate has proven to be extremely versatile and reprogramming protocols now exist for generating neurons [2], cardiomyocytes [3], liver cells [4], [5], neural progenitor cells (NPC) [6], and thyroid [7] (see reviews [8], [9] for more details). Despite these revolutionary experimental advances, cell fate is still poorly understood mechanistically and theoretically. Recent experiments suggest cell fates can be viewed as high-dimensional attractor states of the gene regulatory networks underlying cellular identity [10]. In particular, cell fates are characterized by a robust gene expression and epigenetic state resulting from the complex interplay of transcriptional regulation, chromatin regulators, non-coding and microRNAs, and signal transduction pathways.

These experiments have renewed interest in the idea of an ‘epigenetic landscape’ that underlies cellular identity [11][15]. The landscape picture requires several key features to be consistent with experimental observations (see Figure 1). All cell fates must be robust attractors, yet allow cells to change fate through rare stochastic transitions [8], [16] as in cellular reprogramming experiments (Figure 1A). A common result of reprogramming is not the desired cell fate, but partially reprogrammed cells [17], [18]. These results suggest that the landscape is rugged and may contain additional spurious attractors corresponding to cell fates that do not naturally occur in vivo. In addition, environmental and external signals can control cell fates. Some environments stabilize particular cell fates (Figure 1B). A dramatic example of this is a protocol for reprogramming to neural progenitor cells (NPCs) that is identical to Yamanaka's protocol for reprogramming to ESC except for the culturing media [19]. Other external signals deterministically switch cell fates, as occurs in normal development (Figure 1C) [20]. Together, these imply the landscape is a dynamic entity that depends on environmental signals.

Figure 1. Phenotypic landscape.

Figure 1

These are illustrative cartoons of the cell fate attractor landscape. (A) The minimal cellular identity landscape. Each cell fate is a basin of attraction (black circles). Reprogramming between different cell fates (1 and 2) can occur probabilistically via different trajectories (black paths). Partially reprogrammed cells (PRC) exist as smaller, spurious, basins of attraction (red circle) that can be experimentally observed by reprogramming experiments (example trajectory in red). (B) Same cellular identity landscape in the presence of a stabilizing environment (ex. favorable culturing medium) for cell fate 2. The environment increases the radius and depth of the cell fate 2 basin of attraction. (C) Landscape in the presence of an external signal that gives rise to differentiation from cell fate 1 to cell fate 2 (ex. growth factors associated with differentiation). Notice the low energy path between the cell fates that drives switching from cell fate 1 to cell fate 2.

The recent experimental progress has inspired several different theoretical approaches to understand the epigenetic landscape and the underlying gene regulatory networks governing cell fates. One focus has been on explicit construction of landscapes for specific cell fate decisions such as the erythroid vs myeloid choice in hemopoietic development [21], pancreatic cell fates [22], or C. elegans vulva development [23]. Other network based approaches use experimental data to constrain the possible networks [24], [25]. A second area of work is based on understanding the underlying gene regulatory network [26], [27]. A recent paper [28] attempts to combine the network and landscape picture by using the network entropy to define a landscape. On a more abstract level, there has been a renewed interest in understanding Waddington's landscape mathematically using ideas from dynamical systems and nonequilibrium statistical mechanics [15], [29]. Most of these models focus on in vivo developmental decisions and hence consider the dynamics of a few genes or proteins.

Here, we present a new modeling framework to construct a global (i.e. all cell fates and all TFs) epigenetic landscape that combines techniques from spin glass physics with whole genome expression profiles. We were inspired by the successful application of spin glasses to model neural networks [30][33] and protein folding landscapes [34]. Here, we construct an epigenetic landscape model for cellular identity with 63 stable cell fates and 1337 TFs using cell-fate specific, mouse microarray gene expression data. Each cell fate is a robust attractor, yet cells can deterministically switch fates in response to external signals. Our model provides a unified framework to discuss differentiation and reprogramming. It also naturally explains the existence of partially reprogrammed cell fates as ‘spurious’ attractors resulting from the high dimensionality of the landscape. Our model predicts, and we verify, that partially reprogrammed cells are hybrids that co-express TFs of multiple naturally occurring cell fates. Finally, our model reproduces known reprogramming protocols to iPSCs, heart, liver, NPC, and thyroid, and has the potential for designing reprogramming protocols to novel cell fates. Taken together, these results suggest that epigenetic landscapes represent a powerful framework for understanding the molecular circuitry and dynamics that gives rise to cell fate.

The organization of the paper is as follows. First, we explain the motivation for using an attractor neural network to model the epigenetic landscape. Second, we define the state space for the model and the actual biological data used to construct the state space. Third, we give an overview of our landscape model (with details given in Table 1 and Materials and Methods: Landscape Model). Next, we show that our mathematical model captures the essential experimental features of cellular identity. We then show that our model naturally explains the existence of partially reprogrammed cells and makes predictions about their gene expression profiles. We verify this by reanalyzing experimental data. Finally, we show that our model can identify key reprogramming genes in existing reprogramming protocols, suggesting it can be used to identify candidate TF for reprogramming to novel cell fates. We conclude by discussing the implications of our mathematical model for understanding cellular identity and reprogramming.

Table 1. Mathematical model of cell identity landscape.

Landscape Term: Index Notation Landscape Term: Matrix Notation (dim.) Biological Interpretation
Inline graphic Total landscape.
Inline graphic Inline graphic Produces cell basins of attraction.
Inline graphic Inline graphic External control of individual genes, i.e. inducible expression.
Inline graphic Inline graphic External control of specific cell basins, i.e. culturing conditions.
Inline graphic Inline graphic Cell switching by signals, i.e. in vivo development.
Inline graphic Number of TFs, labeled by Inline graphic, Inline graphic. In this paper N = 1337.
Inline graphic Number of cell fates, labeled by Inline graphic, Inline graphic. In this paper Inline graphic.
Inline graphic Inline graphic (Inline graphic×Inline graphic) State (Inline graphic) of Inline graphic TF.
Inline graphic Inline graphic (Inline graphic×Inline graphic) State (Inline graphic) of Inline graphic TF in cell fate Inline graphic.
Inline graphic Inline graphic (Inline graphic×Inline graphic) Correlation between cell fate Inline graphic and Inline graphic.
Inline graphic Inline graphic (Inline graphic×Inline graphic) Interaction strength between Inline graphic and Inline graphic.
Inline graphic Inline graphic (Inline graphic×Inline graphic) External control of Inline graphic TF.
Inline graphic Inline graphic (Inline graphic×Inline graphic) External control of Inline graphic cell fate.
Inline graphic Inline graphic (Inline graphic×Inline graphic) Overlap of Inline graphic on cell fate Inline graphic.
Inline graphic Inline graphic (Inline graphic×Inline graphic) Projection of Inline graphic on cell fate Inline graphic.
Inline graphic Inline graphic (Inline graphic×Inline graphic) Predictivity of Inline graphic TF in cell fate Inline graphic.
Inline graphic Inline graphic (Inline graphic×Inline graphic) Signal dependent coupling that drives cell fate Inline graphic to cell fate Inline graphic

This table provides a summary of the landscape model and the biological interpretation of each term. The first column is written in index notation, while the second column is the same term in matrix notation with the dimension of the term given in parenthesis. If no dimension is listed, the term is a single number.

Results

Motivation from attractor neural networks

The Takahashi and Yamanaka reprogramming experiments [1] are reminiscent of content-addressable memory and attractor neural networks. First, let us introduce a content-addressable memory with a paraphrasing of the original Hopfield paper. A content-addressable memory allows one to retrieve a full memory based on sufficient partial information. For example, suppose the complete stored memory is “John J. Hopfield, Neural networks and physical systems with emergent collective computational abilities (1982).” A content-addressable memory is capable of retrieving the full memory based on partial, incomplete input. Therefore, the details “Hopfield,” “Neural networks,” and “1982” could be enough to recall the full memory.

In the Yamanaka reprogramming protocol, overexpressing only four TFs is enough for a fibroblast to “recall” the global TF expression of an ESC. A content-addressable memory is naturally represented as a basin of attraction in a dynamical system, with partial recall corresponding to entering the basin of attraction and full recall corresponding to reaching the minimum of the basin. Hopfield attractor neural networks [30], [31], [33] are a general method to take an input set of vectors (“memories”) and explicitly construct a unique, global, landscape such that each input vector is a global minimum and has a basin of attraction. In what follows, we will exploit the analogy between associative memory in attractor neural networks and cellular reprogramming to explicitly construct the epigenetic landscape underlying cellular identity.

The epigenetic landscape

Our goal is to model the global epigenetic landscape involving all cell fates by using genome wide data. Currently, microarrays are the only technology with genome wide data for a multitude of cell fates (although RNA-seq and other technologies will likely be useful in the future). Specifically, we compiled a dataset of 601 mouse whole genome microarrays (details in Materials and Methods: Data Analysis) resulting in the gene expression for N = 1337 transcription factors for Inline graphic cell fates. We restricted our considerations to TFs due to their importance in cellular reprogramming and differentiation. However, our model can be easily generalized to include other important genes. To robustly compare microarrays from multiple platforms, we converted the raw expression data into a rank ordered list. We assumed that gene expression is log-normal distributed (the minimal-assumption model for positive-definite random numbers such as gene expression) and assigned a z-score to each TF. The final output of this procedure is that it assigns each TF in every cell fate a z-score gene expression.

This continuous gene expression could be used to construct our epigenetic landscapes. However, for mathematical convenience, we discretize the continuous gene expression data into high expression (Inline graphic for z-score Inline graphic) and low expression (Inline graphic for z-score Inline graphic). See Text S1 for an extended discussion on continuous vs discrete TF expression in attractor neural networks.

This discretization process is biologically plausible. Cellular identity and differentiation are largely controlled by epigenetics, especially histone modifications (HMs) [35] (Figure 2A). Epigenetics primarily controls the accessibility of DNA and depending on the HM, the DNA can be stabilized in an open or closed configuration. Using global HM data [36], [37] and comparing it to microarray data, we created a conditional probability distribution of having a HM given a TF expression level (Figure 2B). We find that between a z-score of Inline graphic to Inline graphic there is a sharp threshold which distinguishes genes with the activating modification of histone 3 tri-methylation at lysine 4 (K4) from genes with the inactivating modification of histone 3 tri-methylation at lysine 27 (K27) and poised/bivalent genes (both K4 and K27). This provides a potential biological justification to our discretization. In summary, we take the continuous gene expression and binarize (Figure 2C). These binary (i.e. on/off) TF data are the only biological input into our model.

Figure 2. Overview of model.

Figure 2

(A) Histone 3 tri-methylation at lysine 4 (K4) is associated with active genes, while histone 3 tri-methylation at lysine 27 (K27) is associated with repressed genes. (B) Conditional probability distribution of histone modification (HM) given transcription factor (TF) expression levels derived by comparing microarray data with HM data from [36], [37]. Notice the sharp threshold (black line) between expression levels of active and inactive TFs. (C) For mathematical convenience, we take the continuous TF expression levels and convert it to binary states (z-score Inline graphic to Inline graphic and z-score Inline graphic to Inline graphic). This binarization is consistent with the result from (B). (D) An arbitrary state is represented by a vector Inline graphic of Inline graphic, with each dimension in the vector space representing the state of a TF. The natural cell fates form a subspace (gray plane). The landscape model is based on the orthogonal projection of the TF state onto this subspace. (E) The dynamics of the landscape model for different initial conditions for a fully connected interaction matrix Inline graphic and a diluted (non-equilibrium) interaction matrix where 20% of interactions have been randomly deleted. Plot shows the projection of Inline graphic on embryonic stem cells (ESC) as function of time. Notice the large basins of attraction (red bracket). Parameters used were Inline graphic and burst errors of Inline graphic every Inline graphic spin updates. (F) Simulations showing how a common myeloid progenitor (CMP) can differentiate into either granulo-monocytic progenitors (GMP) or megakaryocyte-erythroid progenitors (MEP) in response to two distinct external signals. All trajectories used Inline graphic. For signal 1, we set Inline graphic and all other Inline graphic. For signal 2, we set Inline graphic and all other Inline graphic.

In order to precisely describe the landscape results, we need to define the correct way to measure distances. One possible measure is the overlap (aka dot product or magnetization), defined for cell fate Inline graphic as:

graphic file with name pcbi.1003734.e102.jpg (1)

where Inline graphic is an arbitrary expression state and Inline graphic is the gene expression in the natural cell fate Inline graphic. The overlap between cell fate Inline graphic and state Inline graphic for exactly correlated, anti-correlated, or uncorrelated states is Inline graphic, Inline graphic, or Inline graphic respectively.

Cell fates from similar lineages (ex. blood) often have similar gene expression patterns. For example, B cells and T cells have a 77% overlap in their gene expression profiles. Such large correlations between cell fates makes the overlap, Inline graphic, a poor distance measure. In order to measure distances between highly correlated vectors, it is helpful to define the “projection” Inline graphic of a gene expression state Inline graphic on a cell fate Inline graphic by

graphic file with name pcbi.1003734.e115.jpg (2)

where Inline graphic is the inverse correlation matrix and Inline graphic is the overlap on cell fate Inline graphic and is given by

graphic file with name pcbi.1003734.e119.jpg (3)

The projection Inline graphic measures the orthogonal projection of a state Inline graphic onto the subspace spanned by naturally occurring cell fates, Inline graphic (see Figure 2D and Text S1), and a perfect projection onto state Inline graphic is given by Inline graphic. In contrast with the overlap, B cells have zero projection on T cells, and vice versa.

Our landscape assigns an “energy” to every global expression state. We emphasize that this energy does not correspond to physical energy consumption of ATP; instead it is an abstract energy that corresponds to stability and developmental potential of cell fates. The complete landscape Inline graphic can be thought of as arising from four terms with a simple interpretation (see Figure 1):

graphic file with name pcbi.1003734.e126.jpg (4)

The first term, Inline graphic, ensures that observed cell fates are valleys in our landscape (Figure 1A). The second term, Inline graphic, describes biasing of specific TFs by experimentalists (not shown in Figure 1). The third term, Inline graphic, increases the radius and depth of cell fates that are favored by the environment or culturing conditions (Figure 1B). Finally, in the presence of an external signal that gives rise to differentiation (ex. growth factors associated with differentiation), the fourth term, Inline graphic, opens a low energy path between the initial and final cell fates (Figure 1C). We give a complete mathematical description of the model in the Materials and Methods: Landscape Model and a summary in Table 1.

Cell fates are dynamic attractors that are responsive to signals

We performed self-consistency checks for our model using two in silico experiments (see details in Materials and Methods: Simulations). To verify that naturally occurring cell fates are dynamic attractors, we randomly perturbed the gene expression profile of cells from the ESC state and then tracked the gene expression over time. Real biology has many potential sources of noise, and the asynchronous dynamics introduced above will likely underestimate the noise. To show that our model is still robust to other large sources of noise, in our simulations we also add in periodic bursts of noise by flipping a fixed percentage of TF states (2%) to mimic the observation that cellular divisions produce HM errors [38]. Figure 2E shows the projection of the TF state on the ESC state as a function of time. For a large number of starting conditions, after an initial transient, the system relaxes back to the ESC state (red bracket), explicitly demonstrating the existence of a large basin of attraction [10]. This is true even when we break detailed balance by making the interaction matrix asymmetric by randomly deleting 20% of interactions (Figure 2E Diluted).

Our model can also deterministically switch between cell fates in response to differentiation signals. For example, the common myeloid progenitor (CMP) is a blood cell fate that in vivo can differentiate into either granulo-monocytic progenitors (GMP) or megakaryocyte-erythroid progenitors (MEP). In Figure 2F, we show in silico validation where we start the system in the CMP state and show the trajectories after applying either the GMP (signal 1, blue) or MEP (signal 2, red) differentiation signal, resulting in branching to two distinct cell fates.

Partially reprogrammed cells as “spurious” attractors

When performing a reprogramming experiment, besides the initial cell fate and the end goal cell fate, experimentalists often produce “novel cell fates”, dubbed partially reprogrammed cells [17], [18]. These partially reprogrammed cells have the characteristics of a stable cell fate (i.e. they can be passaged indefinitely in culture), but may express a mix of key markers for multiple cell fates and have a global gene expression that does not match any in vivo cell fate [18].

While the existence of partially reprogrammed cells was surprising to experimentalists, they have a natural interpretation in our model. One of the most generic properties of all attractor neural networks is that in addition to the desired attractors, Inline graphic, the non-linearity of the dynamical process and topology of high-dimensional (in our case N = 1337) vector spaces induces additional attractors, which are termed spurious attractors [33]. In our model, since the natural cell fates are the input vectors, these spurious attractors can be interpreted as potential cell fates that do not occur in vivo. These spurious attractors are predicted to be low-dimensional combinations, or hybrids (see Materials and Methods: Spurious Attractors and Text S1 for details) that should also be stable attractors but with smaller basins of attraction.

A priori, there are several valid hypotheses for the relationship between partially reprogrammed cells and natural cell fates. In the original experiments [17], [18], it was expected that partially reprogrammed cells should be a hybrid of the starting and goal cell fate only (i.e. have a significant projection only on the starting or ending cell fate). Another hypothesis was that in a high-dimensional landscape, randomly chosen vectors should be orthogonal (Figure S2) (i.e. have a projection of Inline graphic with all cell fates). However, our model predicts that partially reprogrammed cells should be low-dimensional hybrids of existing cell fates, but that they do not necessarily have to be a combination of the starting and goal cell fate. Mathematically, we predict that partially reprogrammed cells should only have a projection Inline graphic (2 std above 0, see Figure S2) for a small number of natural cell fates. Reanalyzing existing genome-wide datasets on partially reprogrammed cells (Table 2) validates the prediction of our model that partially reprogrammed cells are low-dimensional hybrids of existing cell fates. This qualitative agreement between the predicted spurious attractors and the partially reprogrammed states is independent of details of our landscape function. Importantly, such hybrid states are a generic property of all attractor-based landscape models and hence represents an important criteria for judging whether attractor-based models are suitable for describing epigenetic landscapes.

Table 2. Partially reprogrammed cells as spurious attractors.

Cell line Start Goal Highest projecting states (projection)
A2 [17] MEF ESC ESC (0.178), MSC (0.158), myoblast (0.142), MEP (0.129), blood vessel (0.113), keratinocyte (0.112), medullary thymic epithelial (−0.111), adipose - brown (−0.117), NK (−0.130), CMP (−0.138)
B3 [17] MEF ESC ESC (0.222), MSC (0.161), blood vessel (0.139), myoblast (0.138), GMP (0.127), kidney (0.111), MEP (0.107), cornea (0.107), NK (−0.129)
BIV1+ [18] B Cell ESC myoblast (0.181), prostate (0.164), MSC (0.154), MEP (0.138), keratinocyte (0.136), cornea (0.125), ESC (0.111), intestine - Paneth cell (−0.111), CMP (−0.122)
BIV1- [18] B Cell ESC ESC (0.382), EpiSC (0.184), MEP (0.160), myoblast (0.145), NSC (−0.108), T Cell (−0.115), skeletal muscle (−0.117), CMP (−0.154)
MCV6 [18] MEF ESC MEP (0.155), myoblast (0.150), ESC (0.149), keratinocyte (0.145), CLP (0.107), GMP (0.107), cornea (0.107), CMP (−0.130)
MCV8 [18] MEF ESC ESC (0.203), MEP (0.191), myoblast (0.160), cornea (0.119), prostate (0.113), skeletal muscle (−0.141), CMP (−0.142)

Partially reprogrammed cell lines (first column) and their significant projections (2 std above noise or Inline graphic) onto “natural” cell fates based on microarray data. Bold indicates 3 std above noise or Inline graphic. Abbreviations: CLP, Common Lymphoid Progenitor; CMP, Common Myeloid Progenitor; EpiSC, epiblast stem cell; ESC, embryonic stem cell; GMP, Granulocyte-Monocyte Progenitor; MEF, mouse embryonic fibroblast; MEP, Megakaryocyte-Erythroid Progenitor; MSC, Mesenchymal stem cells; NK, Natural Killer cells; NSC, neural stem cells.

Identifying transcription factors for cellular reprogramming

Our landscape model provides a quantitative method to identify “predictive” TFs for a given cell fate. These predictive TFs can be used as markers of a cell fate and are potential candidates for reprogramming protocols. We expect reprogramming TFs to be a subset of all predictive TFs but not all predictive TFs will lead to successful reprogramming. For example, cell-specific downstream targets of reprogramming TFs are likely to also be highly predictive for a cell type but may not lead to successful reprogramming.

Most reprogramming experiments follow an experimental protocol similar to the one outlined by Takahashi and Yamanaka in their seminal paper [1], [8]. Initially the starting cells (usually mouse embryonic fibroblasts, MEFs) are infected with viruses containing all the TFs of interest. The original Yamanaka experiment over-expressed 24 TFs [1], while more recent experiments usually start with about 10 TFs [2][6]. Several days after infection, the cells are switched to culturing conditions that support the desired final cell fate. If an experiment is successful, cells resembling the desired cell fate will appear after a few weeks. This original list is then pruned to identify a “minimal” (essential) set of TFs that still allows for successful reprogramming. In many cases, the viruses are excised [39] to confirm that the the reprogramming does not depend on viral expression. Furthermore, recent experiments indicate that the same TFs can be used to reprogram to a desired cell fate from multiple initial cell fates [16]. These experiments suggest that reprogramming TFs should be based on final, not initial, cell fate.

Intuitively, reprogramming candidates should be both highly expressed and highly “predictive” of the desired cell fate. The TF z-score naturally defines high and low TF expression levels. Within our landscape, the “predictivity” Inline graphic of the Inline graphic TF for a given cell fate Inline graphic, is measured by its contribution to the potential energy of that cell fate, and is mathematically defined as:

graphic file with name pcbi.1003734.e139.jpg (5)

where Inline graphic is the cell fate correlation matrix and Inline graphic is the expression of TF Inline graphic in cell fate Inline graphic. We note that the projection and predictivity are directly related as can be seen by

graphic file with name pcbi.1003734.e144.jpg (6)

where Inline graphic is the predictivity of TF Inline graphic in cell fate Inline graphic and Inline graphic is an arbitrary gene expression state.

For a desired target cell fate, TFs that are high (low) in both predictivity and expression in that cell fate are candidates for over expression (knock out) in reprogramming (see Figure 3A). For a simple, single measure of reprogramming efficacy of a TF, the predictivity and expression can be multiplied together to give a “reprogramming score”, where the top (bottom) rank order TFs are the best candidates for over expression (knock out). Figure 3 shows the expression and predictivity for TFs in a variety of cell fates. In Figure 3B, we have explicitly labeled the TFs used in the original Yamanaka protocol for reprogramming to ESC. Consistent with our model, these TFs are both predictive and highly expressed. Figure 3C shows TFs that have been successfully used in any reprogramming protocol to ESCs [8] as well as the pluripotency genes (involved in maintaining stem cell fate) Zfp42 (Rex1) [40] and Nr0b1 (Dax1) [41]. Once again these genes are highly predictive for ESCs. As a further check on the biological validity of our predictions, we analyzed the GO Annotation of our top 50 candidates for ESC reprogramming (Table S1). Within these top TFs, 12 have successfully been used in reprogramming, 7 are known pluripotency TFs, 16 are involving in cell differentiation, while 15 have no known function and are intriguing reprogramming candidates. Taken together this suggest that we are capturing the essential biology despite minimal biological data for input.

Figure 3. Identifying reprogramming candidates.

Figure 3

For a given cell fate, we plot every differentially expressed transcription factor's (TF) predictivity (aka energy projection-contribution, Inline graphic) vs TF expression level (z-score normalized). Unless otherwise stated all existing reprogramming protocols to a given cell fate are labeled. (A) Schematic illustrating predictivity vs expression level plots. The large positive (negative) predictivity and large positive (negative) gene expression TFs are candidates for over expression (knock out) in a reprogramming protocol. The TFs with z-score between Inline graphic and Inline graphic are highlighted in gray because Figure 2B suggests these TFs predictivity may be prone to extra noise induced by the data discretization. (B) Embryonic stem cell, ESC (induced pluripotent stem cells, iPSC). Original Takahashi and Yamanaka factors Pou5f1 (Oct 4), Sox2, Klf4, and Myc [1]. (C) Inset of ESC positive predictivity and gene expression. Zfp42 (Rex1) [40] and Nr0b1 (Dax1) [41] are pluripotency markers that are not necessary to overexpress for reprogramming, while combinations of the remaining labeled TFs have been successfully used in reprogramming protocols [8]. (D) Heart (induced cardiomyocytes, iCM) [3]. (E) Liver (induced hepatocytes, iHep). There are two published protocols. One protocol used Hnf4a plus any of Foxa1, Foxa2, or Foxa3 [4] while another used Gata4, Foxa3, Hnf1a, and deletion of p19Arf [5]. p19Arf was not differentially expressed in our microarrays and is not shown. (F) Thyroid [7]. (G) Neural Progenitor Cells, NPC (induced NPC, iNPC) used Pou3f2 (Brn2), Sox2, and Foxg1 [6]. With our microarrays we find that Foxg1 is not predictive for NPC but is predictive of neural stem cells (NSC) (see Figure S3). (H) Neurons (induced neuron, iN) [2]. The reprogramming protocol used a combination of factors that were known to be important to ether mature neurons (Myt1l) or NPCs (Pou3f2, Ascl1). (G) shows that Pou3f2 and Ascl1 are predictive of NPCs.

While ESC have been studied in the most detail, recent experiments have reprogrammed (aka direct conversion) to other cell fates such as cardiomyocytes [3] (Figure 3D), liver [4], [5] (Figure 3E), and thyroid [7] (Figure 3F). Once again we have explicitly labeled the TFs that have been successfully used for direct conversion. Notice that all of these TFs (except Mef2c) are highly predictive and highly expressed. Note that p19Arf [5] used in the direct conversion to liver was not differentially expressed in our microarrays and therefore was not included in our model.

We also examined TFs used in direct conversion to neural lineages. As discussed in [2], these TFs were chosen because they were known to be important in either neurons or neural progenitor cells (NPC). Figure 3F and 3G show the expression and predictivity of TFs for neural progenitor cells (NPC) [6] (Figure 3G), and neurons [2] respectively. Induced NPC were made using a four TF cocktail consisting of Pou3f2 (Brn2), Sox2, and Foxg1 [6]. Our analysis shows that the first two of these TFs are predictive for NPCs while Foxg1 is predictive for neural stem cells (NSC) (see Figure S3). Induced neurons (iN) can be made using the TFs Myt1l, Pou3f2, and Ascl1 [2]. Consistent with their experimental design, we find that Myt1l is highly predictive for mature neurons, while the remaining TFs (Pou3f2, Ascl1) are predictive for NPCs.

While it is not possible to perform statistical tests to test our examples due to the scarcity of reprogramming protocols, we performed a simple numerical exercise to gauge the predictive power of our model. The four Yamanaka factors are all in the top 50 when ranked by their reprogramming score for ESCs (where the reprogramming score of a TF is defined as the product of the expression and predictivity scores of a TF). We randomly permuted TF labels and asked how often all four Yamanaka factors remained in the top 50. For a million independent permutations, this occurred only once, confirming that our model is capturing many essential aspects of cellular reprogramming.

Our results suggest that epigenetic landscapes may be useful for rationally-designing reprogramming protocols to novel cell fates. To this end, we have used our model to identify candidate TFs for reprogramming, see File S5 for the top 50 candidates for overexpression for all cell fates and File S6 for top 50 candidates for knockouts for all cell fates.

Discussion

A common biological metaphor used to describe development and cellular reprogramming is a rugged “epigenetic landscape” which emerges from a complex gene regulatory network, with cell fates corresponding to attracting valleys in the landscape. Despite decades of biological innovation, the large number of genes and their complex interactions has prevented the quantitative modeling of a global epigenetic landscape. To meet this challenge, we have developed a new quantitative framework of cellular identity to directly model the global, high-dimensional epigenetic landscape. Using whole genome expression data, we constructed an epigenetic landscape based on techniques from spin glass physics and neural networks. Our landscape only depends on the experimentally determined gene expression of natural cell fates. Yet, it explains the existence of spurious cell fates (known as partially-reprogrammed cells) and can reproduce known reprogramming protocols to embryonic stem cells, heart, liver, thyroid, neural progenitor cells, and neurons. More importantly, our model can be used to identify candidate transcription factors for reprogramming to novel cell fates.

An interesting question is if spurious attractors are ubiquitous throughout the landscape, why does standard development not produce partially reprogrammed cells? The key is the difference in the dynamics. In cellular reprogramming, the starting cell fate is forced to express a small number of TF and this leads to a stochastic conversion to the desired cell fate (Figure 1A). During this stochastic exploration of the landscape, there is only a weak bias towards the final state, so it is easy for the cells to get trapped in a metastable state. However, during standard development, the external signals actively reshape the landscape and open up low energy valleys between cell fates (Figure 1C). This strong bias towards the final cell state results in a deterministic switch during which the spurious attractors are only a small road bump on the path to the final cell state. Therefore, it is not a surprise that partially reprogrammed cells are only found during cellular reprogramming and not during standard development.

Epigenetic landscapes can also be used to identify important, or predictive, TFs for cell fates. The predictivity of a TF for a cell fate generalizes the idea of specificity. A TF is specific to a cell fate if it is expressed only on in a small subset of cell fates. In contrast with specificity, predictivity weighs the global correlations amongst cell fates when assessing the importance of a TF for a cell fate. Thus, the predictivity not only picks out important specific TFs, but also TFs that are lineage markers. For example, Brachyury (T) [42] is a general marker of mesodermal lineages. Since it is highly expressed in large a number of cell fates, it is not specific to any given cell fate. However, it is predictive because its expression is a strong indicator that a given cell fate is a mesodermal lineage.

The concept of predictivity also yields new insights into the Yamanaka protocol. When the Yamanaka factors were first published, two of the four TFs, Pou5f1 (Oct4) and Sox2 were known to be important for ESCs. In contrast, the role of the other two TFs, Klf4 and Myc, was not well understood [43]. It was quickly shown that Myc was was not essential to reprogramming (Oct4, Sox2, and Klf4 can reprogram alone), but nonetheless enhanced the efficacy of reprogramming [44]. The importance of Klf4 was surprising given that it is neither highly expressed nor specific for ESC. However, Klf4 is highly predictive of ESC (Table S2). For this reason, our model actually explains why Klf4 is a prime candidate for reprogramming to ESCs.

We make several experimentally verifiable predictions. First, our model predicts the partially reprogrammed cells should be hybrids of existing natural cell fates. As more partially reprogrammed cells are studied, if they are found to either have high projection on only one cell fate (Inline graphic for one Inline graphic) or no projections on any cell fates (Inline graphic for all Inline graphic), this would call into question whether partially reprogrammed cells are truly the spurious attractors of an attractor neural network. Second, our model can be used to identify important, or predictive, TFs for cell fates. TFs with large positive (negative) predictivity should be positive (negative) markers for a cell fate. Additionally, for cellular reprogramming we predict that TFs with large positive (negative) predictivity and expression could be over expressed (knocked out) to reprogram to a desired cell fate. Therefore, our model has several predictions that can be tested against future experimental progress in the field.

Our model has several limitations. First, a generic limitation for any method relying on microarrays to define gene expression is that one cannot distinguish between direct, causal, interactions and indirect, correlative, interactions. Therefore, predictivity can establish the importance of a gene, but further experiments are needed to determine if the predictive gene is the controller of the cell type or just a passive indicator of a cell type. Second, it fails to accurately capture the dynamics of reprogramming. Simulations of reprogramming with known protocols, such as the Yamanaka protocol, lead to rates of reprogramming that are comparable to the rates from a reprogramming simulation with a randomly selected protocol. This is likely due to the fact that cell fates are extremely stable and hence reprogramming is extremely rare. Third, our model does not directly explain the importance of the non-specific transcription factor Myc. Many protocols use Myc [8], but it can be replaced (with no deleterious effect) by short hairpin RNAs (shRNAs) [45], or dropped completely from protocols at the expense of speed and less efficient reprogramming [44]. This suggests that Myc may have an alternative role and instead of being a biasing field, Inline graphic, it may instead raise the effective noise of the system (i.e. decrease Inline graphic). Another limitation is that based on the currently available experimental data, our landscape construction cannot definitively be distinguished from alternative constructions. For example, the interaction network could be constructed by such that it does not weigh each cell fate equally (as is currently done). This would have the effect of changing the relative stability of cell fates. Therefore, in the absence of more experimental data, our landscape and a weighted landscape cannot be distinguished.

A popular approach to inferring landscapes from biology data are “Maximum Entropy” models. This method has been used to model firing neurons [46], protein configurations [47], [48], and antibody diversity [49]. The Maximum Entropy approach takes as input large samples of biological data and a set of constraints and outputs a landscape that maximizes the entropy. While Maximum Entropy models can be used to infer landscapes with basins of attraction [50], it can quickly become a computationally challenging problem. Our approach differs from Maximum Entropy models in the following way. Since our goal is to model a landscape with basins of attractions, we make the ansatz that the landscape can be described by a Hopfield neural network. Then we insert real biological data, Inline graphic, to construct the landscape exactly. Our method requires no computational inference of parameters.

There are several natural extensions of the model discussed in this paper. The landscape could be constructed with additional biological input such as other genes, microRNAs, or histone modification data. This opens up possibilities of improving upon the high reprogramming rates achieved by overexpressing microRNAs [51] or synthetic mRNAs [52]. Another attractive element of the framework presented here is that it allows for a quantitative analysis of whole genome-wide expression states (see Table 2). This is likely to yield a more accurate classification of reprogrammed cells. Finally, directed differentiation protocols [53] attempt to mimic standard development in vitro and have proven to have high efficiency and fidelity. Future work will try to use our landscape to predict the necessary signaling factors for rationally designing more efficient directed differentiation protocols. Overall, epigenetic landscapes provide a unifying framework for cell identity, reprogramming, and directed differentiation, and our results suggest these landscapes can provide crucial insight into the molecular circuitry and dynamics that gives rise to cell fate.

Materials and Methods

Data analysis

Here we present the details of the dataset. All data used in this paper are available in the online Supplementary Information and is organized as follows:

  • File S1: Microarray Sources. List of all microarrays used in this paper.

  • File S2: TF Z-Score. The z-score gene expression for each TF of natural cell fates in this paper. This data is post RMA normalization and averaging over multiple replicates for each natural cell fate.

  • File S3: TF Predictivity. The predictivity for each TF and cell type in this paper.

  • File S4: Partially Reprogrammed Cells Z-Score. The z-score gene expression for each TF of partially reprogrammed cell fate. This data is post RMA normalization and averaging over multiple replicates for each partially reprogrammed cell fate.

  • File S5: Overexpression Candidates. Top overexpression candidates to reprogram to various cell fates.

  • File S6: Knock-Out Candidates. Top knock-out candidates to reprogram to various cell fates.

An older version of this manuscript, Arxiv v3 [54], has additional microarrays available that are unused in this version of the text. All microarrays used in this paper were taken from the public databases ArrayExpress (www.ebi.ac.uk/arrayexpress) or GEO (www.ncbi.nlm.nih.gov/geo). See File S1: Microarray Sources for details on where to obtain raw, pre-normalized and pre-averaged data.

There are two datasets, the natural cell fates and the partially reprogrammed cells. For the natural cell fates, we only used the Affymetrix GeneChip Mouse Gene 1.0 ST platform due to the large number of available microarrays on ArrayExpress (www.ebi.ac.uk/arrayexpress) and the better technical design of the platform (1.0 ST has probe matches throughout a gene in contrast to just the 3′ UTR in Affymetrix GeneChip Mouse Genome 430 2.0). There is limited data on partially reprogrammed cells so we used microarrays from Affymetrix GeneChip Mouse Genome 430 2.0.

The raw microarray data was converted to an expression level as follows. Microarray probe-to-gene map was created with Bioconductor 2.10. All raw microarray files were initially processed by robust mean averaging (RMA) in MATLAB, and genes with multiple microarray probes were averaged. We did additional processing of this output for two reasons. First, we need to compare microarrays from multiple platforms, but the standard RMA output can vary significantly from platform to platform. Second, since gene expression is a set of positive definite numbers, the minimal assumption model of gene expression is a log-normal distribution. Therefore, to make robust comparisons across platforms, we used order statistics [55]. The RMA output was converted to a rank order. Next, we want to convert this rank order to the z-score of a log-normal distribution. We converting the rank to a percentile (for Inline graphic genes, divide by Inline graphic), and then this percentile into a normal z-score. For later mathematical convenience, we used a biased estimator (normalize by Inline graphic not Inline graphic) since then the Euclidean norm of each microarray gene expression is Inline graphic.

At this point, the natural dataset consisted of 601 microarrays with 20719 genes. Since we were interested in cellular identity, only transcription factors, transcription factor co-factors, or chromatin remodeling genes were kept (for short hand, referred to as transcription factors (TF) throughout the text) [56], leaving 1715 TFs.

As explained in the main text, since continuous (sigmoidal input) attractor neural networks and discrete attractor neural networks are known to have the same stable fixed points [57], we used the binarized gene expression. We binarized the gene expression by setting a positive z-score to Inline graphic and a negative z-score to Inline graphic. While this was mainly done for mathematical convenience, this is potentially biologically justified. Histone modifications (HM) either leave chromatin in an open, accessible configuration or a closed, inaccessible state [35]. We found global HM data for embryonic stem cells (ESC), mouse embryonic fibroblasts (MEF), and neural progenitor cells (NPC) [36], [37]. Consequently, we used the global HM data for these three cell fates and compared them to microarray TF expression levels. This allowed us to create a conditional probability distribution of each HM for a given TF expression level (Figure 2B). We found a sharp cutoff (that coincides with a z-score of Inline graphic) which distinguished TFs with the activating modification of histone 3 tri-methylation at lysine 4 (K4) from TFs with the inactivating modification of histone 3 tri-methylation at lysine 27 (K27), poised/bivalent TFs (both K4 and K27), and no HM (most likely DNA methylation). This shows that our mathematical assumption is justified by the HM data.

After the binarization of TF expression, all TFs that were not differentially expressed across cell fates (i.e. TFs that are always on/always off in every cell fate) were dropped, leaving 1337 TFs. The binarized TF expression for the 63 cell fates was found by first binarizing all 601 microarrays and then taking the majority vote for each cell state (with ties broken by averaging the continuous data). The final result was the binary expression state for 63 cell fates.

Microarrays for partially reprogrammed cells were on the Affymetrix GeneChip Mouse Genome 430 2.0 Array. The same procedure was used to convert raw microarray data to z-score expression. However, since different microarrays do not have the same genome coverage, the analysis comparing partially reprogrammed cells and natural cell fates used the N = 1329 TFs common to both platforms.

Several self-consistency checks were performed on the data. First, the correlation matrix Inline graphic (explained in main text and below) was calculated for the original continuous data and for the binarized data (Figure S1). Both correlation matrices are consistent with each other showing binarization does not change the global correlations. Note that in the correlation matrix, cell fates have been grouped by tissue type, leading to a block diagonal form. Second, the expression state of all cell fates was constructed from multiple microarray experiments. These different experiments were compared with each other and were within 2 standard deviations (std equal to Inline graphic) for all cell fates. This demonstrates that microarrays from multiple laboratories can be directly compared.

Landscape model

Here we give an overview of our epigenetic landscape model. The model is summarized in Table 1, and Text S1 provides a supplementary overview of attractor neural networks.

State space

Each TF (labeled by Inline graphic, Inline graphic) can be in a state Inline graphic where Inline graphic indicates the TF is active while Inline graphic indicates it is inactive. A general cell state is given by Inline graphic, an N = 1337 dimensional vector. There are Inline graphic cell fates (labeled by Inline graphic, Inline graphic). In cell type Inline graphic, the state of TF Inline graphic is given by Inline graphic. The complete cell type data Inline graphic is a Inline graphic by Inline graphic matrix determined using our microarray data described above and these Inline graphic are the only biological input into the landscape.

Full landscape

The complete landscape Inline graphic can be written as the following terms:

graphic file with name pcbi.1003734.e188.jpg (7)

Our landscape assigns an “energy” to every global expression state. We emphasize that this energy does not correspond to physical energy consumption of ATP; instead it is an abstract energy that corresponds to stability and developmental potential of cell fates. Each of the four terms has a simple interpretation (see Figure 1). The first term, Inline graphic, ensures that observed cell fates are valleys in our landscape (Figure 1A). The second term, Inline graphic, describes biasing of specific TFs by experimentalists (not shown in Figure 1). The third term, Inline graphic, increases the radius and depth of cell fates that are favored by the environment or culturing conditions (Figure 1B). Finally, in the presence of an external signal that gives rise to differentiation (ex. growth factors associated with differentiation), the fourth term, Inline graphic, opens a low energy path between the initial and final cell fates (Figure 1C).

Landscape details: Inline graphic

The gene expression profiles of naturally occurring cell fates must be minima of our landscape. This is ensured by the landscape term

graphic file with name pcbi.1003734.e194.jpg (8)

In order to guarantee that cell fates are basins of attraction, we need to choose the “effective interaction” matrix, Inline graphic, which encodes how the Inline graphicth TF influences the Inline graphicth TF. Since we have highly correlated cell fates, we use the projection-method [32] (see Text S1 section “Discrete, Projection Method” for extended discussion on this choice), which defines the interaction matrix as:

graphic file with name pcbi.1003734.e198.jpg (9)

where Inline graphic are the natural cell fates and Inline graphic is the inverse of the correlation matrix between cell fates. Since our construction is based on correlations between gene expression profiles, Inline graphic includes the effect of “indirect” interactions between TFs Inline graphic and Inline graphic that are mediated through other TFs (see Text S1 for additional mathematical explanation of this construction). While the current definition implies Inline graphic is symmetric, this can easily be generalized to an asymmetric Inline graphic (see later section Landscape vs Pseudo-Landscape for details).

Landscape details: Inline graphic

The term Inline graphic ensures that all cell fates are global minima of the landscape. However, additional terms in the landscape are needed in order to incorporate key experimental features.

First, biologists can directly manipulate gene expression. For example, during the Yamanaka experiment, the TFs Inline graphic, Inline graphic, Inline graphic, and Inline graphic are overexpressed in fibroblasts. Mathematically, we represent the overexpression of TF Inline graphic by a local biasing field Inline graphic that ensures that Inline graphic. Therefore the Yamanaka reprogramming protocol enters the landscape as:

graphic file with name pcbi.1003734.e215.jpg (10)

where for the Yamanaka protocol, Inline graphic and for any other TF Inline graphic, the field Inline graphic.

Landscape details: Inline graphic

Currently, the basins of attraction Inline graphic are all set to the same minima value. However, environmental signals (such as cell culture conditions) can stabilize and destabilize specific cell fates (see Figure 1B). This can be incorporated into our landscape by terms such as:

graphic file with name pcbi.1003734.e221.jpg (11)
graphic file with name pcbi.1003734.e222.jpg (12)

where Inline graphic represents the culture biasing, and Inline graphic is the projection onto cell fate Inline graphic. This bias can be equivalently expressed at the level of TFs by defining a culture bias, Inline graphic, for the Inline graphicth TF given by:

graphic file with name pcbi.1003734.e228.jpg (13)

For example during the Yamanaka protocol, cells are cultured in conditions favorable to ESC, which is mathematically represented by Inline graphic, while for all other cell fates Inline graphic, Inline graphic.

Landscape details: Inline graphic

During standard development, cells switch fates deterministically in response to external signals. We mathematically represent this using the term:

graphic file with name pcbi.1003734.e233.jpg (14)
graphic file with name pcbi.1003734.e234.jpg (15)

where Inline graphic is the overlap on cell fate Inline graphic, Inline graphic is the projection onto cell fate Inline graphic, and the matrix Inline graphic is the developmental signal matrix that is a dynamic entity and a function of developmental time and external signals. We can equivalently write this in terms of transcription factors using the gene-interaction matrix, Inline graphic, defined as:

graphic file with name pcbi.1003734.e241.jpg (16)

where Inline graphic is the natural cell fate states, Inline graphic is the developmental signal matrix, and Inline graphic is the inverse correlation matrix. Since Inline graphic is asymmetric, Inline graphic is also asymmetric and explicitly breaks detailed balance (see later section Landscape vs Pseudo-Landscape for details).

We now explain the development signal matrix in more details. If Inline graphic, this opens up a low energy path between cell fate Inline graphic and cell fate Inline graphic. For example, during blood development, the common myeloid progenitor (CMP) can differentiate into either granulo-monocytic progenitors (GMP) or megakaryocyte-erythroid progenitors (MEP). The complicated external signals that induce switching from a CMP to GMP leads to Inline graphic and all other Inline graphic, while the signals that induce switching from a CMP to MEP leads to Inline graphic and all other Inline graphic. We emphasize that this term is purely phenomenological and further research will be needed to directly connect the developmental biology signals (such as Inline graphic, Inline graphic, etc) to the matrix elements Inline graphic.

Dynamics

We have uniquely defined the landscape Inline graphic. However, there are multiple ways to implement dynamics on this landscape. In this paper, we are primarily interested in the behavior of the stable fixed points and not dynamical trajectories. Therefore, we follow the standard convention in the attractor neural network literature and update the network by random, asynchronous updates (Glauber dynamics) [33]. Therefore, at each update, a random TF, Inline graphic, is selected and updated according to the probability

graphic file with name pcbi.1003734.e259.jpg (17)

where Inline graphic is the expression state of the Inline graphicth TF, Inline graphic is an effective noise parameter, Inline graphic is the local field, and Inline graphic is the time index. The local field Inline graphic is the gradient of the landscape (covariant derivative) defined for the full landscape Inline graphic as:

graphic file with name pcbi.1003734.e267.jpg (18)

where Inline graphic is the basin-inducing interaction matrix, Inline graphic is the experimentally induced bias on the Inline graphicth TF, Inline graphic is the culturing-condition specific bias on the Inline graphicth TF, and Inline graphic is the developmental interaction matrix.

We have introduced the effective noise parameter Inline graphic (i.e. inverse temperature) that controls the level of stochasticity resulting from biochemical noise. When Inline graphic, the update approaches a deterministic step function, while when Inline graphic each state is equally likely. Based on the currently available static genomic data, this update time cannot be directly related to biological time. Finally, we emphasize that since in this paper we are primarily concerned with the structure of the landscape, our results are independent of our choice of dynamics (see Text S1 for detailed discussion on dynamics).

Landscape vs pseudo-landscape

Currently, the interactions between TFs, Inline graphic, are symmetric. In real biology, this is unlikely to be true. We can introduce asymmetry into the interactions by randomly deleting interactions (for example Figure 2E Diluted). This asymmetry means that influence of TF Inline graphic on TF Inline graphic no longer equals the influence of TF Inline graphic on TF Inline graphic. This asymmetry breaks detailed balance and implies a non-Lyapunov pseudo-potential [29], [33], [58] and has been shown to be an additional source of noise on the basins of attraction [33].

We also note that the landscape term Inline graphic is explicitly non-equilibrium and breaks detailed balance. Under one set of environmental conditions, Inline graphic while Inline graphic driving switching from Inline graphic, while under a different set of environmental conditions, Inline graphic while Inline graphic driving switching from Inline graphic. Therefore, by including Inline graphic we are actually making our landscape a pseudo-landscape (i.e. non-Lypanouv).

Simulations

Here we include details of the simulations in this paper. For all simulations, we set Inline graphic and evolved the system for Inline graphic TF updates.

In Figure 2E, we demonstrate that we have basins of attraction. The initial conditions were created by taking the ESC expression vector and randomly flipping Inline graphic of the TFs. After every Inline graphic updates of asynchronous dynamics, burst errors were introduced by randomly flipping Inline graphic of TFs. For the asymmetric dilution, the standard interaction matrix Inline graphic was created. Then Inline graphic of matrix entries were randomly set to Inline graphic.

In Figure 2F, we demonstrate that the landscape can deterministically switch between basins. The initial conditions were always the CMP expression vector. For signal 1, we set Inline graphic and all other Inline graphic. For signal 2, we set Inline graphic and all other Inline graphic.

Spurious attractors

Here we provide more details on spurious attractors and hybrid cell fates. As explained in more detail in Text S1, for the traditional Hopfield model, these spurious attractors take the form of odd-majority vote mixtures [33] (i.e. majority vote at each TF of Inline graphic of the Inline graphic). The projection method also has the additional spurious attractors of any linear combination of Inline graphic that spans the discrete state space (see geometric interpretation given in Text S1) [32]. For convenience, we use the word hybrid as the collective term for either majority vote mixtures or linear combinations of existing cell fates.

As discussed in the main text, the prediction of spurious attractors in the projection method inspired us to reexamine data on existing partially reprogrammed cells. Surprisingly, we found that partially reprogrammed cells could be thought of as hybrids of existing cell fates. However, we are currently only able to obtain qualitative agreement between partially reprogrammed cells and the predicted nature of the spurious attractors. While it is known that the projection method retains these odd-majority vote mixtures spurious attractors, the correlations between states implies these spurious attractors may no longer be symmetric mixtures. However, the exact nature of these spurious attractors is not known and will be explored in future work.

Supporting Information

Figure S1

Cell fate correlation matrices. (A) Correlation matrix between cell fates for continuous data. (B) Correlation matrix for binarized data.

(PDF)

Figure S2

Projection of a random vector on a given cell fate. Ten thousand binarized random vectors were created in MATLAB and projected onto the cellular sub-space. The histogram shows the distribution of the projections. The red line is a Gaussian fit to the histogram. The mean is practically zero while the standard deviation is 0.053.

(PDF)

Figure S3

Predictivity vs expression for NSC. Same type of plot as Figure 3. Labeled TFs are part of reprogramming protocol to NPC [6]. This illustrates that Foxg1 is predictive for NSC, even though it is not for NPC.

(PDF)

File S1

Microarray sources. List of all microarrays used in this paper.

(TXT)

File S2

TF Z-score. The z-score gene expression for each TF of natural cell fates in this paper. This data is post RMA normalization and averaging over multiple replicates for each natural cell fate.

(TXT)

File S3

TF predictivity. The predictivity for each TF and cell fate in this paper.

(TXT)

File S4

Partially reprogrammed cells Z-score. Partially Reprogrammed Cells Z-Score. The z-score gene expression for each TF of partially reprogrammed cell fate. This data is post RMA normalization and averaging over multiple replicates for each partially reprogrammed cell fate.

(TXT)

File S5

Overexpression candidates. Top overexpression candidates to reprogram to various cell fates.

(TXT)

File S6

Knock-Out candidates. Top knock-out candidates to reprogram to various cell fates.

(TXT)

Table S1

Classifying top ESC reprogramming candidates. Table has top 50 embryonic stem cell (ESC) reprogramming candidates (as ranked by z-score times predictivity, Inline graphic). Classification of each TF is either justified by paper citation or GO Process term.

(PDF)

Table S2

Examining Yamanaka factors in detail. Here we reexamine the Yamanaka transcription factors (TFs) in light of our model.

(PDF)

Text S1

Attractor neural networks: Additional details. This supplementary text provides extended background details on Hopfield attractor neural networks but presents no new research findings. The sections are: (A) Discrete, Standard Hopfield. (B) Continuous, Standard Hopfield. (C) Continuous Gene Expression. (D) Discrete as Limit of Continuous. (E) Discrete, Projection Method.

(PDF)

Acknowledgments

We thank members of the Mehta Group, Collins lab, and Laertis Ikonomou, Katherine Benson, Darrell Kotton, and other members of Boston University Center for Regenerative Medicine (CReM) for stimulating discussions. In addition, we thank Laertis Ikonomou, Darrell Kotton, and Kristian Moss Bendtsen for a detailed reading of an earlier version of the manuscript.

Funding Statement

AHL was supported by a Boston University Dean's Fellowship and a National Science Foundation Graduate Research Fellowship (NSF GRFP) under Grant No. DGE-1247312. PM was supported by a Sloan Fellowship. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Takahashi K, Yamanaka S (2006) Induction of pluripotent stem cells from mouse embryonic and adult fibroblast cultures by defined factors. Cell 126: 663–676. [DOI] [PubMed] [Google Scholar]
  • 2. Vierbuchen T, Ostermeier A, Pang ZP, Kokubu Y, Südhof TC, et al. (2010) Direct conversion of fibroblasts to functional neurons by defined factors. Nature 463: 1035–1041. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Ieda M, Fu JD, Delgado-Olguin P, Vedantham V, Hayashi Y, et al. (2010) Direct reprogramming of fibroblasts into functional cardiomyocytes by defined factors. Cell 142: 375–386. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Sekiya S, Suzuki A (2011) Direct conversion of mouse fibroblasts to hepatocyte-like cells by defined factors. Nature 475: 390–393. [DOI] [PubMed] [Google Scholar]
  • 5. Huang P, He Z, Ji S, Sun H, Xiang D, et al. (2011) Induction of functional hepatocyte-like cells from mouse fibroblasts by defined factors. Nature 475: 386–389. [DOI] [PubMed] [Google Scholar]
  • 6. Lujan E, Chanda S, Ahlenius H, Südhof TC, Wernig M (2012) Direct conversion of mouse fibroblasts to self-renewing, tripotent neural precursor cells. Proc. Natl. Acad. Sci. U.S.A. 109: 2527–2532. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Antonica F, Kasprzyk DF, Opitz R, Iacovino M, Liao XH, et al. (2012) Generation of functional thyroid from embryonic stem cells. Nature 491: 66–71. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. González F, Boué S, Belmonte JCI (2011) Methods for making induced pluripotent stem cells: reprogramming àla carte. Nat. Rev. Genet. 12: 231–242. [DOI] [PubMed] [Google Scholar]
  • 9. Vierbuchen T, Wernig M (2012) Molecular roadblocks for cellular reprogramming. Mol. Cell 47: 827–838. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Huang S, Eichler G, Bar-Yam Y, Ingber DE (2005) Cell fates as high-dimensional attractor states of a complex gene regulatory network. Phys. Rev. Lett. 94. [DOI] [PubMed]
  • 11.Waddington CH (1957) The Strategy of the Genes. A Discussion of Some Aspects of Theoretical Biology. London: Allen and Unwin.
  • 12.Kauffman SA (1993) The Origins of Order: Self-Organization and Selection in Evolution. New York City: Oxford University Press.
  • 13. Enver T, Pera M, Peterson C, Andrews PW (2009) Stem cell states, fates, and the rules of attraction. Cell Stem Cell 4: 387–397. [DOI] [PubMed] [Google Scholar]
  • 14. Zhou JX, Huang S (2011) Understanding gene circuits at cell-fate branch points for rational cell reprogramming. Trends in Genetics 27: 55–62. [DOI] [PubMed] [Google Scholar]
  • 15. Ferrell JE (2012) Bistability, bifurcations, and waddington's epigenetic landscape. Curr. Biol. 22: R458–R466. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Buganim Y, Faddah DA, Cheng AW, Itskovich E, Markoulaki S, et al. (2012) Single-cell expression analyses during cellular reprogramming reveal an early stochastic and a late hierarchic phase. Cell 150: 1209–1222. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Sridharan R, Tchieu J, Mason MJ, Yachechko R, Kuoy E, et al. (2009) Role of the murine reprogramming factors in the induction of pluripotency. Cell 136: 364–377. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Mikkelsen TS, Hanna J, Zhang X, Ku M, Wernig M, et al. (2008) Dissecting direct reprogramming through integrative genomic analysis. Nature 454: 49–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Kim J, Efe JA, Zhu S, Talantova M, Yuan X, et al. (2011) Direct reprogramming of mouse fibroblasts to neural progenitors. Proc. Natl. Acad. Sci. U.S.A. 108: 7838–7843. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Davidson E (2006) The Regulatory Genome: Gene Regulatory Networks in Development and Evolution. London: Academic Press.
  • 21. Huang S, Guo YP, May G, Enver T (2007) Bifurcation dynamics in lineage-commitment in bipotent progenitor cells. Dev. Biol. 305: 695–713. [DOI] [PubMed] [Google Scholar]
  • 22.Zhou JX, Brusch L, Huang S (2011) Predicting pancreas cell fate decisions and reprogramming with a hierarchical multi-attractor model. PLoS ONE 6: e14752–. [DOI] [PMC free article] [PubMed]
  • 23. Corson F, Siggia ED (2012) Geometry, epistasis, and developmental patterning. Proc. Natl. Acad. Sci. U.S.A. 109: 5568–5575. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Henry A, Monéger F, Samal A, Martin OC (2013) Network function shapes network structure: the case of the arabidopsis flower organ specification genetic network. Mol. BioSys. 9: 1726–1735. [DOI] [PubMed] [Google Scholar]
  • 25.Zagorski M, Krzywicki A, Martin OC (2013) Edge usage, motifs, and regulatory logic for cell cycling genetic networks. Phys. Rev. E 87: : 012727–. [DOI] [PubMed] [Google Scholar]
  • 26. MacArthur BD, Ma'ayan A, Lemischka IR (2009) Systems biology of stem cell fate and cellular reprogramming. Nat. Rev. Mol. Cell Biol. 10: 672–681. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Lu R, Markowetz F, Unwin RD, Leek JT, Airoldi EM, et al. (2009) Systems-level dynamic analyses of fate change in murine embryonic stem cells. Nature 462: 358–362. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Banerji CRS, Miranda-Saavedra D, Severini S, Widschwendter M, Enver T, et al.. (2013) Cellular network entropy as the energy potential in waddington's differentiation landscape. Sci. Rep. 3. [DOI] [PMC free article] [PubMed]
  • 29. Wang J, Zhang K, Xu L, Wang E (2011) Quantifying the waddington landscape and biological paths for development and differentiation. Proc. Natl. Acad. Sci. U.S.A. 108: 8257–8262. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Hopfield JJ (1982) Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A. 79: 2554–2558. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Amit DJ, Gutfreund H, Sompolinsky H (1985) Spin-glass models of neural networks. Phys. Rev. A 32: 1007–1018. [DOI] [PubMed] [Google Scholar]
  • 32. Kanter I, Sompolinsky H (1987) Associative recall of memory without errors. Phys. Rev. A 35: 380–392. [DOI] [PubMed] [Google Scholar]
  • 33.Amit D (1992) Modeling Brain Function: The World of Attractor Neural Networks. Cambridge: Cambridge Univ. Press.
  • 34. Bryngelson JD, Onuchic JN, Socci ND, Wolynes PG (1995) Funnels, pathways, and the energy landscape of protein folding: A synthesis. Proteins: Structure, Function, and Bioinformatics 21: 167–195. [DOI] [PubMed] [Google Scholar]
  • 35. Jenuwein T, Allis CD (2001) Translating the histone code. Science 293: 1074–1080. [DOI] [PubMed] [Google Scholar]
  • 36. Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, et al. (2007) Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 448: 553–560. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Meissner A, Mikkelsen TS, Gu H, Wernig M, Hanna J, et al. (2008) Genome-scale dna methylation maps of pluripotent and differentiated cells. Nature 454: 766–770. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Ben-David U, Mayshar Y, Benvenisty N (2011) Large-scale analysis reveals acquisition of lineage-specific chromosomal aberrations in human adult stem cells. Cell Stem Cell 9: 97–102. [DOI] [PubMed] [Google Scholar]
  • 39. Sommer CA, Sommer AG, Longmire TA, Christodoulou C, Thomas DD, et al. (2010) Excision of reprogramming transgenes improves the differentiation potential of ips cells generated with a single excisable vector. STEM CELLS 28: 64–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Masui S, Ohtsuka S, Yagi R, Takahashi K, Ko MSH, et al. (2008) Rex1/zfp42 is dispensable for pluripotency in mouse es cells. BMC Dev. Biol. 8: 45. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Khalfallah O, Rouleau M, Barbry P, Bardoni B, Lalli E (2009) Dax-1 knockdown in mouse embryonic stem cells induces loss of pluripotency and multilineage differentiation. STEM CELLS 27: 1529–1537. [DOI] [PubMed] [Google Scholar]
  • 42. Yamaguchi TP, Takada S, Yoshikawa Y, Wu N, McMahon AP (1999) T (brachyury) is a direct target of wnt3a during paraxial mesoderm specification. Genes Dev (13) 3185–3190. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Jaenisch R (2012) Nuclear cloning and direct reprogramming: The long and the short path to stockholm. Cell Stem Cell 11: 744–747. [DOI] [PubMed] [Google Scholar]
  • 44. Wernig M, Meissner A, Cassady JP, Jaenisch R (2008) c-Myc is dispensable for direct reprogramming of mouse fibroblasts. Cell Stem Cell 2: 10–12. [DOI] [PubMed] [Google Scholar]
  • 45. Onder TT, Kara N, Cherry A, Sinha AU, Zhu N, et al. (2012) Chromatin-modifying enzymes as modulators of reprogramming. Nature 483: 598–602. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Schneidman E, Berry MJ, Segev R, Bialek W (2006) Weak pairwise correlations imply strongly correlated network states in a neural population. Nature 440: 1007–1012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Bialek W, Ranganathan R (2007) Rediscovering the power of pairwise interactions. Arxiv 0712.4397.
  • 48. Cocco S, Monasson R, Weigt M (2013) From principal component to direct coupling analysis of coevolution in proteins: Low-eigenvalue modes are needed for structure prediction. PLoS Comp. Bio. 9: e1003176. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Mora T, Walczak AM, Bialek W, Callan CG (2010) Maximum entropy models for antibody diversity. Proc. Natl. Acad. Sci. U.S.A. 107: 5405–5410. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Tkačik G, Marre O, Amodei D, Schneidman E, Bialek W, et al. (2014) Searching for collective behavior in a large network of sensory neurons. PLoS Comp. Bio. 10 : e1003408 @ 1553–7358. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Yoo AS, Sun AX, Li L, Shcheglovitov A, Portmann T, et al. (2011) Microrna-mediated conversion of human fibroblasts to neurons. Nature 476: 228–231. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Warren L, Manos PD, Ahfeldt T, Loh YH, Li H, et al. (2010) Highly efficient reprogramming to pluripotency and directed differentiation of human cells with synthetic modified mrna. Cell Stem Cell 7: 618–630. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Longmire TA, Ikonomou L, Hawkins F, Christodoulou C, Cao Y, et al. (2012) Efficient derivation of purified lung and thyroid progenitors from embryonic stem cells. Cell Stem Cell 10: 398–411. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Lang AH, Li H, Collins JJ, Mehta P (2012) Epigenetic landscapes explain partially reprogrammed cells and identify key reprogramming genes. arXiv 1211.3133v3. [DOI] [PMC free article] [PubMed]
  • 55.David HA, Nagaraja HN (2003) Order Statistics, volume 7. Hoboken: John Wiley & Sons, 159–61 pp.
  • 56. Zhang HM, Chen H, Liu W, Liu H, Gong J, et al. (2012) Animaltfdb: a comprehensive animal transcription factor database. Nucleic Acids Res. 40: D144–D149. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Hopfield JJ (1984) Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. U.S.A. 81: 3088–3092. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Wang J, Xu L, Wang E, Huang S (2010) The potential landscape of genetic circuits imposes the arrow of time in stem cell differentiation. Biophys. J. 99: 29–39. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Figure S1

Cell fate correlation matrices. (A) Correlation matrix between cell fates for continuous data. (B) Correlation matrix for binarized data.

(PDF)

Figure S2

Projection of a random vector on a given cell fate. Ten thousand binarized random vectors were created in MATLAB and projected onto the cellular sub-space. The histogram shows the distribution of the projections. The red line is a Gaussian fit to the histogram. The mean is practically zero while the standard deviation is 0.053.

(PDF)

Figure S3

Predictivity vs expression for NSC. Same type of plot as Figure 3. Labeled TFs are part of reprogramming protocol to NPC [6]. This illustrates that Foxg1 is predictive for NSC, even though it is not for NPC.

(PDF)

File S1

Microarray sources. List of all microarrays used in this paper.

(TXT)

File S2

TF Z-score. The z-score gene expression for each TF of natural cell fates in this paper. This data is post RMA normalization and averaging over multiple replicates for each natural cell fate.

(TXT)

File S3

TF predictivity. The predictivity for each TF and cell fate in this paper.

(TXT)

File S4

Partially reprogrammed cells Z-score. Partially Reprogrammed Cells Z-Score. The z-score gene expression for each TF of partially reprogrammed cell fate. This data is post RMA normalization and averaging over multiple replicates for each partially reprogrammed cell fate.

(TXT)

File S5

Overexpression candidates. Top overexpression candidates to reprogram to various cell fates.

(TXT)

File S6

Knock-Out candidates. Top knock-out candidates to reprogram to various cell fates.

(TXT)

Table S1

Classifying top ESC reprogramming candidates. Table has top 50 embryonic stem cell (ESC) reprogramming candidates (as ranked by z-score times predictivity, Inline graphic). Classification of each TF is either justified by paper citation or GO Process term.

(PDF)

Table S2

Examining Yamanaka factors in detail. Here we reexamine the Yamanaka transcription factors (TFs) in light of our model.

(PDF)

Text S1

Attractor neural networks: Additional details. This supplementary text provides extended background details on Hopfield attractor neural networks but presents no new research findings. The sections are: (A) Discrete, Standard Hopfield. (B) Continuous, Standard Hopfield. (C) Continuous Gene Expression. (D) Discrete as Limit of Continuous. (E) Discrete, Projection Method.

(PDF)


Articles from PLoS Computational Biology are provided here courtesy of PLOS

RESOURCES