Dynamics of embryonic stem cell differentiation inferred from single-cell transcriptomics show a series of transitions through discrete cell states

Sumin Jang; Sandeep Choubey; Leon Furchtgott; Ling-Nan Zou; Adele Doyle; Vilas Menon; Ethan B Loew; Anne-Rachel Krostag; Refugio A Martinez; Linda Madisen; Boaz P Levi; Sharad Ramanathan

doi:10.7554/eLife.20487

. 2017 Mar 15;6:e20487. doi: 10.7554/eLife.20487

Dynamics of embryonic stem cell differentiation inferred from single-cell transcriptomics show a series of transitions through discrete cell states

Sumin Jang ^1,^2,^*,^†, Sandeep Choubey ^1,^2,^†, Leon Furchtgott ^1,³, Ling-Nan Zou ¹, Adele Doyle ^1,², Vilas Menon ⁴, Ethan B Loew ^1,², Anne-Rachel Krostag ⁴, Refugio A Martinez ⁴, Linda Madisen ⁴, Boaz P Levi ⁴, Sharad Ramanathan ^1,^2,^4,^5,^6,^*

Editor: Nir Yosef⁷

PMCID: PMC5352225 PMID: 28296635

Abstract

The complexity of gene regulatory networks that lead multipotent cells to acquire different cell fates makes a quantitative understanding of differentiation challenging. Using a statistical framework to analyze single-cell transcriptomics data, we infer the gene expression dynamics of early mouse embryonic stem (mES) cell differentiation, uncovering discrete transitions across nine cell states. We validate the predicted transitions across discrete states using flow cytometry. Moreover, using live-cell microscopy, we show that individual cells undergo abrupt transitions from a naïve to primed pluripotent state. Using the inferred discrete cell states to build a probabilistic model for the underlying gene regulatory network, we further predict and experimentally verify that these states have unique response to perturbations, thus defining them functionally. Our study provides a framework to infer the dynamics of differentiation from single cell transcriptomics data and to build predictive models of the gene regulatory networks that drive the sequence of cell fate decisions during development.

DOI: http://dx.doi.org/10.7554/eLife.20487.001

Research Organism: Mouse

Introduction

During differentiation, cells repeatedly choose between alternative fates in order to give rise to a multitude of distinct cell types. A major challenge in developmental biology is to uncover the dynamics of gene expression and the underlying gene regulatory networks that lead cells to their different fates. Given the complexity of gene regulatory networks, with their large number of components and even larger number of potential interactions between those components, building detailed predictive mathematical models is challenging. The lack of sufficient data requires a large number of assumptions to be made in order to constrain all the parameters in such models (Karr et al., 2012).

Our accompanying work on extracting cell states and the sequence of cell state transitions from gene expression data (Furchtgott et al., 2016) suggested that following the dynamics of a key set of genes was sufficient to trace these transitions, and in several instances the set of key genes that we discovered were also functionally important for lineage decisions. We asked whether we could similarly determine the suitable parameters to quantitatively describe cell state transitions during early mammalian germ layer development and build predictive mathematical models of the underlying regulatory network.

Early differentiation of pluripotent mouse embryonic stem (mES) cells, which are derived from the inner cell mass of the peri-implantation stage embryo (see pictorial summary in Figure 1—figure supplement 1A), recapitulate various aspects of in vivo germ layer differentiation (Evans and Kaufman, 1981; Keller, 2005). During this stage, both mES cells and cells in vivo express key pluripotency factors, such as Nanog, Sox2, Oct4, Klf4, Jarid2, and Esrrb, which mutually activate one another to form a pluripotency circuit (Kim et al., 2008; Young, 2011; Zhou et al., 2007). Following implantation, naïve pluripotent ES cells of the inner cell mass downregulate Klf4 and upregulate Otx2, Dnmt3a, and Dnmt3b, as they transition into ‘primed’ pluripotent cells found in the epiblast (Buecker et al., 2014; Nichols and Smith, 2009). Over the next few days of differentiation, TGF-beta signaling factors, with the aid of WNT/beta-catenin signaling, promote and inhibit the differentiation of pluripotent cells into mesendodermal (characterized by genes such as Brachyury (T), FoxA2, Mixl1 and Gsc) and ectodermal (characterized by Eras, Sez6, Stmn3, and Stmn4) cell fates, respectively (Gadue et al., 2006; Hart et al., 2002; Li et al., 2015; Lindsley et al., 2006; Tada et al., 2005; Watabe and Miyazono, 2009). Mesendodermal progenitors further differentiate into mesoderm and definitive endoderm progenitors. Mesoderm cells are usually distinguished by the expression of Gata4 and Eomes, and endoderm cells by Sox17 and FoxA2, although in mouse these genes are shared between both lineages, with differences only in their timing and level of expression (Arnold and Robertson, 2009; Kanai-Azuma et al., 2002; Kim and Ong, 2012; Lumelsky et al., 2001; Rojas et al., 2005). Along the ectodermal lineage, BMP signaling pushes ectodermal cells toward epidermis, while in the absence of BMP signaling, ectodermal cells acquire a neural fate (Wilson and Hemmati-Brivanlou, 1995). Epidermal cells are characterized by Keratins, whereas neural cells express Sox1 and Pax6 (Koch and Roop, 2004; Pevny et al., 1998; Sansom et al., 2009; Streit and Stern, 1999). The cells at the physical border between epidermal and neural cells give rise to neural crest cells (expressing Sox10, Msx2, Snai1 and Slug) in response to WNT and BMP signaling, which are often described as a fourth germ layer because of the diverse range of tissues to which they give rise (Gans and Northcutt, 1983; Knecht and Bronner-Fraser, 2002; Nicole and Chaya, 1991). Despite the detailed understanding of early embryonic development revealed by decades of work in genetics and developmental biology, a quantitative understanding of how the underlying gene regulatory network leads cells through a series of cell fate decisions has remained elusive.

We use single-cell RNA-seq to determine how gene expression patterns change as mouse embryonic stem cells differentiate into different germ-layer progenitors. We employ a Bayesian framework (Furchtgott et al., 2016) to simultaneously infer cell states, the sequence of transitions between these states, and the key sets of genes whose expression patterns provide a parameter space in which the cell states and cell state transitions are inferred. Our computational analysis, together with experimental validation using flow cytometry and live-cell imaging of a new Otx2 reporter mES cell line, suggest that cells reside in discrete states and rapidly transition from one state to another.

Using the inferred gene expression dynamics and by requiring models to replicate the existence of the observed discrete cell states, we extract probability distributions of the parameters of a model gene regulatory network. Intriguingly, requiring the model to have discrete cell states leads to the prediction that each cell state has a distinct response to perturbations by signals and changing transcription factor expression levels. We experimentally verify three distinct categories of predictions, each testing whether cells exhibit such state-dependent behavior in response to a different type of perturbation. The experimental results conclude that whether (i) Sox2 overexpression represses Oct4, (ii) Snai1 overexpression represses Oct4, and (iii) LIF and BMP promote pluripotency or differentiation into neural crest, all depend on cell state. Finally, we discuss the biological implications of our results.

Results

Acquiring single-cell transcriptomics data during early differentiation

We differentiated populations of mES cells by exposing them to one of four combinations of signaling factors and small molecules to perturb key paracrine signaling pathways involved in early mammalian patterning (Power and Tam, 1993; Tam et al., 2006): FGF, WNT, and/or TGF-beta signaling for up to five days (Figure 1A; see also Figure 1—figure supplement 1B, Materials and methods). Although cells in each population were differentiated in a monolayer culture and therefore exposed to nearly uniform conditions, we observed significant heterogeneity in the expression – as measured by immunofluorescence – of various known early germ layer marker genes (such as T, Pax6, Slug, FoxA2, and Gata4) in each population, suggesting a diversity of cell types under the same signaling conditions (Figure 1B). Further, undifferentiated pluripotent cells persisted in differentiating populations (Figure 1—source data 1, Figure 2—source data 1). Therefore, to capture the cell-to-cell variability within differentiating populations, we collected and transcriptionally profiled single cells every 24 hr over the course of five days of differentiation (Figure 1—source data 1) using a modified version of CEL-seq (Hashimshony et al., 2012). We obtained gene expression data from a total of 288 cells (Figure 1—figure supplement 1C–J; Materials and methods) with a median of 508,939 mapped reads, 48,475 transcripts and 7032 genes detected per cell. We then randomly subsampled 20,000 reads from each cell to eliminate any technical biases that may have resulted from differences in read numbers across cells (Figure 1—figure supplement 1K).

Figure 1. — (A) Mouse embryonic stem cells (mESCs) were exposed to various differentiation conditions to perturb FGF, WNT, and TGF-beta signaling for up to five days of differentiation. Single cells, collected every 24 hr during differentiation, were transcriptionally profiled using CEL-Seq. (See also Figure 1—figure supplement 1B and Figure 1—source data 1). (B) Images of immunostained mESCs undergoing differentiation show cell-to-cell variability in their expression of known germ layer marker genes. (Scale bar = 100 μm).

**DOI:** http://dx.doi.org/10.7554/eLife.20487.002

Figure 1—source data 1. Differentiation conditions and duration of single cells sorted into seven 96-well plates.
**DOI:** http://dx.doi.org/10.7554/eLife.20487.003

elife-20487-fig1-data1.docx^{(19.3KB, docx)}

DOI: 10.7554/eLife.20487.003

Figure 1—figure supplement 1. — (A) Mouse embryonic stem cells (mESCs) were exposed to various differentiation conditions to perturb FGF, WNT, and TGF-beta signaling for up to five days of differentiation. Single cells, collected every 24 hr during differentiation, were transcriptionally profiled using CEL-Seq. (See also Figure 1—figure supplement 1B and Figure 1—source data 1). (B) Images of immunostained mESCs undergoing differentiation show cell-to-cell variability in their expression of known germ layer marker genes. (Scale bar = 100 μm).

**DOI:** http://dx.doi.org/10.7554/eLife.20487.002

Figure 1—source data 1. Differentiation conditions and duration of single cells sorted into seven 96-well plates.
**DOI:** http://dx.doi.org/10.7554/eLife.20487.003

elife-20487-fig1-data1.docx^{(19.3KB, docx)}

DOI: 10.7554/eLife.20487.003

Bayesian statistical approach discovers appropriate coordinate systems to infer cell states and state transitions

One of the challenges in analyzing single-cell gene expression data is the high dimensionality of the data set and the concomitant sparsity of the data (the number of data points divided by the dimensionality is small) (Advani and Ganguli, 2016). Conventional analysis of single-cell gene expression data relies on multi-gene or multi-cell correlation estimates, such as PCA (Seurat) (Satija et al., 2015), ICA (Monocle) (Trapnell et al., 2014) and WGCNA (Li et al., 2016; Saadatpour et al., 2014) to reduce the dimensionality of expression data. However, discovering cell types and their lineage relationships using these methods has been challenging (Furchtgott et al., 2016).

In the accompanying paper, Furchtgott et al. develop a Bayesian framework that simultaneously infers (i) cell cluster identities of the cells, ${C} \equiv {c_{1}, c_{2}, \dots, c_{N}}$ ,, (ii) the sets of transitions ${T}$ between these clusters, (iii) the key sets of marker genes ${α_{i}}$ that define each cell cluster and (iv) the sets of transition genes ${β_{i}}$ that define the transitions between clusters, from single-cell gene expression data ${g_{i}}$ , by means of an iterative algorithm to determine the maximum likelihood estimates of these variables (Furchtgott et al., 2016).

Here, we employed this Bayesian framework to discover cell types and infer their lineage relationships for early mouse germ layer differentiation. We started by clustering the single-cell gene expression data for the 288 cells into 12 seed clusters ${c_{1}^{0}, c_{2}^{0}, \dots, c_{12}^{0}}$ using Seurat (Satija et al., 2015) as well as k-means (Figure 2—figure supplement 2A, B and C), restricting the analysis to transcription factors (2672 total) because of their functional role in orchestrating global gene expression (Spitz and Furlong, 2012). Seurat identifies cell clusters by performing density-based clustering on a two dimensional t-distributed Stochastic Neighbor Embedding (t-SNE) map of the gene expression data (Van der Maaten and Hinton, 2008). These clusters ${C}^{0} = {c_{1}^{0}, c_{2}^{0}, \dots, c_{12}^{0}}$ , ranging in size from 14 to 47 single cells, served as a seed for the iterative algorithm (described below).

We next considered every possible group of 3 clusters (e.g., $c_{1}^{0}, c_{2}^{0}$ and $c_{3}^{0}$ ) from a total of $^{12} C_{3} = 220$ such combinations. For each triplet of clusters, we first determined the probability that each gene $i$ was a marker gene ( $α_{i} = 1$ ), a transition gene ( $β_{i} = 1$ ) or neither ( $α_{i}, β_{i} = 0$ ) based on the distribution of their expression patterns in cells of each cluster, where ${g_{i}}$ is the single-cell gene expression data of the $i -$ th gene. Marker and transition genes are defined as follows (Figure 2—figure supplement 1A, Materials and methods,; Furchtgott et al., 2016): (i) A marker gene $i (α_{i} = 1)$ has a distribution of expression levels that is highest in one cluster, and well separated from the distribution of its expression levels in the other two clusters. Marker genes distinguish one of the clusters from the other two. (ii) A transition gene j $(β_{j} = 1)$ has a distribution of expression levels that is lowest in one cluster, and well separated from the distribution of its expression levels in the other two clusters. Each such transition gene establishes relative relationships between the three clusters (Furchtgott et al., 2016). (iii) Genes that are neither marker ( $α = 0$ ) nor transition genes ( $β = 0$ ) do not follow constraints (i) and (ii) on expression level distributions. Computing the probability of each gene being a marker gene, a transition gene, or neither allowed us to determine the most likely set of transitions $T$ between each triplet of clusters. Each gene’s contribution to the posterior probability $T$ is weighted by the odds ratio that the gene is a transition gene (Figure 2—figure supplement 1B). For example, for clusters $c_{1}^{0}, c_{2}^{0}$ and $c_{3}^{0}$ , a gene whose expression is lower in $c_{2}^{0}$ casts a vote against $c_{2}^{0}$ being the intermediate state (i.e., against the transition $T = c_{2}^{0}$ , where $c_{2}^{0}$ is intermediate, Figure 2—figure supplement 1B right) that is weighted by its odds of being a transition gene for those three clusters (Figure 2—figure supplement 1B, left). This Bayesian framework led to a summation of these weighted votes to determine the most likely set of transitions between each set of three clusters and concomitantly the most likely marker and transition genes corresponding to these clusters and transitions (Figure 2—figure supplement 1B, right).

For the seed cluster set ${C}^{0}$ , we determined 179 sets of transitions between clusters and identified 1035 transcription factors that were high probability marker or transition genes for at least one of the identified transitions. For a gene to be defined as a marker or transition gene, we used a probability cutoff of 0.5. Moreover, we used a probability cutoff of 0.6 for a triplet of clusters to count as a transition event. We next re-clustered the single cells in the gene expression space defined by these 1035 marker or transition genes using Seurat, to obtain a new cluster set ${C}^{1} = {c_{1}^{0}, c_{2}^{0}, \dots, c_{10}^{0}}$ consisting of 10 clusters. In this process, cells changed cluster identities, and certain clusters merged (Figure 2A, Figure 2—figure supplement 1C).

Figure 2. — (A) Iterative determination of the most likely sets of transitions ${T}$ and re-clustering of cells in the resulting subspace of transition and marker genes, starting from a seed set of cluster identities ${C}^{0}$ . With each iteration, the cluster identities as well as the total number of clusters change, as shown by the Seurat t-SNE maps (each dot represents a cell, colored based on its cluster identity). The inferred sets of transitions between clusters at each iteration are represented as a lineage tree (each circle represents a cell cluster). After five iterations, the algorithm converged upon a set of 9 clusters (shown in box). (See also Figure 2—figure supplement 2). (B) Left: Top ten genes (x-axis) with highest probability of being marker genes for clusters C₁ (yellow), C₂ (light red) and C₃ (light green) plotted against their probability of being marker genes. Right: Cell-cell correlation matrix computed using these 30 marker genes for the 108 cells belonging to clusters C₁, C₂ and C₃ shows three clear blocks of high correlation along the diagonal. (C) Left: Top ten genes (x-axis) with highest probability of being transitioned genes for clusters C₁, C₂ and C₃, plotted against their probability of being transitioned genes (y-axis). The transition genes belong to one of two classes, those that show high expression in cells belonging to C₁ and C₂ but low expression in C₃ (red), and those expressed at high levels in cells in clusters C₁ and C₃ but low levels in C₂ (green). The cell-cell correlation matrix computed using these 20 transition genes shows that the 29 cells belonging to cluster C₁have intermediate levels of correlation with cells in both C₂ and C₃, whereas the 46 cells in C₂ show low correlation levels with the 33 cells in C₃. (D) The global cell-cell correlation matrix computed for all 288 cells using the 889 genes used for the final iteration of clustering shows a barely detectable structure. (E) The inferred clusters and their lineage relationships can be represented in a three-dimensional coordinate system where the x- and y- axes are the normalized log expression level of the two classes of transition genes (genes in Figure 2B, left) and the z-axis measures the normalized log expression level of the marker genes for cluster C₁ (Figure 2A left in yellow). Each dot represents a single cell, and cells are colored based on their cluster identity.

**DOI:** http://dx.doi.org/10.7554/eLife.20487.005

Figure 2—source data 1. Plate and well id’s of cells belonging to each cluster.
**DOI:** http://dx.doi.org/10.7554/eLife.20487.006

elife-20487-fig2-data1.docx^{(30.8KB, docx)}

DOI: 10.7554/eLife.20487.006

Figure 2—source data 2. Triplet probabilities of final tree.
**DOI:** http://dx.doi.org/10.7554/eLife.20487.007

elife-20487-fig2-data2.docx^{(33.8KB, docx)}

DOI: 10.7554/eLife.20487.007

Figure 2—figure supplement 1. — (A) Iterative determination of the most likely sets of transitions ${T}$ and re-clustering of cells in the resulting subspace of transition and marker genes, starting from a seed set of cluster identities ${C}^{0}$ . With each iteration, the cluster identities as well as the total number of clusters change, as shown by the Seurat t-SNE maps (each dot represents a cell, colored based on its cluster identity). The inferred sets of transitions between clusters at each iteration are represented as a lineage tree (each circle represents a cell cluster). After five iterations, the algorithm converged upon a set of 9 clusters (shown in box). (See also Figure 2—figure supplement 2). (B) Left: Top ten genes (x-axis) with highest probability of being marker genes for clusters C₁ (yellow), C₂ (light red) and C₃ (light green) plotted against their probability of being marker genes. Right: Cell-cell correlation matrix computed using these 30 marker genes for the 108 cells belonging to clusters C₁, C₂ and C₃ shows three clear blocks of high correlation along the diagonal. (C) Left: Top ten genes (x-axis) with highest probability of being transitioned genes for clusters C₁, C₂ and C₃, plotted against their probability of being transitioned genes (y-axis). The transition genes belong to one of two classes, those that show high expression in cells belonging to C₁ and C₂ but low expression in C₃ (red), and those expressed at high levels in cells in clusters C₁ and C₃ but low levels in C₂ (green). The cell-cell correlation matrix computed using these 20 transition genes shows that the 29 cells belonging to cluster C₁have intermediate levels of correlation with cells in both C₂ and C₃, whereas the 46 cells in C₂ show low correlation levels with the 33 cells in C₃. (D) The global cell-cell correlation matrix computed for all 288 cells using the 889 genes used for the final iteration of clustering shows a barely detectable structure. (E) The inferred clusters and their lineage relationships can be represented in a three-dimensional coordinate system where the x- and y- axes are the normalized log expression level of the two classes of transition genes (genes in Figure 2B, left) and the z-axis measures the normalized log expression level of the marker genes for cluster C₁ (Figure 2A left in yellow). Each dot represents a single cell, and cells are colored based on their cluster identity.

**DOI:** http://dx.doi.org/10.7554/eLife.20487.005

Figure 2—source data 1. Plate and well id’s of cells belonging to each cluster.
**DOI:** http://dx.doi.org/10.7554/eLife.20487.006

elife-20487-fig2-data1.docx^{(30.8KB, docx)}

DOI: 10.7554/eLife.20487.006

Figure 2—source data 2. Triplet probabilities of final tree.
**DOI:** http://dx.doi.org/10.7554/eLife.20487.007

elife-20487-fig2-data2.docx^{(33.8KB, docx)}

DOI: 10.7554/eLife.20487.007

By iteratively determining the most likely sets of transitions and the most likely marker and transition genes, and by re-clustering the cells within the subspace of these genes, the algorithm converged (i.e., the number of genes of the re-clustering subspace became less than 10% of the total number of transcription factors) upon the most likely set of cell clusters (Figure 2—source data 1), the sets of transitions between these cell clusters (Figure 2—source data 2), as well as a set of 889 genes categorized as marker or transition genes for at least one set of transitions after five iterations (Figure 2A; Figure 2—figure supplement 2D).

The final cluster set consists of 9 cell clusters ranging in size between 14 and 57 cells; every cell was mapped to a cluster, and we observed mixing of cells from different experimental conditions to the same cluster as well as cells from the same experimental conditions being assigned to different clusters (Figure 1—source data 1; Figure 2—source data 1; Figure 3—figure supplement 1A). We combined the local sets of transitions between different triplets of clusters (Figure 2—source data 2) in order to infer the most parsimonious lineage tree between the clusters (Figure 2A) (Furchtgott et al., 2016). Importantly, we obtained identical final clusters starting with different seed cluster sets using k-means clustering with the gap statistic, as well as with different threshold probability parameter values for defining transition and marker genes, showing that our results were robust to the choice of seed clusters, threshold probability value and clustering method (Figure 2—figure supplement 2A, B and C; Figure 2—figure supplement 3A and B; Materials and methods). The cluster identities as well as their lineage relationships were unchanged when the analysis was repeated with a subset of cells; in which either an entire cluster was removed or a random set of half (144) cells were removed (Figure 2—figure supplement 3C and D). Further, we found that the clustering configuration does not change depending on whether the analysis is restricted to only transcription factors or includes all genes (Figure 2—figure supplement 3E). However, using all genes resulted in greater error rates along the topology of the inferred lineage tree compared to when only transcription factors were used (Furchtgott et al., 2016; Figure 2—figure supplement 3F).

The inferred lineage relationships between the final clusters could be visualized in the subspace of inferred marker and transition genes. We illustrate this first for the three clusters C₁, C₂, and C₃. We identified three classes of marker genes, each consisting of high-probability marker genes specific to one of the three clusters (Figure 2B). Each gene class is denoted by its highest probability member gene in curly brackets (e.g., {Otx2}). When the cell-cell Pearson correlation matrix between all 288 cells was determined using the 889 genes used for the final iteration of clustering and lineage determination, the matrix showed a barely detectable structure of nine blocks (with very low contrast) along the diagonal with marginally higher correlation levels, each corresponding to a cell cluster (Figure 2D). As expected, the low level of contrast observed in Figure 2D improves dramatically when the same correlation measures are taken across cells in a triplet, using marker or transition genes for this triplet; illustrating the locally defined nature of marker and transition genes (Figure 2B, right; Figure 2C, right; Figure 2—figure supplement 2E). The same matrix computed using high-probability marker genes for clusters C₁, C₂, and C₃ (Figure 2B, left) showed three distinct blocks of high correlation along the diagonal, each corresponding to a different cluster (Figure 2B, right). Similarly, when the cell-cell correlations were measured using the two classes of inferred transition genes (Figure 2C, left), each consisting of high-probability transition genes present in C₁ and downregulated either in C₂ or in C₃, the correlation matrix showed intermediate correlation levels between C₁ and either C₂ or C₃, and low correlation levels between C₂ and C₃ (Figure 2C, right). The distribution functions of the expression levels of these transition genes in each of the three different clusters (C₁, C₂ and C₃) led to the inference that clusters C₂ and C₃ are connected via cluster C₁ with a probability of 0.83 (Figure 2E, Figure 2—figure supplement 1A and B).

We visualized the gene expression changes that characterize transitions from one cell cluster to another by plotting the cells in C₁, C₂ and C₃ in a three-dimensional gene expression subspace (Figure 2E), using as axes the mean normalized expression levels of the two transition gene classes down-regulated in C₂ or C₃ (in red and green in Figure 2C) and of the marker gene class specific to C₁ (Figure 2B in orange). These axes constitute a low-dimensional coordinate system for the inferred set of transitions between C₁, C₂ and C₃.

Similarly, the inferred transitions across all sets of three clusters (Figure 2—source data 2) together form a lineage tree (Figure 3A) that spans all nine identified cell clusters, which can be visualized in gene expression space through a series of local transition and marker gene classes (Figure 3C; Figure 3—source data 1). We next investigated the gene expression variability among cells within each cluster by performing principal component analysis (PCA) on the transcription factor gene expression for cells within each cluster. Importantly, we found that for all clusters, no principal component is statistically significant (compared to randomizations of the data; Figure 3B, Figure 3—figure supplement 1B), validating that within each inferred cluster, the cells have the same identity within the resolution of our data.

Figure 3—figure supplement 1. — (A) Computationally inferred cell clusters and sequence of transitions are shown in the appropriate subspace of gene expression. Each dot represents a single cell, and cells are colored based on their cluster identity. For a linear transition sequence of cell states (such as from C₀ to C₁), the transitions are represented in a two dimensional plot with the axes defined by the normalized mean log of the unique reads of genes that are most differentially regulated in the two states, while for lineage bifurcations between alternative daughter cell states, the plots are shown in three dimensions, where the x and y axes are normalized mean log unique reads of the associated set of transition genes, and the z axes are the normalized mean log unique reads of the marker genes associated with the inferred progenitor state. Labeled in parenthesis next to each cluster are the abbreviated names of the putative corresponding cell types found in vivo (Epi: epiblast; bi_Ec: bi-potent ectoderm; ME: mesendoderm; NE: neural ectoderm; NC: neural crest; M: mesoderm; DE: definitive endoderm). (B) Top: Plot of the variances of the first ten principal components of the gene expression of cells in cluster C₀. The red line is the maximum principal component variance over 1000 randomizations of the data, showing that no principal component is statistically significant. Bottom: variances of the first principal component of each cluster, normalized by the maximum principal component variance of the randomized gene expression data for the corresponding cluster. (C) A list of high probability genes that belong to the various marker and transition gene classes that define the axes of the plots in Figure 3A, each represented by one gene in curly brackets. The curly brackets contain the gene name with the highest probability for that class, and other high probability genes (as in Figure 2A and B) are listed in the table. While some of the genes are used only once, others such as *Otx2* and *Oct4* are repeatedly reused in different subspaces to describe the transition. (D) Flow cytometry analysis of cell populations sampled every 24 hr during differentiation and immunostained for nine genes (two shown at a time for each density contour plot): Klf4, Otx2, Oct4, Sox2, Slug, Pax6, FoxA2, Gata4 (each taken from a different gene class shown in Figure 3C), and T recapitulate the predicted structure and temporal ordering of transitions through discrete cell states. Axes represent the log of gene expression, normalized by the range between the minimum and maximum across each gene. Plots in pink and green represent C₂ and C₃ lineages following the split from C₁, respectively. (E) Live cell microscopy of *Otx2* reporter (mCitrine) cell line to infer the dynamics of cell state transition from C₀ to C₁. Sample images (shown) at t = 0, 6, 12, 18, and 24 hr of differentiation. Cells were terminated at approximately 25 hr into differentiation and immunostained for Nanog (ES marker gene, Figure 2—figure supplement 2A), which shows an anti-correlation between Otx2 and Nanog expression levels. (Scale bar = 100 μm) (F) Top: Time series (x-axis) traces of single-cell Otx2 (y-axis) expression dynamics taken every 15 min show that the duration of transition from Otx2-low (C₀) to Otx2-high (C₁) is approximately 4 hr, which is well within the time frame of one cell cycle (~10 hr). The end-point (t = 25 hr) Otx2 levels show a clear separation between high and low (histogram of ~200 cells shown to the right in gray), indicating that some cells have made the transition from C₀ to C₁ while others not. Each trace is colored by its relative end-point Nanog immunofluorescence intensity level. Otx2 levels are normalized by the mean level at t = 0. Bottom: Histogram (y-axis = log (cell count)) of residence durations of ~400 cells in the Otx2-low C₀ state, showing that transition times vary across multiple cell cycle lengths (time lapse length = 48 hr). Inset bar shows mean as well as upper (white) and lower quartiles of the transition durations of cells.

**DOI:** http://dx.doi.org/10.7554/eLife.20487.011

Figure 3—source data 1. Probabilities of membership in marker and transition gene classes in final tree.
Listed are, for direct triplets along the lineage tree, the genes with the highest probabilities of belonging to transition gene and marker gene classes, and their associated probabilities. Genes belonging to the classes shown in curly brackets in Figure 3C (probability greater than 0.5%) are shown.

**DOI:** http://dx.doi.org/10.7554/eLife.20487.012

elife-20487-fig3-data1.xlsx^{(36.5KB, xlsx)}

DOI: 10.7554/eLife.20487.012

The inferred dynamics of differentiation can therefore be visualized in a low-dimensional subspace of gene expression, showing that differentiation occurs through a sequence of discrete cell state transitions.

Correspondence of cell states discovered ab initio from single-cell data to known in vivo cell types

Inspection of the genes that make up the local transition and marker gene classes (Figure 3C; Figure 3—source data 1) allowed us to match clusters to embryonic cell types found in vivo that show similar gene expression.

Cluster C₀ is characterized by the high expression of pluripotency genes Oct4, Sox2, Sall1, Etv5, Jarid2, Esrrb, Klf4 and Klf5, whereas cluster C₁ has lower Jarid2, Esrrb, Klf4 and Klf5, and higher Otx2, Bptf, Cbx1 and Dnmt3a/b expression compared to cluster C₀, suggesting that clusters C₀ and C₁ correspond to naïve ES and primed epiblast pluripotent cell types, respectively (Borgel et al., 2010; Goller et al., 2008; Kim et al., 2001; Nichols and Smith, 2009; Tesar et al., 2007; Zhou et al., 2007).

Clusters C₂ and C₃, which branch out from C₁, show differential expression of pluripotency genes relative to C₁; Bptf and Cbx1 are downregulated in both C₂ and C₃, Oct4, Etv5 and Dnmt3a are downregulated in cluster C₃ but maintained in C₂, and Sox2, Otx2 and Dnmt3b are downregulated in cluster C₂ but maintained in cluster C₃. Cluster C₂ is further characterized by a high expression level of primitive streak markers Mixl1 and T (Hart et al., 2002; Tada et al., 2005), whereas cluster C₃ is characterized by Sez6, Stmn3 and Stmn4, which have recently been shown to characterize the previously elusive mammalian bi-potent ectoderm progenitor population (Li et al., 2015). Together, these patterns strongly suggest that clusters C₂ and C₃ represent mesendoderm and bi-potent ectoderm progenitor cell types, respectively.

The bi-potent ectoderm progenitor-like cluster C₃ is then followed by a lineage split into clusters C₅ and C₆. While Stmn4 is downregulated in both C₅ and C₆ compared to C₃, Sez6 is downregulated in only C₅, and Stmn3 as well as neural progenitor marker Pax6 are downregulated in C₆ but maintained in C₅. Cluster C₅ is further characterized by Smarce1 and Zic2, and cluster C₆ by Slug and Msx2, suggesting that C₅ and C₆ may be related to neural progenitor and neural crest cells, respectively (Brown and Brown, 2009; Nicole and Chaya, 1991; Vogel-Ciernia and Wood, 2014).

Cluster C₄, although similar in its expression level of Mixl1 and T to cluster C₂, shows higher expression of other primitive streak genes such as FoxA2 and Tcf3 (Merrill et al., 2004) and lower expression of Etv5. Cluster C₄ is then followed by a bifurcation between clusters C₇ and C₈. Cluster C₇ shows high expression levels of Gata4 and Snai1, indicative of its relation to mesoderm, and cluster C₈ is characterized by high FoxA2 compared to clusters C₄ and C₈, suggestive of its relation to definitive endoderm (Kim and Ong, 2012; Rojas et al., 2005). We predict that cluster C₄ represents a primed bi-potent mesendoderm cell type relative to cluster C₂ (Nakanishi et al., 2009).

Together, these results suggest that the cell clusters and sets of transitions computationally inferred from single-cell transcriptomics data correspond to known in vivo cell types and their lineage relationships.

Differentiation occurs through a series of discrete cell state transitions

The fact that gene expression in each cell cluster does not vary significantly – as measured by the relative sizes of the largest eigenvalues of the PC components of the gene expression data (or percent variance explained thereby) versus that of the same data randomly shuffled (Figure 3B; Figure 3—figure supplement 1B) – allows for genes to be sorted into a few gene classes that show highly correlated expression patterns across clusters (Figure 3C). This suggests that one can validate the inferred sequence of cell state transitions and its gene expression dynamics by measuring the expression of one gene from each class in differentiating cells over time.

In order to confirm the gene expression dynamics over the inferred sequence of cell state transitions, we assessed populations of cells for their expression levels of key transition and marker genes (each taken from a different gene class) via immunostaining and flow cytometry. We sampled mES cell populations every 24 hr during differentiation and immunostained each for Klf4, Otx2, Oct4, Sox2, Pax6, Slug, FoxA2, Gata4 and T. (Although T is not assigned to a specific gene class, it is highly expressed in the mesendoderm-like states C₂ and C₄, and it thus allows us to distinguish C₂ from the earlier epiblast-like state C₁.) The flow cytometry density contour plots shown (Figure 3D) are characterized by high-density peaks which are separated from one another by regions of low density, mirroring the discreteness of the cell states inferred from single-cell transcriptomics data. The relative locations of these high-density peaks and the time at which they appear and disappear recapitulate the inferred gene expression dynamics of the cell state transitions of the lineage tree.

During the first two days of differentiation, all cell populations downregulated Klf4 and upregulated Otx2, as shown in the first row of density contour plots in Figure 3D. This is consistent with the first observed state transition in our inferred lineage tree from the naïve ES C₀ state to the primed epiblast-like state C₁. On day three of differentiation (third column of plots in Figure 3D), Sox2 and Oct4 are asymmetrically downregulated relative to the preceding population, as is seen in mesendoderm-like state C₂ and bi-potent ectoderm-like state C₃ relative to the epiblast-like state C₁. Sox2-high, Oct4-low cells on day three are either high for Pax6 or for Slug, consistent with comparisons between the neural ectoderm-like state C₅ and neural crest-like C₆. On day four, the Pax6-high and Slug-high populations become proportionally larger as the Pax6/Slug-low population shrinks, supporting the inferred temporal ordering that C₅ and C₆ arise from the bi-potent ectoderm-like state C₃. Oct4-high, Sox2-low cells on day three of differentiation are high for T, but show two discrete levels of FoxA2, mirroring the difference between the two mesendoderm-like states C₂ (FoxA2-low) and C₄ (FoxA2-high). Further, we found that Etv5, a gene whose expression dynamics had hitherto not been implicated with early mesendodermal differentiation in mammals, was significantly downregulated from C₂ to C₄, as predicted from the single-cell gene expression data (Figure 3—figure supplement 1C). Finally, at days four and five, we observe FoxA2-high, Gata4-low and FoxA2-low, Gata4-high cell populations, which correspond to the primed mesendoderm and definitive endoderm-like states C₄ and C₈ and the mesoderm-like state C₇, respectively. We thus confirmed that differentiating cell populations recapitulate the gene expression dynamics of cell state transitions inferred from single-cell data (Figure 3A).

The observation that the majority of randomly sampled cells are found to belong to one of nine discrete cell states (both transcriptionally and at the protein level) suggests that cell state transitions occur within a relatively short timeframe compared to the amount of time cells spend within each state. We tested this hypothesis on the first cell state transition from the naïve ES C₀ state to the primed epiblast-like state C₁ (Figure 3A). To do so, we generated an Otx2-mCitrine fusion protein reporter mES cell line (Materials and methods) and observed the single-cell-resolution dynamics of Otx2 expression for up to two days (Figure 3E and F).

In agreement with our hypothesis, we observed that Otx2 levels, at the end of 24 hr of differentiation, show a bimodal distribution (Figure 3F, top), and cells tend to occupy either an Otx2-low state (corresponding to ES state C₀) or an Otx2-high state (corresponding to epiblast-like state C₁). We find that cells transition from an Otx2-low to an Otx2-high state well within the duration of a single cell cycle (mean transition duration of 4.52 hr compared to the cell-cycle length of approximately 10 hr). In contrast, cells tend to stay in either Otx2-low or -high states for up to multiple cell cycles, with a large amount of cell-to-cell variability in the residence duration (Figure 3F, bottom). Together with our results from the analysis of single-cell transcriptomics data, these observations show that cells reside in discrete states in gene expression space and correspondingly undergo abrupt state transitions.

A probabilistic model that replicates the observed discrete cell states predicts state-dependent interpretation of perturbations

Our analysis of single-cell gene expression data suggested a lineage tree composed of discrete cell states, and identified genes associated with individual cell states and transitions between them. While we predict the existence of discrete cell states based on their gene expression pattern, finding unique physiological properties that can define and distinguish their existence functionally would lend even greater support to this prediction. We therefore next sought to find properties of cell states that distinguished them functionally from one another. In order to do so, we built a predictive and testable quantitative model of the underlying gene regulatory network based on the expression patterns of the marker and transition genes.

From the 889 genes that were categorized as either marker or transition genes for all the high probability triplets, we first chose genes involved only in the triplets that fall directly along the inferred lineage tree. That is, we removed genes that were categorized as transition or marker genes for triplets consisting of ‘indirect’ lineage relationships, where at least one cell state is skipped between two cell states connected through the lineage tree. For instance, we did not consider the genes categorized as marker or transition genes only in the triplet C_0, C₁ and C₅, because C₃ is skipped between C₁ and C₅.

Since some transition genes inferred from our Bayesian analysis are re-used to infer multiple local state transitions (Figure 3C ,e.g., Oct4, Otx2), we classified transcription factors based on their distinct binarized patterns of expression across all nine cell states, with genes showing the same patterns belonging to the same gene module (Materials and methods, Figure 4—figure supplement 1A, Figure 4—source data 1). Hence, we categorized the 321 marker and transition genes involving ‘direct’ triplets along the tree into 26 gene modules, each of which showed distinct patterns of expression across the cell states. Further, because our goal was to test whether different cell states were functionally distinct (i.e., respond differently to the same signals and gene expression changes), we also noted the expression pattern of signaling factor genes belonging to FGF, WNT, LIF and BMP signaling pathways along the lineage tree (Figure 3A). These signaling factor genes constituting each of these modules were selected based on GO categories, leading to a total of 29 gene modules (Materials and methods). We denote each gene module by a representative gene in square brackets; for example, the gene module that uniquely characterizes the ES state C₀ is denoted as [Klf4] (Figure 4—source data 1 and 2).

Owing to the large number of gene modules, and consequently even larger number of potential interactions between these modules, even the simplest mathematical model would consist of hundreds of parameters. However, for most of these parameters, direct experimental measurements are not available. In order to overcome this challenge, we exploited recent developments based on renormalization group approaches to determine which parameters are relevant for the observed data (Machta et al., 2013). We adapted the seminal model of artificial neural networks, known as the Hopfield model (Fard et al., 2016; Hopfield, 1984; Maetschke and Ragan, 2014), to construct an effective gene regulatory network between the 29 gene modules. By construction, we required that this mathematical model produces the nine cell states seen in Figure 3A. We considered a network that contains direct interactions, in which each module j exerts a drive on module i, which is equal to an interaction strength $J_{i j}$ (positive or negative) multiplied by the concentration of module j. The total drive on module i is the sum of the drives from the different modules. Given our observation of discrete cell states, we further considered that the total drive on module i affects expression in a highly non-linear manner, with high gene expression for drives that exceed a critical drive $ϕ_{0}$ , and low gene expression otherwise (Figure 4—figure supplement 1B). For simplicity, we assumed that the expression of every gene module exhibits a non-linear, step-function response, when subjected to the same drive; thereby reducing the number of parameters of the model. Indeed there are numerous genes that manifest sigmoidal-like response in expression, in the presence of internal and external stimuli (Lebrecht et al., 2005; Segal and Widom, 2009). Thus the effective dynamics of expression levels $m_{i}$ of each module $i$ are given by the non-linear equation:

\frac{d m_{i}}{d t} = H (\sum_{j} J_{i j} m_{j} - ϕ_{0}) - \frac{m_{i}}{τ_{i}}

where $H$ is the Heaviside step function and $τ_{i}$ is the effective lifetime of module $i$ (Materials and methods).

We determined the set of interactions $J_{i j}$ that are consistent with the observed cell states (C₀-C₈, Figure 3A) being stable fixed points of the network. If state ${\vec{m}}^{α} = {m_{1}^{α}, \dots, m_{29}^{α}}$ with expression level $m_{i}^{α}$ in module $i$ is a stable fixed point of the network, then the interactions $J_{i j}$ must be such that the total drive on each module that is expressed in ${\vec{m}}^{α}$ is greater than the critical drive, and the total drive on each module that is not expressed in ${\vec{m}}^{α}$ is less than the critical drive:

m_{i}^{α} = 1 \Rightarrow \sum_{j} J_{i j} m_{j}^{α} \geq ϕ_{0}

m_{i}^{α} = 0 \Rightarrow \sum_{j} J_{i j} m_{j}^{α} < ϕ_{0}

Thus, for each stable state, we have 29 constraints on the possible values of $J_{i j}$ , one for each module. Given that we have nine cell states, there are 29*9 = 261 inequalities that constrain the values of the 29² = 841 different parameters, $J_{i j}$ . The problem is therefore underdetermined even for our simplified model of the underlying network, and there are an infinite number of solutions that would allow for the observed cell states to be stable.

By using a linear programming method to obtain an ensemble of 10,000 sets of $J_{i j}$ interactions (Materials and methods), each satisfying the constraint that all nine cell states are stable fixed points, we estimated the probability distribution for the 841 parameters of the model (Figure 4A and Figure 4—figure supplement 2), giving us a probabilistic model of the underlying network. We further assumed that all the possible 10,000 sets of $J_{i j}$ interactions that reproduced the nine stable cell states were equally likely, since we did not have any experimental evidence to distinguish between them.

Figure 4. — (A) The inferred gene regulatory network from 10,000 sampled solutions that stabilize each of the nine cell states. Each circle represents a gene module. Mean positive and negative interactions between the modules are shown in red and green, respectively, and their thickness and transparency are proportional to the absolute magnitude of the mean and the coefficient of variation (c.v.), respectively. The colored circles represent the gene modules expressed uniquely in only one of the cell states (color code matched with Figure 3A for each state). (B, C) Subsets of the network consisting of gene modules that are expressed in (and stabilize) the naïve ES C₀ state (B) and epiblast-like C₁ (C) state. As cells transition from C₀ to C₁, expression of [*Klf4*], [*Apex1*], [*Ets2*], [*Atf2*] modules is downregulated (shown in gray) while [*Hes6*] and [*Otx2*] modules are upregulated, leading to changes in the effective interactions between gene modules that are common to both C₀ and C₁ states, such as [*Sox2*] and [*Oct4*]. (D) [*Sox2*] overexpression (x-axis) plotted against the probability of [*Oct4*] downregulation (y-axis) computed over 10,000 models (Materials and methods). In the C₁ state (solid line), [*Oct4*] is downregulated in an increasing fraction of models following [*Sox2*] overexpression, while in C₀, [*Oct4*] is stable in ~96% of the models (dotted line). In order to obtain the error bars for this and subsequent predictions, we randomly sampled three subsets of 3333 from the 10,000 models. For each set we computed the mean and standard error of the proportion of models that show downregulation of *Oct4* in response to *Sox2* overexpression. (**E, F**) Subsets of the model consisting of gene modules that are expressed in the epiblast-like C₁ (E) and mesendoderm-like C₂ (F) states, and their interactions with [*Snai1*], which is not normally expressed in C₁ or C₂. As cells transition from the C₁ to C₂ state, [*Hes6*], [*Sox2*], [*Otx2*], [*Churc1*] are downregulated (shown in gray), while [*Hmga1*], [T], [*Atf2*], [*Hes1*], [*Ets2*], [*Apex1*], [*Brd7*], [*Hmgn2*] and [*Smarce1*] are upregulated, leading to changes in the effective interactions between [*Snai1*] and modules that are common to both C₁ and C₂, such as [*Oct4*]. (G) The probability of [*Oct4*] being downregulated (y-axis) as a function of [*Snai1*] overexpression (x-axis). In the C₁ state (solid line), the over expression of [*Snai1*] has no effect on [*Oct4*] levels in ~94.5% of the 10,000 models whereas in the C₂ state (dotted line), the overexpression of [*Snai1*] leads to [*Oct4*] downregulation in up to 19% of the models. (H) The C₃ state shows a downregulation of [*Oct4*] and [BMP], and upregulation of [*Tead1*], [*Apex1*], [*Pax6*], [*Smarce1*], [*Ets2*], [*Atf2*], [*Hes1*], [*Fhl1*], [Hmgn2] modules relative to C₁. (I) Cells in different states are predicted to respond differently to morphogens. Plot showing the percentage of models (y-axis) where states C₁ and C₃ (x-axis) transition to C₆ (characterized by unique marker gene module [*Msx2*]), in response to [LIF]+[BMP]. C₁ cells remain stable in response to [LIF]+[BMP] signaling in >98% of the models whereas C₃ cells are destabilized and move to the C₆ state in ~11% of the models.

**DOI:** http://dx.doi.org/10.7554/eLife.20487.014

Figure 4—source data 1. Gene modules used for modeling the network.
* The [BMP] and [Aes] modules have the same binary pattern. ** The [FGF] and [WNT] modules have the same binary pattern.

**DOI:** http://dx.doi.org/10.7554/eLife.20487.015

elife-20487-fig4-data1.docx^{(37.1KB, docx)}

DOI: 10.7554/eLife.20487.015

Figure 4—source data 2. Binary expression profiles of the gene modules used for modeling the network in the 9 cell clusters.
**DOI:** http://dx.doi.org/10.7554/eLife.20487.016

elife-20487-fig4-data2.docx^{(19.6KB, docx)}

DOI: 10.7554/eLife.20487.016

Figure 4—figure supplement 1. — (A) The inferred gene regulatory network from 10,000 sampled solutions that stabilize each of the nine cell states. Each circle represents a gene module. Mean positive and negative interactions between the modules are shown in red and green, respectively, and their thickness and transparency are proportional to the absolute magnitude of the mean and the coefficient of variation (c.v.), respectively. The colored circles represent the gene modules expressed uniquely in only one of the cell states (color code matched with Figure 3A for each state). (B, C) Subsets of the network consisting of gene modules that are expressed in (and stabilize) the naïve ES C₀ state (B) and epiblast-like C₁ (C) state. As cells transition from C₀ to C₁, expression of [*Klf4*], [*Apex1*], [*Ets2*], [*Atf2*] modules is downregulated (shown in gray) while [*Hes6*] and [*Otx2*] modules are upregulated, leading to changes in the effective interactions between gene modules that are common to both C₀ and C₁ states, such as [*Sox2*] and [*Oct4*]. (D) [*Sox2*] overexpression (x-axis) plotted against the probability of [*Oct4*] downregulation (y-axis) computed over 10,000 models (Materials and methods). In the C₁ state (solid line), [*Oct4*] is downregulated in an increasing fraction of models following [*Sox2*] overexpression, while in C₀, [*Oct4*] is stable in ~96% of the models (dotted line). In order to obtain the error bars for this and subsequent predictions, we randomly sampled three subsets of 3333 from the 10,000 models. For each set we computed the mean and standard error of the proportion of models that show downregulation of *Oct4* in response to *Sox2* overexpression. (**E, F**) Subsets of the model consisting of gene modules that are expressed in the epiblast-like C₁ (E) and mesendoderm-like C₂ (F) states, and their interactions with [*Snai1*], which is not normally expressed in C₁ or C₂. As cells transition from the C₁ to C₂ state, [*Hes6*], [*Sox2*], [*Otx2*], [*Churc1*] are downregulated (shown in gray), while [*Hmga1*], [T], [*Atf2*], [*Hes1*], [*Ets2*], [*Apex1*], [*Brd7*], [*Hmgn2*] and [*Smarce1*] are upregulated, leading to changes in the effective interactions between [*Snai1*] and modules that are common to both C₁ and C₂, such as [*Oct4*]. (G) The probability of [*Oct4*] being downregulated (y-axis) as a function of [*Snai1*] overexpression (x-axis). In the C₁ state (solid line), the over expression of [*Snai1*] has no effect on [*Oct4*] levels in ~94.5% of the 10,000 models whereas in the C₂ state (dotted line), the overexpression of [*Snai1*] leads to [*Oct4*] downregulation in up to 19% of the models. (H) The C₃ state shows a downregulation of [*Oct4*] and [BMP], and upregulation of [*Tead1*], [*Apex1*], [*Pax6*], [*Smarce1*], [*Ets2*], [*Atf2*], [*Hes1*], [*Fhl1*], [Hmgn2] modules relative to C₁. (I) Cells in different states are predicted to respond differently to morphogens. Plot showing the percentage of models (y-axis) where states C₁ and C₃ (x-axis) transition to C₆ (characterized by unique marker gene module [*Msx2*]), in response to [LIF]+[BMP]. C₁ cells remain stable in response to [LIF]+[BMP] signaling in >98% of the models whereas C₃ cells are destabilized and move to the C₆ state in ~11% of the models.

**DOI:** http://dx.doi.org/10.7554/eLife.20487.014

Figure 4—source data 1. Gene modules used for modeling the network.
* The [BMP] and [Aes] modules have the same binary pattern. ** The [FGF] and [WNT] modules have the same binary pattern.

**DOI:** http://dx.doi.org/10.7554/eLife.20487.015

elife-20487-fig4-data1.docx^{(37.1KB, docx)}

DOI: 10.7554/eLife.20487.015

Figure 4—source data 2. Binary expression profiles of the gene modules used for modeling the network in the 9 cell clusters.
**DOI:** http://dx.doi.org/10.7554/eLife.20487.016

elife-20487-fig4-data2.docx^{(19.6KB, docx)}

DOI: 10.7554/eLife.20487.016

We used this probabilistic model to make testable predictions as to how different cell states respond to perturbations: to see if different cell states are defined not only by their distinct transcriptional profiles, but also functionally distinct in their phenotypic responses to the same perturbations. There are a vast number of testable predictions that one could extract from our gene regulatory network model. However, given the low throughput nature of perturbation experiments, we selected three distinct probabilistic predictions, each probing different aspects of the model gene regulatory network.

First, we considered changes in the effective interaction between two gene modules as a function of cell state (i.e., how the expression level changes of one gene module affect the expression of another gene module differs across cell states due to the difference sets of gene modules present in each state). To this end, we looked at two classes of gene module pairs: (i) gene modules that are co-expressed in two mother-daughter cell states and (ii) gene modules that are never co-expressed in any cell state.

Gene modules [Sox2] and [Oct4] are highly expressed in both the ES cluster C₀ and the epiblast-like C₁ cluster, after which they are asymmetrically downregulated in the mesendoderm-like C₂ and ectoderm-like C_3. We find that for 67.5% of the 10,000 sampled solutions, [Sox2] and [Oct4] have mutually inhibitory interactions (i.e., negative coupling constants). Although both [Sox2] and [Oct4] are present together in the C₀ and C₁ states, their effective interactions are altered in different ways in each cell state by the presence of other gene modules. As cells transition from state C₀ to C₁, they downregulate gene modules [Klf4], [Atf2], [Apex1] and [Ets2], and upregulate [Hes6] and [Otx2], among others (Figure 4B and C), leading to changes in the effective interaction strength between [Sox2] and [Oct4]. By incrementally increasing [Sox2] levels relative to its base value and assessing the fraction of models that show [Oct4] downregulation, we found that [Oct4] levels are predicted to be more stable to [Sox2] overexpression in state C₀ than in C₁ (Figure 4D), thus distinguishing C₀ and C₁ functionally (Geula et al., 2015).

On the other hand, [Snai1] and [Oct4] are not expressed together in any of the nine cell states. We investigated the predicted effects of [Snai1] overexpression on [Oct4] in the epiblast-like state C₁ and mesendoderm-like state C₂, both of which normally express [Oct4] but not [Snai1]. Although [Snai1] has a negative interaction with [Oct4] in 79.2% of the models, the modules expressed in C₁ exert a greater positive drive on [Oct4] (Figure 4E and F) than those expressed in C₂. This leads to the prediction that [Oct4] is less sensitive to [Snai1] overexpression in state C₁ compared to C₂ (Figure 4G).

We next considered the effect of morphogen signals in different states. Specifically, we considered the LIF, BMP, WNT and FGF signaling pathways, which are known to play a significant role in patterning the early embryo, as well as are central to our in vitro differentiation process (Materials and methods). We grouped signaling genes by their respective pathways (defined by GO categories) and assigned each group to a module based on its average expression pattern across the nine cell states. Because WNT and FGF modules show no changes in expression across all cell states (most likely due to the large number of genes that fall into the relevant GO categories), we focused on investigating the effects of LIF and BMP signaling on cells in the epiblast-like C₁ and in the bi-potent ectoderm-like state C₃ (Figure 4H). Given an initial state C₁ or C₃, we calculated the probabilities that cells either remain in the same state or move to a different state in response to [LIF] and [BMP] (Materials and methods). Our simulations found that for ~98% of the models, cells that are initially in state C₁ either remained stabilized in C₁ or moved to state C₀ in response to [LIF] and [BMP] addition. However, in response to the same perturbation, the vast majority of cells in the C₃ state either transitioned to the neural crest-like state C₆(11.2%) or stayed in the C3 state (86.1%) (Figure 4I).

To summarize, we predict that [Oct4] expression is less sensitive to [Sox2] overexpression in state C₀ than in C₁; [Oct4] expression is less sensitive to [Snai1] overexpression in state C₁ compared to C₂; and cells in state C₃, but not in C₁, can transition to state C₆ following [LIF]+[BMP] exposure.

Importantly, we further noted that the model predictions were robust to changes in the probability cutoff for the genes we considered: although the number of gene modules changed (27 modules for a cut off of 0.7 and 24 for 0.9), we found that the models made the same qualitative predictions (Figure 4—figure supplement 3).

Thus, by categorizing genes into different modules by their expression patterns across the observed cell states, these modules provide a starting point for modeling the gene regulatory network responsible for cell fate decisions, allowing us to make predictions for how the network gives rise to distinct phenotypic responses to the same perturbation across different cell states.

Interpretation of Sox2, Snai1, and LIF+BMP are cell state dependent

We next experimentally tested the qualitative aspects of the model’s predictions of state-dependence in cells’ responses to perturbations. We first tested how cells’ Oct4 levels respond to Sox2 overexpression in the naïve ES and epiblast-like states C₀ and C₁. We transiently transfected cells with a plasmid containing a Tet-inducible bi-directional promoter, flanked by the open reading frames of Sox2 and mCerulean, which we used as a fluorescent reporter of induction (Figure 5—figure supplement 1A). We induced overexpression in cells either in the undifferentiated C₀ state or the epiblast-like C₁ state, which correspond to Day 0 and Day 2 of differentiation, respectively (Figure 3D, Figure 5—figure supplement 1D). As a control, we used identical populations that were transfected with a plasmid containing only mCerulean under the inducible promoter. In such experiments, we typically saw mCerulean fluorescence appear approximately three hours into induction and persist for about three to four days after transfection. We therefore induced overexpression for 24 hr to minimize the effect of plasmid loss but still allow for several cell cycles to occur during induction. Following induction, we fixed and immunostained the cells for Oct4, and analyzed the results via flow cytometry. In agreement with our predictions (Figure 4D), we found that Sox2 overexpression correlates ( $R = - 0.3258, p = 1.48 \times 10^{- 13}$ ) with downregulation of Oct4 in the epiblast-like state C₁ (significant relative to control, $p = 5.72 \times 10^{- 31}$ ; see also Figure 5—figure supplement 1C), whereas this effect was not observed in undifferentiated cells (state C₀) (Figure 5A and B).

Figure 5. — (A) Comparison of the effects of *Sox2* overexpression (x-axis) on Oct4 levels (y-axis) in the naïve ES state C₀ (left) and epiblast-like C₁ state shows negative correlation between *Sox2* overexpression and Oct4 levels in the C₁ state, but not in C₀. Plots showing mCerulean (marker) -only overexpression in C₀ or C₁ are indistinguishable from *Sox2* overexpression in C₀ (Figure 5—figure supplement 1C). (B) Fraction of Oct4-high cells (y-axis; defined as greater than 2σ below the mean log of Oct4 of non-transfected control cells) plotted against binned *Sox2* overexpression level confirms model prediction (Figure 4D) that *Sox2* overexpression leads to downregulation of Oct4 in C₁ but not C₀. (C) Comparison of the effects of *Snai1* and mCerulean-only (left) overexpression on Oct4 levels (x-axis) in the epiblast-like C₁ and mesendoderm-like C₂ states (y-axis; T-low and –high, respectively) shows downregulation of Oct4 in response to *Snai1* overexpression in the C₁ state but not in C₂. (D) Fraction of Oct4-high cells in *Snai1* overexpressing cells, normalized by this fraction in mCerulean overexpressing control cells (y-axis), plotted against binned *Snai1* overexpression level (x-axis) confirms the prediction (Figure 4G) that *Snai1* overexpression leads to greater downregulation of Oct4 in C₂ compared to C₁. (E) Live cell images of Oct4-mCitrine cells at t = 0, 6, 12, 18, 24 hr of LIF+BMP exposure. At t = 0, cells are either in state C₁ (Oct4-high) or C₃ (Oct4-low) (Figure 5—figure supplement 1E). (Scale bar = 100 μm) Cells were fixed at t = 24 hr and immunostained for *Msx2*. (F) Time series (x-axis) traces of single-cell Oct4 expression (y-axis) taken every 15 min from live cells. Each trace is colored by its relative end-point *Msx2* immunofluorescence intensity level. (G) The initial Oct4 reporter (mCitrine) intensity (y-axis) and final Msx2 immunofluorescence (x-axis) are negatively correlated. Each dot represents a single cell. Histogram of Oct4 reporter intensity at t = 0 levels shown in gray. Based on this histogram, we defined a range of threshold values for determining Oct4-high and –low (shown in overlapping region of orange and green along y-axis). (H) Plot showing fraction of Msx2-high (y-axis; as defined by greater than 2σ above background) confirms prediction (Figure 4I) that Msx2 is upregulated with a greater probability in the C₃ state compared to C₁ (x-axis) in response to LIF+BMP exposure.

**DOI:** http://dx.doi.org/10.7554/eLife.20487.020

Figure 5—figure supplement 1. — (A) Comparison of the effects of *Sox2* overexpression (x-axis) on Oct4 levels (y-axis) in the naïve ES state C₀ (left) and epiblast-like C₁ state shows negative correlation between *Sox2* overexpression and Oct4 levels in the C₁ state, but not in C₀. Plots showing mCerulean (marker) -only overexpression in C₀ or C₁ are indistinguishable from *Sox2* overexpression in C₀ (Figure 5—figure supplement 1C). (B) Fraction of Oct4-high cells (y-axis; defined as greater than 2σ below the mean log of Oct4 of non-transfected control cells) plotted against binned *Sox2* overexpression level confirms model prediction (Figure 4D) that *Sox2* overexpression leads to downregulation of Oct4 in C₁ but not C₀. (C) Comparison of the effects of *Snai1* and mCerulean-only (left) overexpression on Oct4 levels (x-axis) in the epiblast-like C₁ and mesendoderm-like C₂ states (y-axis; T-low and –high, respectively) shows downregulation of Oct4 in response to *Snai1* overexpression in the C₁ state but not in C₂. (D) Fraction of Oct4-high cells in *Snai1* overexpressing cells, normalized by this fraction in mCerulean overexpressing control cells (y-axis), plotted against binned *Snai1* overexpression level (x-axis) confirms the prediction (Figure 4G) that *Snai1* overexpression leads to greater downregulation of Oct4 in C₂ compared to C₁. (E) Live cell images of Oct4-mCitrine cells at t = 0, 6, 12, 18, 24 hr of LIF+BMP exposure. At t = 0, cells are either in state C₁ (Oct4-high) or C₃ (Oct4-low) (Figure 5—figure supplement 1E). (Scale bar = 100 μm) Cells were fixed at t = 24 hr and immunostained for *Msx2*. (F) Time series (x-axis) traces of single-cell Oct4 expression (y-axis) taken every 15 min from live cells. Each trace is colored by its relative end-point *Msx2* immunofluorescence intensity level. (G) The initial Oct4 reporter (mCitrine) intensity (y-axis) and final Msx2 immunofluorescence (x-axis) are negatively correlated. Each dot represents a single cell. Histogram of Oct4 reporter intensity at t = 0 levels shown in gray. Based on this histogram, we defined a range of threshold values for determining Oct4-high and –low (shown in overlapping region of orange and green along y-axis). (H) Plot showing fraction of Msx2-high (y-axis; as defined by greater than 2σ above background) confirms prediction (Figure 4I) that Msx2 is upregulated with a greater probability in the C₃ state compared to C₁ (x-axis) in response to LIF+BMP exposure.

**DOI:** http://dx.doi.org/10.7554/eLife.20487.020

We then tested the effects of Snai1 overexpression on Oct4 in the epiblast-like state C₁ and mesendoderm-like state C₂, using the same experimental framework as described above. On day three of differentiation, cell populations either contain a mixture of C₁, C₂ and (minimally) C₄ cell states, or a combination of C₁, C₃ and C₅ (or C₆), depending on the signaling conditions (Figure 3D). Using the signaling conditions that yield the former set of cell states (C₁, C₂ and C₄₎, we transfected cells at 2.5 days into differentiation, and drove overexpression of Snai1 12 hr later in a population consisting primarily of cells in C₁ and C₂ states (Figure 5—figure supplement 1D). After 24 hr of Snai1 overexpression and further differentiation, we fixed and immunostained the cells for T to distinguish cells in C₁ (T-low) and C₂ (T-high) states. We also immunostained the cells for Oct4 to distinguish the C₁ state from other T-low states that arise during the last 24 hr of differentiation following the initiation of induction. We found that the fraction of C₁ cells within the transfected population was significantly reduced relative to control ( $p = 1.98 \times 10^{- 13}$ ), suggesting that cells in this state had downregulated Oct4 levels in response to Snai1 overexpression. On the other hand, the fraction of C₂ cells within the transfected population and their Oct4 levels were maintained relative to control, in agreement with our predictions (Figure 4G; Figure 5C and D).

Finally, we tested whether cells in epiblast-like C₁ and bi-potent ectoderm-like C₃ states respond differently to LIF+BMP signaling, as predicted by our model. In order to investigate the relationship between a cell’s initial state and its final state in response to LIF+BMP exposure, we needed to assess cells’ initial states non-invasively. We found that 2.5 days into differentiation, we could obtain populations that consist primarily of cells in epiblast-like state C₁ and bi-potent ectoderm-like state C₃ (Figure 5—figure supplement 1E), which have high and low expression of Oct4, respectively. We therefore utilized an Oct4-mCitrine mES cell line that we had previously engineered (Thomson et al., 2011) to distinguish cells in C₁ and C₃ states after 2.5 days of differentiation. At this point, 1200 U/mL LIF and 25 ng/mL BMP4 were added to the media, after which we followed individual cells’ Oct4 expression dynamics for approximately 24 hr via live-cell microscopy, followed by fixing and immunostaining for Msx2, a unique marker gene for the neural crest-like cell state C₆ (Figure 5E and F). As predicted by the model (Figure 4I), only cells that had low Oct4 levels (and were therefore in the bi-potent ectoderm-like state C₃) prior to LIF+BMP exposure showed upregulation of Msx2 in response to LIF+BMP ( $R = - 0.5056, p = 0.0044,$ Figure 5G and H). Together, these results show that the inferred cell states reflect phenotypic discreteness in cells’ responses to perturbations, and that the gene expression changes that define these responses mirror those predicted by our model gene regulatory network.

Discussion

By using learned sparse patterns of gene expression from established experimental systems (Furchtgott et al., 2016), we can analyze single-cell transcriptomics data to uncover the gene expression dynamics of differentiation. This method naturally identifies a small set of transcription factors whose expression profiles are multimodal across neighboring cell states. Given that transcription factors are key orchestrators of gene expression and therefore cell fate decisions (Spitz and Furlong, 2012), multimodal distributions of the expression levels of even a small set of transcription factors can define cell states in a population of cells.

While cell states can be characterized by the gene expression patterns of key sets of genes, these states can only be fully validated by demonstrating distinct physiological properties. To discover distinct properties of the cell states in early mES cell differentiation, we built probabilistic models of the underlying network. Requiring these models to have discrete cell states leads to the prediction that each cell state has a distinct response to perturbations by signals and changing levels of gene expression. Thus, the cell states we discovered can be functionally defined by their responses to perturbation. Our experimental tests show, as predicted by the model network, that Oct4 is either downregulated or unaffected by overexpression of Sox2 or Snai1, depending on the cell state. Previous studies have already shown that Sox2 and Oct4, along with Klf4, constitute part of a positive feedback loop that stabilizes the pluripotent ground state (Kim et al., 2008; Young, 2011). It is also known that in undifferentiated cells, Snai1 overexpression leads to downregulation of Oct4 expression and, subsequently, to exit of pluripotency (Galvagni et al., 2015). However, our results demonstrate that these interactions are state-dependent by showing that the effective positive interactions between Sox2 and Oct4 become destabilized as Klf4 levels drop and cells transition to a primed, epiblast-like pluripotent state. Similarly, the negative interaction exerted by Snai1 on Oct4 becomes attenuated in the presence of early primitive streak genes such as T. We also predict and show that LIF+BMP exposure pushes bi-potent ectoderm-like cells toward an Msx2-positive neural crest-like state, but this effect is not seen in epiblast-like cells. These results are further supported by the fact that both LIF and BMP signaling pathways can be used to keep cells in the pluripotent cell state (Chambers, 2004; Tam et al., 2006; Ying and Smith, 2003), and that BMP signaling plays a significant role in the differentiation of neural crest cells (Knecht and Bronner-Fraser, 2002). Together, these findings signify that the inferred cell states directly reflect differences in cells’ responses to perturbations and show that these cell states can also be defined by their unique responses to perturbations.

Comprehensive interrogation of gene expression through RNA sequencing is impossible without the termination of cells, providing only static snapshots of gene expression during differentiation. Despite this and the complexity of the underlying network, we discover that both cell states and the sequence of cell state transitions can be accurately determined by monitoring the levels of just a few transition or marker genes. Monitoring the expression dynamics of these key genes in live cells using microscopy will allow us in the future to continuously track the cell-fate decisions of individual cells. The inferred gene modules therefore represent the ‘order parameters’ by which cell-state transition dynamics can be directly measured. Live cell microscopy experiments will also allow us to measure, in conjunction with cell state transition dynamics, changes in individual cells’ spatial environment, movement, lineage history, and cell cycle dynamics in order to address fundamental biological questions as to how these factors affect cell fate decisions. Finally, our results suggest that cell-to-cell heterogeneity within differentiating populations arises largely as a consequence of cells’ variability in their timing of cell state transitions. Our inferred cell clusters show mixing of cells from different time points (Figure 1—source data 1, Figure 2—source data 1), suggesting that the observed states themselves do not change over time and that at the population level, differentiation occurs as a change in the proportions of cells in various cell states rather than through changes in the cell states themselves (Figure 3D). Since cells interpret perturbations differently even in consecutive states (Figure 5), this suggests that heterogeneity arising from timing variability is further amplified in response to signal addition or fluctuations in gene expression level. These findings emphasize the importance of understanding how the timing of cell state transitions is controlled during development.

Materials and methods

Clustering and re-clustering using seurat

Clustering was performed using Seurat (Satija et al., 2015). For the initial seed clustering, we applied Seurat to the gene expression of all 2672 transcription factors for the 288 single cells. For subsequent re-clustering steps, clustering was performed on a reduced set of genes for which $p (α_{i} = 1 o r β_{i} = 1 | {g_{i}^{A, B, C}}, T, {C}) > 0.5$ for at least one triplet at the previous iteration (assuming a prior odds of $𝒪_{β | T} (i) = 5 \times 10^{- 2}$ ). This reduced set contained between 800 and 1050 genes at each of the reclustering steps (Figure 2—figure supplement 2A).

Seurat performs spectral t-SNE on the statistically significant principal components (PCs) of the gene expression dataset, and it determines the significance of each PC score using a randomization approach developed by Chung and Storey (Chung and Storey, 2015). Our initial seed clustering was performed using the first 10 PCs; subsequent re-clusterings used the first 8 PCs.

Finally, Seurat performs density-based clustering on the t-SNE map; we used a density parameter of G = 8 (Macosko et al., 2015).

Convergence of clustering configurations from different seed configurations

In order to test that our results were robust to the choice of seed clusters, we further used k-means clustering, a standard clustering method, which has previously been applied to identify different cell types using single-cell transcriptomics data (Buettner et al., 2015).

We start with a seed clustering configuration of 12 clusters ${c_{1}^{0}, c_{2}^{0}, \dots, c_{12}^{0}}$ obtained using k-means clustering, which is distinct from the seed clustering configuration obtained via Seurat (Satija et al., 2015). The number of clusters was determined using the gap statistic (Tibshirani et al., 2001). We obtained 164 sets of transitions between clusters and identified 981 transcription factors that were high probability (probability >0.5) marker or transition genes for at least one of the identified transitions. We next re-clustered the single cells in the gene expression space defined by these 981 marker or transition genes, using k-means clustering, to obtain a new cluster set ${C_{1}} = {c_{1}^{0}, c_{2}^{0}, \dots, c_{10}^{0}}$ , consisting of 10 clusters. In the next iteration, the number of clusters went down to 9, and so on. By iteratively determining the most likely sets of transitions, the corresponding most likely marker and transition genes and re-clustering the cells within the subspace of these genes, our algorithm converged upon the most likely set of cell clusters (Figure 2A). We found that the eventual clustering configurations obtained using k-means clustering and Seurat are the same, confirming that the seed clusters do not affect the final outcome (Figure 2—figure supplement 2A, B and C).

Framework for quantitative modeling of germ layer differentiation

Classifying genes based on their patterns of expression along the inferred lineage tree rather than by gene-gene correlations allowed us to identify gene modules (which included the transition and marker genes we inferred as well as signaling genes: BMP, WNT, LIF, see Tables S4 and S5) with similar expression patterns in successive cell-fate decisions.

Determination of gene modules

We obtained 321 transcription factors from the triplets along the tree and classify them based on their pattern across the triplets. In order to explain the discretization procedure let consider the example of Otx2, which is a transition gene for the triplet involving C₁, C₂ and C₃ clusters, where C₁ is the intermediate cluster. Since Otx2 is expressed at high levels in cluster C₁ and C₃ and is downregulated in cluster C₂, we assigned it a value of 1 in clusters C₁ and C₃ respectively and 0 in cluster C₂. We then repeated this local binarization process across all triplets along the lineage tree. We grouped all the genes that showed the same locally binarized expression pattern as Otx2 and obtained their average expression level across all the other clusters. Subsequently, we assigned these genes a value of 1 in a cluster if the average expression of these genes in that cluster was comparable (within ~10% of the mean) or higher than the lower value of their average expression level in the C₁ and C₃ clusters. Some genes, such as Oct4 and Etv5 are re-used at multiple branching points i.e. they belong to multiple triplets, either as marker genes or transition genes, and hence belong to different groups (Figure 4—figure supplement 1). Certain genes that are re-used exhibit three distinct levels of expression. For instance, Sox2 comes up as a marker gene for C₀ cluster, when we consider the triplets involving clusters C₀, C₁ and C₂ and C₀, C₁ and C₃ clusters respectively. However, it also acts as a transition gene for the triplet involving C₁, C₂ and C₃ clusters, where Sox2 is downregulated in C₂. Such a gene expression pattern would require three distinct levels (high in C₀, medium in C₁, C₃, and low in C₂). We classified the medium and higher expression level as one and low expression level as 0. It must be noted that we determined binary gene expression profiles by calculating the mean log2 fold-change in expression level for each group of genes. This way we acquired a total of 29 modules with unique binary gene expression profiles. We denote each module by a representative gene; the genes that belong to each module are shown in Figure 4—source data 1.

Local-field gene regulatory network model for gene modules

In order to build a quantitative model relating the gene modules, we write a N-component gene regulatory network governed by a set of differential equations:

{\dot{m}}_{i} = - \frac{m_{i}}{τ_{i}} + r_{i}^{0} + r_{i} (\vec{m}) (i = 1, \dots, N)

(1)

where $τ_{i}$ and $r_{i}^{0}$ are respectively the life-time and basal production rate of module $i$ ; we will rescale $τ_{i}$ = 1 and $r_{i}^{0}$ = 0 without any loss of generality. We denote the level of module $i$ as $m_{i}$ . We assume here that modules interact only by modulating each-other’s rate of production, described here by rate functions $r_{i} (\vec{m})$ which depend on the state $\vec{m} = [m_{1}, \dots, m_{N}]$ of the gene regulatory network.

As above, we consider that the production rate $r_{i} (\vec{m})$ is the result of only direct interactions, in which each gene j exerts a drive on gene i which is equal to an interaction strength $J_{i j}$ (positive or negative) multiplied by the level of module j. The total drive $ϕ_{i}$ on gene i is the sum of the drives from the different modules:

ϕ_{i} (\vec{m}) = \sum_{j = 1}^{N} J_{i j} m_{j}

(2)

We now assume $r_{i}$ has a universal scaling form that is the same for all factors,

r_{i} (\vec{m}) = r [μ (ϕ_{i} - ϕ_{0})]

(3)

where $r (ϕ; ϕ_{0}, μ)$ is a monotonic sigmoidal function centered at $ϕ_{0}$ and bounded by the limits

r (ϕ) = {\begin{matrix} 0, ϕ ≪ ϕ_{0} \\ 1, ϕ ≫ ϕ_{0} \end{matrix}

(4)

the sharpness of crossover is determined by the nonlinearity parameter $μ$ . The upper bound of $r_{i} = 1$ sets the maximum sustainable expression at $m_{i} = 1$ . In the limit $μ \to \infty$ , $r (ϕ)$ becomes the Heaviside step function, and $m_{i} \in {0, 1}$ is binary.

Suppose state ${\vec{m}}^{α} = {m_{1}^{α}, \dots, m_{29}^{α}}$ with expression level $m_{i}^{α}$ in module $i$ is a stable state of the network. In the limit $μ \to \infty$ , the condition for ${\vec{m}}^{α}$ to be a fixed point is:

m_{i}^{α} = H (\sum_{j} J_{i j} m_{j}^{α} - ϕ_{0}) m_{i}^{α}, m_{j}^{α} \in {0, 1}

(5)

where $H$ is the Heaviside step function. (Note that if $ϕ_{0}$ >0 then $\vec{m} = \vec{0}$ is always a stable fixed point of the network.)

In this limit, each state ${\vec{m}}^{α}$ of the network is associated with N constraints given by inequalities of the form

m_{i}^{α} = 0 \Rightarrow \sum_{j} J_{i j} m_{j}^{α} < ϕ_{0}

(6)

m_{i}^{α} = 1 \Rightarrow \sum_{j} J_{i j} m_{j}^{α} > ϕ_{0}

(7)

If ${\vec{m}}^{α}$ is a fixed point, all N of its constraints must hold. If we know the fixed points of the network, then we can write down a system of inequalities that constrain possible values for $J_{i j}$ . Since gene-gene interactions cannot be infinitely strong, $J_{i j}$ must be bounded. We take $| J_{i j} | < 1$ and $ϕ_{0} = 0.1$ . We further vary the value of the critical drive $ϕ_{0}$ from −2 to 2 to check the robustness of the predictions. We find that all the results qualitatively hold although the individual probabilities change.

Linear programming

The constraints (7) and (8) placed on $J_{i j}$ by the fixed point condition are linear in $J_{i j}$ . We can take advantage of this fact and use linear programming methods (Gass, 2013) to obtain solutions for $J_{i j}$ by extremizing a linear objective function of the form

U (J_{i j}) = \sum_{i, j} a_{i j} J_{i j} = c o n s t a n t

(8)

where $a_{i j}$ are constant coefficients. The system of constraints defines a $N^{2}$ -dimensional polytope in $J$ -space that encloses all solutions of $J_{i j}$ consistent with the fixed-point constraints, and $U$ defines a $N^{2} - 1$ dimensional hyperplane. Linear programming returns a solution for $J_{i j}$ (a point in $J$ -space) where the polytope contacts a $U$ -plane of extremal value. The solution will lie on the boundary of the polytope and is in general non-unique. There is no general principle with which to select any specific $U$ -plane as the ‘best’ objective function. Furthermore, one would like to sample points in the interior of the polytope, and not just on its surface. Here, guided by the fact that we seek pertubative solutions for $J_{i j}$ that ideally lie close to the origin, we impose a fictitious additional constraint on the polytope in the form of a hyperplane that contains the origin

\sum_{i, j} a_{i j} J_{i j} \leq 0, a_{i j} \in {0, 1}

(9)

where the coefficients $a_{i j}$ are randomly chosen; this in effect slices the polytope in two and exposes an interior plane. Then, using the same choices of $a_{i j}$ to define a $U$ -plane, we seek a linear programming solution that maximizes $U$ , that is, a solution that lies on the now-exposed interior plane (if possible). Because these fictitious constraints radiate from the origin, points in the polytope that lie closest to the origin are sampled more densely.

Common features of the sampled networks

By using many different randomly generated fictitious constraints to sample the polytope, we can study the ensemble of model networks that all satisfy the fixed point constraints (Figure 4—source data 2), and attempt to determine whether they share any common regulatory motifs. As discussed in the main text, we sampled 10,000 solutions $J_{i j}$ that satisfied the fixed-point constraints defined by the binarized expression patterns of the known cell states. We then calculated the mean and coefficient of variation (c.v.) for each coupling. We were thus able to discover a core network between the different modules that is shared by the majority of solutions (Figure 4A).

Predictions for Sox2 and Snai1 overexpression

Our model makes predictions for what happens to the level of Oct4 when Sox2 and Snai1 are overexpressed in different cell states. Sox2 and Oct4 are both present in the C₀ and C₁ clusters. On the other hand, Snai1 is not present in C₁ and C₂ but Oct4 is present in both clusters. We perturb the Sox2 and Snai1 levels by amounts ∆s in the above mentioned states, which lead to a change in the field $ϕ_{i}$ total drive on Oct4 level. Numerically we vary ∆s in steps of 0.1 and for each step compute the number of models out of the 10000 total models, for which the Oct4 level decreases to zero. From this number we obtain the fraction of models for which the level of Oct4 goes down.

Predictions for BMP and LIF addition

In order to predict the effect of morphogen signals in different cell states, we considered the LIF, BMP, WNT, and FGF signaling pathways, which are known to play a significant role in patterning the early embryo. We assumed that no single gene in each given pathway is sufficient to evoke a signaling response, but a response rather requires the combined presence of the various constituent genes of the pathway. We therefore grouped genes by their respective signaling pathways and assigned each group to a module based on its average expression pattern across the nine cell states. The discretization process of this mean expression pattern was the same as that used for TF genes. The signaling genes we used are shown in Figure 4—source data 1.

We next modeled the dynamics of BMP and LIF addition. By construction, the nine observed cell states (and the null state $\vec{m} = \vec{0}$ ) are fixed points for all 10,000 sampled solutions for $J_{i j}$ . However, each solution $J_{i j}$ may have additional spurious fixed points. However, given that we only see 9 cell states, we would expect the spurious states to be unstable. In order to overcome this problem, we used the following method.

Given a particular solution $J_{i j}$ , any arbitrary state of the network $\vec{m}$ (not necessarily a fixed point) will have dynamics obeying

m_{i} (t + 1) = H (\sum_{j} J_{i j} m_{j} (t) - ϕ_{0})

(10)

where $m_{i} (t)$ and $m_{i} (t + 1)$ are the levels of module $i$ at successive discretized time points.

For each particular solution $J_{i j}$ , cells will get stuck in spurious fixed points; yet these spurious fixed points are highly unlikely to exist since they are stable in only a small number of the sampled $J_{i j}$ . We can capture the average dynamics of different states of the network given the set of sampled solutions ${J_{i j}}$ by calculating the probability over all sampled solutions of moving from one arbitrary state ${\vec{m}}^{a}$ to another arbitrary state ${\vec{m}}^{b}$ . This allows us to define a 2²⁹ × 2²⁹ state-to-state transition matrix $𝒯$ :

𝒯_{b \leftarrow a} = p ({\vec{m}}^{a} \to {\vec{m}}^{b} | {J_{i j}})

(11)

If we denote as $\vec{p} (t)$ the vector of probabilities of being in the 2²⁹ different states at time $t$ , then

\vec{p} (t + 1) = 𝒯 \vec{p} (t)

(12)

In order to figure out what happens to cells in different states to BMP and LIF addition, we calculated the probability of moving between fixed points ${\vec{m}}^{α}$ and ${\vec{m}}^{β}$ when overexpressing some set of modules ${m_{i}}$ . We calculated the dynamics using the transition matrix $𝒯$ and enforced the overexpression of the set of modules (BMP and LIF module respectively) at each time point, updating the probabilities $\vec{p} (t)$ accordingly. The probabilities shown in Figure 4 are after 1000 time steps.

ES-cell culture

v6.5 (RRID: CVCL_C865; passage number 18 ~ 30; mycoplasma tested negative) mouse embryonic cells were maintained and passaged in monolayer (non-embryoid body formation) in N2B27 basal media with signaling molecules and/or small molecules added to the basal media. ES cells were maintained in a pluripotent cell state using 1200 U/mL mLIF (murine leukemia inhibitory factor), 1 μM PD0325901 (MEK inhibitor), and 3 μM CHIR99021 (GSK inhibitor) conditions (a.k.a. 'LIF + 2i’; Ying et al., 2008), and passaged every two days. To passage cells, we added 0.01% trypsin to cells after aspirating media and incubated the plate in 37’C for 1 ~ 2 min to detach cells. The trypsin was then quenched with 0.5 mL of fetal bovine serum, and the resulting cell suspension was collected, counted, and pelleted at 200 x g for 5 min at room temperature. The supernatant was aspirated and the cells were resuspended and re-seeded onto a gelatinized tissue culture dish at a density of 1e6 cells per 10 cm diameter plate. All cell lines were depleted of feeders and transitioned to serum free medium over several passages prior to experiments (Ying and Smith, 2003). N2B27 is prepared as described in Gaspard et al. (2008), Ying and Smith (2003).

ES cell differentiation

Cells were seeded at a density of 10⁶ per 10 cm diameter plate, and were not trypsinized again until they were harvested for analysis. We either exposed cells to 0.4 μM PD0325901 or 3 μM CHIR99021 and 10 ng/mL Activin A (human, rat, mouse) for 2 days or 3 days, respectively, followed by either 25 ng/mL hBmp4 or 1 μM LDN193189 (BMP antagonist) for up to two days. Media was replenished every 48 hr. Cells exposed to 0.4 μM PD0325901 gave rise to ectodermal lineages, as characterized by expression of Sox1, Pax6 (treated with LDN193189), Slug, and Msx2 (treated with hBmp4) after three days of differentiation. Cells exposed to CHIR99021 and Activin A gave rise to mesendodermal lineages (Sumi et al., 2008), as characterized by expression of T after three days of differentiation, and FoxA2 (treated with LDN193189) and Gata4 (treated with hBmp4) after four days of differentiation.

Single-cell RNA-Seq

CEL-seq libraries as previously reported (Hashimshony et al., 2012) with a few modifications. Single cells were sorted with a FACSAria into 96 well plates containing 1.2 µL 2 × CellsDirect Buffer (Life Technologies) with 0.1 µL of ERCCs diluted to 1 × 10⁻⁶ molecules (Life Technologies). Plates were frozen and stored at −80°C. For library preparation, mRNA was reverse transcribed using 0.15625 pmol of oligoT primer carrying a cell-specific 8 NT barcode and a 5 NT unique molecular identifier (UMI) (Islam et al., 2014). Barcode design ensured at least two nucleotide differences from any other barcode. Samples were lysed at 70°C for 5 min, then reverse transcribed using Superscript III for two hours at 50°C, then primers digested with 1 µL of ExoSAP-IT (Affymetrix). Second strand synthesis was carried out with Second Strand Synthesis Buffer, dNTPs, DNA Polymerase, and RNAse H (NEB) at 16°C for 2 hr. Single-cell cDNAs were pooled by 24 wells per library, with each library containing a water-only well and one ERCC-only well. Pools were purified with an equal volume of RNA Clean Beads (Beckman Coulter) and amplified at 37°C for 15 hr using the HiScribe T7 High Yield RNA Synthesis kit (NEB), and treated with DNAse I (Life Technologies). Amplified RNA was fragmented using the NEBNext RNA Fragmentation Module (NEB), purified with an equal volume of RNA Clean Beads, and visualized using the RNA Pico Kit on the Bioanalyzer 2100 (Agilent). The RNA fragments were repaired with Antarctic Phosphatase and Polynucleotide Kinase (NEB), and purified using an equal volume of RNA Clean Beads. cDNA libraries were made using the NEBNext Small Library Prep Kit according to the manufacturer’s instructions, except Superscript III was used for the RT step. Index primers were used in PCR amplification. Approximately 160–200 nmol of a pool of libraries were size selected to exclude species smaller than 180 bp on a 2% Dye Free cassette on the Pippin Prep (Roccio et al., 2013) and concentrated to approximately 14 µL. Pools were then quantified by qRT-PCR using p5 (5’-AATGATACGGCGACCACCGAGA-3’) and p7 (5’-CAAGCAGAAGACGGCATACGAGAT-3’) primers and by Bioanalyzer (DNA High Sensitivity Kit, Agilent), and sequenced on an Illumina HiSeq. The custom sequencing primer: 5’-TCTACACGTTCAGAGTTCTACAGTCCGACGATC-3’ was included with Illumina primer HP10 for sequencing. Standard Illumina primers HP12 and HP11 were used for the index read and the transcript read, respectively. PE50 kits (Illumina) were used for sequencing with read lengths of 25 nt, six nt, and 47 nt for read1 (cell barcode, UMI), index (library), and read2 (transcript), respectively. Following quantification, we discarded the data from wells that yielded below a total of 20,000 UMI (threshold based on empty well controls), which left us with 358 cells. Further, as others have recognized (Paul et al., 2015), we found that some well-to-well mixing was present with CEL-Seq multiplexed single-cell RNA-Seq. We used the data only from 288 cells because of this mixing artifact. The raw and processed RNA-seq data for the 288 cells can be found on the GEO database (accession number: GSE105054).

Immunofluorescence

Cells were grown on ibidi µ-bottom plates and fixed with 4% paraformaldehyde. Cells were permeabilized with ice-cold 100% methanol, blocked with 5% donkey serum, incubated with primary antibody, washed, and incubated with DAPI and secondary antibody coupled to Alexa488 Alexa568, or Alexa647. Images were acquired with a Zeiss 40× plan apo objective (NA 1.3) with the appropriate filter sets. Data was analyzed using custom written code in MATLAB. Antibodies and dilutions used in this study: Klf4 (Abcam ab129473, 1:400); Nanog (eBiosciences 14–5761, 1:800); Oct4 (Santa Cruz sc-8628, 1:800; Cell Signaling 2840, 1:400); Sox2 (eBiosciences 14–9811, 1:800); Otx2 (Neuromics GT15095, 1:400); T (Brachyury) (Santa Cruz sc-17745, 1:200); FoxA2 (Cell Signaling 8186, 1:400); Gata4 (eBiosciences 14–9980, 1:400); Sox1 (Cell Signaling 4194, 1:200); Pax6 (DSHB Pax6, 1:200); Msx1 +2 (DSHB 4G1, 1:200); Slug (Cell Signaling 9585, 1:200), Snai1 (Cell Signaling 2879, 1:200).

Live-cell microscopy

For live-cell time-lapse microscopy, cells were plated into N2B27 without phenol-red (plus signaling molecules and small molecules) on ibidi µ-bottom plates. Cells were imaged on a Zeiss Axiovision inverted microscope with a Zeiss 40× plan apo objective (NA 1.3) with the appropriate filter sets with an Orca-Flash 4.0 camera (Hamamatsu). The microscope was enclosed with an environmental chamber in which CO₂ and temperature were regulated at 5% and 37°C, respectively. Images were acquired every 15 min for 12–48 hr. Image acquisition was controlled by Zen (Zeiss); image analysis was done with ImageJ (NIH) and Matlab (MathWorks). 38 HE GFP/43 HE DsRed/46 HE YFP/47 HE CFP/49 DAPI/50 Cy5 filter sets from Zeiss. Transition duration of Otx2-mCitrine cells was defined as the time between the last image at which a cell’s reporter intensity was equal to or below its intensity at t = 1 and the first image at which its intensity was equal to or above 2.2 (mean – $σ$ of upper mode of Otx2 reporter intensity) on the normalized scale.

Plasmid transfection

We cloned Sox2 or Snai1 cDNA to one side of a bi-directional Tet-on promoter (pTRE3G-BI; Clontech), to the other side of which we had cloned in mCerulean cDNA. Mini-prepped plasmid was ethanol-precipitated to further concentrate and remove any possible endotoxins. For Sox2 overexpression, cells were seeded at 100,000 cells per 35 mm diameter plate in 2 mL of either LIF+2i conditions or differentiation media (0.4 μM PD0325901 or 3 μM CHIR99021) for 1 day. 200 μL of FBS was then added to each plate and 1.8 ug of plasmid was transfected using 5.4 μL of JetPrime (Polyplus). Cells were incubated for 12 hr, then washed with PBS and replenished with fresh LIF+2i or differentiation media. We then added 3 μL of Tet-Express mixed with 2.5 μL of Intensifier reagent (Clontech). Cells were incubated in induction media for 24 hr, after which they were harvested and fixed with 4% paraformaldehyde. Following fixation, they were permeabilized with ice-cold 100% methanol and rehydrated with 1% BSA. Cells were then stained for Oct4, Otx2 and Sox2 and analyzed using flow cytometry. For Snai1 overexpression, cells were seeded at 100,000 cells per 35 mm diameter plate in 2 mL of 3 μM CHIR99021 for 2.5 days. 200 μL of FBS was then added to each plate and 1.8 μg of plasmid was transfected using 5.4 uL of JetPrime (Polyplus). Cells were incubated in transfection media for 12 hr, then washed with PBS and replenished with fresh N2B27 basal media. We then added 3 μL of Tet-Express mixed with 2.5 μL of Intensifier reagent (Clontech). Cells were incubated in induction media for 24 hr, after which they were harvested and fixed with 4% paraformaldehyde. Following fixation, they were permeabilized with ice-cold 100% methanol and rehydrated with 1% BSA. Cells were then stained for Oct4 and T and analyzed using flow cytometry.

Fluorescence-activated cell sorting

Cells were trypsinized and fixed in suspension with formaldehyde (4% final concentration, diluted in PBS), permeabilized with ice cold 100% methanol and blocked with 5% donkey serum for 1 hr. Finally, cells are stained with primary antibodies diluted in PBS containing 1% BSA, and detected using fluorescent-tagged secondary antibodies. Flow cytometry was performed on a BD FACSAria flow cytometer equipped with 355 nm, 405 nm, 488 nm, 561 nm, and 637 nm lasers. The data acquired were analyzed using custom programs written in MatLab.

Generation of mOTX2-Citrine reporter cell line

G4 mESCs, a 129S6 x B6 F1 hybrid line (Andras Nagy, University of Toronto) were maintained on DR4 mouse embryonic fibroblasts (MEFs). These cells (1 × 10⁷) were electroporated (Transfection Buffer, Millipore; Bio-Rad set at 250 V and 500 mF) with 5 µg each TALEN plasmid (AI-CN301 and AI-CN302 targeting TTCCAGGTTTTGTGAAGA and TTTAAAAATCACCCACAA, respectively) and 20 µg donor plasmid (AI-CN563). Following transfection, cells were placed on ice for 5 min, then plated onto 3 × 10 cm dishes with MEFs. Beginning 30 hr after transfection, cells were selected with hygromycin at 150 µg/mL for 3 days, then 100 µg/mL for an additional 4 days. Approximately 48 hygromycin-resistant colonies were picked and expanded for freezing and DNA preparation and analysis. Five clones were identified with targeted integration by junction PCR (5' junction primers: aagagctaagtgccgccaacagc, catcagcccgtagccgaaggtag; 3' junction primers: cacgctgaacttgtggccgttta, cagctcacctccagcccaaggta). Following expansion and fluorescence-activated cell sorting (FACS), Cerulean⁺ cells from two clones (2.1 and 2.4) were treated with Cre mRNA. After recovery and expansion, the Cerulean^- cells were enriched by FACS and single-cell cloned. The resulting subclones were tested for removal of the selection cassette (primers: ggtgcctattctggtcgaactggatg, atcacctctgctttgaaggccatgac). The TALENs were kindly provided by the Joung lab synthesized using the FLASH method (Reyon et al., 2012). Computation and Modeling were performed using a cluster at Harvard University.

Software

Calculations were performed using custom written MATLAB code (The Mathworks) on the Harvard Research Computing Odyssey cluster. Code is available at https://github.com/furchtgott/sibilant and https://github.com/sandeepc123/Gene_Regulatory_Network_Modeling respectively. Seurat was done using the package provided in https://github.com/satijalab/seurat (Macosko et al., 2015).

Acknowledgements

We thank Alex Schier, Christof Koch, Ajamete Kayakas, Joshua Levi, Carol Thomson, John Phillips, Paola Arlotta, John Calarco, Leonid Mirny and Andrew Murray for their critical feedback. SJ was funded by the Samsung Scholarship Program. We thank the Allen Institute founders, PG Allen and J Allen and the NIH Directors Pioneer Award 5DP1MH099906-03 and National Science Foundation grant PHY-0952766 for support.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Funding Information

This paper was supported by the following grants:

Samsung to Sumin Jang.
NIH Office of the Director to Sharad Ramanathan.
Office of the Director to Sharad Ramanathan.
Allen Foundation to Sharad Ramanathan.

Additional information

Competing interests

The authors declare that no competing interests exist.

Author contributions

SJ, Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Writing—original draft, Writing—review and editing.

SC, Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Writing—original draft, Writing—review and editing.

LF, Data curation, Formal analysis, Investigation, Methodology, Writing—original draft, Writing—review and editing.

L-NZ, Data curation, Methodology.

AD, Validation, Methodology.

VM, Data curation, Formal analysis.

EBL, Data curation, Validation.

A-RK, Resources, Data curation.

RAM, Resources, Data curation.

LM, Resources, Data curation.

BPL, Resources, Data curation, Writing—original draft.

SR, Conceptualization, Supervision, Funding acquisition, Investigation, Methodology, Writing—original draft, Project administration, Writing—review and editing.

Additional files

Major datasets

The following dataset was generated:

Sumin Jang,Vilas Menon,Anne-Rachel Krostag,Boaz P Levi,Sharad Ramanathan,2017,Dynamics of embryonic stem cell differentiation inferred from single-cell transcriptomics show a series of transitions through discrete cell states,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE105054,Publicly available at the NCBI Gene Expression Omnibus (accession no. GSE105054)

References

Advani M, Ganguli S. Statistical mechanics of high-dimensional inference. arXiv. 2016 1601.04650
Arnold SJ, Robertson EJ. Making a commitment: cell lineage allocation and Axis patterning in the early mouse embryo. Nature Reviews Molecular Cell Biology. 2009;10:91–103. doi: 10.1038/nrm2618. [DOI] [PubMed] [Google Scholar]
Borgel J, Guibert S, Li Y, Chiba H, Schübeler D, Sasaki H, Forné T, Weber M. Targets and dynamics of promoter DNA methylation during early mouse development. Nature Genetics. 2010;42:1093–1100. doi: 10.1038/ng.708. [DOI] [PubMed] [Google Scholar]
Brown L, Brown S. Zic2 is expressed in pluripotent cells in the blastocyst and adult brain expression overlaps with makers of neurogenesis. Gene Expression Patterns. 2009;9:43–49. doi: 10.1016/j.gep.2008.08.002. [DOI] [PubMed] [Google Scholar]
Buecker C, Srinivasan R, Wu Z, Calo E, Acampora D, Faial T, Simeone A, Tan M, Swigut T, Wysocka J. Reorganization of enhancer patterns in transition from naive to primed pluripotency. Cell Stem Cell. 2014;14:838–853. doi: 10.1016/j.stem.2014.04.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ, Teichmann SA, Marioni JC, Stegle O. Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nature Biotechnology. 2015;33:155–160. doi: 10.1038/nbt.3102. [DOI] [PubMed] [Google Scholar]
Chambers I. The molecular basis of pluripotency in mouse embryonic stem cells. Cloning and Stem Cells. 2004;6:386–391. doi: 10.1089/clo.2004.6.386. [DOI] [PubMed] [Google Scholar]
Chung NC, Storey JD. Statistical significance of variables driving systematic variation in high-dimensional data. Bioinformatics. 2015;31:545–554. doi: 10.1093/bioinformatics/btu674. [DOI] [PMC free article] [PubMed] [Google Scholar]
Evans MJ, Kaufman MH. Establishment in culture of pluripotential cells from mouse embryos. Nature. 1981;292:154–156. doi: 10.1038/292154a0. [DOI] [PubMed] [Google Scholar]
Fard AT, Srihari S, Mar JC, Ragan MA. Not just a colourful metaphor: modelling the landscape of cellular development using hopfield networks. Npj Systems Biology and Applications. 2016;2:16001. doi: 10.1038/npjsba.2016.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
Furchtgott L, Melton S, Menon V, Lodato S, Ramanathan S. Discovering sparse transcription factor codes for cell states and state transitions during development. eLife. 2017:e20488. doi: 10.7554/eLife.20488. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gadue P, Huber TL, Paddison PJ, Keller GM. Wnt and TGF-beta signaling are required for the induction of an in vitro model of primitive streak formation using embryonic stem cells. PNAS. 2006;103:16806–16811. doi: 10.1073/pnas.0603916103. [DOI] [PMC free article] [PubMed] [Google Scholar]
Galvagni F, Lentucci C, Neri F, Dettori D, De Clemente C, Orlandini M, Anselmi F, Rapelli S, Grillo M, Borghi S, Oliviero S. Snai1 promotes ESC exit from the pluripotency by direct repression of self-renewal genes. Stem Cells. 2015;33:742–750. doi: 10.1002/stem.1898. [DOI] [PubMed] [Google Scholar]
Gans C, Northcutt RG. Neural crest and the origin of vertebrates: a new head. Science. 1983;220:268–273. doi: 10.1126/science.220.4594.268. [DOI] [PubMed] [Google Scholar]
Gaspard N, Bouschet T, Hourez R, Dimidschstein J, Naeije G, van den Ameele J, Espuny-Camacho I, Herpoel A, Passante L, Schiffmann SN, Gaillard A, Vanderhaeghen P. An intrinsic mechanism of corticogenesis from embryonic stem cells. Nature. 2008;455:351–357. doi: 10.1038/nature07287. [DOI] [PubMed] [Google Scholar]
Gass SI. Linear Programming. 5th edn. Boyd & Fraser Publishing Company; 2013. [Google Scholar]
Geula S, Moshitch-Moshkovitz S, Dominissini D, Mansour AA, Kol N, Salmon-Divon M, Hershkovitz V, Peer E, Mor N, Manor YS, Ben-Haim MS, Eyal E, Yunger S, Pinto Y, Jaitin DA, Viukov S, Rais Y, Krupalnik V, Chomsky E, Zerbib M, Maza I, Rechavi Y, Massarwa R, Hanna S, Amit I, Levanon EY, Amariglio N, Stern-Ginossar N, Novershtern N, Rechavi G, Hanna JH. Stem cells. m6A mRNA methylation facilitates resolution of naïve pluripotency toward differentiation. Science. 2015;347:1002–1006. doi: 10.1126/science.1261417. [DOI] [PubMed] [Google Scholar]
Goller T, Vauti F, Ramasamy S, Arnold HH. Transcriptional regulator BPTF/FAC1 is essential for trophoblast differentiation during early mouse development. Molecular and Cellular Biology. 2008;28:6819–6827. doi: 10.1128/MCB.01058-08. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hart AH, Hartley L, Sourris K, Stadler ES, Li R, Stanley EG, Tam PP, Elefanty AG, Robb L. Mixl1 is required for axial mesendoderm morphogenesis and patterning in the murine embryo. Development. 2002;129:3597–3608. doi: 10.1242/dev.129.15.3597. [DOI] [PubMed] [Google Scholar]
Hashimshony T, Wagner F, Sher N, Yanai I. CEL-Seq: single-cell RNA-Seq by multiplexed linear amplification. Cell Reports. 2012;2:666–673. doi: 10.1016/j.celrep.2012.08.003. [DOI] [PubMed] [Google Scholar]
Hopfield JJ. Neurons with graded response have collective computational properties like those of two-state neurons. PNAS. 1984;81:3088–3092. doi: 10.1073/pnas.81.10.3088. [DOI] [PMC free article] [PubMed] [Google Scholar]
Islam S, Zeisel A, Joost S, La Manno G, Zajac P, Kasper M, Lönnerberg P, Linnarsson S. Quantitative single-cell RNA-seq with unique molecular identifiers. Nature Methods. 2014;11:163–166. doi: 10.1038/nmeth.2772. [DOI] [PubMed] [Google Scholar]
Kanai-Azuma M, Kanai Y, Gad JM, Tajima Y, Taya C, Kurohmaru M, Sanai Y, Yonekawa H, Yazaki K, Tam PP, Hayashi Y. Depletion of definitive gut endoderm in Sox17-null mutant mice. Development. 2002;129:2367–2379. doi: 10.1242/dev.129.10.2367. [DOI] [PubMed] [Google Scholar]
Karr JR, Sanghvi JC, Macklin DN, Gutschow MV, Jacobs JM, Bolival B, Assad-Garcia N, Glass JI, Covert MW. A whole-cell computational model predicts phenotype from genotype. Cell. 2012;150:389–401. doi: 10.1016/j.cell.2012.05.044. [DOI] [PMC free article] [PubMed] [Google Scholar]
Keller G. Embryonic stem cell differentiation: emergence of a new era in biology and medicine. Genes & Development. 2005;19:1129–1155. doi: 10.1101/gad.1303605. [DOI] [PubMed] [Google Scholar]
Kim J, Chu J, Shen X, Wang J, Orkin SH. An extended transcriptional network for pluripotency of embryonic stem cells. Cell. 2008;132:1049–1061. doi: 10.1016/j.cell.2008.02.039. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim JK, Huh SO, Choi H, Lee KS, Shin D, Lee C, Nam JS, Kim H, Chung H, Lee HW, Park SD, Seong RH. Srg3, a mouse homolog of yeast SWI3, is essential for early embryogenesis and involved in brain development. Molecular and Cellular Biology. 2001;21:7787–7795. doi: 10.1128/MCB.21.22.7787-7795.2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim PT, Ong CJ. Differentiation of definitive endoderm from mouse embryonic stem cells. Results and Problems in Cell Differentiation. 2012;55:303–319. doi: 10.1007/978-3-642-30406-4_17. [DOI] [PubMed] [Google Scholar]
Knecht AK, Bronner-Fraser M. Induction of the neural crest: a multigene process. Nature Reviews. Genetics. 2002;3:453–461. doi: 10.1038/nrg819. [DOI] [PubMed] [Google Scholar]
Koch PJ, Roop DR. The role of keratins in epidermal development and homeostasis--going beyond the obvious. Journal of Investigative Dermatology. 2004;123:x–0. doi: 10.1111/j.0022-202X.2004.23495.x. [DOI] [PubMed] [Google Scholar]
Nicole LD. Chaya K. The Neural Crest. 2nd edn. Cambridge University Press; 1991. 9780521620109 [Google Scholar]
Lebrecht D, Foehr M, Smith E, Lopes FJ, Vanario-Alonso CE, Reinitz J, Burz DS, Hanes SD. Bicoid cooperative DNA binding is critical for embryonic patterning in Drosophila. PNAS. 2005;102:13176–13181. doi: 10.1073/pnas.0506462102. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li CL, Li KC, Wu D, Chen Y, Luo H, Zhao JR, Wang SS, Sun MM, Lu YJ, Zhong YQ, Hu XY, Hou R, Zhou BB, Bao L, Xiao HS, Zhang X. Somatosensory neuron types identified by high-coverage single-cell RNA-sequencing and functional heterogeneity. Cell Research. 2016;26:83–102. doi: 10.1038/cr.2015.149. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li L, Song L, Liu C, Chen J, Peng G, Wang R, Liu P, Tang K, Rossant J, Jing N. Ectodermal progenitors derived from epiblast stem cells by inhibition of nodal signaling. Journal of Molecular Cell Biology. 2015;7:455–465. doi: 10.1093/jmcb/mjv030. [DOI] [PubMed] [Google Scholar]
Lindsley RC, Gill JG, Kyba M, Murphy TL, Murphy KM. Canonical Wnt signaling is required for development of embryonic stem cell-derived mesoderm. Development. 2006;133:3787–3796. doi: 10.1242/dev.02551. [DOI] [PubMed] [Google Scholar]
Lumelsky N, Blondel O, Laeng P, Velasco I, Ravin R, McKay R. Differentiation of embryonic stem cells to insulin-secreting structures similar to pancreatic islets. Science. 2001;292:1389–1394. doi: 10.1126/science.1058866. [DOI] [PubMed] [Google Scholar]
Machta BB, Chachra R, Transtrum MK, Sethna JP. Parameter space compression underlies emergent theories and predictive models. Science. 2013;342:604–607. doi: 10.1126/science.1238723. [DOI] [PubMed] [Google Scholar]
Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas AR, Kamitaki N, Martersteck EM, Trombetta JJ, Weitz DA, Sanes JR, Shalek AK, Regev A, McCarroll SA. Highly parallel Genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161:1202–1214. doi: 10.1016/j.cell.2015.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
Maetschke SR, Ragan MA. Characterizing Cancer subtypes as attractors of hopfield networks. Bioinformatics. 2014;30:1273–1279. doi: 10.1093/bioinformatics/btt773. [DOI] [PubMed] [Google Scholar]
Merrill BJ, Pasolli HA, Polak L, Rendl M, García-García MJ, Anderson KV, Fuchs E. Tcf3: a transcriptional regulator of Axis induction in the early embryo. Development. 2004;131:263–274. doi: 10.1242/dev.00935. [DOI] [PubMed] [Google Scholar]
Nakanishi M, Kurisaki A, Hayashi Y, Warashina M, Ishiura S, Kusuda-Furue M, Asashima M. Directed induction of anterior and posterior primitive streak by wnt from embryonic stem cells cultured in a chemically defined serum-free medium. The FASEB Journal. 2009;23:114–122. doi: 10.1096/fj.08-111203. [DOI] [PubMed] [Google Scholar]
Nichols J, Smith A. Naive and primed pluripotent states. Cell Stem Cell. 2009;4:487–492. doi: 10.1016/j.stem.2009.05.015. [DOI] [PubMed] [Google Scholar]
Paul F, Arkin Y, Giladi A, Jaitin DA, Kenigsberg E, Keren-Shaul H, Winter D, Lara-Astiaso D, Gury M, Weiner A, David E, Cohen N, Lauridsen FK, Haas S, Schlitzer A, Mildner A, Ginhoux F, Jung S, Trumpp A, Porse BT, Tanay A, Amit I. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell. 2015;163:1663–1677. doi: 10.1016/j.cell.2015.11.013. [DOI] [PubMed] [Google Scholar]
Pevny LH, Sockanathan S, Placzek M, Lovell-Badge R. A role for SOX1 in neural determination. Development. 1998;125:1967–1978. doi: 10.1242/dev.125.10.1967. [DOI] [PubMed] [Google Scholar]
Power MA, Tam PP. Onset of Gastrulation, morphogenesis and somitogenesis in mouse embryos displaying compensatory growth. Anatomy and Embryology. 1993;187:493–504. doi: 10.1007/BF00174425. [DOI] [PubMed] [Google Scholar]
Reyon D, Tsai SQ, Khayter C, Foden JA, Sander JD, Joung JK. FLASH assembly of TALENs for high-throughput genome editing. Nature Biotechnology. 2012;30:460–465. doi: 10.1038/nbt.2170. [DOI] [PMC free article] [PubMed] [Google Scholar]
Roccio M, Schmitter D, Knobloch M, Okawa Y, Sage D, Lutolf MP. Predicting stem cell fate changes by differential cell cycle progression patterns. Development. 2013;140:459–470. doi: 10.1242/dev.086215. [DOI] [PubMed] [Google Scholar]
Rojas A, De Val S, Heidt AB, Xu SM, Bristow J, Black BL. Gata4 expression in lateral mesoderm is downstream of BMP4 and is activated directly by forkhead and GATA transcription factors through a distal enhancer element. Development. 2005;132:3405–3417. doi: 10.1242/dev.01913. [DOI] [PubMed] [Google Scholar]
Saadatpour A, Guo G, Orkin SH, Yuan G-C. Characterizing heterogeneity in leukemic cells using single-cell gene expression analysis. Genome Biology. 2014;15:1–13. doi: 10.1186/s13059-014-0525-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sansom SN, Griffiths DS, Faedo A, Kleinjan DJ, Ruan Y, Smith J, van Heyningen V, Rubenstein JL, Livesey FJ. The level of the transcription factor Pax6 is essential for controlling the balance between neural stem cell self-renewal and neurogenesis. PLoS Genetics. 2009;5:e1000511. doi: 10.1371/journal.pgen.1000511. [DOI] [PMC free article] [PubMed] [Google Scholar]
Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of single-cell gene expression data. Nature Biotechnology. 2015;33:495–502. doi: 10.1038/nbt.3192. [DOI] [PMC free article] [PubMed] [Google Scholar]
Segal E, Widom J. From DNA sequence to transcriptional behaviour: a quantitative approach. Nature Reviews Genetics. 2009;10:443–456. doi: 10.1038/nrg2591. [DOI] [PMC free article] [PubMed] [Google Scholar]
Spitz F, Furlong EE. Transcription factors: from enhancer binding to developmental control. Nature Reviews Genetics. 2012;13:613–626. doi: 10.1038/nrg3207. [DOI] [PubMed] [Google Scholar]
Streit A, Stern CD. Neural induction. A bird's eye view. Trends in Genetics. 1999;15:20–24. doi: 10.1016/S0168-9525(98)01620-5. [DOI] [PubMed] [Google Scholar]
Sumi T, Tsuneyoshi N, Nakatsuji N, Suemori H. Defining early lineage specification of human embryonic stem cells by the orchestrated balance of canonical wnt/beta-catenin, activin/Nodal and BMP signaling. Development. 2008;135:2969–2979. doi: 10.1242/dev.021121. [DOI] [PubMed] [Google Scholar]
Tada S, Era T, Furusawa C, Sakurai H, Nishikawa S, Kinoshita M, Nakao K, Chiba T, Nishikawa S. Characterization of mesendoderm: a diverging point of the definitive endoderm and mesoderm in embryonic stem cell differentiation culture. Development. 2005;132:4363–4374. doi: 10.1242/dev.02005. [DOI] [PubMed] [Google Scholar]
Tam PP, Loebel DA, Tanaka SS. Building the mouse gastrula: signals, asymmetry and lineages. Current Opinion in Genetics & Development. 2006;16:419–425. doi: 10.1016/j.gde.2006.06.008. [DOI] [PubMed] [Google Scholar]
Tesar PJ, Chenoweth JG, Brook FA, Davies TJ, Evans EP, Mack DL, Gardner RL, McKay RD. New cell lines from mouse epiblast share defining features with human embryonic stem cells. Nature. 2007;448:196–199. doi: 10.1038/nature05972. [DOI] [PubMed] [Google Scholar]
Thomson M, Liu SJ, Zou LN, Smith Z, Meissner A, Ramanathan S. Pluripotency factors in embryonic stem cells regulate differentiation into germ layers. Cell. 2011;145:875–889. doi: 10.1016/j.cell.2011.05.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B. 2001;63:411–423. doi: 10.1111/1467-9868.00293. [DOI] [Google Scholar]
Trapnell C, Cacchiarelli D, Grimsby J, Pokharel P, Li S, Morse M, Lennon NJ, Livak KJ, Mikkelsen TS, Rinn JL. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nature Biotechnology. 2014;32:381–386. doi: 10.1038/nbt.2859. [DOI] [PMC free article] [PubMed] [Google Scholar]
Van der Maaten L, Hinton G. Visualizing data using t-SNE. Journal of Machine Learning Research. 2008;9:2579–2605. [Google Scholar]
Vogel-Ciernia A, Wood MA. Neuron-specific chromatin remodeling: a missing link in epigenetic mechanisms underlying synaptic plasticity, memory, and intellectual disability disorders. Neuropharmacology. 2014;80:18–27. doi: 10.1016/j.neuropharm.2013.10.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
Watabe T, Miyazono K. Roles of TGF-beta family signaling in stem cell renewal and differentiation. Cell Research. 2009;19:103–115. doi: 10.1038/cr.2008.323. [DOI] [PubMed] [Google Scholar]
Wilson PA, Hemmati-Brivanlou A. Induction of epidermis and inhibition of neural fate by Bmp-4. Nature. 1995;376:331–333. doi: 10.1038/376331a0. [DOI] [PubMed] [Google Scholar]
Ying QL, Smith AG. Defined conditions for neural commitment and differentiation. Methods in Enzymology. 2003;365:327–341. doi: 10.1016/s0076-6879(03)65023-8. [DOI] [PubMed] [Google Scholar]
Ying QL, Wray J, Nichols J, Batlle-Morera L, Doble B, Woodgett J, Cohen P, Smith A. The ground state of embryonic stem cell self-renewal. Nature. 2008;453:519–523. doi: 10.1038/nature06968. [DOI] [PMC free article] [PubMed] [Google Scholar]
Young RA. Control of the embryonic stem cell state. Cell. 2011;144:940–954. doi: 10.1016/j.cell.2011.01.032. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhou Q, Chipperfield H, Melton DA, Wong WH. A gene regulatory network in mouse embryonic stem cells. PNAS. 2007;104:16438–16443. doi: 10.1073/pnas.0701014104. [DOI] [PMC free article] [PubMed] [Google Scholar]

eLife. 2017 Mar 15;6:e20487. doi: 10.7554/eLife.20487.025

Decision letter

Editor: Nir Yosef¹

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Dynamics of differentiation inferred from single-cell RNA-seq show a series of transitions through discrete cell states" for consideration by eLife. Your article has been favorably evaluated by Arup Chakraborty (Senior Editor) and three reviewers, one of whom, Nir Yosef (Reviewer #1) served as Guest editor. Jacob H. Hanna (Reviewer #3) agreed to share his identity.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

In their manuscript, Jang et al. propose, test, and validate a statistical framework for analyzing single-cell transcriptomics data from mouse embryonic stem (mES) cell differentiation. The first part of the analysis relies on a companion manuscript, which presented a combined method for clustering and lineage inference of single cells. By applying this method on data from multiple mouse ES subject to short inductive differentiation protocols, the authors identify several cell states, the genes that mark these states, and the genes that capture state transitions. Within these clusters of cells, the authors assert there is little variation, thus defining discrete cell states along mES cell differentiation. They then cluster the genes into modules and use the Hopfield model to identify patterns of dependencies between modules that give rise to the observed clusters as steady states. With this analysis they provide and validate three hypotheses about possible "rewiring" at different stages (i.e. when the effect of perturbing gene X on gene Y varies between cell states).

Essential revisions:

Overall, the reviewers find the methodology developed in this paper interesting, and of potential impact. However, there are several key points that need to be addressed in order to fully support the validity of this methodology and to understand its intricacies.

1) Single cell data quality:

1.1) Based on the data in Figure 1—figure supplement 1E, there seems to be a fairly large variation in the percentage of reads aligning to the transcriptome on a cluster to cluster basis. Do any of the lineages correlate with the percentage of transcriptome or genome reads? What is the significance of the differences between the clusters, like C0 and C3 for example?

1.2) More generally, we are missing a description of how was the RNA-seq data normalized. In many cases when scRNA-seq data is not normalized, we see technical factors that confound the data, and library quality can dominate clustering and dimensionality reduction. Please provide evidence that this is not the case or correct accordingly.

2) Clustering and lineage detection algorithm:

2.1) For their clustering analysis, the authors limit their focus to transcription factors. While they provide off-hand reasoning for this, it is insufficient. Transcription factors (TFs), just like any other transcript, undergo stochastic, burst-like kinetics, and are subject to high amount of variation (esp. given their typically moderate expression). Additionally, it is not a given that measuring TF mRNA, rather than a TF's downstream targets, accurately depicts the circuitry involved in cellular response or differentiation. The authors should demonstrate the effects of including genes other than transcription factors on the clustering results. Relatedly, they later include signaling molecules without a rationale for the shift.

2.2) The nature of the clustering method in which only three clusters of cells are considered at a time inherently limits the hierarchy produced by the author's Bayesian framework (see Figure 2—figure supplement 1B, right). In this way, the final lineage tree is limited only to branching into two arms at any given differentiation step. Thus, any differentiation program that produces more than two offspring would not be properly modeled. The authors should address this limitation in their framework.

3) Application to ESC:

3.1) The parameters used for the Bayesian framework from the co-submission are missing. What is the cutoff for a triplet to count as a "transition" event? what is a cutoff for a gene to be defined as a "marker" or "transition" gene? What is the termination/ convergence condition?

3.2) Since the algorithm is iterative, it might be very sensitive to slight variations in initial conditions or the parameters. In standard EM applications, a common practice is to start from many starting conditions. The authors should provide an estimate of how sensitive are the results for the algorithm's parameters (e.g., probability cutoffs) and how sensitive they are for sub-sampling of cells or genes (i.e., going beyond changing the seed set of clusters, which the authors have already done).

3.3) The results in Figure 2B-E, and especially the comparison of 2B vs. 2D are somewhat tautological. It is not clear to me what these figure panels are supposed to show that we don't already know form the definition of the process applied for choosing those genes.

3.4) What is the relationship between the experimental conditions (time/ stimulation; Figure 1—source data 1) and the inferred clusters? This point is potentially crucial for interpreting the meaning of the clusters and should be discussed.

3.5) We are missing a direct and less engineered view that will help evaluate and digest the clustering results. Specifically – please provide a global heat map figure with all gene used for the final clustering (possibly stratified according to their role as transitions or markers in different parts of the tree) vs. all cells (organized by clusters). This will also help support the statement in the first paragraph of the subsection “Differentiation occurs through a series of discrete cell state transitions”.

3.6) The authors claim that gene expression within each cell cluster does not significantly vary. They validate this by comparing the magnitude of the variance explained by the first PC to the that of the first PC from 1000 sets of randomized data (FYI – unclear how 3B shows lack of significance). Why don't the authors compare the percent variance described by the first PC of each cluster to the percent variance described by first PC of randomized data?

3.7) Can the authors identify early primordial germ cell sub-population (e.g. BLIMP1+, T+, TFAP2C+ cells)? Is it discrete or is it perhaps "hiding" in one of their progenitor populations (e.g. mesendodermal cells)?

4) Validation of results:

4.1) The selection of genes in Figure 3D (immunostaining) seem somewhat biased to well-studied markers (shown in Figure 3—figure supplement 1A). Therefore, these results provide a somewhat weak support for the cell states inferred form the single cell data.

4.2) In the subsection “A probabilistic model that replicates the observed discrete cell states predicts state-dependent interpretation of perturbations” the authors mention that they "categorized the 184 marker and transition genes and signaling gene groups into 23 gene modules". However, in Figure 2—figure supplement 2 it seems that the number of transition/ marker genes should be around 800. Also, it is not clear how were the signaling genes selected (since the analysis up to this point focused on transcription factors). Please clarify these points.

5) Network analysis:

5.1) The use of Hopfield model is a nice idea, however the presentation in Figure 4A is somewhat illegible, and it is hard to evaluate the stability of the model (or parts thereof) across the 10k solutions. Please provide a more convenient way to estimate the inferred magnitude and noise for the models parameters. For instance, a scatter plot of parameters showing mean vs. fano factor across the 10,000 solutions; and for a few selected of parameters, the complete empirical distribution.

5.2) How were the gene modules discretized? The explanation in the subsection “1. Determination of gene modules” is insufficient. Specifically – which cutoffs were used? How was gene drop-out taken into account?

5.3) The derivation of the hypotheses (subsection “A probabilistic model that replicates the observed discrete cell states predicts state-dependent interpretation of perturbations”, seventh paragraph) is not defined rigorously. Please describe clearly – what is "effective interaction strength"? How do we decide when "[X] levels are more stable to [Y] overexpression"? Specifically – which statistical cutoffs were used? What is the false discovery rate? How many other, additional hypotheses with a similar FDR can be derived using the same procedure?

eLife. 2017 Mar 15;6:e20487. doi: 10.7554/eLife.20487.026

Author response

Essential revisions:

We would like to thank the Senior and Reviewing Editors and the peer reviewers for their thoughtful comments and suggestions. The feedback and suggestions have greatly improved our manuscript as well as strengthened our conclusions. We have considered each comment and have amended the manuscript to add 3 new figure supplements (20 new subfigures), as well as edited the text as noted in the detailed responses below.

1) Single cell data quality:

As the reviewers pointed out, there is a large amount of variation in the total UMI number across cells. Because of this variation and the potential biases it can create, we subsampled an equal number of 20,000 UMI for all 288 cells, and running all subsequent analyses on this subsampled data set. We added a sub-figure (Figure 1—figure supplement 1K) to show that following subsampling of UMI’s, cells do not show correlations with one another based on the total number of UMI they had prior to subsampling.

We have added in a sentence (end of subsection “Acquiring single-cell transcriptomics data during early differentiation”) explicitly stating that we normalize our RNA-seq data by subsampling 20,000 UMI’s per cell, for all 288 cells.

2) Clustering and lineage detection algorithm:

We have added one supplementary figure in the companion manuscript by Furchtgott et al. to address this comment. We analyzed the robustness of using all genes for lineage determination in comparison to using just transcription factors (TFs) and included the results in the accompanying manuscript (Figure 1—figure supplement 2A of accompanying paper Furchtgott et al., 2016).

These figures show a histogram of the number of genes displaying a “clear minimum pattern” (i.e., one cell type within a triplet has a clear minimum expression distribution relative to those of the other two cell types) – evaluated over all TFs and all genes, respectively – from 150 known developmental topologies in B- and T-cell development (Heng et al., 2008). Triplets in which the root has the most genes showing the pattern are labeled red, and triplets in which one of the leaves has the most genes showing the pattern are in blue.

When the gene expression pattern was learned using just TFs (shown on the left side), none of the triplets with more than 10 genes displaying a “clear minimum pattern” exhibit this pattern where the minimum is in the root (no red in any histogram bar except for the leftmost). When the gene expression pattern was learned using just TFs (shown above on the left side), none of the triplets with more than 10 genes displaying a clear minimum pattern have most genes showing a clear minimum in the root (no red in any histogram bar except for the leftmost). However, when all the genes are used for the same analysis (shown above on the right side), even for triplets with a high number (up to 600) of genes showing a clear minimum pattern, a fraction of them have most of these genes showing a clear minimum in the root. This demonstrates that the clear minimum pattern is not robust when using all genes as opposed to just transcription factors.

Further, we added a new figure to this manuscript: Figure 2—figure supplement 3E, F, showing that including all genes for clustering/ lineage determination leads to errors in the inferred lineage tree (although the clustering configuration remains unchanged relative to when the analysis is restricted to only transcription factors). Although the tree remains the same on the mesendodermal branch, towards the bi-potent ectodermal branch the triplet relationships we obtain are different (as well as less parsimonious given the differentiation durations) from the ones we obtain when we use just the TFs.

Lastly, we now explicitly state our rationale (subsection “A probabilistic model that replicates the observed discrete cell states predicts state- 378 dependent interpretation of perturbations”, third paragraph) for adding signaling genes when modeling the gene regulatory network: that it is because our goal of modeling the gene regulatory network is to make specific predictions as to whether and how cells in different states respond differently to perturbations (i.e., signals as well as gene expression changes).

We thank the reviewers for bringing up this point. Our framework does not in fact inherently limit the resulting lineage tree to bifurcation fate decisions. We illustrate this point using the example in Author response image 1.

Given the set of triplets shown in Author response image 1 (left), the only possible lineage topology is one where the yellow cell type is intermediate to all four other (purple, green, blue, pink) cell types (right). Also, in Figure 2—figure supplement 3C, we show an example of a lineage tree that, based on the set of most likely triplets inferred, contains a trifurcation point. Further, in the accompanying manuscript by Furchtgott et. al, the inferred lineage tree from intestinal single- cell data (Figure 3—figure supplement 2) and single-cell human brain development data (Figure 4) contain differentiation steps where three distinct cell types are produced from the same progenitor cell type.

3) Application to ESC:

The probability cutoff for a triplet to count as a transition event is 0.6. For a gene to be defined as a marker or transition gene, we use a probability cutoff of 0.5. We have added these cutoffs to the manuscript (subsection “Bayesian statistical approach discovers appropriate coordinate systems to infer cell states and state transitions”, fifth paragraph). Further, since the probability cutoffs are arbitrary, in response to this and the following comment we tested our results over a range of cutoffs (see reviewer comment 3.2).

We iterated the clustering-inference procedure until the dimension of the re-clustering subspace i.e., the number of genes changed by less than 10% of the total transcription factor space. We have modified the text (in the sixth paragraph of the aforementioned subsection) to make these points clear.

Following the reviewers’ suggestions, we tested the effects of using different probability cutoff values as well as subsampling cells and genes (subsection “Bayesian statistical approach discovers appropriate coordinate systems to infer cell states and state transitions”, seventh paragraph) (Figure 2—figure supplement 3). We show that:

1) The clustering configuration and lineage tree are unchanged within a probability cutoff value range of 0.5 to 0.96 for defining “high-probability” marker and transition genes, but at a cutoff value of 0.97, both the clustering as well as lineage determination given the original clusters fails (Figure 2—figure supplement 3A and 3B), which we expect to be an effect of the number of genes used for subsequent clustering iterations becoming smaller as the probability cutoff value increases.

2) Further to demonstrating this last point, we show that while using 416 high-probability 𝑝 ≥ 0.96 transition and marker genes still results in the same clustering configuration (with the exception of a few cells, due to the stochastic nature of k- means clustering which becomes more prominent as the number of genes used for clustering becomes smaller) as well as lineage tree. However, using the 416 genes with the highest coefficient of variation across all cells fails to produce the same results but instead gives rise to a much less parsimonious lineage tree, given what we know about the differentiation conditions and duration of the individual cells (Figure 2—figure supplement 3B).

3) Finally, we show that our results are robust to using a subset of the 288 genes: the clustering configuration as well as the lineage relationships of the individual clusters to one another remained unchanged when a) a particular cluster (C₃) was removed, or b) a random subset of 144 cells were removed (Figure 2—figure supplement 3C and 3D). Interestingly, when C₃ is removed, the algorithm connects the grandchildren cell types (C₅ and C₆) to the grandmother (C₁) directly.

Figure 2B and 2C were intended to be mere illustrations – rather than proofs of principles – of what marker and transition genes are, respectively, although we now see how the comparison between 2B and 2D could appear tautological. Following this comment as well as the next (3.4), we decided to change Figure 2D to a cell-cell correlation plot of all cells using all 889 genes that were used for the final clustering and lineage determination step.

Figure 2—source data 2 explains how the individual cells from each of these wells described in Figure 1—source data 1 cluster. We also added a subfigure (Figure 3—figure supplement 1A) to better illustrate the relationship between clusters and culture conditions, showing that cells from different culture conditions sometimes cluster together, as well as that cells exposed to the same conditions are sometimes assigned to different clusters.

We have added a heat map subfigure (Figure 2—figure supplement 2D) showing the expression levels of the 899 genes used for the final iteration of clustering and lineage determination for all 288 cells.

We have added a plot showing the mean and c.v. (coefficient of variation) of percent variance explained by the first principal component (PC) of each cell cluster normalized by that of randomized data (Figure 2—figure supplement 3B). We find that this value ranges between 1.0400 (C₅) and 2.2527 (C₀), with c.v.’s of ~0.01. In contrast, for all possible merged pairs of clusters, the mean and c.v. of the percent variance explained by the first PC normalized by that of randomized data were 3.2220 and 0.2967, respectively. We have referred to this new supplementary figure (subsection “Bayesian statistical approach discovers appropriate coordinate systems to infer cell states and state transitions”, tenth paragraph) in the manuscript.

We were able to identify one cell in cluster C₇ that has an above-background expression level of BLIMP1, T, TFAP2C, and STELLA, and although intriguing, given that it is just one cell unfortunately, it is difficult to conclude anything from this observation.

4) Validation of results:

We have followed up on this comment by immunostaining day 3 and day 4 mesendodermal cells (as identified by T expression) for Etv5 and FoxA2 (Figure 3—figure supplement 1C). Etv5 is known to be present in pluripotent cells (Akagi et al., 2015) as well as play a role in spermatogenesis (Tyagi et al., 2009), but to our knowledge it has so far not been implicated with early mesendodermal differentiation.

We have added a few sentences in the main text to address these results:

“From the 889 genes that were categorized as either marker or transition genes for all the high probability triplets, we chose genes involving only the triplets along the lineage tree. […] For instance, we did not consider the genes involving the triplet of C₀, C₁ and C₅ since between C₁ and C₅ the C₃ cluster is skipped between C₁ and C₅.”

“Further, because our goal was to test whether different cell states were functionally distinct (i.e., respond differently to the same signals and gene expression changes), we also noted the expression pattern of signaling factor genes belonging to FGF, WNT, LIF and BMP signaling pathways along the discovered lineage tree (shown in Figure 3A).”

5) Network analysis:

Following the reviewers’ suggestions, we now provide a heatmap showing the mean and c.v. of the different parameters (Jij s) across the 10,000 solutions (Figure 4—figure supplement 2) to provide a convenient way to estimate the inferred magnitude and noise for the model parameters.

We have added the following paragraph to clarify the discretization procedure:

“We obtained 321 transcription factors from the triplets along the tree and classify them based on their pattern across the triplets. […] This way we acquired a total of 29 modules with unique binary gene expression profiles.”

One key assumption of the model is that we can infer the underlying gene regulatory network from the genes that could be detected from the single-cell transcriptomics data. Gene dropout would result in a reduction in the number of genes within a module or a reduction in the number of gene modules. In order to explore how gene dropout would affect our model predictions, we sub-sampled the number of genes to build the Gene regulatory network by changing the probability cutoff for the transition and marker genes we considered. Although the number of gene modules changed (27 modules for a cut off of 0.7 and 24 for 0.9) we found that the models made the same qualitative predictions (Figure 4—figure supplement 3).

We would like to thank the reviewers for pointing this out. We have made several changes to this section in order to make it more comprehensible. We have explained the term “effective interaction strength” in the manuscript (subsection “A probabilistic model that replicates the observed discrete cell states predicts state-dependent interpretation of perturbations”, eighth paragraph).

In order to obtain the statistical cutoffs to determine "[X] levels are more stable to [Y] overexpression", we randomly sampled three sets of 3333 of the 10000 models. For each set we computed the relevant quantities, such as the number of models that show downregulation of Oct4 in response to Sox2 overexpression, and correspondingly, for each of these quantities computed the mean and the standard error. We have added a few sentences to clarify this point (Figure 4 captions). We have also added these error bars in Figure 4D, 4G and 4I.

Lastly, although our model can make large number of predictions, the eventual confirmation of these predictions can only be done through experimental validation. It is therefore beyond the scope of this manuscript to compute the false discovery rate since that would require experimentally testing hundreds of predictions.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Figure 1—source data 1. Differentiation conditions and duration of single cells sorted into seven 96-well plates.

DOI: http://dx.doi.org/10.7554/eLife.20487.003

elife-20487-fig1-data1.docx^{(19.3KB, docx)}

DOI: 10.7554/eLife.20487.003

Figure 2—source data 1. Plate and well id’s of cells belonging to each cluster.

DOI: http://dx.doi.org/10.7554/eLife.20487.006

elife-20487-fig2-data1.docx^{(30.8KB, docx)}

DOI: 10.7554/eLife.20487.006

Figure 2—source data 2. Triplet probabilities of final tree.

DOI: http://dx.doi.org/10.7554/eLife.20487.007

elife-20487-fig2-data2.docx^{(33.8KB, docx)}

DOI: 10.7554/eLife.20487.007

Figure 3—source data 1. Probabilities of membership in marker and transition gene classes in final tree.

Listed are, for direct triplets along the lineage tree, the genes with the highest probabilities of belonging to transition gene and marker gene classes, and their associated probabilities. Genes belonging to the classes shown in curly brackets in Figure 3C (probability greater than 0.5%) are shown.

DOI: http://dx.doi.org/10.7554/eLife.20487.012

elife-20487-fig3-data1.xlsx^{(36.5KB, xlsx)}

DOI: 10.7554/eLife.20487.012

Figure 4—source data 1. Gene modules used for modeling the network.

* The [BMP] and [Aes] modules have the same binary pattern. ** The [FGF] and [WNT] modules have the same binary pattern.

DOI: http://dx.doi.org/10.7554/eLife.20487.015

elife-20487-fig4-data1.docx^{(37.1KB, docx)}

DOI: 10.7554/eLife.20487.015

Figure 4—source data 2. Binary expression profiles of the gene modules used for modeling the network in the 9 cell clusters.

DOI: http://dx.doi.org/10.7554/eLife.20487.016

elife-20487-fig4-data2.docx^{(19.6KB, docx)}

DOI: 10.7554/eLife.20487.016

[bib1] Advani M, Ganguli S. Statistical mechanics of high-dimensional inference. arXiv. 2016 1601.04650

[bib2] Arnold SJ, Robertson EJ. Making a commitment: cell lineage allocation and Axis patterning in the early mouse embryo. Nature Reviews Molecular Cell Biology. 2009;10:91–103. doi: 10.1038/nrm2618. [DOI] [PubMed] [Google Scholar]

[bib3] Borgel J, Guibert S, Li Y, Chiba H, Schübeler D, Sasaki H, Forné T, Weber M. Targets and dynamics of promoter DNA methylation during early mouse development. Nature Genetics. 2010;42:1093–1100. doi: 10.1038/ng.708. [DOI] [PubMed] [Google Scholar]

[bib4] Brown L, Brown S. Zic2 is expressed in pluripotent cells in the blastocyst and adult brain expression overlaps with makers of neurogenesis. Gene Expression Patterns. 2009;9:43–49. doi: 10.1016/j.gep.2008.08.002. [DOI] [PubMed] [Google Scholar]

[bib5] Buecker C, Srinivasan R, Wu Z, Calo E, Acampora D, Faial T, Simeone A, Tan M, Swigut T, Wysocka J. Reorganization of enhancer patterns in transition from naive to primed pluripotency. Cell Stem Cell. 2014;14:838–853. doi: 10.1016/j.stem.2014.04.003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ, Teichmann SA, Marioni JC, Stegle O. Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nature Biotechnology. 2015;33:155–160. doi: 10.1038/nbt.3102. [DOI] [PubMed] [Google Scholar]

[bib7] Chambers I. The molecular basis of pluripotency in mouse embryonic stem cells. Cloning and Stem Cells. 2004;6:386–391. doi: 10.1089/clo.2004.6.386. [DOI] [PubMed] [Google Scholar]

[bib8] Chung NC, Storey JD. Statistical significance of variables driving systematic variation in high-dimensional data. Bioinformatics. 2015;31:545–554. doi: 10.1093/bioinformatics/btu674. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] Evans MJ, Kaufman MH. Establishment in culture of pluripotential cells from mouse embryos. Nature. 1981;292:154–156. doi: 10.1038/292154a0. [DOI] [PubMed] [Google Scholar]

[bib10] Fard AT, Srihari S, Mar JC, Ragan MA. Not just a colourful metaphor: modelling the landscape of cellular development using hopfield networks. Npj Systems Biology and Applications. 2016;2:16001. doi: 10.1038/npjsba.2016.1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] Furchtgott L, Melton S, Menon V, Lodato S, Ramanathan S. Discovering sparse transcription factor codes for cell states and state transitions during development. eLife. 2017:e20488. doi: 10.7554/eLife.20488. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] Gadue P, Huber TL, Paddison PJ, Keller GM. Wnt and TGF-beta signaling are required for the induction of an in vitro model of primitive streak formation using embryonic stem cells. PNAS. 2006;103:16806–16811. doi: 10.1073/pnas.0603916103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] Galvagni F, Lentucci C, Neri F, Dettori D, De Clemente C, Orlandini M, Anselmi F, Rapelli S, Grillo M, Borghi S, Oliviero S. Snai1 promotes ESC exit from the pluripotency by direct repression of self-renewal genes. Stem Cells. 2015;33:742–750. doi: 10.1002/stem.1898. [DOI] [PubMed] [Google Scholar]

[bib14] Gans C, Northcutt RG. Neural crest and the origin of vertebrates: a new head. Science. 1983;220:268–273. doi: 10.1126/science.220.4594.268. [DOI] [PubMed] [Google Scholar]

[bib15] Gaspard N, Bouschet T, Hourez R, Dimidschstein J, Naeije G, van den Ameele J, Espuny-Camacho I, Herpoel A, Passante L, Schiffmann SN, Gaillard A, Vanderhaeghen P. An intrinsic mechanism of corticogenesis from embryonic stem cells. Nature. 2008;455:351–357. doi: 10.1038/nature07287. [DOI] [PubMed] [Google Scholar]

[bib16] Gass SI. Linear Programming. 5th edn. Boyd & Fraser Publishing Company; 2013. [Google Scholar]

[bib17] Geula S, Moshitch-Moshkovitz S, Dominissini D, Mansour AA, Kol N, Salmon-Divon M, Hershkovitz V, Peer E, Mor N, Manor YS, Ben-Haim MS, Eyal E, Yunger S, Pinto Y, Jaitin DA, Viukov S, Rais Y, Krupalnik V, Chomsky E, Zerbib M, Maza I, Rechavi Y, Massarwa R, Hanna S, Amit I, Levanon EY, Amariglio N, Stern-Ginossar N, Novershtern N, Rechavi G, Hanna JH. Stem cells. m6A mRNA methylation facilitates resolution of naïve pluripotency toward differentiation. Science. 2015;347:1002–1006. doi: 10.1126/science.1261417. [DOI] [PubMed] [Google Scholar]

[bib18] Goller T, Vauti F, Ramasamy S, Arnold HH. Transcriptional regulator BPTF/FAC1 is essential for trophoblast differentiation during early mouse development. Molecular and Cellular Biology. 2008;28:6819–6827. doi: 10.1128/MCB.01058-08. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] Hart AH, Hartley L, Sourris K, Stadler ES, Li R, Stanley EG, Tam PP, Elefanty AG, Robb L. Mixl1 is required for axial mesendoderm morphogenesis and patterning in the murine embryo. Development. 2002;129:3597–3608. doi: 10.1242/dev.129.15.3597. [DOI] [PubMed] [Google Scholar]

[bib20] Hashimshony T, Wagner F, Sher N, Yanai I. CEL-Seq: single-cell RNA-Seq by multiplexed linear amplification. Cell Reports. 2012;2:666–673. doi: 10.1016/j.celrep.2012.08.003. [DOI] [PubMed] [Google Scholar]

[bib21] Hopfield JJ. Neurons with graded response have collective computational properties like those of two-state neurons. PNAS. 1984;81:3088–3092. doi: 10.1073/pnas.81.10.3088. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib22] Islam S, Zeisel A, Joost S, La Manno G, Zajac P, Kasper M, Lönnerberg P, Linnarsson S. Quantitative single-cell RNA-seq with unique molecular identifiers. Nature Methods. 2014;11:163–166. doi: 10.1038/nmeth.2772. [DOI] [PubMed] [Google Scholar]

[bib23] Kanai-Azuma M, Kanai Y, Gad JM, Tajima Y, Taya C, Kurohmaru M, Sanai Y, Yonekawa H, Yazaki K, Tam PP, Hayashi Y. Depletion of definitive gut endoderm in Sox17-null mutant mice. Development. 2002;129:2367–2379. doi: 10.1242/dev.129.10.2367. [DOI] [PubMed] [Google Scholar]

[bib24] Karr JR, Sanghvi JC, Macklin DN, Gutschow MV, Jacobs JM, Bolival B, Assad-Garcia N, Glass JI, Covert MW. A whole-cell computational model predicts phenotype from genotype. Cell. 2012;150:389–401. doi: 10.1016/j.cell.2012.05.044. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] Keller G. Embryonic stem cell differentiation: emergence of a new era in biology and medicine. Genes & Development. 2005;19:1129–1155. doi: 10.1101/gad.1303605. [DOI] [PubMed] [Google Scholar]

[bib26] Kim J, Chu J, Shen X, Wang J, Orkin SH. An extended transcriptional network for pluripotency of embryonic stem cells. Cell. 2008;132:1049–1061. doi: 10.1016/j.cell.2008.02.039. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] Kim JK, Huh SO, Choi H, Lee KS, Shin D, Lee C, Nam JS, Kim H, Chung H, Lee HW, Park SD, Seong RH. Srg3, a mouse homolog of yeast SWI3, is essential for early embryogenesis and involved in brain development. Molecular and Cellular Biology. 2001;21:7787–7795. doi: 10.1128/MCB.21.22.7787-7795.2001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib28] Kim PT, Ong CJ. Differentiation of definitive endoderm from mouse embryonic stem cells. Results and Problems in Cell Differentiation. 2012;55:303–319. doi: 10.1007/978-3-642-30406-4_17. [DOI] [PubMed] [Google Scholar]

[bib29] Knecht AK, Bronner-Fraser M. Induction of the neural crest: a multigene process. Nature Reviews. Genetics. 2002;3:453–461. doi: 10.1038/nrg819. [DOI] [PubMed] [Google Scholar]

[bib30] Koch PJ, Roop DR. The role of keratins in epidermal development and homeostasis--going beyond the obvious. Journal of Investigative Dermatology. 2004;123:x–0. doi: 10.1111/j.0022-202X.2004.23495.x. [DOI] [PubMed] [Google Scholar]

[bib31] Nicole LD. Chaya K. The Neural Crest. 2nd edn. Cambridge University Press; 1991. 9780521620109 [Google Scholar]

[bib32] Lebrecht D, Foehr M, Smith E, Lopes FJ, Vanario-Alonso CE, Reinitz J, Burz DS, Hanes SD. Bicoid cooperative DNA binding is critical for embryonic patterning in Drosophila. PNAS. 2005;102:13176–13181. doi: 10.1073/pnas.0506462102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib33] Li CL, Li KC, Wu D, Chen Y, Luo H, Zhao JR, Wang SS, Sun MM, Lu YJ, Zhong YQ, Hu XY, Hou R, Zhou BB, Bao L, Xiao HS, Zhang X. Somatosensory neuron types identified by high-coverage single-cell RNA-sequencing and functional heterogeneity. Cell Research. 2016;26:83–102. doi: 10.1038/cr.2015.149. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] Li L, Song L, Liu C, Chen J, Peng G, Wang R, Liu P, Tang K, Rossant J, Jing N. Ectodermal progenitors derived from epiblast stem cells by inhibition of nodal signaling. Journal of Molecular Cell Biology. 2015;7:455–465. doi: 10.1093/jmcb/mjv030. [DOI] [PubMed] [Google Scholar]

[bib35] Lindsley RC, Gill JG, Kyba M, Murphy TL, Murphy KM. Canonical Wnt signaling is required for development of embryonic stem cell-derived mesoderm. Development. 2006;133:3787–3796. doi: 10.1242/dev.02551. [DOI] [PubMed] [Google Scholar]

[bib36] Lumelsky N, Blondel O, Laeng P, Velasco I, Ravin R, McKay R. Differentiation of embryonic stem cells to insulin-secreting structures similar to pancreatic islets. Science. 2001;292:1389–1394. doi: 10.1126/science.1058866. [DOI] [PubMed] [Google Scholar]

[bib37] Machta BB, Chachra R, Transtrum MK, Sethna JP. Parameter space compression underlies emergent theories and predictive models. Science. 2013;342:604–607. doi: 10.1126/science.1238723. [DOI] [PubMed] [Google Scholar]

[bib38] Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas AR, Kamitaki N, Martersteck EM, Trombetta JJ, Weitz DA, Sanes JR, Shalek AK, Regev A, McCarroll SA. Highly parallel Genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161:1202–1214. doi: 10.1016/j.cell.2015.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib39] Maetschke SR, Ragan MA. Characterizing Cancer subtypes as attractors of hopfield networks. Bioinformatics. 2014;30:1273–1279. doi: 10.1093/bioinformatics/btt773. [DOI] [PubMed] [Google Scholar]

[bib40] Merrill BJ, Pasolli HA, Polak L, Rendl M, García-García MJ, Anderson KV, Fuchs E. Tcf3: a transcriptional regulator of Axis induction in the early embryo. Development. 2004;131:263–274. doi: 10.1242/dev.00935. [DOI] [PubMed] [Google Scholar]

[bib41] Nakanishi M, Kurisaki A, Hayashi Y, Warashina M, Ishiura S, Kusuda-Furue M, Asashima M. Directed induction of anterior and posterior primitive streak by wnt from embryonic stem cells cultured in a chemically defined serum-free medium. The FASEB Journal. 2009;23:114–122. doi: 10.1096/fj.08-111203. [DOI] [PubMed] [Google Scholar]

[bib42] Nichols J, Smith A. Naive and primed pluripotent states. Cell Stem Cell. 2009;4:487–492. doi: 10.1016/j.stem.2009.05.015. [DOI] [PubMed] [Google Scholar]

[bib43] Paul F, Arkin Y, Giladi A, Jaitin DA, Kenigsberg E, Keren-Shaul H, Winter D, Lara-Astiaso D, Gury M, Weiner A, David E, Cohen N, Lauridsen FK, Haas S, Schlitzer A, Mildner A, Ginhoux F, Jung S, Trumpp A, Porse BT, Tanay A, Amit I. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell. 2015;163:1663–1677. doi: 10.1016/j.cell.2015.11.013. [DOI] [PubMed] [Google Scholar]

[bib44] Pevny LH, Sockanathan S, Placzek M, Lovell-Badge R. A role for SOX1 in neural determination. Development. 1998;125:1967–1978. doi: 10.1242/dev.125.10.1967. [DOI] [PubMed] [Google Scholar]

[bib45] Power MA, Tam PP. Onset of Gastrulation, morphogenesis and somitogenesis in mouse embryos displaying compensatory growth. Anatomy and Embryology. 1993;187:493–504. doi: 10.1007/BF00174425. [DOI] [PubMed] [Google Scholar]

[bib46] Reyon D, Tsai SQ, Khayter C, Foden JA, Sander JD, Joung JK. FLASH assembly of TALENs for high-throughput genome editing. Nature Biotechnology. 2012;30:460–465. doi: 10.1038/nbt.2170. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib47] Roccio M, Schmitter D, Knobloch M, Okawa Y, Sage D, Lutolf MP. Predicting stem cell fate changes by differential cell cycle progression patterns. Development. 2013;140:459–470. doi: 10.1242/dev.086215. [DOI] [PubMed] [Google Scholar]

[bib48] Rojas A, De Val S, Heidt AB, Xu SM, Bristow J, Black BL. Gata4 expression in lateral mesoderm is downstream of BMP4 and is activated directly by forkhead and GATA transcription factors through a distal enhancer element. Development. 2005;132:3405–3417. doi: 10.1242/dev.01913. [DOI] [PubMed] [Google Scholar]

[bib49] Saadatpour A, Guo G, Orkin SH, Yuan G-C. Characterizing heterogeneity in leukemic cells using single-cell gene expression analysis. Genome Biology. 2014;15:1–13. doi: 10.1186/s13059-014-0525-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib50] Sansom SN, Griffiths DS, Faedo A, Kleinjan DJ, Ruan Y, Smith J, van Heyningen V, Rubenstein JL, Livesey FJ. The level of the transcription factor Pax6 is essential for controlling the balance between neural stem cell self-renewal and neurogenesis. PLoS Genetics. 2009;5:e1000511. doi: 10.1371/journal.pgen.1000511. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib51] Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of single-cell gene expression data. Nature Biotechnology. 2015;33:495–502. doi: 10.1038/nbt.3192. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib52] Segal E, Widom J. From DNA sequence to transcriptional behaviour: a quantitative approach. Nature Reviews Genetics. 2009;10:443–456. doi: 10.1038/nrg2591. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib53] Spitz F, Furlong EE. Transcription factors: from enhancer binding to developmental control. Nature Reviews Genetics. 2012;13:613–626. doi: 10.1038/nrg3207. [DOI] [PubMed] [Google Scholar]

[bib54] Streit A, Stern CD. Neural induction. A bird's eye view. Trends in Genetics. 1999;15:20–24. doi: 10.1016/S0168-9525(98)01620-5. [DOI] [PubMed] [Google Scholar]

[bib55] Sumi T, Tsuneyoshi N, Nakatsuji N, Suemori H. Defining early lineage specification of human embryonic stem cells by the orchestrated balance of canonical wnt/beta-catenin, activin/Nodal and BMP signaling. Development. 2008;135:2969–2979. doi: 10.1242/dev.021121. [DOI] [PubMed] [Google Scholar]

[bib56] Tada S, Era T, Furusawa C, Sakurai H, Nishikawa S, Kinoshita M, Nakao K, Chiba T, Nishikawa S. Characterization of mesendoderm: a diverging point of the definitive endoderm and mesoderm in embryonic stem cell differentiation culture. Development. 2005;132:4363–4374. doi: 10.1242/dev.02005. [DOI] [PubMed] [Google Scholar]

[bib57] Tam PP, Loebel DA, Tanaka SS. Building the mouse gastrula: signals, asymmetry and lineages. Current Opinion in Genetics & Development. 2006;16:419–425. doi: 10.1016/j.gde.2006.06.008. [DOI] [PubMed] [Google Scholar]

[bib58] Tesar PJ, Chenoweth JG, Brook FA, Davies TJ, Evans EP, Mack DL, Gardner RL, McKay RD. New cell lines from mouse epiblast share defining features with human embryonic stem cells. Nature. 2007;448:196–199. doi: 10.1038/nature05972. [DOI] [PubMed] [Google Scholar]

[bib59] Thomson M, Liu SJ, Zou LN, Smith Z, Meissner A, Ramanathan S. Pluripotency factors in embryonic stem cells regulate differentiation into germ layers. Cell. 2011;145:875–889. doi: 10.1016/j.cell.2011.05.017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib60] Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B. 2001;63:411–423. doi: 10.1111/1467-9868.00293. [DOI] [Google Scholar]

[bib61] Trapnell C, Cacchiarelli D, Grimsby J, Pokharel P, Li S, Morse M, Lennon NJ, Livak KJ, Mikkelsen TS, Rinn JL. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nature Biotechnology. 2014;32:381–386. doi: 10.1038/nbt.2859. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib62] Van der Maaten L, Hinton G. Visualizing data using t-SNE. Journal of Machine Learning Research. 2008;9:2579–2605. [Google Scholar]

[bib63] Vogel-Ciernia A, Wood MA. Neuron-specific chromatin remodeling: a missing link in epigenetic mechanisms underlying synaptic plasticity, memory, and intellectual disability disorders. Neuropharmacology. 2014;80:18–27. doi: 10.1016/j.neuropharm.2013.10.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib64] Watabe T, Miyazono K. Roles of TGF-beta family signaling in stem cell renewal and differentiation. Cell Research. 2009;19:103–115. doi: 10.1038/cr.2008.323. [DOI] [PubMed] [Google Scholar]

[bib65] Wilson PA, Hemmati-Brivanlou A. Induction of epidermis and inhibition of neural fate by Bmp-4. Nature. 1995;376:331–333. doi: 10.1038/376331a0. [DOI] [PubMed] [Google Scholar]

[bib66] Ying QL, Smith AG. Defined conditions for neural commitment and differentiation. Methods in Enzymology. 2003;365:327–341. doi: 10.1016/s0076-6879(03)65023-8. [DOI] [PubMed] [Google Scholar]

[bib67] Ying QL, Wray J, Nichols J, Batlle-Morera L, Doble B, Woodgett J, Cohen P, Smith A. The ground state of embryonic stem cell self-renewal. Nature. 2008;453:519–523. doi: 10.1038/nature06968. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib68] Young RA. Control of the embryonic stem cell state. Cell. 2011;144:940–954. doi: 10.1016/j.cell.2011.01.032. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib69] Zhou Q, Chipperfield H, Melton DA, Wong WH. A gene regulatory network in mouse embryonic stem cells. PNAS. 2007;104:16438–16443. doi: 10.1073/pnas.0701014104. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Dynamics of embryonic stem cell differentiation inferred from single-cell transcriptomics show a series of transitions through discrete cell states

Sumin Jang

Sandeep Choubey

Leon Furchtgott

Ling-Nan Zou

Adele Doyle

Vilas Menon

Ethan B Loew

Anne-Rachel Krostag

Refugio A Martinez

Linda Madisen

Boaz P Levi

Sharad Ramanathan

Roles

Abstract

Introduction

Results

Acquiring single-cell transcriptomics data during early differentiation

Figure 1. Single-Cell Gene Expression Profiling of mESCs during early germ layer differentiation.

Figure 1—figure supplement 1. Quality validation of single-cell RNA-seq data.

Bayesian statistical approach discovers appropriate coordinate systems to infer cell states and state transitions

Figure 2. Iterative Bayesian algorithm converges upon a set of cell clusters and local transitions that together define a multi-potent lineage tree.

Figure 2—figure supplement 1. Diagram of Bayesian framework for inferring sequence of transitions for triplets.

Figure 2—figure supplement 2. Iterative clustering and lineage determination is robust to clustering method.

Figure 2—figure supplement 3. Iterative clustering and lineage determination is robust to changes in parameters.

Figure 3. Cells transition from one discrete state to another during differentiation.

Figure 3—figure supplement 1. Validation of inferred cell types and lineage relationships.

Correspondence of cell states discovered ab initio from single-cell data to known in vivo cell types

Differentiation occurs through a series of discrete cell state transitions

A probabilistic model that replicates the observed discrete cell states predicts state-dependent interpretation of perturbations

Figure 4. Quantitative modeling of the network underlying germ layer differentiation.

Figure 4—figure supplement 1. Summary of gene modules and illustration of production rate determination for each gene module.

Figure 4—figure supplement 2. Summary of parameters for model gene regulatory network.

Figure 4—figure supplement 3. The predictions of the gene regulatory network are robust to changes in the probability threshold for considering a gene to be a transition or a marker gene.

Interpretation of Sox2, Snai1, and LIF+BMP are cell state dependent

Figure 5. Experimental validation shows that interpretation of Sox2, Snai1, and LIF+BMP is cell state dependent.

Figure 5—figure supplement 1. Controls for perturbation experiments.

Discussion

Materials and methods

Clustering and re-clustering using seurat

Convergence of clustering configurations from different seed configurations

Framework for quantitative modeling of germ layer differentiation

Determination of gene modules

Local-field gene regulatory network model for gene modules

Linear programming

Common features of the sampled networks

Predictions for Sox2 and Snai1 overexpression

Predictions for BMP and LIF addition

ES-cell culture

ES cell differentiation

Single-cell RNA-Seq

Immunofluorescence

Live-cell microscopy

Plasmid transfection

Fluorescence-activated cell sorting

Generation of mOTX2-Citrine reporter cell line

Software

Acknowledgements

Funding Statement

Funding Information

Additional information

Competing interests

Author contributions

Additional files

Major datasets

References

Decision letter

Roles

Author response

Author response image 1.

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases