Skip to main content
eLife logoLink to eLife
. 2017 Mar 15;6:e20488. doi: 10.7554/eLife.20488

Discovering sparse transcription factor codes for cell states and state transitions during development

Leon A Furchtgott 1,2,*,, Samuel Melton 1,3,, Vilas Menon 4,5, Sharad Ramanathan 1,3,4,6,7,*
Editor: Nir Yosef8
PMCID: PMC5352226  PMID: 28296636

Abstract

Computational analysis of gene expression to determine both the sequence of lineage choices made by multipotent cells and to identify the genes influencing these decisions is challenging. Here we discover a pattern in the expression levels of a sparse subset of genes among cell types in B- and T-cell developmental lineages that correlates with developmental topologies. We develop a statistical framework using this pattern to simultaneously infer lineage transitions and the genes that determine these relationships. We use this technique to reconstruct the early hematopoietic and intestinal developmental trees. We extend this framework to analyze single-cell RNA-seq data from early human cortical development, inferring a neocortical-hindbrain split in early progenitor cells and the key genes that could control this lineage decision. Our work allows us to simultaneously infer both the identity and lineage of cell types as well as a small set of key genes whose expression patterns reflect these relationships.

DOI: http://dx.doi.org/10.7554/eLife.20488.001

Research Organism: Human, Mouse

Introduction

During development, pluripotent cells make a series of lineage decisions to give rise to the different cell types of the body. These lineage decisions are controlled by intra-cellular molecular networks that include transcription factors and signaling molecules. There are two fundamental challenges associated with understanding the differentiation of individual cells. The first is to identify lineage relationships: how cells and their progeny move from pluripotent through intermediate to terminally differentiated cell states. The second is to identify the key molecular drivers that allow cells to make fate decisions along their developmental trajectory.

Reconstructing cell lineages has traditionally involved prospectively tracking cells and their progeny using a variety of imaging or genetic tools (Buckingham and Meilhac, 2011; Frumkin et al., 2008; Orkin and Zon, 2008; Sulston et al., 1983). Recent progress in single-cell sequencing techniques (Grün et al., 2015; Jaitin et al., 2014; Macosko et al., 2015; Patel et al., 2014; Paul et al., 2015; Treutlein et al., 2014; Zeisel et al., 2015) allows for a complementary view of the transcriptional states of individual cells during the course of development, providing static snapshots of the dynamics of the underlying molecular network. But inferring lineage relationships and the dynamics of the underlying molecular networks has proved difficult using transcriptional data alone, in part because of the high dimensional nature of these data. Overcoming this challenge would be particularly useful for understanding the development of human organs such as the brain where traditional lineage tracing experiments are more difficult.

High dimensional data analysis techniques such as PCA, ICA, or t-SNE, are useful at reducing dimensionality. However, since the resulting axes represent linear combinations of a large number of features (for example, expression levels of each gene), interpreting the analysis or making experimental predictions is sometimes challenging. Meanwhile, traditional statistical methods such as linear multivariate regression have limited applicability for detecting patterns in high dimensional data (Advani and Ganguli, 2016; Donoho and Tanner, 2009). The challenges inherent in high-dimensional data analysis such as identifying discriminatory features are further exacerbated as the fraction of relevant features decreases (Donoho and Tanner, 2009). Computational techniques currently in use to cluster single cell data or to infer relationships among cells are built on these approaches, thereby assuming that all high-variance genes are equally relevant for pattern detection (Marco et al., 2014; Satija et al., 2015; Trapnell et al., 2014). In contrast, decades of work in developmental biology have revealed that combinations of a few transcription factors can be sufficient to experimentally perturb cell fate and developmental decisions (Gilbert, 2014; Graf and Enver, 2009; Takahashi and Yamanaka, 2006) suggesting that the expression patterns of a few genes may be most relevant for making computational inferences. Therefore, there is a need to detect patterns involving a small fraction of all genes. Unfortunately, except in the case of well-studied lineage decisions, we do not know the identity of this fraction.

In statistics, techniques relying on L1 regularization have been successful in contexts where the number of informative variables is known to be small but whose identities are unknown, both for regression problems (Baraniuk, 2007; Candès et al., 2006; Tibshirani, 1996; Wainwright, 2009) and for clustering (Witten and Tibshirani, 2010). Inspired by these successes in statistics, our aim here is to discover generalizable sparse patterns in gene expression data during development (if they exist), and to exploit these patterns to computationally infer the dynamics of cell state transitions from high-dimensional transcriptional data obtained during the course of development.

In this manuscript we analyze expression patterns among cell types with known lineage relationships in late hematopoiesis and discover a pattern in a sparse subset of genes that correlates with these relationships. We develop a Bayesian framework based on this gene expression pattern to simultaneously infer lineage transitions and the key genes that drive them. We apply this method to reconstruct the lineage tree among a different set of cell types in early hematopoietic development, and in this process identify many known drivers of early hematopoiesis, including Gata1, Cebpa and Ebf1. We further extend our method to analyze single-cell gene expression data, using genes exhibiting the discovered pattern to cluster cells from early brain development and to infer lineage relationships between these clusters. Our analysis reveals a split from early progenitors to putative neocortex and mid/hindbrain cell types, as evidenced by the mutually exclusive expression of region-specific genes such as FOXG1, LHX1, and POU3F2 (BRN2). This prediction was validated experimentally in a separate work (Yao et al., 2017). We finally discuss the advantages of using sparse patterns for making inferences and for modeling the underlying gene regulatory networks.

Results

Discovering sparse patterns correlated with lineage transitions

In order to identify gene expression patterns that are robustly predictive of lineage relationships, we analyzed gene expression data from 41 cell types during B- and T- cell development that have an experimentally established developmental lineage (Figure 1A, Heng et al., 2008). We searched for sparse patterns of gene expression amongst groups of three cell types from this collection; subsets of three are the minimal set in which measures of relative similarity can be used infer relative lineage relationships.

Figure 1. The clear minimum pattern is robustly detected in leaf cell types throughout triplet lineages in B- and T-cell development.

(A) Developmental lineage tree showing relationships among 41 cell types in B- and T-cell development (Heng et al., 2008). Three triplets – the minimal subset of the tree from which relative distances can be studied – are denoted, each including an intermediate root cell type (red) and terminal leaf cell types (blue). 150 triplets among all sets of three cell types within five steps on the lineage tree were extracted for pattern-detection. (B) For each triplet of cell types (left), each gene’s expression level can have the clear minimum pattern in either the root or the leaves (right, top box) where the distribution of gene expression levels in one cell type is well separated from the other two (left, p<0.005 in a two sample t-test); clear maximum pattern (left, bottom) in the root or the leaves: the gene has a clear maximum in one of the cell types (right, p<0.005 in a two sample t-test). (C) A histogram of the number of genes showing the clear minimum pattern among the 150 triplets with known developmental topology. Triplets in which the root has the most genes showing the pattern shown in red; triplets in which one of the leaves has the most genes showing the pattern shown in blue. None of the triplets with more than 10 genes showing the pattern have the most genes with a clear minimum in the root (no red in any histogram bar except for the left most)(D) Principal component analysis of microarray data from the cell types in B- and T- cell development does not reflect known lineage relationships in (A).

DOI: http://dx.doi.org/10.7554/eLife.20488.002

Figure 1—source data 1. Triplets used for pattern detection.
Sheet 1: The 150 related triplets used for pattern detection are listed, with the root cell type of the triplet in the first column and the two leaves in second and third columns. Cell type names follow (Heng et al., 2008). For each triplet are listed the number of genes with a clear minimum in each cell type and a clear maximum in each cell type (t-test p-value < 0.005). Also listed for each triplet are the length of the triplet, whether the triplet is a lineage progression or a cell fate decision, whether the triplet contains a terminal node, and the entropy in the clear minimum pattern and clear maximum pattern. Sheet 2: The 100 unrelated triplets are listed; cell type names follow (Heng et al., 2008). For each triplet are listed the number of genes with a clear minimum in each cell type and a clear maximum in each cell type (t-test p-value < 0.005) and the entropy in the clear minimum pattern and clear maximum pattern.
DOI: 10.7554/eLife.20488.003
Figure 1—source data 2. List of mouse transcription factors.
The list of human transcription factors used to cluster and infer lineages for the hematopoietic and intestinal development trees. The list was adapted from (Thomson et al., 2011).
DOI: 10.7554/eLife.20488.004

Figure 1.

Figure 1—figure supplement 1. Clear minimum and clear maximum patterns.

Figure 1—figure supplement 1.

(A, B) We used a two-distribution t-test to quantify the likelihood of a gene (each denoted by a single dot) showing either the clear minimum (A) or clear maximum pattern (B).We show the p-value from this t-test and the cell type in which the gene shows differential expression. The p-value cutoff is used to identify genes which show these patterns with high significance (dotted line). In this particular triplet, there are fewer genes with a clear minimum and with a clear maximum in the root of the triplet (B.T1) than in either leaf, as seen by fewer genes in the first row to the right of the cutoff. (C) A histogram of the number of genes showing the clear maximum pattern among the 150 triplets with known developmental topology. Triplets in which the root as the most genes showing the pattern are shown in blue. The clear maximum pattern does not show strong predictive power over the lineage relationships between triplets, and even when a large number of genes exhibit the pattern, the majority often reach their maximum in a leaf.
Figure 1—figure supplement 2. The clear minimum pattern is observed across different types of triplets.

Figure 1—figure supplement 2.

(A) Histogram of the number of genes showing the clear minimum pattern evaluated over all 25,194 genes among the 150 triplets with known developmental topology. Triplets in which the root has the most genes showing the pattern shown in red; triplets in which one of the leaves has the most genes showing the pattern shown in blue. (B) Histogram of the number of genes showing the clear minimum pattern evaluated over transcription factors, using FDR-adjusted p-values, among the 150 triplets with known developmental topology. Triplets in which the root has the most genes showing the pattern shown in red; triplets in which one of the leaves has the most genes showing the pattern shown in blue. (C) Number of genes showing the clear minimum pattern in the three cell types of each triplet. The number of genes in the root cell type of each triplet is indicated in blue; the number of genes in the two leaves are represented in yellow and red. The triplets are ordered according to the total number of genes showing the pattern in the triplet. (DF) A histogram of the number of genes showing the clear minimum pattern. As in Figure 1C, triplets in which the root has the most genes showing the pattern shown in red; triplets in which one of the leaves has the most genes showing the pattern shown in blue. Triplets are shaded according to different characteristics: length of triplet (D), presence of terminal node in triplet (E), cell-fate decision or lineage progression (F).
Figure 1—figure supplement 3. Distinction between related and unrelated triplets.

Figure 1—figure supplement 3.

(A) Plot, for triplets of related and unrelated cell types, of the entropy S of the fraction of genes downregulated in the 3 cell types (x-axis) and the total number of genes showing a minimum in any triplet for 100 unrelated and 150 related triplets (Figure 1—source data 1). Related triplets are shown in blue or red, depending on whether the cell type with the most genes showing the downregulated pattern is the root (red) or a leaf (blue), as in Figure 1C. Unrelated triplets are shown in green. For each triplet, we computed the entropy S=i=A,B,Cfilog(fi), where fi is the fraction of the genes which reach a clear minimum in cell type i = A, B, C. Jitter added to triplets with S = 0 for clarity. (B) Histogram of the maximum inferred probability of a non-null topology in the Bayesian algorithm for the set of related triplets (top) and the set of unrelated triplets (bottom). (C) Receiver operating characteristic (ROC) curve for the probability of a non-null topology (B) for binary classification of the related and unrelated triplets. The area under the curve (AUC) is 0.96.

We identified 150 triplets of cell types with experimentally verified lineage relationships from B- and T- cell development (Heng et al., 2008) (Figure 1—source data 1). Three such triplets are shown in Figure 1A. These triplets constituted both cell fate decisions (for example, cell type A gives rise to cell type B and C) and lineage progressions (cell type B gives rise to cell type A which then gives rise to cell type C). For each triplet, we noted which cell type was the progenitor or intermediate cell type (‘root’ cell type A) and which cell types were not (‘leaf’ cell types B and C). Note that even in the case in which cell type B gives rise to cell type A which gives rise to cell type C, we will refer to A as the ‘root’ and B and C as ‘leaves’ for that particular triplet. We analyzed transcription factor gene expression data for these triplets from the Immunological Genome Consortium (Heng et al., 2008) since transcription factors are the ultimate drivers of cell fate decisions.

Surprisingly, we found that specific expression patterns involving only single genes correlated well with lineage relationships between three cell types. Genes that are not differentially expressed within a triplet of cell types convey no information about relationships between the cell types. Therefore, we select genes with expression variability among the three cell types. The expression pattern of such genes can belong to one of only two possible patterns: (Figure 1B): the clear minimum pattern: the gene has a clear minimum level in one of the three cell types, with its distribution of gene expression levels being well-separated from the other two (p<0.005 in both two sample t-tests between the minimum cell type and the other two cell types; Figure 1—figure supplement 1A); the clear maximum pattern: the gene has a clear maximum level in one of the cell types (p<0.005 in both sample t-tests; Figure 1—figure supplement 1B). Note that both patterns can be satisfied simultaneously if the distribution of expression levels in the three cell types are all well-separated with a clear maximum and minimum. We tested if either of the two patterns correlated with the lineage topologies between the three cell types with known lineage relationships (Figure 1C, Figure 1—figure supplement 1C).

Triplets of cell types can be separated into categories based on how many genes exhibit the aforementioned patterns. 56% of the triplets contained more than 10 genes exhibiting the clear minimum pattern, and in 100% of these triplets the majority of genes with expression fitting this pattern reached their minimum expression in one of the leaves (Figure 1C) and never in the root of the triplet. The frequency with which the pattern correctly indicated the lineage relationship increased with the number of genes within a triplet exhibiting the pattern, thus suggesting a confidence measure. The genes showing the clear minimum pattern fell into two distinct groups, corresponding to whether the minimum expression level was in one or the other of the two leaves (Figure 1—figure supplement 1A; Figure 1—figure supplement 2C). Thus the expression pattern of the total set of clear minimum genes correlated with the topology. Since genes showing the clear minimum pattern correlated with lineage relationships (Figure 1C) between cell states both in the case of branches and linear sequences of cell state transitions, we refer to them as transition genes.

We further verified that the clear minimum pattern could be observed (a) in the set of all 25,194 genes (Figure 1—figure supplement 2A), (b) using FDR-adjusted p-values (Benjamini and Hochberg, 1995) (Figure 1—figure supplement 2B), (c) in triplets of different lengths along the lineage tree (Figure 1—figure supplement 2D), (d) in triplets both containing only internal nodes and including terminal nodes (Figure 1—figure supplement 2E), (e) and in both lineage progression and cell fate decision triplets (Figure 1—figure supplement 2F).

The clear maximum pattern was a poorer indicator of lineage relationships (Figure 1—figure supplement 1C). 83% of the triplets had more than 10 genes exhibiting this pattern, but 10% of those showed the majority of genes with expression fitting this pattern reaching their maximum in the root while the others did so in the leaves. Crucially, the integrity of the relationship between the clear maximum pattern and lineage topology was not correlated with the number of genes exhibiting the pattern (Figure 1—figure supplement 1C). While the clear maximum pattern did not correlate with lineage relationships, genes exhibiting this pattern identify individual cell types, and therefore we will refer to them as marker genes.

There are many examples of genes known to be functionally important for lineage decisions whose expression patterns fit the clear minimum pattern. In the case of lateral inhibition commonly used during development, progenitor cells express genes together (for example, Notch and Delta) which are differentially expressed in the differentiated states (only Notch or only Delta) (Perrimon et al., 2012) reaching the minimum expression level in one of the leaves. The same pattern is also seen in multiple examples of lineage decisions often involving mutual inhibition, where key genes expressed in the progenitor are differentially regulated in the progeny (Graf and Enver, 2009; Qi et al., 2013; Thomson et al., 2011; Zhang et al., 1999). In each of these cases, key genes reach minimal expression levels in one of the leaves of the triplets.

The observation that genes exhibiting the clear minimum pattern are correlated with the lineage topology of a triplet of cell types further revealed that (i) only this fraction of transcription factors can be useful for inferring lineage relationships, and (ii) the identity of this fraction depends on which group of three cell types were analyzed. As the subset of genes that are informative varies based on the triplet of cell types being considered, establishing lineage relationship between all cell types at once as opposed to three at a time could be challenging. Indeed, our attempt to reconstruct the lineage relationships between all cell types using methods based on PCA failed (Figure 1D).

We further evaluated the clear minimum pattern in 100 triplets in which there was no clear relation between the cell types (Figure 1—source data 1). We found that while there were a substantial number of genes exhibiting a clear minimum in one of the three cell types in unrelated triplets, their minima were evenly distributed amongst the cell types. To quantify that the minima were evenly distributed, we counted the fraction of genes fi which reached a clear minimum in cell type i = A, B, C (for A, B, and C unrelated cell types), and for each triplet, we computed the entropy S=i=A,B,Cfilog(fi). We compared the distribution of the entropy and the number of genes showing a minimum in any triplet for unrelated and related triplets (Figure 1—figure supplement 3A). The unrelated triplets have higher entropy and typically more genes with a minimum level. This suggested that unrelated triplets show a distinct pattern from the related triplets.

Using patterns to infer lineages

Together, these observations suggested a strategy for inferring the lineage topology between three cell types: each gene showing the clear minimum pattern with a minimum expression level in a particular cell type increases to the probability that this cell type is not the root of the topology (Figure 2A). We next developed a statistical machinery to systematically detect this pattern in gene expression data and to use the resulting sparse subset of genes to infer lineage relationships between three cell types at a time. We then used the inferred relationships between all sets of three cell types as constraints to determine the full developmental lineage tree.

Figure 2. Identification of topology and transition genes (showing clear minimum pattern) for each triplet of cell types.

(A) Schematic for the statistical inference of lineage topology for 3 cell types. Genes with a clear minimum pattern indicate which cell types that are not the root (see Figure 1C) and hence allow inference of the topological relationship. (B) Dot plot (each dot representing a gene) of the cell type that is most likely to have the minimum mean expression of each gene among CMP, ST and MPP as a function of the odds 𝒪i of that gene being a transition gene. Each gene votes against the topology whose root has the minimum mean among the three cell types, and this vote is weighted by the odds that the gene is a transition gene (Equation 1). Two groups of genes, labeled by their names, have high odds of being transition genes and thus cast a strong vote against CMP or MPP being the root.(C) The computed probability of the topology given gene expression data indicates 0.84 probability that ST is the intermediate type.(D) The plot of the probabilities of genes being transition genes for triplet ST/MPP/CMP, given gene expression data and that the topology is MPP – ST – CMP. The names of the 10 genes with the highest probability of being transition genes are shown. Probabilities are calculated assuming the prior odds p(βi=1)p(βi=0) = 0.05 (see main text). There are two classes of transition genes: one for which gene expression in CMP is greater than expression in MPP (regular font), and another for which gene expression in MPP is greater than expression in CMP (bold font).(E) Plot of the replicates of ST, MPP and CMP in the gene-expression space of the two classes of transition genes (with probability > 0.8). Plotted on each axis is the mean normalized log expression level of the transition genes in the class, each class is denoted in curly brackets by the name of the transition gene with the highest probability.(F) Dot plot for triplet MEP/GMP/FrBC of the cell type that is most likely to have the minimum mean expression as a function of the odds Oi of that gene being a transition gene.(G) The computed probability of the topology given gene expression data is the null hypothesis (p=0.99).

DOI: http://dx.doi.org/10.7554/eLife.20488.008

Figure 2—source data 1. Early hematopoietic cell types considered.
Listed for each early hematopoietic progenitor whose name is abbreviated in this paper are the Immunological Genome Project descriptor for the cell type, its common name and phenotype, its age and location, and the number of replicates in the data set.
DOI: 10.7554/eLife.20488.009
Figure 2—source data 2. Probabilities of topologies for triplets of hematopoietic cell types.
Listed are, for each triplet of cell types, the probabilities of the four topologies for prior odds p(βi=1)p(βi=0)=0.05; the number of topologies that reach probability p(T | {giA,B,C})>0.6 for some value of p(βi=1)p(βi=0)  between 10−6 and 102; the non-null topology that has the highest probability p(T | {giA,B,C}) over the range of prior odds (if the null topology is the most likely topology for the entire range of prior odds, the topology is marked ‘null’); and the value of highest probability p(T | {giA,B,C}) over the range of prior odds; the correct topology and triplet length in the traditional model; and the correct topology and triplet length in the Adolfsson model.
DOI: 10.7554/eLife.20488.010
Figure 2—source data 3. Probabilities of transition and marker genes for the hematopoietic lineage tree.
Listed are, for crucial triplets along the lineage tree, the 200 genes with the highest probabilities of belonging to the two transition gene classes and the root marker class, and their associated probabilities.
DOI: 10.7554/eLife.20488.011

Figure 2.

Figure 2—figure supplement 1. Probability of topology depends on prior odds.

Figure 2—figure supplement 1.

(A) Two models of the hierarchy of early hematopoietic progenitors, both built based on prospective isolation of lineage restricted progenitors, include (left) the traditional model, in which the first split strictly separates myeloid and lymphoid lineages (Akashi et al., 2000; Kondo et al., 1997; Reya et al., 2001) and (right) an alternative hierarchy proposed by Adolfsson and colleagues, in which lymphoid progenitors subsequent to the first split retain some myeloid potential (Adolfsson et al., 2005). (B) The inferred probability of different topological configurations as a function of the prior odds of genes being transition genes. At low values of the prior odds, very few genes are given substantial votes in the inference procedure. The topology MPP-ST-CMP has high probability for a range of prior odds. (C) Genes with high probability of being marker genes for MPP (bold), CMP (plaintext), or ST (italics). (D) The triplet MEP, GMP, FrBC shows no significant probability of a non-null topology for a wide range of prior odds. In this case, our inference method makes no claim as to the topological configuration based on the gene expression evidence provided. (E) Histogram of the maximum inferred probability of a non-null topology for the 165 triplets of hematopoietic progenitors. (F) Projections of the early hematopoietic progenitors along first two principal components. (PC1: 30%; PC2: 17%). Each point represents a different replicate. Note the proximity between ETP and MPP samples (red circle), which does not reflect either of the lineage models shown in (A).
Figure 2—figure supplement 2. Plots of the length of the triplets distinguishing the traditional model and in the Adolfsson model.

Figure 2—figure supplement 2.

Separate plots are shown for triplets whose inferred topology is consistent with only the Adolfsson model (A) and the traditional model (B) Triplets whose inferred topology is null are shown in turquoise; those whose inferred topology is non-null are shown in black. Jitter was added to the points for clarity.

The classifications of genes as transition or marker genes in Figure 1 were based on p<0.005 in a two-sample t-test. To implement such classifications probabilistically without arbitrary cutoffs we developed a statistical framework to infer the lineage relationships between each set of three cell types A, B, and C and find the key sets of transition genes (those genes that show the clear minimum pattern), given gene expression data {giA,B,C} in those cell types. We determined the probability of any possible topological relationship between the cell types T= {A, B, C, } referring to cell type A, B, or C being the root of the triplet, or  which corresponds to the case where the data does not suggest any lineage relationship between the three cell types because either no significant pattern could be detected, or multiple genes exhibiting the minimum pattern suggested conflicting topologies. Rather than an absolute classification of genes as showing a pattern or not, we calculated the probability p(αi=1  | giA,B,C) of each gene i being a marker gene (denoted by αi=1), i.e. gene i showing the clear maximum pattern, and the probability p(βi=1 | giA,B,C) of it being a transition gene denoted by βi=1, i.e. gene i  showing the clear minimum pattern (Materials and methods).

We derived the probability of a given topology T given the expression data as (Materials and methods):

p(T | {giA,B,C})i(1+32𝒪i[1p(μTi ismin | giA,B,C)]),

where, 𝒪i=p(βi=1 | giA,B,C)/p(βi=0 | giA,B,C) is the odds of gene i being a transition gene and thus having a unique minimum. The term p(μTi ismin | giA,B,C) is the probability that the mean μTi of the distribution of the expression levels of gene i in the root cell type T is less than the mean in the other two cell types. The odds implicitly contains the only free parameter in our analysis, the prior odds p(βi=1)p(βi=0), which defines the number of genes we expect to show the clear minimum pattern a priori, and functions as a sparsity parameter for the inference. Qualitatively, in the above equation, every gene casts a vote p(μTi ismin | giA,B,C) against the cell type T in which its mean expression is minimal being the root. Further, this vote is weighted by the odds 𝒪i of gene i being a transition gene. Thus, genes with a clearer minimum pattern get larger votes in determining which cell type is not the root. In practice, these quantities are computed numerically (Materials and methods).

We note further that if a substantial number of genes cast votes against each of the cell types, then the probability of the null topology increases. We computed the probability of obtaining the null topology among the 150 related triplets and 100 unrelated triplets from our training set. The distribution of the probability of obtaining the null topology was considerably different between the related triplets and the unrelated triplets, with an AUC of 0.96 (Figure 1—figure supplement 3B–C).

Application to hematopoietic gene expression data

We used our statistical framework to recreate the lineage of early hematopoietic differentiation. We considered 11 early hematopoietic progenitors from the ImmGen Consortium microarray data set (Heng et al., 2008) (Figure 2—source data 1). These cell types and their associated relationships were not included in the data set used earlier to study the correlations of the two patterns and lineage topologies. Several features of the early hematopoietic lineage tree are debated (Adolfsson et al., 2005; Iwasaki and Akashi, 2007) (Figure 2—figure supplement 2A). Given only the gene expression data for these different subpopulations of cells, we determined the lineage relationships and the key factors associated with each lineage decision. We calculated the probabilities of topology and marker and transition genes for the (113)=165 possible triplets of cell types using our statistical framework (Figure 2—source data 2). To illustrate our method, we first described the analysis of the expression data from two such triplets of cell types: CMP/ST/MPP and MEP/GMP/FrBC (Figure 2B–G). We then assembled the triplets to form an undirected lineage tree (Figure 3; Video 1).

Video 1. Tree-building process for early hematopoietic lineage.

Download video file (835.5KB, mp4)
DOI: 10.7554/eLife.20488.018

Animation of the triplet assembly and pruning process for reconstructing the early hematopoietic lineage. For illustrative purposes, triplets (with p>0.6) are successively selected at random (in practice, the assembly and pruning process was performed on all triplets simultaneously; the resulting tree does not depend on the order in which the triplets are selected). The nodes of the current triplet are highlighted in yellow; if a topology is recognized for the triplet, the root is shown in green and the leaves in yellow, and the triplet edges are shown in magenta. If adding the triplet causes another triplet to be pruned, the soon-to-be-pruned (i.e. offending) edge is highlighted in red. The resulting pruned graph is then shown before adding the next triplet. As more triplets are considered, more edges between nodes are added and then pruned, leading to the final tree.

DOI: http://dx.doi.org/10.7554/eLife.20488.018

Figure 3. Reconstruction of lineage tree and key transition gene for early hematopoiesis.

(A) Final lineage tree, recapitulating the inferred triplet topologies, with top inferred transition genes indicated along cell fate decisions. (B) Plot of the replicates of different cell types in the gene-expression space of the transition gene classes (probability > 0.8) for 4 cell-fate transitions along the inferred lineage tree in (A). Plotted on each axis is the mean normalized log expression level of the transition genes in the class. The axis labels and data points are color-coded according to the colors in (A). (C) Table with selected transition genes for early hematopoietic cell-fate transitions, along with references to published validations of their functional role. Genes known to be effective for reprogramming are shown in bold.

DOI: http://dx.doi.org/10.7554/eLife.20488.014

Figure 3—source data 1. Marker genes for early hematopoiesis.
Table with selected marker genes for early hematopoietic cell types, along with references to published validations of their functional role. Genes known to be effective for reprogramming are shown in bold.
DOI: 10.7554/eLife.20488.015

Figure 3.

Figure 3—figure supplement 1. Reconstruction of lineage tree from individual triplets.

Figure 3—figure supplement 1.

(A–D) Probability of all topological configurations for a triplet as a function of prior odds for all triplets between 4 cell types: MPP, ST, LT, and CMP. (A) The LT-ST-MPP triplet, with p=0.99. (B) The CMP-ST-MPP triplet, with p=0.95. (C) The LT-ST-CMP triplet, with p=0.99. (D) The MPP-LT-CMP triplet, with p=0.65. Note that the next highest topology here is the null topology. (E) These four triplets can be combined to infer a simple relationship between all four cell types that is consistent with the topological restrictions inferred from each high probability triplet.
Figure 3—figure supplement 2. Inferred lineage tree and transition genes for intestinal development.

Figure 3—figure supplement 2.

(A) Inferred lineage tree from intestinal single cell data taken from (Grün et al., 2015).Cluster identifications were taken from RaceID2 (Grün et al., 2016), and a similar tree was inferred from StemID ([Grün et al., 2016], not shown). Using 1369 transcription factors, we inferred these lineage relationships between all cell clusters with more than 10 cells per cluster. The tree differs from that in (Grün et al., 2016) only in the progression from C2 – C8 – C4, while we infer the triplet C8 – C2 – C4 with p>0.99, both of which are consistent with literature (van der Flier and Clevers, 2009), and await experimental validation. (B) High-probability transition genes with known relevance in intestinal homeostasis or differentiation taken from the top 20 transition genes for each lineage progression.

Following Equation 1, each gene votes against the topology whose central node has the minimum expression of that gene among the three cell types, and this vote is weighted by the odds that the gene is a transition gene. To illustrate this for the triplet of cell types CMP, ST and MPP, we plotted the topology each gene voted most against, i.e. the topology T for which p(μTi ismin | giCMP,ST,MPP) is the maximum, versus the odds 𝒪i of that gene being a transition gene (Figure 2B).

We find two groups of genes that are much more likely to be transition genes than any of the other genes, with values of Oi102 compared to 100 at most for other genes (Figure 2B, regular and bold fonts). These two groups of genes have a large value for either p(μCMPi ismin | giCMP,ST,MPP) or p(μMPPi ismin | giCMP,ST,MPP) and thus vote against T=CMP (cell type CMP is the intermediate) or against T=MPP (cell type MPP is the intermediate). Together these genes that have a high odds of being transition genes appear to most support topology T=STCMPSTMPP.

In fact, the intuition in Figure 2B is borne out in the calculation of p(T | {giCMP,ST,MPP}). Using Equation 1 above and assuming a sparsity parameter of 0.05, we calculate that there is an 84% chance that the topology is ST (Figure 2C; Figure 2—figure supplement 2B). Although gene Irf8 (Figure 2B, italic font) is strongly downregulated in ST and is expressed at higher levels in CMP and MPP (Figure 2C), we note that its signal is overwhelmed by the large number of genes downregulated in either CMP or MPP, illustrating the statistical nature of the framework.

For each triplet, we evaluated each gene’s probability of being a transition or marker gene (Figure 2—source data 3). Figure 2D shows the names and associated probabilities of the 12 genes most likely to be transition genes for the triplet CMPSTMPP. The transition genes fall into two groups, corresponding to the two groups in Figure 2B. One group, which includes genes Gata1, Dach1, and Gata2, has higher expression in CMP than in MPP; the other group, which includes Hlf, Tsc22d1, and Hoxa9, has higher expression in MPP. Although the values of the probabilities of the genes being transition genes vary with the value of the sparsity parameter, the relative order of different genes does not change. The genes identified include many genes previously identified as being important for lineage specification (Crispino, 2005; Gazit et al., 2013; Miyawaki et al., 2015). The transition genes we discovered thus not only have gene expression patterns that reflect the lineage decision but also include functionally important genes.

In addition to the transition genes, we identified marker genes (p(αi=1 | {gi},T)>0.8) present only in ST (including Mpl and Rai14, consistent with [Solar et al., 1998]) and then symmetrically downregulated in both CMP and MPP (Figure 2—figure supplement 1C). Marker genes for CMP include Srf, Zeb2, Rbpj and Irf8 (consistent with [Goossens et al., 2011; Kurotaki et al., 2013; Ragu et al., 2010; Robert-Moreno et al., 2005; Tamura et al., 2000]); marker genes for MPP include Satb1, consistent with (Satoh et al., 2013). Although these genes were not used to determine the topology, they are good markers for cell types ST, CMP and MPP.

Plotting the cell types using the mean expression levels of the two transition gene class captures the fork in the gene expression space associated with the cell-fate decision (Figure 2E). In contrast with the PCA analysis of the cell types (Figure 2—figure supplement 1E), in which MPP appears to be an intermediate between the hematopoietic stem cell types (LT and ST) and CMP, the projection of the cell types onto the transition-gene subspace shows that ST splits into CMP and MPP.

In contrast to the case of the triplet of cell types CMP/ST/MPP, for triplet MEP/GMP/FrBC, the distributions of genes supporting different topologies are similar (Figure 2F). Thus the most likely topology calculated using Equation 1 is the null hypothesis (99%), which is that transition genes, if they exist, do not have patterns that depend on the cellular topology (Figure 2G, Figure 2—figure supplement 1D), in which case there is insufficient evidence to classify the triplet according to a particular non-null topology. The distribution of the maximal probability of non-null topologies in the different triplets is heavily concentrated near 1, allowing for clear separation between null and non-null triplets (Figure 2—figure supplement 1E). Null topologies were identified by the algorithm for triplets with cells that are from three terminal nodes (for example, the triplet MEP/GMP/FrBC) or from triplets that contain very distantly related triplets (for example, LT/CLP/ETP) (Figure 2—source data 2).

A lineage tree for early hematopoiesis

We next reconstructed the early hematopoietic lineage tree and identified transition genes involved in the different cell state transitions using all the non-null triplet relationships as constraints on the lineage relationships between all cell types. Out of the 165 possible triplets of hematopoietic progenitors, 144 showed one single non-null topology with probability greater than 0.6 over a range of the prior odds from 10−6 to 102. We next determined an undirected graph that recapitulates all of the individual triplet topologies (note that we are only inferring triplet topologies and are not inferring directionality). For example, although triplet CMP/LT/MPP has topology CMP – LT – MPP (Figure 3—figure supplement 1A), we could determine that LT cannot be the direct progenitor of CMP or MPP, because ST is an intermediate between LT and both cell types (Figure 3—figure supplement 1B–C). We could thus 'prune' this triplet when inferring the full graph (Figure 3—figure supplement 1D–E). A visualization of the pruning process is shown in Video 1, where successive triplets are added to the graph, creating new edges and pruning others, leading to the final undirected tree. In practice, though, the pruning process was performed on all triplets simultaneously, not in succession.

The ST/CMP/MPP triplet (Figure 2B–E) immediately distinguishes between two competing models regarding the hierarchy of early hematopoietic progenitors. According to the traditional picture (Iwasaki and Akashi, 2007), MPP is the progenitor of CMP, and ST is the progenitor of MPP – therefore MPP should be an intermediate between ST and CMP and the topology of triplet ST/MPP/CMP should be ST – MPP – CMP (Figure 2—figure supplement 1A, left). According to a model suggested by Adolfsson and colleagues (Adolfsson et al., 2005), ST splits into CMP and MPP (Figure 2—figure supplement 1A, right), and the topology should be CMP – ST – MPP. We identify both CMP – ST – MPP and CMP – LT – MPP as the correct topologies, lending support to the Adolfsson model. The Adolfsson and traditional models differ in the topology of 9 triplets. The inferred expected topologies of 8 out of these nine triplets support the Adolfsson model, which led to the identification of the final tree (Figure 2—figure supplement 2).

The lineage tree that we determined is consistent with the Adolfsson model and contains three additional lineage decisions (Figure 3A–B). First, CMP gives rise to MEP (megakaryocyte/erythroid progenitor) and GMP (granulocyte/macrophage progenitor). Second, MPP gives rise to MLP (multilineage progenitor), which then splits into the GMP and CLP (common lymphoid progenitor) cell types. In the final lineage decision, CLP gives rise to the ETP (pre-T) and FrA (pre-pro-B) cell types.

For each triplet of cell types along the tree, we identified transition and marker classes of genes. Among the 14 triplets that contained only adjacent cell types, we identified on average 24 marker genes per cell type and 25 transition genes (probability threshold of 0.8, prior odds p(βi=1)/p(βi=0) =0.05). Many genes we discovered as belonging with high probability to the transition and marker classes of genes at each lineage decision are known in the literature to be functionally important genes, including classic hematopoietic regulators such as Cebpa, Sfpi1, Gata1, Satb1, Irf8 and Ebf1 (see full tables with references in Figure 3C and Figure 3—source data 1). Additionally, the genes identified include many genes successfully used in hematopoietic reprogramming experiments, including Gata2 and Pbx1 (Figure 3C). Together these observations suggest that the sparse subspace of transition and marker genes identified by our framework not only allows for accurate reconstruction of the lineage hierarchy but also constitutes a set of candidates for relevant biological functions.

As further validation of the inference method, we compared it to the method proposed by Grun et al. on a single-cell intestinal development data set (Grün et al., 2015, 2016). We inferred lineages between each cell type based on their cluster identifications, excluding clusters with fewer than 10 cells, and constructed an undirected lineage tree by taking triplets with probability > 0.6 and applying the pruning rule (Figure 3—figure supplement 2A). The only disagreement between the two methods is the progression from crypt base columnar cells (C2) to different populations of Goblet cells (C4 and C8). Grun et al. hypothesize a C2 – C8 – C4 progression, while we infer the triplet C8 – C2 – C4 with p>0.99, suggesting that the progenitor C2 gives rise to both differentiated Goblet subpopulations. Both lineage trees are supported by the literature (van der Flier and Clevers, 2009). The high probability transition genes included many factors well known for their roles in tissue homeostasis and development (Figure 3—figure supplement 2B), notably Klf4 (Yu et al., 2012), Atoh1 (VanDussen and Samuelson, 2010), Spdef (Noah et al., 2010), Foxa1/Foxa2 (Ye and Kaestner, 2009), and Tcf3 (Merrill et al., 2001).

Inferred lineage tree for human excitatory neuronal progenitors from in vitro single-cell data over 80 days of differentiation

The ease with which single-cell transcriptomic data can be generated (Grün et al., 2015; Jaitin et al., 2014; Macosko et al., 2015; Patel et al., 2014; Paul et al., 2015; Treutlein et al., 2014; Zeisel et al., 2015) presents an opportunity to understand the dynamics of the underlying networks that lead individual cells to their final fate. We studied the differentiation of stem cells both into germ layer progenitors (Jang et al., 2017) and into cortical neurons. To study the latter, we analyzed single-cell gene expression data from 2217 cells from an in vitro differentiation protocol for early human neuronal development (Yao et al., 2017). Briefly, human embryonic stem cells (hESCs) were subjected to a SMAD inhibition-based cortical induction phase, a progenitor expansion phase, and a neural differentiation phase. Single cells were sorted at 12, 26, 54, and 80 days into differentiation, and their gene expression was profiled using the SMART-Seq2 technique (Picelli et al., 2013). In the initial clustering of the single-cell data, dimensionality reduction by PCA (into 15 above-noise components) followed by t-SNE (Van Der Maaten and Hinton, 2008; Satija et al., 2015) showed separation by day and SOX2 expression (Figure 4A). However, the number of predicted clusters varied depending on the perplexity parameter (Figure 4—figure supplement 1A–B). In addition, no clear lineage or distance relationship among the putative types is immediately apparent from this clustering. Analysis of this data with other recent methods such as Monocle and Monocle2 (Trapnell et al., 2014), TSCAN (Ji and Ji, 2016), and StemID (Grün et al., 2016) did not clearly reconstruct lineage or infer key genes regulating transitions (Figure 4A – Bottom, Figure 4—figure supplement 2). Monocle2 (Figure 4—figure supplement 2A) produces a tree with complex branching, but SOX2+ progenitors and DCX+ differentiated neurons do not clearly separate.

Figure 4. Inference of lineage tree and key transitions genes using single cell expression data from in vitro differentiated developing human brain. .

(A) RNA-seq data from single cells collected at days 12, 26, 54, and 80 from a human brain in vitro differentiation protocol (Yao et al., 2017) were analyzed using a variety of existing methods. Partitioning single-cells into cell types through non-linear dimensionality reduction using t-SNE (top) depends on the perplexity parameter (set here to 5, see Figure 4 – Figure Supplement 1A-B) and does not allow for mechanistic understanding. Independent component analysis of all transcription factors with Monocle (bottom) does not show clear structure and could not inform reconstruction of lineage relationships. (B) Maximization algorithm to determine most likely cluster identities {C}{c1,c2,,cn}, sets of transitions {T}, marker genes (αi=1) and transition genes (βi=1), given single-cell gene expression data {gi}. Starting from a seed clustering scheme {C0}, iterative maximization of the conditional probabilities p({T},{αi},{βi}|{gi},{C}) and p({C}|{gi},{T},{αi},{βi}) converges to most likely set ({C},{T},{αi},{βi}) (C) Cell-cell covariance matrix between cells using only the associated high probability marker and transition genes show the final cluster assignments c0, c2 and  c3 (right) in contrast to using all transcription factors (left). (D) Selected high probability triplets of clusters plotted in the axes defined by two sets of transition gene classes for each triplet. c1 c0 c2 (top right, p(T=c0 | {gic0,c1,c2})> 0.99), plotted in transition gene class {CEBPG} also including POU3F1, POU3F2, NR2F1, NR2F2, ARX, LIN28A, TOX3, ZBTB20, PROX1 and SOX15, and class {DMRTA1} also including HES1, HES5, FOXG1, PAX6, HMGA2, SOX2, SOX3, SOX9, SOX6, SP8, OTX2, TGIF, ID4, TCF7L2, and TCFL1.  c2 c0 c3 (top left, p(T=c0 | {gic0,c2,c3})= 0.96), plotted in transition gene class {LHX2} also including FEZF2, FOXG1, HMGA1, SP8, OTX1, SOX11, GLI3, SIX3, ETV5, and class {POU3F2} also including GTF2I, HIF1A, ID1, ID3, PROX1, SALL1, SOX21, TCF12, TRPS1, ZHX2. c1 c0 c3 (bottom left, p(T=c0 | {gic0,c1,c3})>0.99), plotted in transition gene class {FOXO1} also including HMGA2, PAX6, and SOX2, and class {LHX2} also including DMRTA2, HMGA1, ARX, LIN28A, OTX2, LITAF, NANOG, POU3F1, SOX15. c6 c5 c7 (bottom right, p(T=c5 | {gic5,c6,c7})>0.99), plotted in transition gene class {PAX3} also including CRX, SOX11, EBF2, FOXP4, ASCL1, FOXO3, and SIX3, and class {ARGFX} also including DUXA, HES1, NFIB, PPARA, SOX2, SOX7, and SOX9. (E) Correlations between differentiated cell clusters (Figure 4 – Figure Supplement 4D) and bulk population samples from brain regions (in vivo developmental human data) (Miller et al., 2014). Neuronal cell types can be identified with specific spatial regions of the brain to interpret the topology of the lineage tree. Expression signatures of SOX2+ cell types c0, c2 and  c3 were dominated by pluripotency factors, and are not shown. (F) Inferred lineage tree for brain development. Genes associated with neocortical development, and mid-/hind-brain progenitors, and specific neuronal cell types are identified as high probability transition genes and are corroborated by mapping information from in vivo data. Clusters color-coded similarly to (D). D12/26/54/80 labels indicate time of collection of cells within each cell type. Prog refers to SOX2+ cells, Diff refers to SOX2-/DCX+ cells (Figure 4—figure supplement 1C–D).

DOI: http://dx.doi.org/10.7554/eLife.20488.019

Figure 4—source data 1. Final cluster identities of single cells from in vitro cortical differentiation.
Cells were assigned to a cluster based on an iterative clustering procedure. Their final cluster assignments are shown here. Cell types 8–11 contained bulk brain control cells, and non-neuronal cells which were excluded from the analysis.
DOI: 10.7554/eLife.20488.020
Figure 4—source data 2. Probabilities of topologies for triplets of single-cell clusters.
Listed are, for each triplet of clusters, the probability of a given topology and the root of the inferred topology. ‘0’ refers to the null topology. Triplets with p>0.6 were used to construct the lineage tree.
DOI: 10.7554/eLife.20488.021
Figure 4—source data 3. Probabilities of Transition and Marker Genes for the Human Brain Developmental Lineage Tree.
Listed are, for crucial triplets along the lineage tree, the genes with the p>0.8 of belonging to the two transition gene classes and the root marker class, and their associated probabilities.
DOI: 10.7554/eLife.20488.022
Figure 4—source data 4. Human Brain Development SmartSeq2 Census.
Further detail can be found Materials and methods and in Yao et al. (2017) (Supplemental Information).
DOI: 10.7554/eLife.20488.023
Figure 4—source data 5. List of Human Transcription Factors.
List of human transcription factors used to cluster and infer lineages for the cortical differentiation tree. List was adapted from (Ben-Porath et al., 2008).
DOI: 10.7554/eLife.20488.024

Figure 4.

Figure 4—figure supplement 1. Cluster identity and sparse coding in neuronal differentiation.

Figure 4—figure supplement 1.

(AB) t-SNE of developing neuronal cells displays seemingly different number of cell types depending on the perplexity parameter.(C) SOX2 expression clearly falls into bimodal distribution. Cell types with high levels of SOX2 (blue, c0,c2,  and c3) are labeled progenitor cell populations. (D) DCX expression in final clusters is also bimodal, and cell types with high expression (green, c1,c4, c6, and c7) are labeled differentiated (post-mitotic) neurons. (E) We assembled 20 triplets that were inferred to be non-null, and had a maximal distance between leaf nodes of less than 5. Amongst these triplets, we attempted to infer the correct topology using a minimal subset of genes. Restricting the inference to the N genes per triplet with the greatest odds ratio of being transition genes, the tree can be reconstructed accurately for any N ≥ 4.
Figure 4—figure supplement 2. A selection of recent lineage-determination methods for single cell transcriptomic analysis applied to an in vitro neuronal differentiation data set (Yao et al., 2017).

Figure 4—figure supplement 2.

Each of these methods was run with multiple sets of parameters in an attempt to optimize the lineage inference; however, it is possible that an unexplored parameter regime might yield more interpretable results, although it is unclear which parameters these might be. (A) Monocle2 (Trapnell et al., 2014) shows a complex tree with clear progression and multiple branches, but does not separate progenitor cells (DCX- cells) from neurons (DCX+ cells): progenitor cells are known to give rise to neurons, whereas neurons are post-mitotic and do not give rise to progenitors, in contradiction with certain portions of the tree. In addition, portions of the tree are in conflict with the general time point information specifying when cells were collected (Yao et al., 2017). (B) TSCAN (Ji and Ji, 2016) shows separation of DCX- and DCX+ cells along PC dimension 2 (note that cells are not labeled by DCX expression in this plot), but fails to capture structure among the differentiated neurons, particularly a fore-/hind-brain split. (C). StemID (Grün et al., 2016) produces a complex lineage that is not directly interpretable, and no clear separation of progenitors from neurons or forebrain-like from mid/hindbrain-like neurons is apparent.

Analyzing data from single-cell profiling presents an additional challenge relative to data from population-level profiling because cell types are not previously known and must be inferred from the data. Computationally, it is necessary define a measure of similarity in the gene expression profiles of individual cells so as to be able to cluster them and define cell states. Here again, it is necessary to identify the correct gene subspace to use for clustering. Clustering and determining lineage have typically been performed sequentially and treated as independent problems (Satija et al., 2015; Trapnell, 2015; Trapnell et al., 2014). However, we found that it is informative to solve both these problems simultaneously.

In our framework, the relevant feature set for clustering is the set of marker and transition transcription factors from the triplets with non-null topologies. But determination of these sets of genes and of the transition topologies depends itself on knowledge of the cluster identities. Following previous work on sparse clustering (Witten and Tibshirani, 2010), we simultaneously determined optimal clusters, lineage topology and sets of transition and marker genes by iteratively selecting transition and marker genes and clustering the data using this set of features (Figure 4B).

In order to utilize information about developmental distances in clustering, we iteratively maximized a joint distribution, P(T,{C}m|{gi}), over the developmental tree, T, and the set of clusters of cells, {C}m={c1m,c2m,,cKm} (Figure 4B). Starting with a clustering {C}m, we first inferred the set of genes {αi=1} and {βi=1} which were identified in high probability triplets, p(T|{gi}, {C}m)>0.6, and we then re-clustered in this new subspace to obtain the clusters for the next iteration {C}m+1.

Initial analysis based on the gap statistic (Tibshirani et al., 2001) suggested that the single-cell gene expression profiles clustered into 20 clusters of cell types. We chose a seed {C}0 for the iterated clustering procedure by intentionally over-clustering the data into 40 clusters using spectral K-medoids. By being overly discriminative in our initial clustering, we ensured that all genes with differential regulation would be classified as either marker genes or transition genes, and would be preserved in later clustering iterations. We iterated the clustering-inference procedure until the dimension of the re-clustering subspace changed by less than 10% of the total transcription factor space. In this resulting subspace of 469 genes, we finally clustered the cells into the final configuration {C}f={c0, c1,,c7} using K-medoids (Figure 4C, Figure 4—source data 1) where the number of clusters K = 8 was chosen based on the gap statistic (Tibshirani et al., 2001).

We inferred 45 high probability triplets (p(T | {gi})>0.6) between the final cell clusters (Figure 4—source data 2). Four such triplets are shown in Figure 4D, plotted in axes defined by transition genes for each triplet (Figure 4—source data 3). Starting with progenitor cell states (SOX2+, Figure 4—figure supplement 1C), we manually appended cell clusters to the tree according to their time information and in agreement with inferred topological restrictions. The first transition involves the production of day 12 neuronal cell type c1 from day 12 progenitor c0, which is observed in the triplets c1c0c2 and c1c0c3. The c1c0c2 triplet (p(T=c0 | {gic0,c1,c2,})= 0.99, Figure 4D – top left) is mediated by 47 transition genes between c0 and c1 and 87 between c0 and c2 (p(βi=1 | {gi},T)>0.8). Transition genes expressed highly in the c1c0 branch include CEBPG, POU3F1, POU3F2, NR2F1, NR2F2, ARX, LIN28A, TOX3, ZBTB20, PROX1, and SOX15 which have been previously implicated in proliferation of forebrain progenitors (Au et al., 2013; Borello et al., 2014; Cimadamore et al., 2013; Dominguez et al., 2013; Yang et al., 2015) and the migratory behaviors of ganglionic eminences (Kanatani et al., 2008; Kessaris et al., 2014; Lodato et al., 2014; Olivetti and Noebels, 2012; Reinchisi et al., 2012). The c0c2 transition genes include DMRTA1, HES1, HES5, FOXG1, PAX6, HMGA2, SOX2, SOX3, SOX9, SOX6, SP8, OTX2, TGIF, ID4, SOX3, TCF7L2, and TCFL1 which are known to be expressed in forebrain progenitors of the developing neocortex, thalamus, and hypothalamus (Abraham et al., 2013; Pozniak et al., 2010; Shimojo et al., 2011; Tzeng and de Vellis, 1998; Wang et al., 2006), and are known to establish dorsal forebrain regional identity; (Azim et al., 2009; Bani-Yaghoub et al., 2006; Borello et al., 2014; Gaston-Massuet et al., 2016; Hagey et al., 2014; Hutton and Pevny, 2011; Johansson et al., 2013; Kikkawa et al., 2013; Kishi et al., 2012; Manuel et al., 2011; Miyoshi and Fishell, 2012; Ohtsuka et al., 2001; Ross et al., 2003; Shen and Walsh, 2005; Sur and Rubenstein, 2005; Yang et al., 2015; Zembrzycki et al., 2007).

The progenitor cell types form the triplet c2c0c3 (p(T=c0|{gic0,c2,c3})= 0.96, Figure 4D – top right). The c2c0 branch is mediated by 39 transition genes including LHX2, FEZF2, FOXG1, HMGA1, SP8, OTX1, SOX11, GLI3, SIX3, and ETV5 which suggest that c2 is comprised of cortical progenitors (Appolloni et al., 2008; Greig et al., 2013; Kishi et al., 2012; Manuel et al., 2011; Raciti et al., 2013). The c0c3 branch is mediated by 55 transition genes including GTF2I, HIF1A, ID1, ID3, PROX1, POU3F2, SALL1, SOX21 TCF12, TRPS1 and ZHX2, which are associated with mesencephalon and metencephalon regional development, as well as having known involvement with midbrain/hindbrain organizer identity (Buck et al., 2001; Enkhmandakh et al., 2009; Inoue et al., 2012; Jaegle et al., 2003; Kunath et al., 2002; Lavado and Oliver, 2007; Milosevic et al., 2007; Ohba et al., 2004; Uittenbogaard and Chiaramello, 2002; Yao et al., 2017). We additionally inferred the triplet c2c0c3 (p(T=c0 | {gic0,c1,c3})> 0.99, Figure 4D – bottom left), which suggests a three way split from early progenitor c0 into early differentiated neuron c1, and progenitors c2 and c3.

The continuation of the c2 branch is inferred through triplet c0c2c4 (p(T=c2 | {gic0,c2,c4})= 0.97). The c2c4 branch includes transition genes BCL11A, EMX2, FOXP2 and RORB, which are known to be associated with the neocortex and neuronal identity (Cánovas et al., 2015; Ebisu et al., 2016; Greig et al., 2016; Jabaudon et al., 2012; Wiegreffe et al., 2015; Woodworth et al., 2016; Zembrzycki et al., 2007).The triplet c0c3c5 (p(T=c3 | {gic0,c3,c5})> 0.99) meanwhile is characterized by transition genes in the c3c5 branch including ASCL1, FOXP2, PAX3, POU3F4, ZIC1, ZIC4, HOXB2, and EN2, which have been shown to regulate fate acquisition in the midbrain/hindbrain (Agoston et al., 2012; Ang, 2006; Di Bonito et al., 2013; Elsen et al., 2008; Hegarty et al., 2013; Miller et al., 2011; Tan et al., 2014).

The c5 cell cluster differentiates into two distinct clusters of post-mitotic neurons -- c6 and c7  (p(T=c5 | {gic5,c6,c7})> 0.99, Figure 4D – bottom right). The c5c6 branch is inferred from transition genes including PAX3, CRX, SOX11, EBF2, FOXP4, ASCL1, FOXO3, and SIX3 which have strong expression in developing dopaminergic and gabaergic neurons of the midbrain (Agoston et al., 2012; Erickson et al., 2010; Pino et al., 2014; Yang et al., 2015; Yin et al., 2009; Zhang et al., 2002). Transition genes in the c5c7 branch include ARGFX, DUXA, HES1, NFIB, PPARA, SOX2, SOX7, and SOX9, which are known to be associated with the medulla oblongata region of the hindbrain (Fawcett and Klymkowsky, 2004; Kameda et al., 2011; Kumbasar et al., 2009; Madissoon et al., 2016; Matsui et al., 2000; Stolt et al., 2003).

In addition to interpreting these individual transition genes defining the major branch splits, we correlated the expression over all predicted transition and marker genes in neuronal clusters to in vivo developmental human data (Miller et al., 2014). The in vivo data comprise a representative range of microarray data sampled from different parts of the developing brain at post-conception week 15, including forebrain proliferative regions, midbrain, and hindbrain. The differences in data acquisition methods (RNAseq vs. microarray, single-cell vs heterogeneous populations) resulted in relatively low correlations overall, but there are clear associations between individual clusters and specific brain regions (Figure 4E). Specifically, c1, maps to the ganglionic eminences, suggesting an interneuron identity, whereas c5, c6 and c7 show better mapping to mid- and hindbrain structures, and c4 appears to be more closely related to neocortex. Overall, this global comparison, combined with the identification of genes with known regional expression, suggests that the inferred clusters from the in vitro data capture the diversity of differentiation into the early stages of the major neuronal lineages (Figure 4F). These lineage predictions based on our analysis techniques were verified experimentally using viral barcoding in a separate work (Yao et al., 2017).

To estimate the sparsity of the underlying network and to find a minimal subset of genes through which lineage could be inferred, we replicated the analysis while only considering a limited set of genes per triplet. We assembled a collection of 20 triplets with maximal leaf-to-leaf distance of 4 nodes, and non-null inferred topology. For each triplet, we ranked genes based on their odds of being a transition gene, 𝒪i, agnostic of the true topology of the triplet. We then replicated the inference process using only the N genes with the greatest odds (Figure 4—figure supplement 1E). We found that with as few as 4 genes per triplet, the correct lineage topology could be inferred for all of the triplets. Genes with greatest odds comprise a restricted subset of genes for further experimental investigation. Further, these findings suggest that the dynamics of expression of just four specific genes are sufficient to monitor a particular lineage decision in single cells.

Discussion

Finding an informative subspace of variables for data analysis is a general problem in machine learning, both for regression and clustering (Tibshirani, 1996; Witten and Tibshirani, 2010); the innovation in this paper is to use a statistical pattern learned from known biology to inform this subspace search. The approach we take here is complementary to methods that project expression variability onto coordinates of PCA, ICA or t-SNE maps (Marco et al., 2014; Satija et al., 2015; Trapnell et al., 2014), which are combinations of all variables. Searching for sparse representation of the dynamics has the advantage of providing interpretability and experimental direction (McGibbon and Pande, 2017). Following the dynamics of this small set of high-probability transition genes via fluorescent tagging could allow for the tracking of lineage decisions of individual cells in real time. Further, these genes provide a list of candidates for drivers of fate decisions, and hence a set of experimental hypotheses.

Not all genes give us equal information about the dynamics of differentiation. We discovered that genes showing the clear minimum pattern are most predictive of the sequence of lineage transitions during development. Although our pattern discovery and subsequent lineage reconstruction does not assume any functional role for the clear minimum pattern, we note that this pattern is shown by genes known to be regulators of development during hematopoiesis. The same pattern observed in many differentiating systems (Graf and Enver, 2009; Qi et al., 2013; Thomson et al., 2011; Zhang et al., 1999), and is consistent with mutual inhibition. Mutual inhibition, in turn, is hypothesized to play an important role in maintaining multi-stable systems and in mediating transitions between different stable states of multi-stable systems (Ferrell, 2012).

Discovery of sparse representations of the cell states and variability between them demonstrates the efficacy of low dimensional descriptions of the system. Understanding the dynamics and transitions of complex physical systems composed of a large number of variables has been driven by the discovery of low dimensional order parameters (Anderson, 1978; Landau and Lifshitz, 1951). As opposed to measuring and modeling states as high dimensional objects in their native representation, order parameters provide low dimensional descriptions of the states and dynamics, which has proven crucial in developing both qualitative and quantitative models. Finding small subsets of genes which captures the lineage transitions in cells analogously provides a low dimensional subspace that captures the dynamics in genetic networks and can be useful for modeling (Jang et al., 2017). An accompanying paper allows us to exploit this idea to extract mathematical models for the underlying molecular circuits from single cell gene expression data obtained during germ layer differentiation (Jang et al., 2017).

Materials and methods

In vitro neuronal differentiation

Single-cell transcriptomic data from the in vitro neural differentiation procedure was obtained as described in Yao et al. (2017) (Supplemental information):

hESCs were dissociated with Accutase and plated on Matrigel-coated 24-well plates at 2.5 × 105 cells/cm2 in DMEM/F12 (#11330–032), 1 × N2, 1 × B27 without vitamin A, 2 mM Glutamax, 100 μM non-essential amino acids, 0.5 mg/mL BSA, 1X Pen-Strep, and 100 μM 2-mercaptoethanol (referred to as basal medium; all from Thermo Fisher, Waltham, MA) with 20 ng/mL FGF2 (Thermo Fisher) and 2 μM thiazovivin. Cortical induction was initiated by changing to the basal medium with 5 μM SB431542 (StemRD, Burlingame, CA), 50 nM LDN193189 (Reagents Direct, Encinitas, CA) and 1 μM cyclopamine (Stemgent, Lexington, MA) (referred to as NIM). NIM was changed daily for 11 days. On day 12, cells were dissociated and seeded on Matrigelcoated 24-well plates at 5 × 105/cm2 in basal medium with 20 ng/mL FGF2 and 2 μM thiazovivin. Progenitor expansion was initiated on D13 by changing to serum-free human neural stem cell culture medium (NSCM, #A10509–01 from Thermo Fisher) containing 20 ng/mL FGF2 and 20 ng/mL EGF. NSCM was changed daily for 6 days. Cultures were passaged once more on D19 with Accutase and replated at 5 × 105 cells/cm2. On D26, cells were dissociated with Accutase and seeded on 24-well plates sequentially coated with poly-D-lysine (Millipore, Billerica, MA) and laminin (Thermo Fisher) at 1 × 105 cells/cm2 in basal medium supplemented with 20 ng/mL FGF2 and 2 μM thiazovivin. On D27, medium was changed to a 1:1 mixture of DMEM/F12 and Neurobasal medium (#21103–049) supplemented with 100 μM cAMP (Sigma-Aldrich, St Louis, MO), 10 ng/mL BDNF (R and D Systems, Minneapolis, MN), 10 ng/mL GDNF (R and D Systems) and 10 ng/mL NT-3 (R and D Systems) (referred to as ND). Cells were maintained in ND medium for four weeks until day 54 with half medium change every other day. Quality of differentiations was routinely assessed by immunostaining at D12 (PAX6 and DCX), at D26 (LHX2, SOX2, EOMES, POU3F2, and TBR1), and at D54 (MAP2 costained with TBR1, CTIP2, SATB2). In addition flow cytometry at D26 (EOMES, SOX2 and PAX6) was performed. Typically, EOMES at day 26 proved the most valuable quality control metric (~10% of cells by both flow cytometry and immunostaining) and predicted failure at D54. Specifically, when EOMES was low (<1% of cells) differentiations failed and were typically dominated by POU3F2+ cell types and/or non-neural ‘other’ cell types. These failed differentiations were eliminated from further analysis, typically ~20% of experiments (5 of 19 experiments in 2016). 50 bp paired-end Smart-Seq2 libraries were prepared from these cells as previously described (Picelli et al., 2013) and mapped as described in Thomsen et al., 2016 and Yao et al. (2017). As each cell was profiled independently (without pooling before amplification, as in methods such as Cel-Seq, STRT, or Drop-Seq), we did not observe the batch effects present in pooling-based methods. Although cells were profiled in plates (batches), there was no significant plate-related variation - this Fixed single-cell transcriptomic characteriza is most likely due to amplification being carried out separately in each well of the plate, thereby precluding cross-talk among barcodes, or plate-related differences in amplification. This is clear from the mixing of cells from different plates in the clustering. A complete census is provided (Figure 4—source data 4).

Gene expression data

Hematopoietic gene expression data were downloaded from the Immunological Genome Project (Heng et al., 2008); GEO GSE15907) and log-2 transformed. We restricted the genes considered to 1459 transcription factors. Brain development expression data were log-2 transformed, and 1460 transcription factors were individually normalized by dividing by the 90th-percentile expression value. Lists of mouse and human transcription factors are provided (Figure 1—source data 2, Figure 4—source data 5).

Software

Calculations were performed using custom written MATLAB code (The Mathworks) on the Harvard Research Computing Odyssey cluster. Code is available at https://github.com/furchtgott/sibilant. t-SNE was done using the package provided in (Van Der Maaten, 2009). Monocole was run in R (Core Team, 2015; Trapnell et al., 2014).

Description of algorithm

The algorithm proceeds according to the following steps:

  1. Find initial seed clustering configuration {C}0 using K-medoids where K is chosen to be larger than the cluster number inferred from the Gap statistic (Tibshirani et al., 2001)

  2. For all triplets of clusters, find most likely T and {αi} and {βi} given {C}m:
    1. Compute p(giA,B,C | T,αi,βi,{C}m) by integrating numerically over p(μ, σ) (Equations 6, 9, and 10).
    2. Compute p(T,{αi},{βi} | {giA,B,C},{C}m) using Equations 20 and 29.
    3. Identify mostly likely topology T and set of {αi} and {βi}.
  3. Recluster {g} using K-medoids in the space of all {αi} and {βi} for the triplets with probability p(T | {giA,B,C},{C}m)> 0.6 of being non-null. Determine new clustering configuration {C}m+1.

  4. Repeat steps 1 and 2 until convergence of {C}.

  5. Determine the most likely tree connecting cell clusters, recapitulating high-probability triplet topologies.

Each of these steps is described in the following section.

Bayesian framework for inferring cluster identities, state transitions, and marker and transition genes simultaneously

Notation; Bayes’ rule

Given gene expression data from single cells {gi}, we built a probabilistic framework to simultaneously infer cell cluster identities, {C}{cA,cB,}, the sequence of transitions T between these clusters, the key sets of marker genes {αi} that define each cell cluster, and genes {βi} that determine the sequence of transitions between clusters. We maximized the joint probability distribution of these variables given the gene expression data, p(T,{cA,cB,},{αi},{βi}{gi}) to determine the maximum likelihood estimates of these parameters.

We first consider how to solve this problem in the case in which there are three cell clusters, and we will later build a tree using all possible combinations of three cell clusters. Let the set of three cell clusters be cA, cB and cC with gene expression data {giA,B,C} for all genes (i=1 to N) and all cells. The term giA,B,C denotes the expression data for just gene i in cells in clusters cA, cB, and cC. The topology T of the relationships between cell clusters cA, cB and cC can take on four possible values: T=𝒜: cell cluster cA is in the middle (either cA is the progenitor of cB and cC, or cA is an intermediate cell type between cB and cC); T=: cell cluster cB is in the middle; T=𝒞: cell cluster cC is in the middle; or T=: we cannot determine the topology. Complementarily, for each gene i we define variables αi and βi, where αi=1 and βi=0 if the gene is a marker gene, αi=0 and βi=1 if the gene is a transition gene, and αi=βi=0 otherwise. Our task is to determine the probability p(T,{cA,cB,},{αi},{βi} | {giA,B,C}) given gene expression data for all genes {giA,B,C}.

According to Bayes’ rule, p(T,{cA,cB,},{αi},{βi} | {giA,B,C}) is proportional to the probability of the gene expression data given T, {C}, {αi} and {βi}:

p(T,{cA,cB,},{αi},{βi} | {giA,B,C})=p({giA,B,C} | T,{C},{αi},{βi}) p({αi},{βi} | T,{C}) p(T|{C}) p({C})p({giA,B,C}) (2)

The denominator of the right hand side of Equation 2 is a normalization constant. Expressions p({αi},{βi}T,{C}),  p(T|{C}), and  p({C}) are respectively the prior probabilities of {αi} and {βi} given T and {C}, the prior probability of T given {C}, and the prior probability of {C}. We assume that in the absence of any expression data, the probability that a gene is a transition or marker gene is independent of the topology and clustering configuration: p({αi},{βi}T,{C}) p(T|{C}) p({C})=p({C})p(T)i p(αi,βi), and p(T)=1/4 .

Conditional independence

In our model, we assume that knowing the clustering configuration {C}, the topology T and whether or not a gene is a marker or transition gene is sufficient to determine the probability distribution for its expression levels in each of the cell clusters. Therefore, the gene expression patterns of different genes are conditionally independent given the topology, clustering and gene type:

p({giA,B,C} | T,{C},{αi},{βi})=i p(giA,B,C | T,{C},αi, βi) (3)

Thus, Equation 2 becomes

p(T,{cA,cB,},{αi},{βi} | {giA,B,C})=p({C})p(T)ip(giA,B,C | T,{C},αi, βi) p(αi,βi)p({giA,B,C}) (4)

We maximize the evaluated p(T,{cA,cB,},{αi},{βi} | {giA,B,C}) with respect to T, {C} and each of the αi and βi to obtain the most likely relationships between cell types cA, cB and cC, as well as the genes most likely to be marker and transition genes.

Expression for p(giA,B,C | T,{C},αi=0, βi=1) (transition genes)

To infer p(T,{cA,cB,},{αi},{βi} | {giA,B,C}), we need a model to compute the probability of the gene expression data for each gene, p(giA,B,C | T,{C},αi, βi), given T, {C},αi and βi, following Equation 4. Our model for the probability distribution of the expression of a single transition gene i in the three cell types p(giA,B,C | T,{C},αi=0, βi=1) is defined solely by the geometry of the arrangement of the cell types in gene expression space, as described in the main text. For example, for T=𝒜 and βi=1, our model is that the distribution of the expression levels of giA,B,C in the three cell types A, B and C has the smallest mean value in either B or C but not in A (Figure 2A). If the distribution of the expression of gene i in cell type A is DA(giA | μAi,σAi,{C}) (we assume a log-normal distribution) with a mean μAi and standard deviation σAi, with analogous expressions for cell types B and C, then our model defining p(giA,B,C | T=𝒜,βi=1,{C}) is that either μBi< μCi and μBi< μAi or μCi< μBi and μCi< μAi, where μAi, μBi and μCi are the mean values of the expression levels of gi in cell types A, B and C. Thus,

p(giA,B,C | T=𝒜,βi=1,{C})=12{p(giA,B,C | μBi< μAi,μBi< μCi,{C})+p(giA,B,C | μCi< μAi,μCi< μBi,{C})} (5)

The terms in Equation 5 can be calculated by integrating over the prior probability distribution of the means μAi, μBi and μCi and standard deviations σAi, σBi and σCi, with the conditions on the means constraining the domains of integration:

p(giA,B,C|T=𝒜,βi=1,{C})=12μBi<μAi,μBi<μCi,σAi,σBi,σCiDA(giA|μAi,σAi)DB(giB|μBi,σBi)DC(giC|μCi,σCi)p(μAi,μBi,μCi,σAi,σBi,σCi)+12μCi<μAi,μCi<μBi,σAi,σBi,σCiDA(giA|μAi,σAi)DB(giB|μBi,σBi)DC(giC|μCi,σCi)p(μAi,μBi,μCi,σAi,σBi,σCi) (6)

Probabilities p(giA,B,C | T=,βi=1,{C}) and p(giA,B,C | T=𝒞,βi=1,{C}) are defined similarly.

In addition to topologies 𝒜, and 𝒞, we consider a null hypothesis in which transition genes have differential expression levels between states, but these levels are not correlated with any particular topology of states. This corresponds to having gene expression levels from cell-types A, B and C coming from three distributions with no restrictions on the relative order of the three means:

p(giA,B,C | T=,βi=1,{C})=μAi,μBi,  μCi,σAi,σBi,σCiDA(giA | μAi,σAi)DB(giB | μBi,σBi)DC(giC | μCi,σCi)p(μAi,μBi,μCi,σAi,σBi,σCi) (7)

Note that the probability of the data given the null hypothesis is the average of the probabilities of the data given the non-null hypotheses:

p(giA,B,C|T=,βi=1,{C})=13T=A,B,Cp(giA,B,C|T,βi=1,{C}) (8)

Note that p(giA,B,C|T,{C},αi=0,βi=1) depends on both T and {C}.

Expression for p(giA,B,C | T,{C},αi=1,βi=0) (marker genes)

Our model for marker genes assumes that the probability distribution for the expression level of such genes, p(giA,B,C | T,{C},αi=1,βi=0) to be independent of T and to be generated from distributions with two cell-types having a low value and the third a high value (for example, DAB(giA,B,C | μABi,σABi) for cell-types A and B and DC(giC | μCi,σci) for cell-type C, with the constraint μABi<μCi):

p(giA,B,C|T,{C},αi=1,βi=0)=13μABi<μCi,σABi,σCiDAB(giAB|μABi,σABi)DC(giC|μCi,σCi)p(μABi,μCi,σABi,σCi)+13μACi<μBi,σACi,σBiDAC(giAC|μACi,σACi)DB(giB|μBi,σBi)p(μACi,μBi,σACi,σBi)+13μBCi<μAi,σBCi,σAiDBC(giBC|μBCi,σBCi)DB(giA|μAi,σAi)p(μBCi,μAi,σBCi,σAi) (9)

Note that p(giA,B,C | T,{C}, αi=1,βi=0)=  p(giA,B,C | {C}, αi=1,βi=0) does not depend on T but does depend on {C}.

Expression for p(giA,B,C | T,{C}, αi=0,βi=0) (irrelevant genes)

Our model for genes that are neither marker nor transition genes is that the expression levels of such genes, p(giA,B,C | T,{C}, αi=0,βi=0), is generated from one single distribution DABC(giA,B,C | μi,σi):

p(giA,B,C | T,{C},αi=0,βi=0)=μi,σiDABC(giA,B,C | μi,σi)p(μi,σi) (10)

Note that p(giA,B,C | T,{C}, αi=0,βi=0)=  p(giA,B,C |  αi=0,βi=0) does not depend on T or {C}.

Numerical integration

Each of the probabilities on the right hand side of Equation 4 is evaluated numerically as above. We assume the distribution of the expression of gene i in cluster cDA(giA | μAi,σAi) to be log-normal. Given m log2-transformed replicate measurements giA of gene expression of gene i in cells belonging to cluster cA, the probability of the data assuming mean μAi and standard deviation σAi is:

DA(giA|μAi,σAi)=(12πσAi2)mgiAme(giAμAi)22σAi2. (11)

Distributions DB, DC, DAB, DAC, DBC and DABC are defined analogously.

We take the a priori probability distribution of μi and σi, p(μi,σi) as uniform over a certain range of means and standard deviations estimated from the data. For the log2-transformed hematopoietic gene expression data, we take 2<μi<14 and 0<σi<0.75.

We take the prior probabilities for the distributions in different cell types to be independent: p(μAi,μBi,μCi,σAi,σBi,σCi)= p(μAi,σAi) p(μBi,σBi) p(μCi,σCi). The constraints on the order of the means are enforced by the domain of integration, and the prior must be properly normalized over this domain. For example, in Equation 6,

12μBi<μAi,μBi<μCi,σAi,σBi,σCip(μAi,μBi,μCi,σAi,σBi,σCi)+12μCi<μAi,μCi<μBi,σAi,σBi,σCip(μAi,μBi,μCi,σAi,σBi,σCi)=1. (12)

Integrals were evaluated numerically in MATLAB using trapezoidal integration with step-sizes δμ = 0.05 and δσ = 0.01.

The choice of a log-normal distribution could potentially confound the results, particularly for single-cell RNA-Seq data with significant zero-inflation. This is a potential area for improvement in our algorithm, but the method can easily be adapted to model different distributions of RNA expression, such as gamma distributions (Shahrezaei and Swain, 2008; Wills et al., 2013) or beta-Poisson distributions (Delmans and Hemberg, 2016; Vu et al., 2016). In either case, the probability of the data given different topologies would be computed by numeric integration over the parameters of the distribution, for example, α and β for the Gamma distribution, by replacing the log-normal distributions in Equations 7, 9 and 10 with ones from the appropriate model.

The choice of the right parametric form for single-cell RNA expression is still an area of active research. Our choice of log-normal distributions assumes that higher order moments of the distributions (beyond standard deviation) ought to have a minimal contribution to the predictions, but we have not tested this extensively.

Although the default prior was the uniform prior, we also implemented an empirical prior p(μ, σ) by estimating the empirical distribution over all the Immgen cell types, using the kernel density estimation code provided in (Botev et al., 2010). The resulting hematopoietic lineage tree was identical. Using the kernel-density-estimated empirical prior may provide more stability in future analyses.

Probability of topology given gene expression and cluster identities p(T | {giA,B,C},{C})

We can derive the probability of the topology given the gene expression data and cluster identities p(T | {giA,B,C},{C}) by summing over all the {αi} and {βi} to find the probability of the data given topology p({giA,B,C} | T,{C}):

p({giA,B,C} | T,{C})=αi,βi p({giA,B,C} | T,{αi},{βi},{C})p({αi},{βi} | T,{C})=β1β2βNi p(giA,B,C | T,αi, βi,{C}) p(αi, βi)=i(βi p(giA,B,C | T,αi, βi,{C})p(αi,βi)) (13)
p({giA,B,C} | T,{C})=i p(giA,B,C | T,{C}), (14)

where the probability of gene expression data for gene i given topology p(giA,B,C | T,{C}) is obtained by summing p(giA,B,C | T,{C}, αi,βi) over αi and βi:

p(giA,B,C | T,{C})= p(giA,B,C | T,{C},αi=0,βi=1)p(βi=1)+p(giA,B,C | {C},αi=1,βi=0)p(αi=1,βi=0)+p(giA,B,C | αi=0,βi=0)p(αi=0,βi=0). (15)

We consider the odds of a gene being a transition gene p(βi=1)/p(βi=0) as a sparsity parameter, which we vary. We take p(αi=0 | βi=0)= p(αi=1 | βi=0)=1/2.

The probability of topology T given data is proportional to the probability of the data given topology T (using Bayes’ rule):

p(T | {giA,B,C},{C})=p({giA,B,C} | T,{C})p(T)p({giA,B,C} | {C}), (16)

where

p({giA,B,C} | {C})=Tp(T)p({giA,B,C} | T,{C}). (17)

Therefore, using Equation 14, we obtain the following expression for p(T | {giA,B,C},{C}):

p(T | {giA,B,C},{C})=p(T)ip(giA,B,C | T,{C})Tp(T)ip(giA,B,C | T,{C}). (18)

Equation 18 can be written more explicitly by rewriting Equation 15 as follows:

p(giA,B,C | T,{C})=p(giA,B,C | T,βi=0,{C})p(βi=0)+p(giA,B,C | T,βi=1,{C})p(βi=1)=p(giA,B,C | βi=0,{C}) p(βi=0)(1+p(giA,B,C | T,βi=1,{C})p(βi=1)p(giA,B,C | βi=0,{C})p(βi=0)). (19)

Here we have used the fact noted earlier that in our generating model, p(giA,B,C | T,αi=1,  βi=0, {C})=p(giA,B,C | αi=1,  βi=0,{C}) and p(giA,B,C | T,αi=0,  βi=0, {C})=p(giA,B,C | αi=0,  βi=0) do not depend on T (Equation 10). The terms ip(giA,B,C | βi=0,{C}) p(βi=0) cancel out in the numerator and denominator of Equation 18, and we can write Equation 18 in terms of ratios of the probabilities of the data given transition-gene and non-transition-gene status:

p(T | {giA,B,C},{C})=p(T)i(1+p(giA,B,C | T,βi=1,{C})p(βi=1)p(giA,B,C | βi=0,{C})p(βi=0))Tp(T)i(1+p(giA,B,C | T,βi=1,{C})p(βi=1)p(giA,B,C | βi=0,{C})p(βi=0)). (20)

We can rewrite Equation 20 as:

p(T | {giA,B,C},{C})=p(T)i(1+1p(T)𝒪i p(TgiA,B,C, βi=1,{C}))Tp(T)i(1+1p(T)𝒪i p(TgiA,B,C, βi=1,{C})), (21)

where 𝒪i is the odds that gene i is a transition gene, given clustering:

𝒪i=p(βi=1 | giA,B,C,{C})p(βi=0 | giA,B,C,{C})= p(giA,B,C | βi=1,{C})p(giA,B,C | βi=0,{C})p(βi=1)p(βi=0) (22)

and p(T | giA,B,C, βi=1,{C}) is the probability of T given only gene expression data for gene i, clustering and that gene i is a transition gene:

p(T | giA,B,C, βi=1,{C})=p(giA,B,C | βi=1,  T,{C}) p(T)p(giA,B,C | βi=1,{C}). (23)

Thus each gene’s contribution p(T | giA,B,C, βi=1,{C}) to the probability of the topology given total gene expression p(T | {giA,B,C},{C}) is weighted by the odds 𝒪i that it is transition gene.

Rewriting Equation 21 in terms of negative votes

Let us denote the probability of gene expression data for gene i given that cell cluster ξ has the distribution with minimum mean expression as p(giA,B,C | μξi ismin, {C }). For example, p(giA,B,C | μBi ismin , {C })= p(giA,B,C | μBi< μAi,μBi< μCi, {C}). Then, using p(T)=1/4 and Equations 5 and 8, we can write:

p(giA,B,C | βi=1,{C})=14[T=𝒜,,𝒞,p(giA,B,C | T,βi=0,{C })]=14[ξ=A,B,Cp(giA,B,C | μξi is min, {C })T=𝒜,,𝒞+13ξ=A,B,Cp(giA,B,C | μξi is min, {C })T=]=13[ξ=A,B,Cp(giA,B,C | μξi is min, {C })] (24)

Therefore, for T=𝒜,,𝒞, we can rewrite Equation 5 as:

p(giA,B,C | T,βi=1,{C })=12[3 p(giA,B,C | βi=1,{C }) p(giA,B,C | μTi ismin, {C })] (25)

Combining Equations 21 and 25, we derive, for T:

p(T | {giA,B,C},{C})p(T)i(1+p(giA,B,C|T,βi=1,{C})p(giA,B,C|βi=1,{C})𝒪i)p(T)i(1+12[3 p(giA,B,C βi=1,{C})p (giA,B,C μTi ismin, {C })]p(giA,B,C|βi=1,{C})𝒪i) p(T)i(1+32𝒪i[1p(μTi ismin | giA,B,C,βi=1,{C})]), (26)

where p(μTiismin |  giA,B,C,βi=1,{C}) is the probability that cell cluster T (the intermediate cluster in topology T) has the distribution with the minimum mean for gene i:

p(μTi ismin | giA,B,C,βi=1,{C})=p(giA,B,C | μTi ismin,{C }) p(μTi ismin|βi=1,{C}) p(giA,B,C | βi=1)=13p(giA,B,C | μTi ismin, {C })  p(giA,B,C | βi=1,{C}). (27)

Every gene can be thought of as casting a vote p(μTiismin | giA,B,C,βi=1,{C}) against cell type T being the intermediate, and this vote is weighted by the odds 𝒪i of the gene i being a transition gene and having a unique minimum, given the clustering. This corresponds to Equation 1 in the main text.

Expression for p(T,{αi},{βi} | {giA,B,C},{C})

Once p(T | {giA,B,C},{C}) is calculated, it is straightforward to find p(T,{αi},{βi} | {giA,B,C},{C}):

p(T,{αi},{βi} | {giA,B,C},{C})= p({αi},{βi} | {giA,B,C},T,{C})p(T | {giA,B,C},{C}), (28)

where p({αi},{βi} | {giA,B,C},T,{C}) is the probability of {αi} and {βi} given the particular topology T, clustering {C} and gene expression. Because we have assumed that gene expression patterns p(giA,B,C | T,{C}, αi,βi) are conditionally independent given T, {C}, αi and βi (Equation 3), the probabilities of being marker or transition genes αi or βi are also conditionally independent given gene expression, clustering and the topology:

p({αi},{βi} | {giA,B,C},T,{C})=p({giA,B,C} | T,{αi},{βi},{C}) p({αi},{βi} | T,{C})p({giA,B,C} | T,{C})=i p(giA,B,C | T,αi,βi,{C}) p(αi,βi | T,{C})p(giA,B,C | T,{C})=i p(αi,βi | T,{C}, giA,B,C), (29)

where p(αi,βi | T,{C}, giA,B,C) is the probability that gene i is a marker or transition gene given its gene expression, the clustering, and that the topology is T.

Choice of prior odds does not affect the most likely topology

The only free parameter in our calculation above is the prior odds of gene i being a transition gene, p(βi=1)/p(βi=0). At one extreme, if p(βi=1)/p(βi=0)0, then p(T | {giA,B,C})p(T): if we assume that none of the genes are transition genes, then knowing gene expression does not give us any new knowledge of the topology T, since only transition genes are informative about T. At the other extreme, if p(βi=1)/p(βi=0) then the null hypothesis dominates: if all genes are transition genes, then there will be negative votes against all topologies. We computed the behavior of p(T | {giA,B,C}) between these two limits to determine the sensitivity of our answer to p(βi=1)/p(βi=0).

Figure 2—figure supplement 1 shows the dependence of the probabilities p(T | {giA,B,C}) on the prior odds for triplets CMP/ST/MPP and GMP/MEP/FrBC for values of p(βi=1)/p(βi=0) between 10−8 and 102. For triplet CMP/ST/MPP the topology STCMPSTMPP dominates for p(βi=1)/p(βi=0) between 10−2 and 10, whereas for triplet MEP/GMP/FrBC there is no value of the prior odds that strongly favors a non-null topology. For most triplets, the most likely topology does not depend on the choice of prior odds; when building lineage trees, we ignore those triplets where different choices of prior odds lead to different most-likely topologies, i.e. there is more than one non-null topology that reaches probability 0.6 over the range of prior odds.

Determination of lineage tree from triplet topologies

Selection of triplets

In order to build lineage trees from the topologies we determine for each cell type, we select the triplets for which our determination of the topology is most robust. There is one free parameter in our model: the prior odds for a gene to be a transition gene in the absence of gene expression data, p(βi=1)/p(βi=0). For each triplet, we vary this parameter between 10−6 and 102 and calculate the probability of the topology given gene expression data p(T | {giA,B,C}) as a function of the prior odds.

We want to consider only triplets which showed a single dominant topology. We exclude triplets which show a weak probability for a particular topology or ones which depend on a particular choice of prior odds. We also do not consider triplets which show a strong probability for two different topologies, depending on the choice of prior odds.

There were 14 such triplets among the 165 hematopoietic triplets. Mathematically, these cases come up when the genes that are most likely to show the clear minimum pattern (furthest on the right in a ‘dot plot’) suggest one topology, but if one used a more permissive value of the sparsity parameter, a different topology wins out. One of the cell types might have a small number of genes with very high odds, but then fewer genes with moderately high odds compared to the other cell types. We did not notice a clear pattern in the identity of the triplets exhibiting this behavior, but 9 of the 14 were triplets of length five or greater in the Adolfsson model. One of the triplets with this behavior was the MLP/CMP/GMP triplet, and the dominant topology was either MLP (at low prior odds) or CMP (at higher prior odds). Interestingly, both cell types are progenitors to GMP in the Adolfsson model.

We consider triplets for which only one non-null topology has probability p(T | {giA,B,C})greater than 0.6. The probabilities of the different topologies for each triplet in the hematopoietic tree are shown in Figure 2—source data 1 and for the triplets in the cortical development tree in Figure 4—source data 2.

Pruning rule

We assemble the triplets with known topology into an undirected graph. Since we determined topologies by considering cell types three at a time, we obtain topological relationships involving both cell types that are nearest neighbors and cell types that are more distantly related. In order to reconstruct the tree, we must determine which cell types are nearest neighbors and which ones are separated by one or more intermediate cell types.

The set of inferred topologies allows us to determine which cell types are separated by intermediates. For every pair of cell types, we ask whether any of the inferred topologies features an intermediate between the two cell types. If such a topology has been inferred, we consider that the two cell types are not nearest neighbors, and that at least one other cell type is an intermediate. For example, we can ignore triplet CMP – LT – MPP because triplet LT – ST – CMP testifies that there exists an intermediate between LT and CMP, and triplet LT – ST – MPP testifies that there exists an intermediate between LT and MPP (Figure 3—figure supplement 1).

Note that this pruning rule does not assume the absence of loops. The lineage tree we infer for the hematopoietic progenitors contains a loop that includes ST to CMP to GMP on one side and ST to MPP to MLP to GMP on the other side. The loop involves triplets CMP – ST – MPP, ST – CMP – GMP, ST – MPP – MLP and MPP – MLP – GMP (we cannot determine the topology of triplet CMP/MLP/GMP). None of the triplets shows a topology that would allow us to break up the loop.

Distinguishing between two models of hematopoiesis

The topologies we infer support the model from Adolfsson et al. (2005), in which CMP splits from ST-HSC. In particular, there are several triplets that can distinguish the Adolfsson model from the traditional picture, and they support the Adolfsson model. These triplets include CMP – ST – MPP, CMP – ST – MLP, MEP – ST – MLP and CMP – LT – MPP. These triplets show that, unlike in the traditional picture, cell types CMP and MEP split from ST are not descended from MPP or MLP.

On the other hand, triplets LT – MLP – GMP, ST – MLP – GMP and MPP – MLP – GMP show that MLP is an intermediate between the earliest progenitors and GMP. See also Figure 2—figure supplement 2 and Figure 2—source data 2.

Stability analysis

The inference algorithm depends on several parameters and priors. We performed a stability analysis for both the microarray hematopoietic data and the human brain single cell data to determine the parameter ranges for which the inferred lineage tree was unchanged.

  1. Hematopoiesis
    • For the prior probability, given that a gene is not a transition gene, that it is an irrelevant gene, our default value was p(αi=0 | βi=0) =0.5, but the tree was unchanged for values of p(αi=0 |  βi=0 ) between 0.25 and 0.65.
    • For the prior probabilities of different topologies, our default value was p()= p(𝒜)= p()= p(𝒞)= 0.25. We varied p() while keeping the prior probabilities of the non-null topologies equal: p(T)=13(1p()). The tree was unchanged for values of p() between 0.1 and 0.35.
    • We used a threshold of 0.6 to consider a triplet significant for the tree-building step. The tree was unchanged for thresholds between 0.5 and 0.65.
    • A key input parameter into the algorithm is the expected prior distribution of means and standard deviations p(μ, σ) used in the numeric integration. Our default prior was a uniform prior for both means and standard deviations over reasonable ranges of these parameters. We also implemented an empirical prior p(μ, σ) by estimating the empirical distribution over all the Immgen cell types, using kernel density estimation (Botev et al., 2010). The resulting hematopoietic lineage tree was identical. Using the kernel-density-estimated empirical prior may provide more stability in future analyses.
  2.  Brain Development
    • For the prior probability, given that a gene is not a transition gene, that it is an irrelevant gene, our default value was p(αi=0 | βi=0 ) = 0.5, but the tree was unchanged for values of p(αi=0 | βi=0 ) between 0.02 and 0.97.
    • For the prior probabilities of different topologies, our default value was p()= p(𝒜)= p()= p(𝒞)= 0.25. We varied p() while keeping the prior probabilities of the non-null topologies equal: p(T)=13(1p()). The tree was unchanged for values of p() between 0.06 and 0.96.
    • We used a threshold of 0.6 to consider a triplet significant for the tree-building step. The tree was unchanged for thresholds between 0.5 and 0.8.
    • For the single cell data, we also changed the number of initial seed clusters for the iterative algorithm within a range from 30 to 45. In each case, the tree remained the same, and never more than 23% of the cells were clustered into a different cell type.

Acknowledgements

We thank Sandeep Choubey, Alex Schier, Andrew Murray, David Scadden and Toshihiko Oki for scientific discussions. We also thank Christof Koch, Andrew Murray, KC Huang, Jim Valcourt and Dann Huh for detailed comments and feedback on this work. We are grateful to Simona Lodato for helping us interpret our results and in particular help us write the section on the genes we discover to be important during cortical development. We are particularly grateful to the Senior and Reviewing Editors and to the four peer reviewers, including Nir Yosef, for their very thoughtful comments and suggestions. The study was supported by NSF-GRFP and NDSEG Fellowships (LF), an NIH Pioneer Award (SR), and the Allen Institute for Brain Science (SR).

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Funding Information

This paper was supported by the following grants:

  • National Science Foundation to Leon A. Furchtgott.

  • National Institutes of Health to Sharad Ramanathan.

  • Allen Foundation to Vilas Menon, Sharad Ramanathan.

Additional information

Competing interests

The authors declare that no competing interests exist.

Author contributions

LAF, Conceptualization, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing—original draft, Writing—review and editing.

SM, Conceptualization, Data curation, Software, Formal analysis, Validation, Visualization, Methodology, Writing—original draft, Writing—review and editing.

VM, Validation, Visualization, Writing—review and editing.

SR, Conceptualization, Resources, Formal analysis, Supervision, Funding acquisition, Investigation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing.

Additional files

Major datasets

The following previously published datasets were used:

Gopalan G,2009,Immunological Genome Project,http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE15907,Publicly available at the NCBI Gene Expression Omnibus (accession no: GSE15907)

Grün D,Lyubimova A,Kester L,Wiebrands K,Basak O,Sasaki N,Clevers H,van Oudenaarden A,2015,Single-Cell mRNA Sequencing Reveals Rare Intestinal Cell Types,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62270,Publicly available at the NCBI Gene Expression Omnibus (accession no: GSE62270)

Yao Z,Mich JK,Ku S,Menon V,Krostag A,Martinez RA,Grimley JS,Wang Y,Ramanathan S,Levi BP,2016,Region-specific neural stem cell lineages revealed by single-cell rna-seq from human embryonic stem cells [Smart-seq],https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE86982,Publicly available at the NCBI Gene Expression Omnibus (accession no: GSE86982)

References

  1. Abraham AB, Bronstein R, Chen EI, Koller A, Ronfani L, Maletic-Savatic M, Tsirka SE. Members of the high mobility group B protein family are dynamically expressed in embryonic neural stem cells. Proteome Science. 2013;11:18. doi: 10.1186/1477-5956-11-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Adolfsson J, Månsson R, Buza-Vidas N, Hultquist A, Liuba K, Jensen CT, Bryder D, Yang L, Borge OJ, Thoren LA, Anderson K, Sitnicka E, Sasaki Y, Sigvardsson M, Jacobsen SE. Identification of Flt3+ lympho-myeloid stem cells lacking erythro-megakaryocytic potential a revised road map for adult blood lineage commitment. Cell. 2005;121:295–306. doi: 10.1016/j.cell.2005.02.013. [DOI] [PubMed] [Google Scholar]
  3. Advani M, Ganguli S. Statistical mechanics of optimal convex inference in high dimensions. Physical Review X. 2016;6:031034. doi: 10.1103/PhysRevX.6.031034. [DOI] [Google Scholar]
  4. Agoston Z, Li N, Haslinger A, Wizenmann A, Schulte D. Genetic and physical interaction of Meis2, Pax3 and Pax7 during dorsal midbrain development. BMC Developmental Biology. 2012;12:10. doi: 10.1186/1471-213X-12-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Akashi K, Traver D, Miyamoto T, Weissman IL. A clonogenic common myeloid progenitor that gives rise to all myeloid lineages. Nature. 2000;404:193–197. doi: 10.1038/35004599. [DOI] [PubMed] [Google Scholar]
  6. Anderson PW. Local moments and localized states. Reviews of Modern Physics. 1978;50:191–201. doi: 10.1103/RevModPhys.50.191. [DOI] [PubMed] [Google Scholar]
  7. Ang SL. Transcriptional control of midbrain dopaminergic neuron development. Development. 2006;133:3499–3506. doi: 10.1242/dev.02501. [DOI] [PubMed] [Google Scholar]
  8. Appolloni I, Calzolari F, Corte G, Perris R, Malatesta P. Six3 controls the neural progenitor status in the murine CNS. Cerebral Cortex. 2008;18:553–562. doi: 10.1093/cercor/bhm092. [DOI] [PubMed] [Google Scholar]
  9. Au E, Ahmed T, Karayannis T, Biswas S, Gan L, Fishell G. A modular gain-of-function approach to generate cortical interneuron subtypes from ES cells. Neuron. 2013;80:1145–1158. doi: 10.1016/j.neuron.2013.09.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Azim E, Jabaudon D, Fame RM, Macklis JD. SOX6 controls dorsal progenitor identity and interneuron diversity during neocortical development. Nature Neuroscience. 2009;12:1238–1247. doi: 10.1038/nn.2387. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Bani-Yaghoub M, Tremblay RG, Lei JX, Zhang D, Zurakowski B, Sandhu JK, Smith B, Ribecco-Lutkiewicz M, Kennedy J, Walker PR, Sikorska M. Role of Sox2 in the development of the mouse neocortex. Developmental Biology. 2006;295:52–66. doi: 10.1016/j.ydbio.2006.03.007. [DOI] [PubMed] [Google Scholar]
  12. Baraniuk RG. Compressive sensing. IEEE signal processing magazine. 2007;118:12–13. [Google Scholar]
  13. Ben-Porath I, Thomson MW, Carey VJ, Ge R, Bell GW, Regev A, Weinberg RA. An embryonic stem cell-like gene expression signature in poorly differentiated aggressive human tumors. Nature Genetics. 2008;40:499–507. doi: 10.1038/ng.127. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B. 1995;57:289–3000. [Google Scholar]
  15. Borello U, Madhavan M, Vilinsky I, Faedo A, Pierani A, Rubenstein J, Campbell K. Sp8 and COUP-TF1 reciprocally regulate patterning and Fgf signaling in cortical progenitors. Cerebral Cortex. 2014;24:1409–1421. doi: 10.1093/cercor/bhs412. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Botev ZI, Grotowski JF, Kroese DP. Kernel density estimation via diffusion. The Annals of Statistics. 2010;38:2916–2957. doi: 10.1214/10-AOS799. [DOI] [Google Scholar]
  17. Buck A, Kispert A, Kohlhase J. Embryonic expression of the murine homologue of SALL1, the gene mutated in Townes--Brocks syndrome. Mechanisms of Development. 2001;104:143–146. doi: 10.1016/S0925-4773(01)00364-1. [DOI] [PubMed] [Google Scholar]
  18. Buckingham ME, Meilhac SM. Tracing cells for tracking cell lineage and clonal behavior. Developmental Cell. 2011;21:394–409. doi: 10.1016/j.devcel.2011.07.019. [DOI] [PubMed] [Google Scholar]
  19. Candès EJ, Romberg JK, Tao T. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics. 2006;59:1207–1223. doi: 10.1002/cpa.20124. [DOI] [Google Scholar]
  20. Chang CY, Pasolli HA, Giannopoulou EG, Guasch G, Gronostajski RM, Elemento O, Fuchs E. NFIB is a governor of epithelial-melanocyte stem cell behaviour in a shared niche. Nature. 2013;495:98–102. doi: 10.1038/nature11847. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Cimadamore F, Amador-Arjona A, Chen C, Huang CT, Terskikh AV. SOX2-LIN28/let-7 pathway regulates proliferation and neurogenesis in neural precursors. PNAS. 2013;110:E3017–3026. doi: 10.1073/pnas.1220176110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Core Team R. R: A Language and Environment for Statistical Computing 2015
  23. Crispino JD. GATA1 in normal and malignant hematopoiesis. Seminars in Cell & Developmental Biology. 2005;16:137–147. doi: 10.1016/j.semcdb.2004.11.002. [DOI] [PubMed] [Google Scholar]
  24. Cánovas J, Berndt FA, Sepúlveda H, Aguilar R, Veloso FA, Montecino M, Oliva C, Maass JC, Sierralta J, Kukuljan M. The specification of cortical subcerebral projection neurons depends on the direct repression of TBR1 by CTIP1/BCL11a. Journal of Neuroscience. 2015;35:7552–7564. doi: 10.1523/JNEUROSCI.0169-15.2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. de Santa Barbara P, van den Brink GR, Roberts DJ. Development and differentiation of the intestinal epithelium. Cellular and Molecular Life Sciences. 2003;60:1322–1332. doi: 10.1007/s00018-003-2289-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Delmans M, Hemberg M. Discrete distributional differential expression (D3E)--a tool for gene expression analysis of single-cell RNA-seq data. BMC Bioinformatics. 2016;17:110. doi: 10.1186/s12859-016-0944-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Di Bonito M, Narita Y, Avallone B, Sequino L, Mancuso M, Andolfi G, Franzè AM, Puelles L, Rijli FM, Studer M. Assembly of the auditory circuitry by a Hox genetic network in the mouse brainstem. PLoS Genetics. 2013;9:e1003249. doi: 10.1371/journal.pgen.1003249. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Dominguez MH, Ayoub AE, Rakic P. POU-III transcription factors (Brn1, Brn2, and Oct6) influence neurogenesis, molecular identity, and migratory destination of upper-layer cells of the cerebral cortex. Cerebral Cortex. 2013;23:2632–2643. doi: 10.1093/cercor/bhs252. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Donoho D, Tanner J. Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal processing. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences. 2009;367:4273–4293. doi: 10.1098/rsta.2009.0152. [DOI] [PubMed] [Google Scholar]
  30. Duggan SP, Behan FM, Kirca M, Zaheer A, McGarrigle SA, Reynolds JV, Vaz GM, Senge MO, Kelleher D. The characterization of an intestine-like genomic signature maintained during Barrett's-associated adenocarcinogenesis reveals an NR5A2-mediated promotion of cancer cell survival. Scientific Reports. 2016;6:32638. doi: 10.1038/srep32638. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Ebisu H, Iwai-Takekoshi L, Fujita-Jimbo E, Momoi T, Kawasaki H. Foxp2 regulates identities and projection patterns of thalamic nuclei during development. Cerebral Cortex. 2016:bhw187. doi: 10.1093/cercor/bhw187. [DOI] [PubMed] [Google Scholar]
  32. Elsen GE, Choi LY, Millen KJ, Grinblat Y, Prince VE. Zic1 and Zic4 regulate zebrafish roof plate specification and hindbrain ventricle morphogenesis. Developmental Biology. 2008;314:376–392. doi: 10.1016/j.ydbio.2007.12.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Enkhmandakh B, Makeyev AV, Erdenechimeg L, Ruddle FH, Chimge NO, Tussie-Luna MI, Roy AL, Bayarsaihan D. Essential functions of the Williams-Beuren syndrome-associated TFII-I genes in embryonic development. PNAS. 2009;106:181–186. doi: 10.1073/pnas.0811531106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Erickson T, French CR, Waskiewicz AJ. Meis1 specifies positional information in the retina and tectum to organize the zebrafish visual system. Neural Development. 2010;5:22. doi: 10.1186/1749-8104-5-22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Fawcett SR, Klymkowsky MW. Embryonic expression of xenopus laevis SOX7. Gene Expression Patterns. 2004;4:29–33. doi: 10.1016/j.modgep.2003.08.003. [DOI] [PubMed] [Google Scholar]
  36. Ferrell JE. Bistability, bifurcations, and Waddington's epigenetic landscape. Current Biology. 2012;22:R458–R466. doi: 10.1016/j.cub.2012.03.045. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Fre S, Huyghe M, Mourikis P, Robine S, Louvard D, Artavanis-Tsakonas S. Notch signals control the fate of immature progenitor cells in the intestine. Nature. 2005;435:964–968. doi: 10.1038/nature03589. [DOI] [PubMed] [Google Scholar]
  38. Frumkin D, Wasserstrom A, Itzkovitz S, Stern T, Harmelin A, Eilam R, Rechavi G, Shapiro E. Cell lineage analysis of a mouse tumor. Cancer Research. 2008;68:5924–5931. doi: 10.1158/0008-5472.CAN-07-6216. [DOI] [PubMed] [Google Scholar]
  39. Garcia H, Fleyshman D, Kolesnikova K, Safina A, Commane M, Paszkiewicz G, Omelian A, Morrison C, Gurova K. Expression of FACT in mammalian tissues suggests its role in maintaining of undifferentiated state of cells. Oncotarget. 2011;2:783–796. doi: 10.18632/oncotarget.340. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Gaston-Massuet C, McCabe MJ, Scagliotti V, Young RM, Carreno G, Gregory LC, Jayakody SA, Pozzi S, Gualtieri A, Basu B, Koniordou M, Wu CI, Bancalari RE, Rahikkala E, Veijola R, Lopponen T, Graziola F, Turton J, Signore M, Mousavy Gharavy SN, Charolidi N, Sokol SY, Andoniadou CL, Wilson SW, Merrill BJ, Dattani MT, Martinez-Barbera JP. Transcription factor 7-like 1 is involved in hypothalamo-pituitary axis development in mice and humans. PNAS. 2016;113:E548–557. doi: 10.1073/pnas.1503346113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Gazit R, Garrison BS, Rao TN, Shay T, Costello J, Ericson J, Kim F, Collins JJ, Regev A, Wagers AJ, Rossi DJ, Immunological Genome Project Consortium Transcriptome analysis identifies regulators of hematopoietic stem and progenitor cells. Stem Cell Reports. 2013;1:266–280. doi: 10.1016/j.stemcr.2013.07.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Gilbert SF. Developmental biology. Sinauer; 2014. [Google Scholar]
  43. Goossens S, Janzen V, Bartunkova S, Yokomizo T, Drogat B, Crisan M, Haigh K, Seuntjens E, Umans L, Riedt T, Bogaert P, Haenebalcke L, Berx G, Dzierzak E, Huylebroeck D, Haigh JJ. The EMT regulator Zeb2/Sip1 is essential for murine embryonic hematopoietic stem/progenitor cell differentiation and mobilization. Blood. 2011;117:5620–5630. doi: 10.1182/blood-2010-08-300236. [DOI] [PubMed] [Google Scholar]
  44. Graf T, Enver T. Forcing cells to change lineages. Nature. 2009;462:587–594. doi: 10.1038/nature08533. [DOI] [PubMed] [Google Scholar]
  45. Greig LC, Woodworth MB, Galazo MJ, Padmanabhan H, Macklis JD. Molecular logic of neocortical projection neuron specification, development and diversity. Nature Reviews Neuroscience. 2013;14:755–769. doi: 10.1038/nrn3586. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Greig LC, Woodworth MB, Greppi C, Macklis JD, Abdel-Majid RM, Leong WL, Schalkwyk LC, Smallman DS, Wong ST, Storm DR. Ctip1 controls acquisition of sensory area identity and establishment of sensory input fields in the developing neocortex. Neuron. 2016;90:261–277. doi: 10.1016/j.neuron.2016.03.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Grün D, Lyubimova A, Kester L, Wiebrands K, Basak O, Sasaki N, Clevers H, van Oudenaarden A. Single-cell messenger RNA sequencing reveals rare intestinal cell types. Nature. 2015;525:251–255. doi: 10.1038/nature14966. [DOI] [PubMed] [Google Scholar]
  48. Grün D, Muraro MJ, Boisset JC, Wiebrands K, Lyubimova A, Dharmadhikari G, van den Born M, van Es J, Jansen E, Clevers H, de Koning EJ, van Oudenaarden A. De novo prediction of stem cell identity using Single-Cell transcriptome data. Cell Stem Cell. 2016;19:266–277. doi: 10.1016/j.stem.2016.05.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Hagey DW, Muhr J. Sox2 acts in a dose-dependent fashion to regulate proliferation of cortical progenitors. Cell Reports. 2014;9:1908–1920. doi: 10.1016/j.celrep.2014.11.013. [DOI] [PubMed] [Google Scholar]
  50. Hegarty SV, Sullivan AM, O'Keeffe GW. Midbrain dopaminergic neurons: a review of the molecular circuitry that regulates their development. Developmental Biology. 2013;379:123–138. doi: 10.1016/j.ydbio.2013.04.014. [DOI] [PubMed] [Google Scholar]
  51. Heng TS, Painter MW, Immunological Genome Project Consortium The immunological genome project: networks of gene expression in immune cells. Nature Immunology. 2008;9:1091–1094. doi: 10.1038/ni1008-1091. [DOI] [PubMed] [Google Scholar]
  52. Hutton SR, Pevny LH. SOX2 expression levels distinguish between neural progenitor populations of the developing dorsal telencephalon. Developmental Biology. 2011;352:40–47. doi: 10.1016/j.ydbio.2011.01.015. [DOI] [PubMed] [Google Scholar]
  53. Inoue F, Kurokawa D, Takahashi M, Aizawa S. Gbx2 directly restricts Otx2 expression to forebrain and midbrain, competing with class III POU factors. Molecular and Cellular Biology. 2012;32:2618–2627. doi: 10.1128/MCB.00083-12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Iwasaki H, Akashi K. Myeloid lineage commitment from the hematopoietic stem cell. Immunity. 2007;26:726–740. doi: 10.1016/j.immuni.2007.06.004. [DOI] [PubMed] [Google Scholar]
  55. Jabaudon D, Shnider SJ, Tischfield DJ, Galazo MJ, Macklis JD. Rorβ induces barrel-like neuronal clusters in the developing neocortex. Cerebral Cortex. 2012;22:996–1006. doi: 10.1093/cercor/bhr182. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Jaegle M, Ghazvini M, Mandemakers W, Piirsoo M, Driegen S, Levavasseur F, Raghoenath S, Grosveld F, Meijer D. The POU proteins Brn-2 and Oct-6 share important functions in schwann cell development. Genes & development. 2003;17:1380–1391. doi: 10.1101/gad.258203. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Jaitin DA, Kenigsberg E, Keren-Shaul H, Elefant N, Paul F, Zaretsky I, Mildner A, Cohen N, Jung S, Tanay A, Amit I. Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types. Science. 2014;343:776–779. doi: 10.1126/science.1247651. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Jang S, Furchtgott L, Choubey S, Zou L-N, Doyle A, Menon V, Loew E, Krostag A-R, Martinez RA, Madisen L, Levi BP, Ramanathan S. Probabilistic model of gene networks controlling embryonic stem cell differentiation inferred from single-cell transcriptomics. eLife. 2017;6:e20487. doi: 10.7554/eLife.20487. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Jenny M, Uhl C, Roche C, Duluc I, Guillermin V, Guillemot F, Jensen J, Kedinger M, Gradwohl G. Neurogenin3 is differentially required for endocrine cell fate specification in the intestinal and gastric epithelium. The EMBO Journal. 2002;21:6338–6347. doi: 10.1093/emboj/cdf649. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Jensen J, Pedersen EE, Galante P, Hald J, Heller RS, Ishibashi M, Kageyama R, Guillemot F, Serup P, Madsen OD. Control of endodermal endocrine development by Hes-1. Nature Genetics. 2000;24:36–44. doi: 10.1038/71657. [DOI] [PubMed] [Google Scholar]
  61. Ji Z, Ji H. TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Research. 2016;44:e117. doi: 10.1093/nar/gkw430. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Johansson PA, Irmler M, Acampora D, Beckers J, Simeone A, Götz M. The transcription factor Otx2 regulates choroid plexus development and function. Development. 2013;140:1055–1066. doi: 10.1242/dev.090860. [DOI] [PubMed] [Google Scholar]
  63. Kameda Y, Saitoh T, Fujimura T. Hes1 regulates the number and anterior-posterior patterning of mesencephalic dopaminergic neurons at the mid/hindbrain boundary (isthmus) Developmental Biology. 2011;358:91–101. doi: 10.1016/j.ydbio.2011.07.016. [DOI] [PubMed] [Google Scholar]
  64. Kanatani S, Yozu M, Tabata H, Nakajima K. COUP-TFII is preferentially expressed in the caudal ganglionic eminence and is involved in the caudal migratory stream. Journal of Neuroscience. 2008;28:13582–13591. doi: 10.1523/JNEUROSCI.2132-08.2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Katz JP, Perreault N, Goldstein BG, Lee CS, Labosky PA, Yang VW, Kaestner KH. The zinc-finger transcription factor Klf4 is required for terminal differentiation of goblet cells in the colon. Development. 2002;129:2619–2628. doi: 10.1242/dev.129.11.2619. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Kessaris N, Magno L, Rubin AN, Oliveira MG. Genetic programs controlling cortical Interneuron fate. Current Opinion in Neurobiology. 2014;26:79–87. doi: 10.1016/j.conb.2013.12.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Khan WI, Blennerhasset P, Ma C, Matthaei KI, Collins SM. Stat6 dependent goblet cell hyperplasia during intestinal nematode infection. Parasite Immunology. 2001;23:39–42. doi: 10.1046/j.1365-3024.2001.00353.x. [DOI] [PubMed] [Google Scholar]
  68. Kikkawa T, Obayashi T, Takahashi M, Fukuzaki-Dohi U, Numayama-Tsuruta K, Osumi N. Dmrta1 regulates proneural gene expression downstream of Pax6 in the mammalian telencephalon. Genes to Cells. 2013;18:636–649. doi: 10.1111/gtc.12061. [DOI] [PubMed] [Google Scholar]
  69. Kishi Y, Fujii Y, Hirabayashi Y, Gotoh Y. HMGA regulates the global chromatin state and neurogenic potential in neocortical precursor cells. Nature Neuroscience. 2012;15:1127–1133. doi: 10.1038/nn.3165. [DOI] [PubMed] [Google Scholar]
  70. Knight JM, Davidson LA, Herman D, Martin CR, Goldsby JS, Ivanov IV, Donovan SM, Chapkin RS. Non-invasive analysis of intestinal development in preterm and term infants using RNA-Sequencing. Scientific Reports. 2014;4:159–173. doi: 10.1038/srep05453. [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Kondo M, Weissman IL, Akashi K. Identification of clonogenic common lymphoid progenitors in mouse bone marrow. Cell. 1997;91:661–672. doi: 10.1016/S0092-8674(00)80453-5. [DOI] [PubMed] [Google Scholar]
  72. Kumbasar A, Plachez C, Gronostajski RM, Richards LJ, Litwack ED. Absence of the transcription factor Nfib delays the formation of the basilar pontine and other mossy fiber nuclei. The Journal of Comparative Neurology. 2009;513:98–112. doi: 10.1002/cne.21943. [DOI] [PubMed] [Google Scholar]
  73. Kunath M, Lüdecke HJ, Vortkamp A. Expression of Trps1 during mouse embryonic development. Mechanisms of Development. 2002;119 Suppl 1:S117–120. doi: 10.1016/S0925-4773(03)00103-5. [DOI] [PubMed] [Google Scholar]
  74. Kurotaki D, Osato N, Nishiyama A, Yamamoto M, Ban T, Sato H, Nakabayashi J, Umehara M, Miyake N, Matsumoto N, Nakazawa M, Ozato K, Tamura T. Essential role of the IRF8-KLF4 transcription factor cascade in murine monocyte differentiation. Blood. 2013;121:1839–1849. doi: 10.1182/blood-2012-06-437863. [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Landau LD, Lifshitz EM. Statistical Physics. Vol. 5. Elsevier; 1951. https://books.google.com/books?id=VzgJN-XPTRsC&pgis=1 [Google Scholar]
  76. Lavado A, Oliver G. Prox1 expression patterns in the developing and adult murine brain. Developmental Dynamics. 2007;236:518–524. doi: 10.1002/dvdy.21024. [DOI] [PubMed] [Google Scholar]
  77. Li X, Udager AM, Hu C, Qiao XT, Richards N, Gumucio DL. Dynamic patterning at the pylorus: formation of an epithelial intestine-stomach boundary in late fetal life. Developmental Dynamics. 2009;238:3205–3217. doi: 10.1002/dvdy.22134. [DOI] [PMC free article] [PubMed] [Google Scholar]
  78. Lodato S, Molyneaux BJ, Zuccaro E, Goff LA, Chen HH, Yuan W, Meleski A, Takahashi E, Mahony S, Rinn JL, Gifford DK, Arlotta P. Gene co-regulation by Fezf2 selects neurotransmitter identity and connectivity of corticospinal neurons. Nature Neuroscience. 2014;17:1046–1054. doi: 10.1038/nn.3757. [DOI] [PMC free article] [PubMed] [Google Scholar]
  79. Lorente-Trigos A, Varnat F, Melotti A, Ruiz i Altaba A. BMP signaling promotes the growth of primary human colon carcinomas in vivo. Journal of Molecular Cell Biology. 2010;2:318–332. doi: 10.1093/jmcb/mjq035. [DOI] [PubMed] [Google Scholar]
  80. Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas AR, Kamitaki N, Martersteck EM, Trombetta JJ, Weitz DA, Sanes JR, Shalek AK, Regev A, McCarroll SA. Highly parallel Genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161:1202–1214. doi: 10.1016/j.cell.2015.05.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  81. Madissoon E, Jouhilahti EM, Vesterlund L, Töhönen V, Krjutškov K, Petropoulous S, Einarsdottir E, Linnarsson S, Lanner F, Månsson R, Hovatta O, Bürglin TR, Katayama S, Kere J. Characterization and target genes of nine human PRD-like homeobox domain genes expressed exclusively in early embryos. Scientific Reports. 2016;6:28995. doi: 10.1038/srep28995. [DOI] [PMC free article] [PubMed] [Google Scholar]
  82. Manuel MN, Martynoga B, Molinek MD, Quinn JC, Kroemmer C, Mason JO, Price DJ. The transcription factor Foxg1 regulates telencephalic progenitor proliferation cell autonomously, in part by controlling Pax6 expression levels. Neural Development. 2011;6:9. doi: 10.1186/1749-8104-6-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  83. Marco E, Karp RL, Guo G, Robson P, Hart AH, Trippa L, Yuan GC. Bifurcation analysis of single-cell gene expression data reveals epigenetic landscape. PNAS. 2014;111:E5643–E5650. doi: 10.1073/pnas.1408993111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  84. Marjoram L, Alvers A, Deerhake ME, Bagwell J, Mankiewicz J, Cocchiaro JL, Beerman RW, Willer J, Sumigray KD, Katsanis N, Tobin DM, Rawls JF, Goll MG, Bagnat M. Epigenetic control of intestinal barrier function and inflammation in zebrafish. PNAS. 2015;112:2770–2775. doi: 10.1073/pnas.1424089112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  85. Matsui H, Kimura A, Yamashiki N, Moriyama A, Kaya M, Yoshida I, Takagi N, Takahashi T. Molecular and biochemical characterization of a serine proteinase predominantly expressed in the medulla oblongata and cerebellar white matter of mouse brain. Journal of Biological Chemistry. 2000;275:11050–11057. doi: 10.1074/jbc.275.15.11050. [DOI] [PubMed] [Google Scholar]
  86. McGibbon RT, Husic BE, Pande VS. Identification of simple reaction coordinates from complex dynamics. The Journal of Chemical Physics. 2017;146:044109. doi: 10.1063/1.4974306. [DOI] [PMC free article] [PubMed] [Google Scholar]
  87. Mellor P, Deibert L, Calvert B, Bonham K, Carlsen SA, Anderson DH. CREB3L1 is a metastasis suppressor that represses expression of genes regulating metastasis, invasion, and angiogenesis. Molecular and Cellular Biology. 2013;33:4985–4995. doi: 10.1128/MCB.00959-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  88. Merrill BJ, Gat U, DasGupta R, Fuchs E. Tcf3 and Lef1 regulate lineage differentiation of multipotent stem cells in skin. Genes & Development. 2001;15:1688–1705. doi: 10.1101/gad.891401. [DOI] [PMC free article] [PubMed] [Google Scholar]
  89. Miller JA, Ding SL, Sunkin SM, Smith KA, Ng L, Szafer A, Ebbert A, Riley ZL, Royall JJ, Aiona K, Arnold JM, Bennet C, Bertagnolli D, Brouner K, Butler S, Caldejon S, Carey A, Cuhaciyan C, Dalley RA, Dee N, Dolbeare TA, Facer BA, Feng D, Fliss TP, Gee G, Goldy J, Gourley L, Gregor BW, Gu G, Howard RE, Jochim JM, Kuan CL, Lau C, Lee CK, Lee F, Lemon TA, Lesnar P, McMurray B, Mastan N, Mosqueda N, Naluai-Cecchini T, Ngo NK, Nyhus J, Oldre A, Olson E, Parente J, Parker PD, Parry SE, Stevens A, Pletikos M, Reding M, Roll K, Sandman D, Sarreal M, Shapouri S, Shapovalova NV, Shen EH, Sjoquist N, Slaughterbeck CR, Smith M, Sodt AJ, Williams D, Zöllei L, Fischl B, Gerstein MB, Geschwind DH, Glass IA, Hawrylycz MJ, Hevner RF, Huang H, Jones AR, Knowles JA, Levitt P, Phillips JW, Sestan N, Wohnoutka P, Dang C, Bernard A, Hohmann JG, Lein ES. Transcriptional landscape of the prenatal human brain. Nature. 2014;508:199–206. doi: 10.1038/nature13185. [DOI] [PMC free article] [PubMed] [Google Scholar]
  90. Miller RL, Stein MK, Loewy AD. Serotonergic inputs to FoxP2 neurons of the pre-locus coeruleus and parabrachial nuclei that project to the ventral tegmental area. Neuroscience. 2011;193:229–240. doi: 10.1016/j.neuroscience.2011.07.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  91. Milosevic J, Maisel M, Wegner F, Leuchtenberger J, Wenger RH, Gerlach M, Storch A, Schwarz J. Lack of hypoxia-inducible factor-1 alpha impairs midbrain neural precursor cells involving vascular endothelial growth factor signaling. Journal of Neuroscience. 2007;27:412–421. doi: 10.1523/JNEUROSCI.2482-06.2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  92. Miyawaki K, Arinobu Y, Iwasaki H, Kohno K, Tsuzuki H, Iino T, Shima T, Kikushige Y, Takenaka K, Miyamoto T, Akashi K. CD41 marks the initial myelo-erythroid lineage specification in adult mouse hematopoiesis: redefinition of murine common myeloid progenitor. Stem Cells. 2015;33:976–987. doi: 10.1002/stem.1906. [DOI] [PubMed] [Google Scholar]
  93. Miyoshi G, Fishell G. Dynamic FoxG1 expression coordinates the integration of Multipolar pyramidal neuron precursors into the cortical plate. Neuron. 2012;74:1045–1058. doi: 10.1016/j.neuron.2012.04.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  94. Muncan V, Sansom OJ, Tertoolen L, Phesse TJ, Begthel H, Sancho E, Cole AM, Gregorieff A, de Alboran IM, Clevers H, Clarke AR. Rapid loss of intestinal crypts upon conditional deletion of the Wnt/Tcf-4 target gene c-Myc. Molecular and Cellular Biology. 2006;26:8418–8426. doi: 10.1128/MCB.00821-06. [DOI] [PMC free article] [PubMed] [Google Scholar]
  95. Negishi H, Miki S, Sarashina H, Taguchi-Atarashi N, Nakajima A, Matsuki K, Endo N, Yanai H, Nishio J, Honda K, Taniguchi T. Essential contribution of IRF3 to intestinal homeostasis and microbiota-mediated Tslp gene induction. PNAS. 2012;109:21016–21021. doi: 10.1073/pnas.1219482110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  96. Noah TK, Kazanjian A, Whitsett J, Shroyer NF. SAM pointed domain ETS factor (SPDEF) regulates terminal differentiation and maturation of intestinal goblet cells. Experimental Cell Research. 2010;316:452–465. doi: 10.1016/j.yexcr.2009.09.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  97. Ohba H, Chiyoda T, Endo E, Yano M, Hayakawa Y, Sakaguchi M, Darnell RB, Okano HJ, Okano H. Sox21 is a repressor of neuronal differentiation and is antagonized by YB-1. Neuroscience Letters. 2004;358:157–160. doi: 10.1016/j.neulet.2004.01.026. [DOI] [PubMed] [Google Scholar]
  98. Ohtsuka T, Sakamoto M, Guillemot F, Kageyama R. Roles of the basic helix-loop-helix genes Hes1 and Hes5 in expansion of neural stem cells of the developing brain. Journal of Biological Chemistry. 2001;276:30467–30474. doi: 10.1074/jbc.M102420200. [DOI] [PubMed] [Google Scholar]
  99. Olivetti PR, Noebels JL. Interneuron, interrupted: molecular pathogenesis of ARX mutations and X-linked infantile spasms. Current Opinion in Neurobiology. 2012;22:859–865. doi: 10.1016/j.conb.2012.04.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  100. Orkin SH, Zon LI. Hematopoiesis: an evolving paradigm for stem cell biology. Cell. 2008;132:631–644. doi: 10.1016/j.cell.2008.01.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  101. Patel AP, Tirosh I, Trombetta JJ, Shalek AK, Gillespie SM, Wakimoto H, Cahill DP, Nahed BV, Curry WT, Martuza RL, Louis DN, Rozenblatt-Rosen O, Suvà ML, Regev A, Bernstein BE. Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma. Science. 2014;344:1396–1401. doi: 10.1126/science.1254257. [DOI] [PMC free article] [PubMed] [Google Scholar]
  102. Paul F, Arkin Y, Giladi A, Jaitin DA, Kenigsberg E, Keren-Shaul H, Winter D, Lara-Astiaso D, Gury M, Weiner A, David E, Cohen N, Lauridsen FK, Haas S, Schlitzer A, Mildner A, Ginhoux F, Jung S, Trumpp A, Porse BT, Tanay A, Amit I. Transcriptional heterogeneity and lineage commitment in myeloid progenitors. Cell. 2015;163:1663–1677. doi: 10.1016/j.cell.2015.11.013. [DOI] [PubMed] [Google Scholar]
  103. Perrimon N, Pitsouli C, Shilo BZ. Signaling mechanisms controlling cell fate and embryonic patterning. Cold Spring Harbor Perspectives in Biology. 2012;4:a005975. doi: 10.1101/cshperspect.a005975. [DOI] [PMC free article] [PubMed] [Google Scholar]
  104. Picelli S, Björklund ÅK, Faridani OR, Sagasser S, Winberg G, Sandberg R. Smart-seq2 for sensitive full-length transcriptome profiling in single cells. Nature Methods. 2013;10:1096–1098. doi: 10.1038/nmeth.2639. [DOI] [PubMed] [Google Scholar]
  105. Pino E, Amamoto R, Zheng L, Cacquevel M, Sarria JC, Knott GW, Schneider BL. FOXO3 determines the accumulation of α-synuclein and controls the fate of dopaminergic neurons in the substantia nigra. Human Molecular Genetics. 2014;23:1435–1452. doi: 10.1093/hmg/ddt530. [DOI] [PubMed] [Google Scholar]
  106. Pozniak CD, Langseth AJ, Dijkgraaf GJ, Choe Y, Werb Z, Pleasure SJ. Sox10 directs neural stem cells toward the oligodendrocyte lineage by decreasing suppressor of fused expression. PNAS. 2010;107:21795–21800. doi: 10.1073/pnas.1016485107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  107. Qi X, Hong J, Chaves L, Zhuang Y, Chen Y, Wang D, Chabon J, Graham B, Ohmori K, Li Y, Huang H. Antagonistic regulation by the transcription factors C/EBPα and MITF specifies basophil and mast cell fates. Immunity. 2013;39:97–110. doi: 10.1016/j.immuni.2013.06.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  108. Raciti M, Granzotto M, Duc MD, Fimiani C, Cellot G, Cherubini E, Mallamaci A. Reprogramming fibroblasts to neural-precursor-like cells by structured overexpression of pallial patterning genes. Molecular and Cellular Neuroscience. 2013;57:42–53. doi: 10.1016/j.mcn.2013.10.004. [DOI] [PubMed] [Google Scholar]
  109. Ragu C, Boukour S, Elain G, Wagner-Ballon O, Raslova H, Debili N, Olson EN, Daegelen D, Vainchenker W, Bernard OA, Penard-Lacronique V. The serum response factor (SRF)/megakaryocytic acute leukemia (MAL) network participates in megakaryocyte development. Leukemia. 2010;24:1227–1230. doi: 10.1038/leu.2010.80. [DOI] [PubMed] [Google Scholar]
  110. Reinchisi G, Ijichi K, Glidden N, Jakovcevski I, Zecevic N. COUP-TFII expressing interneurons in human fetal forebrain. Cerebral Cortex. 2012;22:2820–2830. doi: 10.1093/cercor/bhr359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  111. Reya T, Morrison SJ, Clarke MF, Weissman IL. Stem cells, cancer, and cancer stem cells. Nature. 2001;414:105–111. doi: 10.1038/35102167. [DOI] [PubMed] [Google Scholar]
  112. Riccio O, van Gijn ME, Bezdek AC, Pellegrinet L, van Es JH, Zimber-Strobl U, Strobl LJ, Honjo T, Clevers H, Radtke F. Loss of intestinal crypt progenitor cells owing to inactivation of both Notch1 and Notch2 is accompanied by derepression of CDK inhibitors p27Kip1 and p57Kip2. EMBO reports. 2008;9:377–383. doi: 10.1038/embor.2008.7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  113. Robert-Moreno A, Espinosa L, de la Pompa JL, Bigas A. RBPjkappa-dependent notch function regulates Gata2 and is essential for the formation of intra-embryonic hematopoietic cells. Development. 2005;132:1117–1126. doi: 10.1242/dev.01660. [DOI] [PubMed] [Google Scholar]
  114. Ross SE, Greenberg ME, Stiles CD. Basic helix-loop-helix factors in cortical development. Neuron. 2003;39:13–25. doi: 10.1016/S0896-6273(03)00365-9. [DOI] [PubMed] [Google Scholar]
  115. Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of single-cell gene expression data. Nature Biotechnology. 2015;33:495–502. doi: 10.1038/nbt.3192. [DOI] [PMC free article] [PubMed] [Google Scholar]
  116. Satoh Y, Yokota T, Sudo T, Kondo M, Lai A, Kincade PW, Kouro T, Iida R, Kokame K, Miyata T, Habuchi Y, Matsui K, Tanaka H, Matsumura I, Oritani K, Kohwi-Shigematsu T, Kanakura Y. The Satb1 protein directs hematopoietic stem cell differentiation toward lymphoid lineages. Immunity. 2013;38:1105–1115. doi: 10.1016/j.immuni.2013.05.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  117. Shahrezaei V, Swain PS. Analytical distributions for stochastic gene expression. PNAS. 2008;105:17256–17261. doi: 10.1073/pnas.0803850105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  118. Shen J, Walsh CA. Targeted disruption of Tgif, the mouse ortholog of a human holoprosencephaly gene, does not result in holoprosencephaly in mice. Molecular and Cellular Biology. 2005;25:3639–3647. doi: 10.1128/MCB.25.9.3639-3647.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  119. Shimojo H, Ohtsuka T, Kageyama R. Dynamic expression of notch signaling genes in neural stem/progenitor cells. Frontiers in Neuroscience. 2011;5:78. doi: 10.3389/fnins.2011.00078. [DOI] [PMC free article] [PubMed] [Google Scholar]
  120. Sillars-Hardebol AH, Carvalho B, Beliën JAM, de Wit M, Delis-van Diemen PM, Tijssen M, van de Wiel MA, Pontén F, Meijer GA, Fijneman RJA. CSE1L, DIDO1 and RBM39 in colorectal adenoma to carcinoma progression. Cellular Oncology. 2012;35:293–300. doi: 10.1007/s13402-012-0088-2. [DOI] [PubMed] [Google Scholar]
  121. Solar GP, Kerr WG, Zeigler FC, Hess D, Donahue C, de Sauvage FJ, Eaton DL. Role of c-mpl in early hematopoiesis. Blood. 1998;92:4–10. [PubMed] [Google Scholar]
  122. Spehlmann ME, Manthey CF, Dann SM, Hanson E, Sandhu SS, Liu LY, Abdelmalak FK, Diamanti MA, Retzlaff K, Scheller J, Rose-John S, Greten FR, Wang JYJ, Eckmann L. Trp53 deficiency protects against acute intestinal inflammation. The Journal of Immunology. 2013;191:837–847. doi: 10.4049/jimmunol.1201716. [DOI] [PMC free article] [PubMed] [Google Scholar]
  123. Stolt CC, Lommes P, Sock E, Chaboissier MC, Schedl A, Wegner M. The Sox9 transcription factor determines glial fate choice in the developing spinal cord. Genes & Development. 2003;17:1677–1689. doi: 10.1101/gad.259003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  124. Sulston JE, Schierenberg E, White JG, Thomson JN. The embryonic cell lineage of the nematode caenorhabditis elegans. Developmental Biology. 1983;100:64–119. doi: 10.1016/0012-1606(83)90201-4. [DOI] [PubMed] [Google Scholar]
  125. Sur M, Rubenstein JL. Patterning and plasticity of the cerebral cortex. Science. 2005;310:805–810. doi: 10.1126/science.1112070. [DOI] [PubMed] [Google Scholar]
  126. Takahashi K, Yamanaka S. Induction of pluripotent stem cells from mouse embryonic and adult fibroblast cultures by defined factors. Cell. 2006;126:663–676. doi: 10.1016/j.cell.2006.07.024. [DOI] [PubMed] [Google Scholar]
  127. Tamura T, Nagamura-Inoue T, Shmeltzer Z, Kuwata T, Ozato K. ICSBP directs bipotential myeloid progenitor cells to differentiate into mature macrophages. Immunity. 2000;13:155–165. doi: 10.1016/S1074-7613(00)00016-9. [DOI] [PubMed] [Google Scholar]
  128. Tan X, Zhang L, Zhu H, Qin J, Tian M, Dong C, Li H, Jin G. Brn4 and TH synergistically promote the differentiation of neural stem cells into dopaminergic neurons. Neuroscience Letters. 2014;571:23–28. doi: 10.1016/j.neulet.2014.04.019. [DOI] [PubMed] [Google Scholar]
  129. Thomsen ER, Mich JK, Yao Z, Hodge RD, Doyle AM, Jang S, Shehata SI, Nelson AM, Shapovalova NV, Levi BP, Ramanathan S. Fixed single-cell transcriptomic characterization of human radial glial diversity. Nature Methods. 2016;13:87–93. doi: 10.1038/nmeth.3629. [DOI] [PMC free article] [PubMed] [Google Scholar]
  130. Thomson M, Liu SJ, Zou LN, Smith Z, Meissner A, Ramanathan S. Pluripotency factors in embryonic stem cells regulate differentiation into germ layers. Cell. 2011;145:875–889. doi: 10.1016/j.cell.2011.05.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  131. Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B. 2001;63:411–423. doi: 10.1111/1467-9868.00293. [DOI] [Google Scholar]
  132. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B. 1996;58:267–288. doi: 10.1111/j.1467-9868.2011.00771.x. [DOI] [Google Scholar]
  133. Tou L, Liu Q, Shivdasani RA. Regulation of mammalian epithelial differentiation and intestine development by class I histone deacetylases. Molecular and Cellular Biology. 2004;24:3132–3139. doi: 10.1128/MCB.24.8.3132-3139.2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  134. Trapnell C, Cacchiarelli D, Grimsby J, Pokharel P, Li S, Morse M, Lennon NJ, Livak KJ, Mikkelsen TS, Rinn JL. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nature Biotechnology. 2014;32:381–386. doi: 10.1038/nbt.2859. [DOI] [PMC free article] [PubMed] [Google Scholar]
  135. Trapnell C. Defining cell types and states with single-cell genomics. Genome research. 2015;25:1491–1498. doi: 10.1101/gr.190595.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  136. Treutlein B, Brownfield DG, Wu AR, Neff NF, Mantalas GL, Espinoza FH, Desai TJ, Krasnow MA, Quake SR. Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq. Nature. 2014;509:371–375. doi: 10.1038/nature13173. [DOI] [PMC free article] [PubMed] [Google Scholar]
  137. Tzeng SF, de Vellis J. Id1, Id2, and Id3 gene expression in neural cells during development. Glia. 1998;24:372–381. doi: 10.1002/(SICI)1098-1136(199812)24:4&#x0003c;372::AID-GLIA2&#x0003e;3.0.CO;2-B. [DOI] [PubMed] [Google Scholar]
  138. Uittenbogaard M, Chiaramello A. Expression of the bHLH transcription factor Tcf12 (ME1) gene is linked to the expansion of precursor cell populations during neurogenesis. Gene Expression Patterns. 2002;1:115–121. doi: 10.1016/S1567-133X(01)00022-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  139. van der Flier LG, Clevers H. Stem cells, self-renewal, and differentiation in the intestinal epithelium. Annual Review of Physiology. 2009;71:241–260. doi: 10.1146/annurev.physiol.010908.163145. [DOI] [PubMed] [Google Scholar]
  140. Van Der Maaten L, Hinton G. Visualizing data using t-SNE. Journal of Machine Learning Research. 2008;9:2579–2605. [Google Scholar]
  141. Van Der Maaten L. Learning a parametric embedding by preserving local structure. Twelfth International Conference on Artificial Intelligence and Statistics (AI-STATS); 2009. pp. 384–391. [Google Scholar]
  142. VanDussen KL, Samuelson LC. Mouse atonal homolog 1 directs intestinal progenitors to secretory cell rather than absorptive cell fate. Developmental Biology. 2010;346:215–223. doi: 10.1016/j.ydbio.2010.07.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
  143. Varelas X. The hippo pathway effectors TAZ and YAP in development, homeostasis and disease. Development. 2014;141 doi: 10.1242/dev.102376. [DOI] [PubMed] [Google Scholar]
  144. Vu TN, Wills QF, Kalari KR, Niu N, Wang L, Rantalainen M, Pawitan Y. Beta-Poisson model for single-cell RNA-seq data analyses. Bioinformatics. 2016;32:2128–2135. doi: 10.1093/bioinformatics/btw202. [DOI] [PubMed] [Google Scholar]
  145. Wainwright MJ. Sharp thresholds for High-Dimensional and noisy sparsity recovery using $\ell _{1}$ -Constrained Quadratic Programming (Lasso) IEEE Transactions on Information Theory. 2009;55:2183–2202. [Google Scholar]
  146. Wang M, Gu D, Du M, Xu Z, Zhang S, Zhu L, Lu J, Zhang R, Xing J, Miao X, Chu H, Hu Z, Yang L, Tang C, Pan L, Du H, Zhao J, Du J, Tong N, Sun J, Shen H, Xu J, Zhang Z, Chen J. Common genetic variation in ETV6 is associated with colorectal cancer susceptibility. Nature Communications. 2016;7:11478. doi: 10.1038/ncomms11478. [DOI] [PMC free article] [PubMed] [Google Scholar]
  147. Wang TW, Stromberg GP, Whitney JT, Brower NW, Klymkowsky MW, Parent JM. Sox3 expression identifies neural progenitors in persistent neonatal and adult mouse forebrain germinative zones. The Journal of Comparative Neurology. 2006;497:88–100. doi: 10.1002/cne.20984. [DOI] [PubMed] [Google Scholar]
  148. Wiegreffe C, Simon R, Peschkes K, Kling C, Strehle M, Cheng J, Srivatsa S, Liu P, Jenkins NA, Copeland NG, Tarabykin V, Britsch S. Bcl11a (Ctip1) controls migration of cortical projection neurons through regulation of sema3c. Neuron. 2015;87:311–325. doi: 10.1016/j.neuron.2015.06.023. [DOI] [PubMed] [Google Scholar]
  149. Wills QF, Livak KJ, Tipping AJ, Enver T, Goldson AJ, Sexton DW, Holmes C. Single-cell gene expression analysis reveals genetic associations masked in whole-tissue experiments. Nature Biotechnology. 2013;31:748–752. doi: 10.1038/nbt.2642. [DOI] [PubMed] [Google Scholar]
  150. Witten DM, Tibshirani R. A framework for feature selection in clustering. Journal of the American Statistical Association. 2010;105:713–726. doi: 10.1198/jasa.2010.tm09415. [DOI] [PMC free article] [PubMed] [Google Scholar]
  151. Woodworth MB, Greig LC, Liu KX, Ippolito GC, Tucker HO, Macklis JD. Ctip1 regulates the balance between specification of distinct projection neuron subtypes in deep cortical layers. Cell Reports. 2016;15:999–1012. doi: 10.1016/j.celrep.2016.03.064. [DOI] [PMC free article] [PubMed] [Google Scholar]
  152. Wullaert A, Bonnet MC, Pasparakis M. NF-κB in the regulation of epithelial homeostasis and inflammation. Cell Research. 2011;21:146–158. doi: 10.1038/cr.2010.175. [DOI] [PMC free article] [PubMed] [Google Scholar]
  153. Yang Q, Liu S, Yin M, Yin Y, Zhou G, Zhou J. Ebf2 is required for development of dopamine neurons in the midbrain periaqueductal gray matter of mouse. Developmental Neurobiology. 2015;75:1282–1294. doi: 10.1002/dneu.22284. [DOI] [PubMed] [Google Scholar]
  154. Yao Z, Mich JK, Ku S, Menon V, Krostag AR, Martinez RA, Furchtgott L, Mulholland H, Bort S, Fuqua MA, Gregor BW, Hodge RD, Jayabalu A, May RC, Melton S, Nelson AM, Ngo NK, Shapovalova NV, Shehata SI, Smith MW, Tait LJ, Thompson CL, Thomsen ER, Ye C, Glass IA, Kaykas A, Yao S, Phillips JW, Grimley JS, Levi BP, Wang Y, Ramanathan S. A Single-Cell roadmap of lineage bifurcation in human ESC models of embryonic brain development. Cell Stem Cell. 2017;20:120–134. doi: 10.1016/j.stem.2016.09.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  155. Ye DZ, Kaestner KH. Foxa1 and Foxa2 control the differentiation of goblet and enteroendocrine L- and D-cells in mice. Gastroenterology. 2009;137:2052–2062. doi: 10.1053/j.gastro.2009.08.059. [DOI] [PMC free article] [PubMed] [Google Scholar]
  156. Yin M, Liu S, Yin Y, Li S, Li Z, Wu X, Zhang B, Ang SL, Ding Y, Zhou J. Ventral mesencephalon-enriched genes that regulate the development of dopaminergic neurons in vivo. Journal of Neuroscience. 2009;29:5170–5182. doi: 10.1523/JNEUROSCI.5569-08.2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  157. Yu T, Chen X, Zhang W, Li J, Xu R, Wang TC, Ai W, Liu C, Ghaleb A, McConnell B. Krüppel-like factor 4 regulates intestinal epithelial cell morphology and polarity. PLoS One. 2012;7:e32492. doi: 10.1371/journal.pone.0032492. [DOI] [PMC free article] [PubMed] [Google Scholar]
  158. Zeisel A, Muñoz-Manchado AB, Codeluppi S, Lönnerberg P, La Manno G, Juréus A, Marques S, Munguba H, He L, Betsholtz C, Rolny C, Castelo-Branco G, Hjerling-Leffler J, Linnarsson S. Brain structure. cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science. 2015;347:1138–1142. doi: 10.1126/science.aaa1934. [DOI] [PubMed] [Google Scholar]
  159. Zembrzycki A, Griesel G, Stoykova A, Mansouri A. Genetic interplay between the transcription factors Sp8 and Emx2 in the patterning of the forebrain. Neural Development. 2007;2:8. doi: 10.1186/1749-8104-2-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  160. Zhang P, Behre G, Pan J, Iwama A, Wara-Aswapati N, Radomska HS, Auron PE, Tenen DG, Sun Z. Negative cross-talk between hematopoietic regulators: GATA proteins repress PU.1. PNAS. 1999;96:8705–8710. doi: 10.1073/pnas.96.15.8705. [DOI] [PMC free article] [PubMed] [Google Scholar]
  161. Zhang Y, Miki T, Iwanaga T, Koseki Y, Okuno M, Sunaga Y, Ozaki N, Yano H, Koseki H, Seino S. Identification, tissue expression, and functional characterization of Otx3, a novel member of the Otx family. Journal of Biological Chemistry. 2002;277:28065–28069. doi: 10.1074/jbc.C100767200. [DOI] [PubMed] [Google Scholar]
eLife. 2017 Mar 15;6:e20488. doi: 10.7554/eLife.20488.033

Decision letter

Editor: Nir Yosef1

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Discovering sparse transcription factor codes for cell states and state transitions during development" for consideration by eLife. Your article has been favorably evaluated by Arup Chakraborty (Senior Editor) and three reviewers, one of whom, Nir Yosef (Reviewer #1), is a member of our Board of Reviewing Editors.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

This paper presents an interesting framework for identification of lineage/ developmental relationships based on whole transcriptome profiles. The basic notion is that differentially expressed genes among triplets of samples (representing either a developmental fork or a linear developmental progression) should largely exhibit a prototypical pattern that reflects the lineage relationship. The authors successfully apply this to infer lineage relationship in hematopoiesis, using bulk- level RNA-seq. They then turn to apply their method to single cell data of early neuronal development, and concomitantly infer clusters of cells (each representing a developmental state), and their organization into a lineage structure. A subset of these results is followed up in another, yet unpublished manuscript. The algorithm was also used and followed up on in a companion manuscript of ESC differentiation.

Essential revisions:

Overall, this study is innovative and provides new insights into cell-type detection and lineage relationship. The method developed here is complementary to existing pseudotime inference methods in single-cell analysis and the paper is clearly written. Nonetheless, the paper may benefit from additional valuation of the method's performance, specifically – by applications to additional data sets and by systematic comparison with existing methods. More detailed descriptions of their methods is also merited, as described below.

1) "Lineage progression" triplets: In the case of cell type B that gives rise to cell type A that gives rise to cell type C the middle node (A) is chosen as the "root"

1.1) The terminology "root cell type" and "terminal leaf cell types" seems inappropriate and confusing in this case and should be revised.

1.2) The minimum-in-leaf model makes sense in the "Lineage progression" case as it captures the common scenario of genes that get turned on or off in a lineage path, with some persistence of this state. However, in such triplets that are distant enough along the path this may contradict the single "pulse" model (i.e., transient change) proposed in other studies. To explore this, Figure 1 should be revised. Currently it only shows the majority statistics. It would be important to see the fraction of genes that conform to the minimum-in-leaf model (which can be anywhere between 51% and 100%) while contrasting:

1.2.1) "Lineage progression" vs. "cell fate decision" triplets.

1.2.2) Triplets with leaves vs. triplets that only include internal nodes.

1.2.3) Triplets with different distance in the tree.

2) The Clear-minimum-in-leaf model.

2.1) To further support the validity of the minimum-in-leaf model, the latter extension to Figure 1 (comment 1.2) should also include comparison between all related triplets vs. all unrelated triplets (i.e., representatives from three different branches).

2.2) Subsection “Discovering sparse patterns correlated with lineage transitions”, third paragraph: There are more than 2 possible patterns for differentially expressed genes between 3 cell types. In addition to the clear minimum A<B=C and clear maximum A=B<C there is also A<B<C, which is actually what is expected for lateral inhibition, discussed in the text several times. This pattern of three clearly distinct expression levels is not discussed at all in the text (in fact, it is excluded from the model, as described in subsection “1. Notation; Bayes’ Rule”, first paragraph). A clear example is HLF, mentioned in the text as higher in MPP compared to CMP, but it is much higher in the third member of the discussed triplet, ST (according to ImmGen skyline browser).

2.3) The B&T subtree used for the observation has a very simple topology of only two branches. That might limit the generality of the conclusion. Indeed, the reconstructed lineage tree of the progenitors include a loop, which is not supported by the literature. Authors should discuss this difference from known tree topology. Also, why did they use their approach to reconstruct only hematopoietic progenitors and not all ~250 ImmGen populations?

3) First algorithm.

3.1) For the formula in the third paragraph of the subsection “Using patterns to infer lineages”, but I wonder how to draw a cutoff for a tree to be 'correct'. If the method is applied to unrelated cell-types, what is distribution of the p-values?

3.2) How robust is the lineage prediction model to the initialization procedure and parameter choices? For example, how sensitive is the resulting tree on the number of initial clusters? how sensitive are the results for the algorithm's parameters (e.g., probability cutoffs) or for sub-sampling of samples or genes? A similar stability analysis should also be included for the second algorithm (extension to single cells), which can be even more sensitive due to its iterative nature.

4) Application to hematopoiesis:

4.1) It is not clear how well the algorithm can separate related from unrelated triplets? (i.e., representatives from three different branches). While this is mentioned anecdotally in Figure 2, it should be evaluated systematically, e.g., with AUROC.

4.2) Given the constraints of the algorithm, the aggregated lineage tree is undirected. While this is mentioned briefly, this key point should be better clarified in the main text.

5) Algorithm's extension to single cells.

5.1) Comparison to other methods.

Recently a number of papers have been published to infer cell lineages from single-cell gene expression data, such as Monocole (Trapnell et al. NBT, 2014), Treutlein et al. (Nature 2014), Wishbone (Setty, Nature Biotech, 2016), tSCAN (Ji, NAR, 2016), StemID (Grun, Cell Stem Cell 2016), and diffusion map (Haghverdi, Nature Methods, 2016). It will be useful to compare with these methods at least to get a sense to what degree this method provides new information (note: it looks like Figure 4B comes from Monocole, but we could not find a mention for it in the Results section).

5.2) Differential expression in single cells.

The t-test used to identify clear minimum and clear maximum patterns is appropriate for microarray analysis, but one limitation of t-test is it assumes the underlying distribution is normal, which is not true for RNAseq, and especially single-cell RNA-seq data. Specifically, single-cell RNA-seq can be severely under-powered because many genes (especially transcription factors) have low or even zero expression values due to dropout. Please discuss ways to incorporate more sophisticated tests that addresses this challenge (e.g., as in scDE), or demonstrate why this is not needed.

5.3) Related to that – the threshold used for definition of a pattern is p<0.005 in a two sample t-test. Though not specified by the authors, it seems that as there are 1500 transcription factors tested, and 3 tests performed per gene (to test the separation of each of the 3 triplet members from the other two), 4500 tests were performed per triplet. Thus, one would expect to find 22 transcription factors (4500*0.005) with a pattern that pass this threshold for each triplet in the absence of a signal. The authors have to address the multiple comparisons issue.

6) Application on Single cell data

6.1) While the single-cell data comes from another manuscript that is under review elsewhere; the information provided on this data is insufficient.

6.1.1) Please provide a compete "census" of the data – how many cells from each condition; how many reads per library, which read lengths were used?

6.1.2) How was the data processed? Should we worry about potential confounders of library quality (which have been presented and discussed time and again in the recent scRNA-seq literature)?

6.2) The algorithm should have been demonstrated first on a single cell RNA-seq datasets where the true cell type transitions tree is known, such as one of the publicly available hematopoietic progenitors single cell RNA-seq datasets.

6.3) There is no estimate of how sparse is the resulting code – how many genes can be used to decipher the topology? Can they be identified ahead? Is it possible that any group of genes of similar size (~1500) can code for cell states and cell transitions with the same success as the transcription factors?

6.4) It is not clear how the tree was generated. Was it using the generic algorithm from the previous section, or by enforcing the temporal order of the samples? If it is the latter – please describe the procedure and explain why the generic algorithm does not work.

eLife. 2017 Mar 15;6:e20488. doi: 10.7554/eLife.20488.034

Author response


Essential revisions:

Overall, this study is innovative and provides new insights into cell-type detection and lineage relationship. The method developed here is complementary to existing pseudotime inference methods in single-cell analysis and the paper is clearly written. Nonetheless, the paper may benefit from additional valuation of the method's performance, specifically – by applications to additional data sets and by systematic comparison with existing methods. More detailed descriptions of their methods is also merited, as described below.

We would like to thank the Senior and Reviewing Editors and four peer reviewers for their thoughtful comments and suggestions. The feedback and suggestions have greatly improved our manuscript as well as strengthened our conclusions and for this again we are grateful. We have considered each comment and have amended the manuscript to add 5 new figure supplements, 3 new source data items, and 1 movie, as well as edited the text as noted in the detailed responses below.

1) "Lineage progression" triplets: In the case of cell type B that gives rise to cell type A that gives rise to cell type C the middle node (A) is chosen as the "root"

1.1) The terminology "root cell type" and "terminal leaf cell types" seems inappropriate and confusing in this case and should be revised.

We would be happy to change the terminology to something clearer, but we have struggled to find an unequivocally clearer alternative. We would like to propose a few alternatives, which we feel all have advantages and disadvantages, and we hope the reviewers could advise on which conveys the meaning most accurately (or propose a better alternative): A) central node and peripheral nodes. B) intermediate node and endpoints. C) internal node and leaf nodes.

D) central node and terminal nodes.

We have also clarified our use of the terms “root” and “leaf, adding: “Note that even in the case in which cell type B gives rise to cell type A which gives rise to cell type C, we will refer to A as the “root” and B and C as “leaves” for that particular triplet.”

1.2) The minimum-in-leaf model makes sense in the "Lineage progression" case as it captures the common scenario of genes that get turned on or off in a lineage path, with some persistence of this state. However, in such triplets that are distant enough along the path this may contradict the single "pulse" model (i.e., transient change) proposed in other studies. To explore this, Figure 1 should be revised. Currently it only shows the majority statistics. It would be important to see the fraction of genes that conform to the minimum-in-leaf model (which can be anywhere between 51% and 100%) while contrasting:

1.2.1) "Lineage progression" vs. "cell fate decision" triplets.

1.2.2) Triplets with leaves vs. triplets that only include internal nodes.

1.2.3) Triplets with different distance in the tree.

We added an additional figure supplement to Figure 1 (Figure 1—figure supplement 2) to test these questions. Figure 1—figure supplement 2C shows the number of genes conforming to the pattern for each triplet, and Figure 1—figure supplement 2D-F show the triplets shaded according to decision type, presence of internal nodes, and distance. Although biologically, each of these subsets could reasonably be expected to exhibit different statistics, we found that each scenario produced aggregate statistics that were indistinguishable from Figure 1. Since Figure 1—figure supplement 2 includes plots for each of the scenarios suggested in this comment, we have left the representation in Figure 1 unchanged for the purposes of clarity. Figure 1—source data 1 has been appropriately updated.

2) The Clear-minimum-in-leaf model.

2.1) To further support the validity of the minimum-in-leaf model, the latter extension to Figure 1 (comment 1.2) should also include comparison between all related triplets vs. all unrelated triplets (i.e., representatives from three different branches).

The comment suggests we compare the model’s success on triplets with known relationships vs. those with unknown relationships to establish an expectation of the false positives from our algorithm. This is an important point that we have addressed in two ways.

First, we have repeated the analysis in Figure 1 on a set of 100 unrelated triplets from the same hematopoietic lineage tree, where there is no clear relationship between the cell types. Figure 1—source data 1 has been updated with these triplets. We found that while there were a substantial number of genes exhibiting a clear minimum in one of the three cell types in unrelated triplets, their minima were evenly distributed amongst the cell types (in contrast to what is seen in related triplets: the clear minimum pattern is not seen in the ‘root’). To quantify this, we counted the fraction of genes fiwhich reached a clear minimum in cell type i = A, B, C (For A, B, and C unrelated cell types), and for each triplet, we computed the entropy 𝑆 = −∑i=A,B,C fi log (fi) and plotted this quantity vs. the number of genes showing a minimum in any triplet for unrelated and related triplets (Figure 1—figure supplement 3A). The unrelated triplets have higher entropy and typically more genes showing the downregulation pattern.

Second, we applied our Bayesian methodology to the set of 150 related and 100 unrelated triplets and compared the distribution of the probability of obtaining a non-null topology. (Figure 1—figure supplement 3B). The two distributions diverge considerably, with an AUROC (Area Under Receiver Operating Characteristic) of 0.96, which is shown in a new supplemental figure (Figure 1—figure supplement 3C).

2.2) Subsection “Discovering sparse patterns correlated with lineage transitions”, third paragraph: There are more than 2 possible patterns for differentially expressed genes between 3 cell types. In addition to the clear minimum A<B=C and clear maximum A=B<C there is also A<B<C, which is actually what is expected for lateral inhibition, discussed in the text several times. This pattern of three clearly distinct expression levels is not discussed at all in the text (in fact, it is excluded from the model, as described in subsection “1. Notation; Bayes’ Rule”, first paragraph). A clear example is HLF, mentioned in the text as higher in MPP compared to CMP, but it is much higher in the third member of the discussed triplet, ST (according to ImmGen skyline browser).

This is an important comment and we have tried to clarify the text accordingly. First, we are defining the clear minimum pattern as A < B & A < C and the clear maximum pattern as A < C & B < C. (The p-value we calculate for the clear minimum model is the less significant of the two p-values for ttest(A,B) and ttest(A,C)). Given this definition, a gene displaying the A < B < C pattern would fulfill both the clear minimum and clear maximum patterns.

In our inference method, transition genes are similarly defined as having a minimum expression level, i.e. A < B and A < C. These constraints are enforced in the bounds of the numeric integration (point 3 of the subsection “Bayesian Framework for inferring cluster identities, state transitions, and marker and transition genes simultaneously”), and we are not constraining B=C. Indeed, Hlf is identified with high probability as a transition gene for triplet ST/MPP/CMP despite the two levels in ST and MPP. We have revised the text to discuss and clarify genes with three distinct levels (subsection “Discovering sparse patterns correlated with lineage transitions”, third paragraph).

In response to the comments we defined the A < B < C pattern as the “clear median” pattern, and we have evaluated how well the pattern correlates with the known topologies. Specifically, does the cell type with median expression level tend to be the root cell type of the triplet? We found that, similar to the clear maximum pattern, the clear median pattern was not a good predictor of the topology.

Therefore, for clarity, we did not define an additional clear median pattern in the main text, and clarified that genes with the clear median pattern belonged to both the clear minimum and clear maximum classes (in the aforementioned paragraph).

2.3) The B&T subtree used for the observation has a very simple topology of only two branches. That might limit the generality of the conclusion.

The B and T lineage tree is certainly simpler than some systems, but it contains a combination of lineage decisions as well as lineage progressions. While we were restricted by the relative paucity of well-verified lineage trees, our algorithm has been successfully tested on a) microarray data from early hematopoietic progenitors, b) single-cell gene expression data from intestinal cell differentiation (added in the revised manuscript, now in Figure 3—figure supplement 2), c) single-cell gene expression data from early germ layer differentiation (accompanying Jang et al. manuscript). The predictions in this paper regarding a forebrain/hindbrain split in neural development have been experimentally verified with viral barcoding in human brain development (Yao et al., 2017). Together, this body of evidence suggests to us that the motif we identified and the approach we developed may have more general applicability.

Indeed, the reconstructed lineage tree of the progenitors include a loop, which is not supported by the literature. Authors should discuss this difference from known tree topology.

The loop we predict in the progenitor tree is in fact supported by the Adolfsson model in the literature (Adolfsson et al., 2005). We have written a discussion of the differences between the reconstructed lineage tree and the traditional model (subsection “A lineage tree for early hematopoiesis”, second paragraph). To address the conflicting models of early hematopoietic progenitor lineages, we have compared the triplet topologies expected in the canonical model and in the Adolfsson model, and demonstrated that our predictions support the key features of the Adolfsson model (Figure 2—figure supplement 2).

Also, why did they use their approach to reconstruct only hematopoietic progenitors and not all ~250 ImmGen populations?

We did not test our model on the entirety of the ImmGen data set because not all of the lineage relationships were verified experimentally as the B- and T- cell branches and early progenitors, meaning the results would not necessarily have been informative of the performance of the algorithm.

3) First algorithm.

3.1) For the formula in the third paragraph of the subsection “Using patterns to infer lineages”, but I wonder how to draw a cutoff for a tree to be 'correct'. If the method is applied to unrelated cell-types, what is distribution of the p-values?

We thank the reviewers for this helpful suggestion. We compiled the inferred probability of triplet topologies for both related and unrelated triplets explored earlier from the B- and T- cell branches of the hematopoietic tree (see comment 2.2). We applied our Bayesian methodology to the set of 150 related and 100 unrelated triplets and compared the distributions of the probability of obtaining a non-null topology in a new supplemental subfigure (Figure 1—figure supplement 3B). As mentioned in a previous comment and addressed in the new supplemental figure, the two distributions diverge considerably, with an AUC of 0.96 (Figure 1—figure supplement 3C).

3.2) How robust is the lineage prediction model to the initialization procedure and parameter choices? For example, how sensitive is the resulting tree on the number of initial clusters? how sensitive are the results for the algorithm's parameters (e.g., probability cutoffs) or for sub-sampling of samples or genes? A similar stability analysis should also be included for the second algorithm (extension to single cells), which can be even more sensitive due to its iterative nature.

We have added a subsection “Stability analysis” to the Materials and methods section, covering both algorithms. We found that the resulting trees are robust to probability cutoff, the prior probability of having a null topology p(T), the relative probabilities of irrelevant and marker genes, and the number of initial clusters.

In a new subfigure (Figure 2—figure supplement 1E), we have shown the distribution of maximal probabilities for non-null topologies in hematopoiesis. Here we show that triplets tend to have very high probabilities (close to 1), and varying the probability cutoff for triplet consideration does not affect the overall tree construction.

Additionally, a key input parameter into the algorithm is the expected prior distribution of means and standard deviations 𝑝(𝜇, 𝜎) used in the numeric integration (LXYZ). We have been using a uniform prior for both mean and standard deviation, but this requires specification of the ranges of the mean and standard deviation. We implemented an empirical prior 𝑝(𝜇, 𝜎) by estimating the empirical distribution over all the Immgen cell types, using kernel density estimation. The resulting hematopoietic lineage tree was identical. Using the kernel density estimated empirical prior may provide more stability in future analyses.

4) Application to hematopoiesis:

4.1) It is not clear how well the algorithm can separate related from unrelated triplets? (i.e., representatives from three different branches). While this is mentioned anecdotally in Figure 2, it should be evaluated systematically, e.g., with AUROC.

We have systematically annotated each of the 165 triplets, noting the topologies expected in the traditional model and in the Adolfsson model, and the expected triplet length in each model. This information has also been added to Figure 2—source data 2. and demonstrated that our predictions support the key features of the Adolfsson model (Figure 2—figure supplement 2).

Null topologies were identified by the algorithm for triplets with cells that are from 3 terminal nodes (e.g. the triplet MEP/GMP/FrBC) or from triplets that contain very distantly related triplets (e.g. LT/CLP/ETP).

For 7 triplets with terminal nodes in there is no expected topology (e.g. MEP/GMP/FrA), the method identified a non-null topology (e.g. GMP). However, these misidentified triplets did not affect the final tree, since they were eliminated using the pruning rule (subsection “Pruning rule”), which prunes triplets that contain links between cell types that are not nearest neighbors. In the case of MEP/GMP/FrA, MEP and GMP are separated by CMP, and GMP and FrA are separated by MLP.

By our count, the Adolfsson and traditional models differ in the topology of 9 triplets. The inferred expected topologies of 8 out of these 9 triplets support the Adolfsson model, which led to the identification of the final tree.

4.2) Given the constraints of the algorithm, the aggregated lineage tree is undirected. While this is mentioned briefly, this key point should be better clarified in the main text.

This is indeed a key point that is easily missed. We have amended the text to clarify this point, stating that “we next determined an undirected graph that recapitulates all of the individual triplet topologies (note that we are only inferring triplet topologies and are not inferring directionality).”

5) Algorithm's extension to single cells.

5.1) Comparison to other methods.

Recently a number of papers have been published to infer cell lineages from single-cell gene expression data, such as Monocole (Trapnell et al. NBT, 2014), Treutlein et al. (Nature 2014), Wishbone (Setty, Nature Biotech, 2016), tSCAN (Ji, NAR, 2016), StemID (Grun, Cell Stem Cell 2016), and diffusion map (Haghverdi, Nature Methods, 2016). It will be useful to compare with these methods at least to get a sense to what degree this method provides new information (note: it looks like Figure 4B comes from Monocole, but we could not find a mention for it in the Results section).

While our method is complementary to many of these alternative approaches, we have addressed some specific areas in which our method can add insight to analysis of single cell data. We applied Monocle 2, tSCAN, and StemID to the neuronal differentiation data set in this paper and were unable to extract meaningful results as shown in a new supplemental figure (Figure 4—figure supplement 2). In each of these cases, it was difficult to extract experimentally verified hypotheses from the analysis. In particular, the forebrain/hindbrain split in early development was not correctly identified (this split has been verified experimentally in Yao et al., 2017). We have also addressed this in the main text, adding that: “Analysis of this data with other recent methods such as Monocle and Monocle2 (Trapnell et al., 2014), TSCAN (Ji and Ji, 2016), and StemID (Grün et al., 2016) did not clearly reconstruct lineage or infer key genes regulating transitions (Figure 4A – Bottom, Figure 4—figure supplement 2). Monocle2 (Figure 4—figure supplement 2A) produces a tree with complex branching, but Sox2Sox2+ progenitors and DCX+ differentiated neurons do no clearly separate”.

In addition, we performed a more in-depth comparison with StemID to explore the different conclusions one could gain from these orthogonal methods. We analyzed the intestinal cell differentiation system investigated by Grun et al., 2015 and compared our conclusions with those that they found. Using the Grun et al. cell clusters and a minimum cluster size of 10 cells, we accurately predicted the topology of each triplet in the lineage tree which they reconstruct, as shown in a new supplemental figure (Figure 3—figure supplement 3).

Our method is complementary to StemID in that for each of the triplets, we were able to identify a sparse set of transition genes which could be experimentally perturbed to explore the dynamics of the system (Figure 4—source data 2), many of which have known developmental relevance.

5.2) Differential expression in single cells.

The t-test used to identify clear minimum and clear maximum patterns is appropriate for microarray analysis, but one limitation of t-test is it assumes the underlying distribution is normal, which is not true for RNAseq, and especially single-cell RNA-seq data. Specifically, single-cell RNA-seq can be severely under-powered because many genes (especially transcription factors) have low or even zero expression values due to dropout. Please discuss ways to incorporate more sophisticated tests that addresses this challenge (e.g., as in scDE), or demonstrate why this is not needed.

We have now clarified that the single cell data is analyzed using the Bayesian formalism introduced earlier, and does not use a t-test to quantify separation between distributions. The inference methodology does, however, implicitly fit the expression data to a log-normal distribution. The choice of a log-normal distribution could potentially confound the results, particularly in cases with significant zero-inflation. This is a potential area for improvement in our algorithm, but the method can easily be adapted to model different distributions of RNA expression, such as γ distributions (Shahrezaei and Swain, 2008; Wills et al., 2013) and β-Poisson distributions (Vu et al., 2016; Delman and Hemberg, 2016). In either case, the probability of the data given different topologies would be computed by numeric integration over the parameters of the distribution, e.g. 𝛼 and 𝛽 for the Γ distribution, by replacing the log-normal distributions in Equations 6, 8 and 9 with ones from the appropriate model.

The choice of the right parametric form for single-cell RNA expression is still an area of active research. Our choice of log-normal distributions assumes that higher order moments of the distributions (beyond standard deviation) ought to have a minimal contribution to the predictions, but we have not tested this extensively.

5.3) Related to that – the threshold used for definition of a pattern is p<0.005 in a two sample t-test. Though not specified by the authors, it seems that as there are 1500 transcription factors tested, and 3 tests performed per gene (to test the separation of each of the 3 triplet members from the other two), 4500 tests were performed per triplet. Thus, one would expect to find 22 transcription factors (4500*0.005) with a pattern that pass this threshold for each triplet in the absence of a signal. The authors have to address the multiple comparisons issue.

We computed the FDR-adjusted p-values for the pattern recognition and found that the clear minimum pattern is still robust to this multiple hypothesis correction (Figure 1—figure supplement 2B). In the Bayesian inference method, the prior odds of being a transition gene serves as a sparsity parameter which directly addresses the multiple comparisons issue.

6) Application on Single cell data

6.1) While the single-cell data comes from another manuscript that is under review elsewhere; the information provided on this data is insufficient.

6.1.1) Please provide a compete "census" of the data – how many cells from each condition; how many reads per library, which read lengths were used?

6.1.2) How was the data processed? Should we worry about potential confounders of library quality (which have been presented and discussed time and again in the recent scRNA-seq literature)?

We have added a new subsection to the Materials and methods to include this information (“In Vitro Neuronal Differentiation”). In particular, as the single cells were profiled independently using the Smart- Seq2 protocol (without pooling before amplification, as in methods such as Cel-Seq, STRT, or Drop-Seq), we did not observe the batch effects present in pooling-based methods. We would also like to note that the paper detailing the collection of this data set has now been published in Cell Stem Cell(Yao et al., 2017).

6.2) The algorithm should have been demonstrated first on a single cell RNA-seq datasets where the true cell type transitions tree is known, such as one of the publicly available hematopoietic progenitors single cell RNA-seq datasets.

We applied our method to the intestinal cell differentiation from Grun et al., (Figure 4—figure supplement 3) and found that our method correctly predicts the topology of each triplet based on the Grun et al. reconstructed lineage tree. Further, we identified key genes in many of the triplets with known developmental relevance. By reducing the system to a sparse measure of distances between cell types in the lineage tree, we can provide complimentary biological insight to existing methods which would allow for further experimental investigation. We have modified the main text in the last paragraph of the subsection “A lineage tree for early hematopoiesis”, to discuss this additional analysis.

6.3) There is no estimate of how sparse is the resulting code – how many genes can be used to decipher the topology? Can they be identified ahead? Is it possible that any group of genes of similar size (~1500) can code for cell states and cell transitions with the same success as the transcription factors?

To address this suggestion, we explored the sparsity of the code based on our reconstruction. We assembled 20 triplets from brain development that were originally inferred to be non-null and had a maximal leaf-to-leaf distance no greater than 5. For each triplet, we ranked genes based on their odds of being transition genes 𝒪i. We then attempted to infer the correct topology using only the N genes per triplet with the largest odds, and varied N between 1 and 50 (Figure 4—figure supplement 1E). We find that with just 4 genes per triplet (80 total), we can correctly infer the topology of each of these triplets. Note that these genes are ranked by their odds of being transition genes, which is agnostic to the true topology of the triplet. Given assignments of cells to cell types (or clustering), these genes can be found without knowing the topology of the tree by computing the odds. We thank the reviewers for pointing out this relevant feature of the tree reconstruction. We have added discussion of the sparsity to the last paragraph of the subsection “Inferred lineage tree for human excitatory neuronal progenitors from in vitro single-cell data over 80 days of differentiation”.

6.4) It is not clear how the tree was generated. Was it using the generic algorithm from the previous section, or by enforcing the temporal order of the samples? If it is the latter – please describe the procedure and explain why the generic algorithm does not work.

We have now clarified in the main text (subsection “Inferred lineage tree for human excitatory neuronal progenitors from in vitro single-cell data over 80 days of differentiation”, sixth paragraph) that the tree was compiled using a) triplets with P>0.6, b) knowledge that Sox2Sox2+ cell types were progenitors, and DCX+ cell types were differentiated (so a DCX+ cell type could not transition to a Sox2Sox2+ cell type), c) using the pruning rule as described in the Materials and methods section. Using these rules, the tree was assembled by hand. A method for computationally generating a tree from a set of triplets would be a valuable extension of this work.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Figure 1—source data 1. Triplets used for pattern detection.

    Sheet 1: The 150 related triplets used for pattern detection are listed, with the root cell type of the triplet in the first column and the two leaves in second and third columns. Cell type names follow (Heng et al., 2008). For each triplet are listed the number of genes with a clear minimum in each cell type and a clear maximum in each cell type (t-test p-value < 0.005). Also listed for each triplet are the length of the triplet, whether the triplet is a lineage progression or a cell fate decision, whether the triplet contains a terminal node, and the entropy in the clear minimum pattern and clear maximum pattern. Sheet 2: The 100 unrelated triplets are listed; cell type names follow (Heng et al., 2008). For each triplet are listed the number of genes with a clear minimum in each cell type and a clear maximum in each cell type (t-test p-value < 0.005) and the entropy in the clear minimum pattern and clear maximum pattern.

    DOI: http://dx.doi.org/10.7554/eLife.20488.003

    DOI: 10.7554/eLife.20488.003
    Figure 1—source data 2. List of mouse transcription factors.

    The list of human transcription factors used to cluster and infer lineages for the hematopoietic and intestinal development trees. The list was adapted from (Thomson et al., 2011).

    DOI: http://dx.doi.org/10.7554/eLife.20488.004

    DOI: 10.7554/eLife.20488.004
    Figure 2—source data 1. Early hematopoietic cell types considered.

    Listed for each early hematopoietic progenitor whose name is abbreviated in this paper are the Immunological Genome Project descriptor for the cell type, its common name and phenotype, its age and location, and the number of replicates in the data set.

    DOI: http://dx.doi.org/10.7554/eLife.20488.009

    DOI: 10.7554/eLife.20488.009
    Figure 2—source data 2. Probabilities of topologies for triplets of hematopoietic cell types.

    Listed are, for each triplet of cell types, the probabilities of the four topologies for prior odds p(βi=1)p(βi=0)=0.05; the number of topologies that reach probability p(T | {giA,B,C})>0.6 for some value of p(βi=1)p(βi=0)  between 10−6 and 102; the non-null topology that has the highest probability p(T | {giA,B,C}) over the range of prior odds (if the null topology is the most likely topology for the entire range of prior odds, the topology is marked ‘null’); and the value of highest probability p(T | {giA,B,C}) over the range of prior odds; the correct topology and triplet length in the traditional model; and the correct topology and triplet length in the Adolfsson model.

    DOI: http://dx.doi.org/10.7554/eLife.20488.010

    DOI: 10.7554/eLife.20488.010
    Figure 2—source data 3. Probabilities of transition and marker genes for the hematopoietic lineage tree.

    Listed are, for crucial triplets along the lineage tree, the 200 genes with the highest probabilities of belonging to the two transition gene classes and the root marker class, and their associated probabilities.

    DOI: http://dx.doi.org/10.7554/eLife.20488.011

    DOI: 10.7554/eLife.20488.011
    Figure 3—source data 1. Marker genes for early hematopoiesis.

    Table with selected marker genes for early hematopoietic cell types, along with references to published validations of their functional role. Genes known to be effective for reprogramming are shown in bold.

    DOI: http://dx.doi.org/10.7554/eLife.20488.015

    DOI: 10.7554/eLife.20488.015
    Figure 4—source data 1. Final cluster identities of single cells from in vitro cortical differentiation.

    Cells were assigned to a cluster based on an iterative clustering procedure. Their final cluster assignments are shown here. Cell types 8–11 contained bulk brain control cells, and non-neuronal cells which were excluded from the analysis.

    DOI: http://dx.doi.org/10.7554/eLife.20488.020

    DOI: 10.7554/eLife.20488.020
    Figure 4—source data 2. Probabilities of topologies for triplets of single-cell clusters.

    Listed are, for each triplet of clusters, the probability of a given topology and the root of the inferred topology. ‘0’ refers to the null topology. Triplets with p>0.6 were used to construct the lineage tree.

    DOI: http://dx.doi.org/10.7554/eLife.20488.021

    DOI: 10.7554/eLife.20488.021
    Figure 4—source data 3. Probabilities of Transition and Marker Genes for the Human Brain Developmental Lineage Tree.

    Listed are, for crucial triplets along the lineage tree, the genes with the p>0.8 of belonging to the two transition gene classes and the root marker class, and their associated probabilities.

    DOI: http://dx.doi.org/10.7554/eLife.20488.022

    DOI: 10.7554/eLife.20488.022
    Figure 4—source data 4. Human Brain Development SmartSeq2 Census.

    Further detail can be found Materials and methods and in Yao et al. (2017) (Supplemental Information).

    DOI: http://dx.doi.org/10.7554/eLife.20488.023

    DOI: 10.7554/eLife.20488.023
    Figure 4—source data 5. List of Human Transcription Factors.

    List of human transcription factors used to cluster and infer lineages for the cortical differentiation tree. List was adapted from (Ben-Porath et al., 2008).

    DOI: http://dx.doi.org/10.7554/eLife.20488.024

    DOI: 10.7554/eLife.20488.024

    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES