Lineage reconstruction is central to understanding tissue development and maintenance. To overcome the limitations of current techniques that typically reconstruct clonal trees using genetically encoded reporters, we report scPECLR, a probabilistic algorithm to endogenously infer lineage trees at a single-cell-division resolution by using 5-hydroxymethylcytosine (5hmC). When applied to 8-cell pre-implantation mouse embryos, scPECLR predicts the full lineage tree with greater than 95% accuracy. In addition, we developed scH&G-seq to sequence both 5hmC and genomic DNA from the same cell. Given that genomic DNA sequencing yields information on both copy number variations and single-nucleotide polymorphisms, when combined with scPECLR it enables more accurate lineage reconstruction of larger trees. Finally, we show that scPECLR can also be used to map chromosome strand segregation patterns during cell division, thereby providing a strategy to test the “immortal strand” hypothesis. Thus, scPECLR provides a generalized method to endogenously reconstruct lineage trees at an individual-cell-division resolution.
Wangsanuwat et al. develop a probabilistic algorithm to reconstruct cellular lineages at an individual-cell-division resolution by using strand-specific 5hmC measurements in single cells, and combine it with the development of a single-cell multi-omics technology to quantify 5hmC and genomic DNA from the same cell to reconstruct larger lineage trees.
Understanding lineage relationships between cells in a tissue is a central question in biology. Reconstructing lineage trees is not only fundamental to understanding tissue development, homeostasis, and repair but is also important for gaining insights into the dynamics of tumor evolution and other diseases. Genetically encoded fluorescent reporters have been a powerful approach to reconstruct the lineage of many tissues (Kretzschmar and Watt, 2012). However, these methods require the generation of complex animal models for each stem or progenitor cell type of interest, and are limited to a clonal resolution (Kretzschmar and Watt, 2012). Similarly, other techniques, such as the use of viruses (Naik et al., 2013), transposons (Sun et al., 2014; Wagner et al., 2018), Cre-loxP-based recombination (Pei et al., 2017), and CRISPR/Cas9 (Alemany et al., 2018; Kalhor et al., 2017; McKenna et al., 2016; Perli et al., 2016; Raj et al., 2018; Spanjaard et al., 2018) have also been used to genetically label cells to primarily reconstruct clonal lineages. This clonal resolution limits our ability to understand tissue dynamics at a single-cell-division resolution. Although a recent report that combined CRISPR/Cas9-mediated mutagenesis with single-molecule RNA fluorescence in situ hybridization (FISH) enabled reconstruction of lineages at a single-cell-division resolution (MEMOIR) (Frieda et al., 2017), their ability to infer lineages dropped substantially by the third cell division.
Furthermore, as these methods involve exogenous labeling, they cannot be used to directly map cellular lineages in human tissues, thereby posing a barrier to understanding human development and diseases. Although endogenous somatic mutations have been used to reconstruct lineages, their low frequency of occurrence over the whole genome make them challenging to detect and therefore limit their application as a lineage reconstruction tool (Behjati et al., 2014; Ju et al., 2017; Lodato et al., 2015). Similarly, recent methods have used mutations within the mitochondrial genome or microsatellites to reconstruct lineages, but these approaches are also limited to a clonal resolution (Biezuner et al., 2016; Evrony et al., 2015; Ludwig et al., 2019; Xu et al., 2019). Previously, we developed a method to detect the endogenous epigenetic mark 5-hydroxymethylcytosine (5hmC) in single cells (scAba-seq) and showed that the lack of maintenance of this mark during replication resulted in older DNA strands containing higher levels of 5hmC (Mooijman et al., 2016). The ability to track individual DNA strands through cell division allowed us to deterministically reconstruct lineages that were limited to two cell divisions (Mooijman et al., 2016). Therefore, to reconstruct larger trees and overcome limitations of other methods, we report single-cell Probabilistic Endogenous Cellular Lineage Reconstruction (scPECLR), a generalized probabilistic framework to endogenously reconstruct cellular lineages at an individual-cell-division resolution by using single-cell 5hmC sequencing. This approach can be used to successfully reconstruct up to four cell divisions. To reconstruct larger trees, we developed an integrated single-cell method, scH&G-seq, to simultaneously sequence 5hmC and genomic/mitochondrial DNA from the same cell. By combining information from genomic variants that can be used to identify clonal subtrees within the complete tree, together with strand-specific 5hmC that enables tracking the lineage of individual cells, scH&G-seq can be generalized to endogenously reconstruct the lineage of larger trees at a single-cell-division resolution.
Genome-wide strand-specific 5hmC enables initial lineage bifurcation of individual cells into two subtrees
As proof of principle, we dissociated 8-cell mouse embryos and performed scAba-seq to quantify strand-specific genome-wide patterns of 5hmC in single cells (Figure 1A). As shown previously, a majority of 5hmC is present on the paternal genome during these stages of pre-implantation development (Inoue and Zhang, 2011; Iqbal et al., 2011; Wossidlo et al., 2011). Single cells from an 8-cell embryo displayed a mosaic genome-wide distribution with no overlap of 5hmC between the plus and minus strands of a chromosome (Figure 1B). Furthermore, for each chromosome the strand-specific 5hmC was localized to a few cells, and other cells contained undetectable levels of the mark (Figure 1B). These observations show that only one allele carries a majority of 5hmC and that we are primarily detecting 5hmC on the original paternal genome, with DNA strands synthesized in subsequent rounds of replication carrying very low levels of the mark. We used this as our basis to reconstruct cellular lineages of 8-cell embryos.
Figure 1. Strand-specific single-cell 5hmC enables initial lineage bifurcation of individual cells into two subtrees.
(A) Schematic shows a zygote with chromosomes containing high 5hmC levels (solid lines) undergoing three cell divisions. The newly synthesized strands contain very low levels of 5hmC (dotted lines). SCE events occur randomly during each cell cycle. Single cells are sequenced by using scAba-seq to quantify strand-specific 5hmC.
(B) Data showing mosaic pattern of strand-specific 5hmC in single cells obtained from an 8-cell mouse embryo. 5hmC counts within 2-Mb bins on the plus and minus strands are shown in orange and blue, respectively.
(C) OSS analysis on chromosome 7 places cell 8 in one 4-cell subtree and cells 1 and 2 in the other subtree. Performing OSS on all chromosomes places cells in one of these two 4-cell subtrees and reduces the complexity of the lineage reconstruction problem.
As the first step toward reconstructing lineage trees, we noted that the original plus and minus strands of each paternal chromosome in the 1-cell zygote will be found in distinct cells on opposite sides of the lineage tree after n cell divisions. As a result, all cells can be placed in one of two subtrees, thereby reducing the number of cell divisions to be reconstructed from n to n – 1. For example, at the 8-cell stage, the original paternal plus strand of chromosome 7 is detected in cell 8 and the minus strand is detected in cells 1 and 2 (Figure 1B). This suggests that cell 8 is on the opposite side of the tree compared with cells 1 and 2. Performing this first step of scPECLR, referred to as original strand segregation (OSS) analysis, over all the chromosomes enables us to place cells 1–4 and 5–8 on opposite sides of the lineage tree, reducing the complexity of the problem from reconstructing 3 cell divisions with 315 tree topologies to 2 cell divisions with 9 tree topologies (Figure 1C).
Probabilistic lineage reconstruction using scPECLR accurately predicts 8-cell embryo trees
To reconstruct the complete lineage tree, we next used the mosaic pattern of 5hmC arising from abrupt transitions in hydroxymethylation levels among cells along the length of a chromosome. These sharp transitions in 5hmC that are shared between two cells are the result of homologous recombination during sister chromatid exchange (SCE) events in the G2 phase of a previous cell cycle (Mooijman et al., 2016). Detection of 5hmC transitions that are common to two cells therefore indicate a shared evolutionary history between these cells (Figure 1A, inset). However, although an SCE event at the 4-cell stage would imply that the cells are sisters (Figure 1C, left), one occurring at the 2-cell stage would indicate that the same pattern of 5hmC transition can also be observed between cousins (Figure 1C, right). Thus, the observation of a single shared SCE event between two cells cannot be used to immediately discriminate between sister and cousin cell configurations.
To systematically determine the likelihood of observing different tree topologies, we developed a probabilistic framework where the occurrence of SCE events is modeled as a Poisson process. The total number of SCE events is used to estimate the parameter b of the Poisson process, the rate of SCE events per chromosome per cell division, using maximum-likelihood estimation (STAR Methods). After OSS, 8-cell trees can be grouped into two 4-cell subtrees, each with three possible tree arrangements (Figure 2A). Next, we used the probabilistic model to calculate the likelihood of observing an SCE pattern for a chromosome given a tree topology. We observed a large variety of SCE patterns, ranging from commonly observed patterns, such as one or two SCE transitions shared between two cells, to more complex distributions of 5hmC between cells (Figure S1). For the most common pattern of one SCE transition between two cells, scPECLR predicts that the tree with the two cells as sisters (tree A) is twice as likely as one where the two cells are cousins (tree B or C), in good agreement with simulated data (Figure 2B and STAR Methods). Similarly, when two SCE transitions are shared between two cells, the probability that the two cells are sisters is 2–3 times higher than the probability that they are cousins, with the likelihood ratio between sister and cousin tree configurations depending on the relative position of the SCE transition on the chromosome (Figures 2C and S2; STAR Methods). More complex 5hmC distribution patterns, such as when two SCE events are shared between three cells, substantially favors the configuration of tree A (Figure 2C and STAR Methods). After the SCE pattern of each chromosome is analyzed, we can estimate the total likelihood of observing different tree topologies, assuming that the SCE events on each chromosome are independent (STAR Methods). Finally, the likelihood of an 8-cell tree is the product of the likelihoods of the two corresponding 4-cell subtrees (Figure 2D and Method S1).
Figure 2. Endogenous 5hmC-based lineage reconstruction using scPECLR.
(A) Two cells sharing an original DNA strand (solid orange line) can either be sisters (Tree A) or cousins (trees B and C) depending on whether the SCE event occurred at the 4- to 8-cell or 2- to 4-cell stage, respectively. Newly synthesized DNA strands are shown as dashed black lines.
(B) For an SCE transition between two cells, the probability of the pair of cells being sisters versus cousins is plotted against the relative position of the SCE event on the chromosome (k11). The model prediction (black) and simulation results (yellow) are shown for chromosome 1 (N = 97 for 2-Mb bins) with b = 0.3.
(C) The probability ratio between Trees A and B are shown for N = 97 and b = 0.3 for two cases: two SCE transitions shared between two cells and two SCE events shared between three cells.
(D) For the 8-cell mouse embryo in Figure 1B, the probability of observing the different topologies, rounded to four decimal places, for the two 4-cell subtrees are shown.
To test the accuracy of scPECLR, we simulated 5hmC patterns of 8-cell embryos with an SCE rate similar to the experimentally observed value (b = 0.3) and within the range found in other cell types (Falconer et al., 2012; Hongslo et al., 1991; Tateishi et al., 2003; Wu et al., 2017; Zack et al., 1977). scPECLR predicted the lineage tree correctly in 96% of all simulations (Figure 3A, left). In contrast, MEMOIR predicted the lineage tree accurately in only ~67% of the top 40% most reliably reconstructed trees, although this was based on ground truth obtained from imaging data (Figure 3A, left). This improved accuracy of scPECLR strongly suggests that endogenous strand-specific 5hmC patterns present an accurate tool to reconstruct lineage trees at an individual-cell-division resolution. Furthermore, to directly validate our method against experimental data, we combined the lineage trees predicted by scPECLR from simulated 8-cell embryos to estimate the number of SCE events at the 4-cell stage. We hypothesized that if scPECLR predicted the correct tree then it would produce a distribution of SCE events similar to that of the experimental data at the 4-cell stage. We found that the scPECLR-predicted distribution of SCE events per cell at the 4-cell stage was statistically indistinguishable from the experimentally obtained distribution in 4-cell embryos (p>0.8, two-sample Kolmogorov-Smirnov [KS] test) (Figure 3A, right). In contrast, when one of the 314 incorrect tree topologies at the 8-cell stage were sampled randomly, it resulted in a distribution of SCE events per cell that was significantly different from the experimental data (p<10−4, two-sample KS test) (Figure 3A, right). These results show that scPECLR can reconstruct three cell divisions with high accuracy. Finally, we applied scPECLR on the 8-cell mouse embryo shown in Figure 1B and other embryos to predict lineage trees with high confidence (Figure 2D and Data S1).
Figure 3. scPECLR can reconstruct 8- and 16-cell lineage trees.
(A) (Left) scPECLR accurately predicts the lineage of 96% of simulated 8-cell trees (b = 0.3). Error bars indicate the bootstrapped standard error. In comparison, MEMOIR accurately predicts 67% of the top 40% most reliably reconstructed 8-cell trees (Frieda et al., 2017). (Right) The distribution of SCE events in 4-cell embryos (blue) is not statistically different from that of 4-cell trees inferred with scPECLR starting from 8-cell trees (orange, p > 0.8), but is different from 4-cell trees inferred from a random topology at the 8-cell stage (brown, p<10−4).
(B) Percentage of simulated 8- and 16-cell trees that are correctly predicted by scPECLR for different SCE rates (b). The prediction accuracy is computed by simulating 5,000 trees. Error bars indicate the bootstrapped standard error.
(C) Percentage of 2-, 4-, and 8-cell subtrees that are accurately predicted within simulated 16-cell trees as a function of the SCE rate (b). The prediction accuracy is computed by simulating 5,000 16-cell trees. Error bars indicate the bootstrapped standard error.
(D) Construction of consensus trees. In this example, the top six tree topologies (with the highest probabilities) obtained after applying scPECLR on a 16-cell tree are shown. The relative threshold (RT) parameter is used to determine the number of topologies considered in the consensus tree analysis. With an RT of 0.5, the top 5 topologies are selected to generate a consensus tree that is consistent with all these trees. The uncertainty within the consensus tree is quantified by the number of tree topologies it contains. Red fonts indicate parts of the lineage tree that are incorrectly predicted. The tree highlighted in bold is the true tree.
(E) Simulations show that as the RT increases, the median number of topologies in the consensus tree decreases (solid lines, left axis) whereas the false discovery rate (FDR) increases (dotted lines, right axis). In these simulations, two other parameters t8 and t4 are set to 0.75 and 1.0, respectively. For details, see STAR Methods.
(F) Graph showing how the specificity of the consensus tree is related to error tolerance. As the FDR decreases, the median number of topologies contained within the consensus tree increases. Note that the lowest FDR possible for b = 0.3, 0.5, 0.7, and 1.0 are 15%, 10%, 10%, and 5%, respectively.
(G) Single-cell 5hmC sequencing data for a 16-cell mouse embryo (4-Mb bins). The consensus tree associated with this embryo is estimated to have a 15% FDR rate. RT, t8, and t4 are set at 0.05, 0.85, and 0.8, respectively. The consensus tree is constrained to only 180 possible topologies, a significant reduction from the more than 600 million trees originally.
As SCE transitions play a central role in reconstructing lineage trees with scPECLR, we next explored how the endogenous rate of SCE events influences the accuracy of the model. As expected, the accuracy of lineage reconstruction increases monotonically with increasing rates of SCE events, and greater than 98% of the simulated 8-cell trees were correctly predicted for b≥0.4 (Figure 3B and STAR Methods). These simulations were performed by using 19 paternal autosomes based on our observations in pre-implantation mouse embryos; however, most cell types carry 5hmC on both alleles, and therefore we also performed simulations with 38 chromosomes. Again, as expected, the predictive power of the model increases, and more than 98% of the simulated 8-cell trees were accurately predicted for b≥0.2 (Figure 3B). These results demonstrate that the lineage tree can be accurately predicted up to three cell divisions even with low rates of SCE events (Figure 3B).
scPECLR can be extended to reconstruct the lineage of 16-cell trees
We next extended scPECLR to reconstruct the lineage of 16-cell trees, whereby the number of possible tree topologies increases exponentially to more than 6 × 108. Although the ability to predict the complete lineage tree decreases (17% accuracy for b = 0.3), large parts of the tree were reconstructed accurately, with the most common error being the misidentification of one sister pair within a 4-cell subtree (Figures 3B and 3C). For an SCE rate of b = 0.3, 83% of all 4-cell subtrees and 63% of all 2-cell subtrees were predicted correctly (Figure 3C). These results suggest that when reconstructing 16-cell trees it is important to identify parts of the tree that can be predicted with high confidence. To accomplish this, we first included all tree topologies with probabilities above a threshold in relation to the tree with the highest probability (Figure 3D). A consensus tree that is consistent with all these tree topologies is then established (Figures 3D and S3; STAR Methods). As the relative threshold is increased (i.e., we include fewer tree topologies to construct the consensus tree), the median consensus tree contains fewer topologies, resulting in a more specific consensus tree. However, this results in an increase in false discovery rate (FDR). For example, with b = 0.3 and a relative threshold of 0.1, the median consensus tree contained 24 tree topologies (Figure 3E, solid red line). The consensus trees displayed an FDR of ~26%, implying that in 26% of the simulations the consensus tree has at least some part of the lineage tree that is not consistent with the true tree (Figure 3E, dotted red line). Thus, the relative threshold allows us to tune the competing goals of specificity and accuracy of the consensus tree. These results show that for a certain rate of SCE events and a desired level of FDR, the median number of topologies contained in the consensus tree can be estimated, yielding insights into how much lineage information can be extracted (Figure 3F and STAR Methods). Finally, as proof of principle, we sequenced a 16-cell mouse embryo and applied scPECLR to show that we can extract partial lineage information from larger trees (Figure 3G and STAR Methods).
Integrated single-cell genomic DNA and 5hmC sequencing enables reconstruction of larger lineage trees
For larger 32-cell trees, the number of possible tree topologies increases to more than 1026, making it computationally very expensive to calculate the likelihood of all trees. Therefore, we extended scPECLR by developing an algorithm that efficiently searches through the tree topology space to reconstruct these larger trees. After OSS bifurcates the 32 cells into 2 16-cell subtrees, we identify groups of 8 cells that when combined minimize the number of SCE events at the 4-cell stage. This algorithm relies on the strategy that incorrectly grouped cells will increase the number of SCE events at the 4-cell stage, and this subsampling enables rapid search through the tree topology space. Finally, the four groups of eight cells are reconstructed using scPECLR as described above (STAR Methods). As expected, although the ability to predict the complete lineage tree is lower than that for 16-cell trees, this method can rapidly predict subtrees within the 32-cell tree. For example, for b = 1 and 19 alleles, 2-, 4-, and 8-cell subtrees are predicted with 50%–60% accuracy, whereas the 16-cell subtrees are predicted at close to 100% accuracy (Figure 4A, solid lines). For the more general case of 38 alleles in mouse genomes, the prediction accuracy increases substantially, and 80%–95% of the 2-, 4-, and 8-cell subtrees were predicted correctly for b = 1 (Figure 4A, dotted lines).
Figure 4. Integrated single-cell 5hmC and genomic DNA sequencing can be used to endogenously reconstruct larger lineage trees.
(A) Percentage of the full lineage, along with 2-, 4-, 8-, and 16-cell subtrees, that are accurately predicted in simulated 32-cell trees as a function of SCE rates (b). The prediction accuracy is computed by simulating 2,000 trees. Solid and dotted lines indicate cells where 5hmC can be quantified in 19 or 38 chromosomes, respectively.
(B) Percentage of the full lineage, along with the subtrees, that are correctly predicted in simulated 32-cell trees as a function of SCE rates (b), by using information from both 5hmC and gDNA. Solid and dotted lines indicate prediction accuracy by using integrated information and gDNA alone, respectively. The prediction accuracy is computed by simulating 2,000 38-chromosome trees, and the rate of occurrence of genomic variants is set to 0.6 per chromosome per cell division.
(C) Schematic illustration of scH&G-seq.
(D) scH&G-seq enables simultaneous detection of gDNA and 5hmC from the same cell.
(E) Heatmap of the Euclidean distance between cells and the corresponding dendrogram. Single cells cluster into two major groups. Cells from AluI, BseRI, and dual enzyme libraries are displayed in green, orange, and blue, respectively.
(F) Heatmap of the copy number profile of single cells sorted in the same order as the dendrogram in (E).
To endogenously reconstruct large lineage trees at an individual-cell division resolution, we hypothesized that single-cell strand-specific 5hmC data combined with information on genomic variants, such as genomic copy-number variations (CNV), genomic single-nucleotide polymorphisms (SNPs), or mitochondrial SNPs, could significantly improve the prediction accuracy. Genomic variants have previously been used to reconstruct clonal lineages and, therefore, when integrated with strand-specific 5hmC could help anchor subtrees within the complete lineage tree (Biezuner et al., 2016; Evrony et al., 2015; Ludwig et al., 2019; Xu et al., 2019). To test this hypothesis, we simulated trees with genomic variants together with SCE events and found that the prediction accuracy increases dramatically compared with the use of SCE events alone (Figures 4A, 4B, S4B, and S4C; STAR Methods). For example, for b = 1, the complete 32-cell lineage tree was predicted correctly in 76% of all simulations, and the 2- to 16-cell subtrees were predicted with greater than 96% accuracy (Figure 4B). In contrast, when using 5hmC or genomic variants alone, the prediction accuracy was lower (Figures 4A and 4B). Overall, these results demonstrate that 5hmC and genomic variants together present a general strategy to accurately reconstruct large lineage trees at a single-cell division resolution.
To accomplish this goal experimentally, we developed a method to simultaneously quantify 5hmC and the genome from the same cell (scH&G-seq). Single cells are lysed, and the genomic DNA (gDNA) and mitochondrial DNA (mtDNA) are digested by using the restriction enzymes AluI and/or BseRI (Figure 4C). After stripping chromatin from gDNA, 5hmC sites are glucosylated, and these sites are thereafter digested by the restriction enzyme AbaSI (Figure 4C). Double-stranded adapters, containing a cell-specific barcode, a 5′ Illumina adapter, and T7 promoter, together with restriction enzyme-compatible overhangs, are ligated to the fragmented DNA molecules (Figure 4C). These ligated molecules are then amplified by in vitro transcription and used to prepare Illumina libraries as described previously (Hashimshony et al., 2016; Mooijman et al., 2016; Rooijers et al., 2019; Sen et al., 2021), enabling simultaneous quantification of gDNA, mtDNA, and 5hmC from the same cell.
As proof of concept, we applied scH&G-seq to single H9 human embryonic stem cells with different combination of restriction enzymes—AluI and AbaSI, BseRI and AbaSI, or AluI, BseRI, and AbaSI—and successfully detected both gDNA/mtDNA and 5hmC from the same cell (Figures 4D and S4D). We detected a similar number of 5hmC sites per cell, when compared with scAba-seq control cells, and integration with additional restriction enzymes enabled genome-wide sequencing of gDNA/mtDNA (Figures 4D and S4D). To show that gDNA variants can infer clonal cellular relationships, we called CNVs in single cells. Hierarchical clustering identified two major clusters with a diploid and non-diploid population, with additional subgroups within the non-diploid population (Figures 4E and 4F). These results demonstrate that scH&G-seq can be used to predict large lineage trees at a single-cell-division resolution. Similarly, the high mutation rate in mtDNA has previously been used to reconstruct clonal lineage trees, and therefore we used scH&G-seq to identify mtSNPs. Although we identified nearly 40 mtSNPs in H9 cells when mapping to the reference human genome, these SNPs were observed at a frequency of close to 100%. Comparison with previously published ATAC-seq data from H9 cells together with SNP calls from another human cell line also identified the same SNPs, suggesting that these nucleotides represented the wild-type sequence (Table S1) (Diroma et al., 2020; Liu et al., 2017). Nevertheless, these results show that in addition to sequencing 5hmC in single cells, scH&G-seq can be used to obtain clonal lineage information that can together be used to reconstruct larger trees.
scPECLR can be used to infer the rate of SCE events at each cell division and test the “immortal strand” hypothesis
In addition to reconstructing lineage trees, scPECLR can also be used to infer the rate of SCE events at each cell division. For example, in 8-cell embryos, the 5hmC distribution at the 4-cell and 2-cell stages can be reconstituted on the basis of the predicted lineage, enabling us to estimate the rate of SCE events at each cell division (Figure 2D and Data S1). Although the overall SCE rate over three cell divisions for all the 8-cell mouse embryos analyzed in this study was estimated to be 0.35 events per chromosome per cell division on average, the individual SCE rates for the 1- to 2-cell, 2- to 4-cell, and 4- to 8-cell stages were 0.31, 0.24, and 0.51, respectively. Furthermore, we found that the different rates of SCE events at each cell division did not affect the prediction accuracy of scPECLR (Figure S5 and STAR Methods). These results show that scPECLR can be used to infer the rate of double-stranded DNA (dsDNA) breaks at each cell division and that the rate of SCE events can vary during development.
Finally, we explored another application of scPECLR. As scPECLR uses endogenous strand-specific 5hmC in single cells to accurately reconstruct 8-cell trees, we hypothesized that this method could quantify how paternal alleles are segregated during cell division (Figure 5A). Different stem cell populations, such as hair follicle (Huh et al., 2013), neural (Karpowicz et al., 2005), satellite muscle (Conboy et al., 2007; Rocheteau et al., 2012), and intestinal crypt stem cells (Falconer et al., 2010; Potten et al., 2002), have been shown to display non-random segregation of DNA strands that can influence cell-fate decisions. These results have led to the “immortal strand” hypothesis, which postulates that old DNA strands are retained by daughter stem cells during asymmetric cell divisions to reduce the mutational load arising from genome replication of these long-lived cells. During mouse pre-implantation development, recent reports have shown that blastomeres show biases in cell fate specification as early as the 4-cell stage (Goolam et al., 2016; White et al., 2016). Therefore, we investigated sister chromatid segregation patterns of the paternal alleles at the 4-cell stage. We first combined 5hmC data from reconstructed sister cell pairs at the 8-cell stage to generate the distribution of the oldest DNA strands at the 4-cell stage (Figure 5B). In the example shown, when comparing cells (1,2) and (3,4), the original DNA strands appear to preferentially segregate to cell (1,2). In contrast, such a non-random pattern of DNA strand segregation is not observed between sister cells (5,6) and (7,8). Quantitatively, we analyzed 14 8-cell mouse embryos (equivalent to 28 2- to 4-cell division events) to find one sister pair at the 4-cell stage that displayed statistically significant non-random segregation of DNA strands (p<0.05) (Figure 5C and STAR Methods). To directly validate these results, we performed scAba-seq on 13 4-cell mouse embryos (equivalent to 26 2- to 4-cell division events). We again observed a similar distribution with one sister pair displaying a statistically significant non-random segregation pattern of DNA strands (p<0.05), which was statistically indistinguishable from that observed in 8-cell embryos (p>0.8, two-sample KS test) (Figure 5C and STAR Methods). The observation of two non-random segregation events out of 27 embryos was not statistically significant (p>0.15), suggesting this level of non-random segregation at the 4-cell stage of mouse embryogenesis could arise by random chance (Figure 5D and STAR Methods). Thus, this study shows that strand-specific reconstruction of lineage trees can be a powerful approach to test the immortal strand hypothesis in different stem cell populations.
Figure 5. scPECLR can be used to map DNA strand segregation patterns.
(A) Schematic of DNA strand segregation patterns during cell division.
(B) Combining the experimental 5hmC data for the 8-cell embryo in Figure 1B with the lineage tree predicted by scPECLR enables the genome-wide reconstitution of 5hmC in single cells at the 4-cell stage.
(C) Testing non-random segregation of DNA strands at the 4-cell stage of mouse embryogenesis. The p values from a binomial test under a null hypothesis of random segregation shows that, out of 27 embryos, two pairs of sister cells display statistically significant (p<0.05) non-random segregation of DNA strands.
(D) Twenty-seven embryos were randomly sampled 10,000 times from a pool of 100,000 simulated 4-cell embryos, generated with a constant SCE rate of b = 0.3. A cumulative distribution of the number of sister pairs that display statistically significant (p<0.05) non-random segregation within the 27 embryos is shown. Red dot indicates the experimentally observed value of 2.
Cellular lineage reconstruction plays an important role in answering fundamental questions in several areas of biology, such as immunology, cancer biology, and developmental biology. However, most current methods have two major limitations: (1) clonal lineage reconstruction cannot establish lineage relationships at the resolution of individual cell divisions; and (2) the use of transgenes involves time-intensive generation of complex animal models and is an approach that cannot be extended to map lineages in human tissues. To overcome these limitations, we developed a generalized probabilistic framework, scPECLR, to reconstruct short-term cellular lineage trees at an individual-cell-division resolution by using strand-specific single-cell 5hmC sequencing data. Using simulated 8-cell trees, scPECLR showed a prediction accuracy of 96%. Because simultaneous live-cell imaging combined with single-cell 5hmC sequencing to directly compare lineage predictions is challenging, we validated our results by showing that 8-cell trees predicted by scPECLR, and not randomly selected incorrect trees, allow us to estimate the distribution of SCE events at the 4-cell stage that is consistent with experimental data (Figure 3A). These results highlight that scPECLR is not only accurate at reconstructing short-term lineage trees at an individual-cell-division resolution but can also be used to quantify DNA strand segregation patterns and test the immortal strand hypothesis in stem cell biology.
Furthermore, scPECLR can be applied to single-cell measurements of other non-maintained epigenetic marks, such as non-CpG methylation, 5-formylcytosine, and 5-carboxylcytosine, to reconstruct lineages (Sen et al., 2021; Wu et al., 2017), and more generally to systems where the chromosome strands present in the original cell can be distinguished from subsequently synthesized strands, such as those exposed to bromodeoxyuridine (Claussin et al., 2017; Sanders et al., 2020). Finally, we show that by integrating 5hmC data with information on genomic variants from the same cell (scH&G-seq) significantly improves the prediction accuracy of larger lineage trees. Importantly, the use of an endogenous epigenetic mark and genomic variants to reconstruct lineage trees suggests that this method can be directly extended to study human development.
Limitations of the study
Although scPECLR enables endogenous lineage reconstruction at a single-cell-division resolution, the method suffers from two limitations. First, it cannot be applied to cell types in which the levels of 5hmC are below the detection limit of scAba-seq and scH&G-seq. However, as scPECLR relies on the relative levels of 5hmC between the two strands of a chromosome, it can be applied to many cell types, including those with low levels of 5hmC in their genome. For example, 16-cell mouse embryos display distinct mosaic genome-wide strand-specific 5hmC patterns that enable lineage reconstruction despite undergoing global erasure of DNA methylation (Messerschmidt et al., 2014; Saitou et al., 2012) (Figure 3G). A second general limitation of reconstructing larger lineage trees at an individual-cell division resolution is that the number of tree topologies increases exponentially, resulting in a drop in prediction accuracy with each additional cell division. However, as this work demonstrates, scPECLR, in combination with scH&G-seq, significantly improves the lineage-reconstruction accuracy of larger trees (Figure 4B). Finally, as most other lineage-reconstruction methods employing CRISPR/Cas9, viruses, transposons, or Cre-loxP resolve larger-scale clonal information, scPECLR presents a complementary approach to these methods for applications that require reconstructing smaller lineage trees at an individual-cell-division resolution.
Lead contact
Additional information and requests for resources and reagents should be directed to and will be fulfilled by the Lead Contact, Siddharth S. Dey
Materials availability
This study did not generate new unique materials nor reagents
Data and code availability
The raw and processed single-cell sequencing data have been deposited at GEO and are publicly available as of the date of publication. The accession number is listed in the key resources table.
All original code for scPECLR implementation is available in this paper’s supplemental information. All original code for scH&G-seq implementation has been deposited at GitHub and is publicly available as of the date of publication. The GitHub link is listed in the key resources table.
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.
Supplemental information can be found online at
