Skip to main content
NAR Genomics and Bioinformatics logoLink to NAR Genomics and Bioinformatics
. 2023 Aug 21;5(3):lqad077. doi: 10.1093/nargab/lqad077

Machine learning based lineage tree reconstruction improved with knowledge of higher level relationships between cells and genomic barcodes

Alisa Prusokiene 1, Augustinas Prusokas 2, Renata Retkute 3,
PMCID: PMC10440785  PMID: 37608801

Abstract

Tracking cells as they divide and progress through differentiation is a fundamental step in understanding many biological processes, such as the development of organisms and progression of diseases. In this study, we investigate a machine learning approach to reconstruct lineage trees in experimental systems based on mutating synthetic genomic barcodes. We refine previously proposed methodology by embedding information of higher level relationships between cells and single-cell barcode values into a feature space. We test performance of the algorithm on shallow trees (up to 100 cells) and deep trees (up to 10 000 cells). Our proposed algorithm can improve tree reconstruction accuracy in comparison to reconstructions based on a maximum parsimony method, but this comes at a higher computational time requirement.

INTRODUCTION

Single-cell lineage tracing, reconstructing the relationship between individual dividing cells in a tissue or organism, has the potential to improve our understanding of many biological processes, including the major transitions in evolution (1), development from the founding zygote to a complex organism (2), stem cell properties in tissue (3), cancer cell differentiation (4) or metastasis (5), and pathways of tumour evolution (6).

The first full development lineage tree was traced for the embryonic cells of the nematode Caenorhabditis elegans by sketching the events of cell division and development histories observed directly through light microscopy (7). Recently, platforms based on cell barcoding were proposed for tracking the lineage of individual cells at a high resolution. An integrase-based synthetic barcode system, intMEMOIR, uses the serine integrase Bxb1 to perform irreversible random modification of DNA recording arrays that can be read out using fluorescence in situ hybridization imaging (8). Fluorescent reporter assays allow the rapid characterization of these recording units (9). This experimental system is coupled with a time-lapse movie of the cells as they divide to provide a ground truth lineage tree (8). Another technique, substitution mutation-aided lineage tracing (SMALT) system, used a 3-kb readout sequence with 16 iSceI binding motifs to map single-cell resolution cell phylogenies during organ development (10). Additional strategies to achieve lineage tracing at single-cell resolution have been developed in the past few years: integration barcodes (designed as short DNA fragments placed in an expressed locus) and polylox barcodes (comprising a DNA cassette with multiple loxP sites in alternating orientations) (11). These, and future, technical advancements have to be matched by progress in relevant computational methods (12).

Various statistical methods for reconstructing gene trees and species trees (13,14) have been developed over the last few decades, but these methods have not been widely used on time and individual-cell resolved datasets. Mutations induced by cell barcoding methods are irreversible, which is different from the somatic mutations accumulated during mitotic cell division (8). Previously, we have developed a machine learning (ML) approach for cell lineage reconstruction, with results submitted to the Allen Institute Cell Lineage Reconstruction DREAM Challenge (15). This was the best-performing algorithm for the reconstruction of in vitro cell lineages of trees with <100 cells. It reached higher accuracy than other methods, such as distance-based method DCLEAR (16), maximum parsimony based method Cassiopeia-integer linear programming (17) and Cassiopeia-Greedy (17). Our proposed framework was based on embedding single-cell data into a feature space in order to train an ML algorithm to predict the probability that cells are siblings.

In this study, we expanded this method by training an ML model to predict higher level relationships between cells. We refined our methodology to be applied to any number of possible edit states. To evaluate the performance of the algorithm, we used two sets of in silico data: (i) shallow trees (up to 100 cells), simulated using parametrization based on in vitro dataset for mouse embryonic stem cells (15); and (ii) deep trees (up to 10 000 cells), simulated using parametrization based on in vitro dataset for fly organ development (10). Further, we reconstructed trees of in vitro dataset for mouse embryonic stem cells (15). We measured the accuracy of lineage reconstruction using the normalized Robinson–Foulds (RF) score. For trees with up to 100 cells we additionally calculating four metrics: normalised Robinson Foulds score (RF), triplet score (TRP), quartet score (QRT), and clustering information score (CLI). We compared our reconstruction with a maximum parsimony method. We found that our proposed algorithm has an advantage over the maximum parsimony method in terms of reconstruction quality for both shallow trees and deep trees.

MATERIALS AND METHODS

The in vitro datasets

Mouse embryonic stem cell colonies

The data were obtained as part of the Allen Institute Cell Lineage Reconstruction Challenge through Synapse ID syn20692755 (15). In the experiment, the recording array (barcode) consisted of L = 10 recording units (8). Each recording unit was in one of three states: ground state (represented as ‘1’), a deletion (represented as ‘0’) or an inversion (represented as ‘2’) of the DNA sequence. Each colony was started from an individual cell and colony growth was observed for 48 h. The experimental data comprise (i) an array of intMEMOIR readouts as a text file (also called matrix) and (ii) the ground truth lineage for the colony. Tree-like data structures were provided as a Newick file (18). The ground truth cell lineage trees were obtained from video-microscopy data (8).

Fly organ development

A SMALT system used a barcode with 16 iSceI binding motifs present with equal distance throughout the sequence (10). The recording array consisted of L = 2943 recording units. Each recording unit was in one of two states: ground state (represented as ‘0’) or mutated state (represented as ‘1’). Binary sequence data were available for two specimens: one had 5002 cells and other 5420 cells. The datasets correspond to fly organ development from embryo to late-third instar larvae (10).

The in silico datasets

Stochastic simulation of cell division and barcode editing

Accumulation of stochastic mutations during cell division was simulated similarly to (8,19). Each cell colony starts from an individual cell carrying an unedited array with length of L units. The model assumes that cells divide synchronously and at a constant rate. The initial cell then undergoes a series of cell divisions (Figure 1A). After d divisions, the colony consists of N cells, where N = 2d.

Figure 1.

Figure 1.

The ML approach for lineage tree reconstruction. (A) Ground truth lineage. Here, we have a colony with nine cells; each cell corresponds to a tip of the lineage tree (yellow circles). All cell pairs have a common node (blue circles), showing their level of relatedness. (B) Corresponding barcode states: rows correspond to cells and columns to recorder units. A unit can be at the ground state (encoded as ‘1’), deletion (encoded as ‘0’) or inversion (encoded as ‘2’). (C) Feature construction is based on pairwise comparison of barcodes: number of units that have not been edited (F1); number of units that have the same edit (F4 and F6); number of units that have single edit (F2 and F3); and number of units that have different edits (F5). (D) Two examples of embedding single-cell data into a feature space for ML training. The upper row corresponds to level 1 (sibling cells) and the lower row corresponds to level 2 (cousin cells). State indicates whether relationship is true or false.

After every cell division, each unedited target can mutate with a given probability μ to one of several possible edited states sS. The state is chosen according to a probability αs. The process of recording is irreversible, and once a recording unit is edited, it can no longer change (Figure 1B).

The probability that a recording unit has not been edited after d divisions is equal to Inline graphic. The ground truth lineage could provide necessary information on what is an expected lower number of divisions each cell has undergone in a colony, for example by counting the number of ancestors for phylogenetic nodes.

The probability of an edited unit going to state s is denoted by αs, such that Inline graphic, where S is a set of all possible edits and |S| denotes the size of set S. In the case of the mouse embryonic stem cell colonies dataset, |S| = 2 and S = {0, 2}, and in the case of the fly organ development dataset, |S| = 1 and S = {1}. The probability that a recording unit is in state s at division d, given that it was in the unedited state at division d − 1, is equal to Inline graphic. As the state values of lineage tree internal nodes are not available, we do not know which division incurred the editing. Therefore, the probability of observing the recording unit in state s after d generation becomes Inline graphic.

The total likelihood is equal to

graphic file with name M0004.gif (1)

where C is the total number of colonies, nc is the number of cells in a colony c, dc,i is the number of divisions for cell i in colony c, oc,i,l is the observed state for unit l in cell i and colony c, and Inline graphic.

For parameter estimation, we used the R package AMISEpi, which has an implementation of adaptive multiple importance sampling for Bayesian analysis (20). We set the priors to be uniformly distributed: μU[0, 1] and αiU[0, 1].

Simulation of shallow trees (up to 100 cells)

We used parameters (mutation rate and probability of mutations) estimated from the in vitro mouse embryonic stem cell dataset (15). We used data on all colonies for parameter estimation.

Simulation of deep trees (up to 10 000 cells)

We used parameters (mutation rate) estimated from the in vitro fly organ development dataset (10). We have used both datasets to estimate the mutation rate per division/target. We assumed that there were 13 divisions, which was closest to the number of cells available for dataset (213 = 8192).

Lineage tree reconstruction using the ML approach (AMbeRland-TR)

Feature engineering

For illustration purpose, we assume that the barcode units can be three possible states: ground state, a deletion or an inversion (Figure 1B). We are proposing the following classes of features based on pairwise comparison of barcodes: number of units that have not been edited (F1); number of units that have the same edit (F4 and F6); number of units that have single edit (F2 and F3); and number of units that have different edits (F5) (Figure 1C).

This approach can be extended to any number of possible edited states by including all possible pairwise combinations of the extended record set Inline graphic as predictors for the ML model. This should produce (|S| + 1)! predictors.

We have used the R package phangorn (21) to extract information on relatedness between cells from the ground truth cell lineage trees. For each level of the tree, this produced two lists of cell pairs for each colony: cells that share an ancestor at level t and cells that do not share an ancestor at level t. Two examples of embedding single-cell data into a feature space for ML training are shown in Figure 1D. The upper row corresponds to level 1 (sibling cells) and the lower row corresponds to level 2 (cousin cells). State indicates whether relationship is true or false.

ML training, prediction and interpretation

We used a gradient boosting machine (GBM) to implement the outlined ML approach. All calculations were performed in R using package gbm (22). The following options were used to train the GBM model: distribution = ‘bernoulli’; n.trees = 1000; interaction.depth = 10; n.minobsinnode = 5; cv.folds = 5; and train.fraction = 0.5.

We use the relative importance of features, partial dependence plots and individual conditional expectation plots for model interpretation. Relative importance is based on the number of times a predictor is selected when training the model (23). Higher values of relative importance indicate larger influence on the response. Individual conditional expectation curves visualize the partial relationship between the predicted response and a feature for individual datasets (24). Partial dependence plots show the relationship averaged over all observations, which makes it easier to extract expected trends (24).

Clustering

We applied a custom hierarchical clustering method for building a cell lineage tree from predicted probabilities. Clustering begins at the lowest tree level, where all clusters contain an individual cell. Each possible cell pair is then ranked according to the predicted probability that they share an ancestor at this level. At consecutively increasing levels, pairwise comparison is performed between each lower level cluster, where the calculated probability is the maximum between any elements of the two clusters. Cluster pairs are ordered again according to this probability and are assumed to have the same parent node if its value is above the estimated threshold for this level. This process is repeated until one or two clusters are left. We assume only binary trees.

Lineage tree reconstruction using a maximum parsimony tree reconstruction method

Maximum parsimony based methods try to find the minimum number of changes necessary to describe the data for a given tree. We performed a maximum parsimony reconstruction using the R package phangorn (21). Initial tree was required to start the maximum parsimony tree search. This was done using the Hamming distance between barcodes [package DescTools (25)] and unweighted pair group method with arithmetic mean clustering (26). We set the method as ‘fitch’ and minimum number of iterations in the ratchet as 100.

Scores

We assessed the accuracy of lineage tree reconstruction using four metrics: normalized RF score, TRP score, QRT score and CLI score. All scores have values between 0 and 1, with smaller values indicating larger similarity between ground truth lineage tree and reconstructed lineage tree.

Normalized RF score

The RF distance counts the number of splits that are unique to one of the two trees (27). An RF distance of 0 indicates that all splits in both trees are the same. We used the RF.dist function from the R package phangorn to compute the normalized RF score (21).

TRP score

The TRP distance counts the number of subtrees of three taxa that are different in the two trees (28). The TRP distance was calculated using the tqDist algorithm (29) implemented in the R package Quartet (30). The TRP score was calculated by dividing the TRP distance by the total number of triplets shared between the two trees, i.e. Inline graphic (29).

QRT score

The QRT distance enumerates all subsets of leaves of size 4 and counts how often the topologies induced by the four leaves agree in the two trees (31). QRT divergence was calculated using the tqDist algorithm (32,33) implemented in the R package Quartet (30). The QRT score was calculated as one minus QRT divergence.

CLI score

CLI distance is a generalized RF metric based on the information content of the largest split (34). To compute the CLI score, we use the function ClusteringInfoDistance with the option normalize = TRUE from the R package TreeDist (35).

Set-up for lineage tree reconstruction

In silico shallow trees

To determine the effect of varying the number of possible states and the number of recording units on the accuracy of lineage reconstruction, we performed simulations with the number of states varied between 2 and 6 and a recorder carrying either 10 or 20 units. For each configuration, we simulated 1000 lineage trees that were used for ML training, and an additional 100 lineage trees that were used for the testing of lineage reconstruction methods. For each simulation, we sampled the depth of lineage from the empirical distribution associated with the mouse embryonic stem cell dataset and editing rate from the fitted posterior distribution.

Mouse embryonic stem cell colonies

We used the same partition of the data as in (15), i.e. array readout data from 76 colonies along with the corresponding ground truth lineages as the training set and array readout data from 30 cell colonies as a testing set for accuracy evaluation.

In silico deep trees

We simulated division and target editing for 10 and 12 divisions, which produced colonies with 1024 and 4096 cells, respectively. For trees with 1024 cells, we trained an ML model using data obtained from 10 individual trees, but a single simulated tree was used to train the ML model for the case with 4096 cells.

RESULTS

Comparison of tree reconstruction accuracy scores

First, we investigated performance of scores under conditions where a lineage tree is reconstructed erroneously. Figure 2 shows an example how scores compare for a tree with eight cells and few possible reconstructions. In case (i), cells A and C (which are cousins) have been assigned incorrectly. This resulted in the normalized RF score equal to 0.4, but had a small effect on the TRP score (0.07). In case (ii), we assign cells A and E incorrectly, which increases all scores, including the TRP score, to have values between 0.53 and 0.6. If two pairs of cells were assigned incorrectly [case (iii)], the normalized RF score is reduced by a factor of 3 and the QRT score by a factor of 2, but the value of the TRP score stayed the same as in case (ii). Next, we assumed that the number of divisions was estimated incorrectly for a pair of cells (G, H) [case (iv)]. Although this results in a different structure of the tree, only the TRP score detected the discrepancy, giving a value of 0.29, with all other metrics at value 0. So, if the quality of reconstruction would be judged solely by the normalized RF score, QRT score or CLI score, this would erroneously suggest that case (iv) is a perfectly reconstructed tree. If further two pairs of sibling cells are swapped [case (v)], this increases scores by 0.2–0.23 points. Finally, if all cells are reconstructed as sibling cells [case (vi)], the QRT score would be equal to 0.5, indicating that half of all possible quartets were reconstructed correctly.

Figure 2.

Figure 2.

Comparison between scores for a colony with eight cells. (A) Ground truth lineage tree and six possible reconstructed tree topologies (i–vi). (B) Corresponding scores for each reconstruction from panel (A): normalized RF score, TRP score, QRT score and CLI score.

Estimating mutation rates

The mouse embryonic stem cell dataset

The resulting colonies had from 4 to 39 cells; in total, there were 1453 individual cells in the dataset. The distribution of the number of divisions in the ground truth lineage trees is shown in Figure 3A.

Figure 3.

Figure 3.

Parameter estimation. (A) Distribution of the number of divisions in the mouse embryonic stem cell experiment dataset. (B) Posterior distribution of editing rate (μ) in the mouse embryonic stem cell experiment dataset. (C) Posterior distribution of the probability that an edited unit has the state ‘2’ (α) in the mouse embryonic stem cell experiment dataset. (D) Posterior distribution of editing rate (μ) in the fly organ development dataset.

Posterior distributions for fitted editing rate and the probability that an edited unit has a state ‘2’ are shown in Figure 3B and C, respectively. We estimated the mean of marginal posterior distributions to be μ = 0.15 and α = 0.48. For simulation of cell division and record editing, we assume that all states have the same probability, i.e. αs = 1/|S|.

The fly organ development dataset

Figure 3D shows the posterior distribution of mutation rate. We estimated the mean of marginal posterior distributions to be 0.0005. This is in agreement with (10), where it was concluded that ∼0.8–1.3 mutations were recorded on the readout sequence per cell generation.

Lineage reconstruction of in silico dataset: shallow trees

Results of simulations are shown in Figure 4. Our ML-based algorithm outperformed the maximum parsimony based method in all four metrics. Comparing mean performance, we found a 43–50% improvement in the normalized RF score, a 19–32% improvement in the QRT score and a 36–45% improvement in the CLI score.

Figure 4.

Figure 4.

Lineage reconstruction accuracy for the in silico dataset, with L = 10 or L = 20 recording units, and the number of possible edits ranging from 2 to 6. Each bar summarizes the score of 100 lineage tree reconstruction tests. We denote our ML-based approach as ’AMbeRland-TR’ and lineage reconstruction using the maximum parsimony approach as ’MaxParsymony’. Accuracy metrics are normalized RF score, TRP score, QRT score and CLI score.

We found that increasing the number of target units had a stronger effect on lineage reconstruction accuracy than increasing the number of states. For a recording array using 20 targets instead of 10 targets, there was a 62% improvement in performance for the QRT score, a 56% improvement in performance for both the normalized RF score and CLI score, and a 28% improvement in performance for the TRP score. When the number of editing states was increased from 2 to 5, there was a 17% improvement in the normalized RF score for a recording array using 10 targets. There was a 68% improvement in the normalized RF score for a recording array using 20 targets. There was a 30% improvement in the QRT score for a recording array using 10 targets and a 49% improvement in the QRT score for a recording array using 20 targets.

Lineage tree reconstruction of the mouse embryonic stem cell dataset

A reconstruction was computed from the test dataset consisting of 30 cell colonies using only the intMEMOIR array readout (8). Figure 5A shows a pairwise comparison between the two methods. Most of the points are above the diagonal, indicating that our ML-based algorithm outperformed the maximum parsimony based method for all four metrics. Comparing mean performance, we found a 17–23% improvement in the normalized RF score, a 7–37% improvement in the TRP score, a 7–11% improvement in the QRT score and a 12–15% improvement in the CLI score (Figure 5B).

Figure 5.

Figure 5.

Lineage tree reconstruction accuracy for the mouse embryonic stem cell dataset: (A) pairwise comparisons and (B) distribution of scores. We denote our ML-based approach as ’AMbeRland-TR’ and lineage reconstruction using the maximum parsimony approach as ’MaxParsymony’. Accuracy metrics are normalized RF score, TRP score, QRT score and CLI score.

Partial dependence plots and individual conditional expectation plots can be used to analyse the relationship between features and the response. It can be seen from partial dependence plots (red line) in Figure 6 (upper row) that the probability of cells being siblings decreases when the value of feature ‘F3’ increases up to 3. This has a biological meaning: the number of pairwise barcodes where one cell stayed in the ground state and the other had undergone editing cannot be high if the cells are siblings, and the highest probability is when there are no such barcodes. The same relationship is true for the feature ‘F5’: the probability of cells being siblings decreases with increasing numbers of pairwise barcodes where one cell had undergone deletion and the other inversion. There is a linear relationship between the probability of cells being siblings and the number of pairwise barcodes where both cells have undergone deletions (feature ‘F4’). These dependences for ‘F3’ and ‘F4’ become weaker when cells get further apart on the lineage tree (lower rows). The conditional expectation plots (grey lines) demonstrate that at the individual level, there are more complex relationships between features.

Figure 6.

Figure 6.

Interpretation of the model for the mouse embryonic stem cell dataset. (A) Relative importance of features. (B) Average partial dependence (red line) and individual conditional expectation (grey line) of cell relatedness probability on features. Rows correspond to levels of lineage tree hierarchy, with the lowest level (siblings) on the top.

Lineage reconstruction of in silico dataset: deep trees

The performance of the AMbeRland-TR algorithm was consistent between reconstruction of trees with 1024 cells and 4096 cells, and in both cases outperformed the reconstruction by the maximum parsimony method (Figure 7A). On average, there was a 50% improvement in performance for the normalized RF score in comparison to the maximum parsimony method. Improvement in reconstruction quality comes at higher computational time requirements. It takes 54 min on average to reconstruct a tree with 4096 tips using the AMbeRland-TR algorithm, which is two times longer than the time required for the maximum parsimony method (Figure 7B).

Figure 7.

Figure 7.

Lineage reconstruction for deep trees. (A) Normalized RF score. (B) Time requirements for the algorithm.

Computational requirements

The computational cost of the AMbeRland-TR algorithm has the following components: extracting features from barcode data; extracting relationships between cells from the ground truth trees; training the ML model for each tree level; and reconstructing trees for testing data. We have combined all procedure into two tasks: (i) training data preparation and ML model training; and (ii) testing data preparation and tree reconstruction. For the purpose of evaluating time requirements, we assume that training and testing data have only a single cell colony. For deep trees, as shown in the previous section, it is enough to train an ML model on single tree. For shallow trees, the training dataset should contain enough trees to accommodate variate of barcode combination, but this should not be a burden as training data preparation for shallow trees is computationally fast. Overall, time required for training and testing tasks was similar for a range of number of cells we investigate (Figure 8). For example, for a tree with 10 000 cells, it would take ∼24 h to train the models and 24 h for reconstructing a tree. All calculations were performed on a MacBook Pro with 2.4-GHz 8-core processor.

Figure 8.

Figure 8.

Time requirements for the algorithm. (A) Training data preparation and ML model training. (B) Testing data preparation and tree reconstruction.

DISCUSSION

Machine learning is becoming an important tool that has considerable potential in biology (36), genetics and genomics, including the annotation of sequence elements and epigenetic, proteomic and metabolomic data (37), or solving problems arising in population and evolutionary genetics (38). So far, it has been underutilized for lineage tree reconstruction. In this study, we introduced the framework for constructing features and training an ML algorithm for experimental systems based on mutating synthetic genomic barcodes. We explored how the barcoding configuration influenced the performance of the algorithm by using a collection of simulated and biological datasets.

There is no universal method to quantify topological similarities between lineage trees (39). The most widely used score is the RF score, but it is very insensitive to discrepancies at higher levels of lineage tree structures [Figure 2 (iv)]. Assigning the correct pathway cells undergo during organism or cancer development is necessary in order to understand tissue-type differentiation. The QRT score systematically showed lower values indicating higher similarity between ground truth and reconstruction, even when reconstructed lineage trees lacked any structure [Figure 2 (vi)]. Therefore, calculating a combination of accuracy scores is necessary in assessing the quality of lineage tree reconstruction.

Our in silico lineage reconstruction experiments showed that the ML-based approach is able to take advantage of high complexity of relationships between processes governing cell division and barcode editing. Under ideal conditions, i.e. no noise in observations and a large training dataset, they have outperformed other statistical approaches by 19–62% for scores evaluating various aspects of lineage reconstruction (Figure 4). It has also showed that when engineering the barcode system, it is more desirable to increase the number of recording units on barcode arrays than to increase the number of possible mutation states. However, increasing the number of edits from two states (i.e. single possible mutation) to three states (two possible mutations) could improve lineage reconstruction accuracy by at least 50% on average.

The two in vitro datasets represent different lineage tracing configurations: number of states (3 versus 2) and number of recording units (10 versus 2943). The ML approach performed better than the Hamming distance approach on both datasets. The difference was not as striking as for the simulated data. Possible explanations include the following: much smaller training dataset, larger heterogeneity between colonies or the presence of noise from experimental readouts.

From the model fitted to the mouse embryonic stem cell dataset, we can get insight into the functional relationship between experimentally observed barcode values and cell relatedness. For all levels of lineage tree hierarchy, we found that the features had a relative influence in the range from 11% to 24%; i.e. none of the six predictors had zero influence (Figure 6A). However, the ranking of features varied with the relationship level. When predicting whether cells are siblings (level 1), the highest relative influence was the number of pairwise barcodes where one cell was in the ground state and the other was inverted. This feature had lower influence for higher relatedness levels.

Our lineage reconstruction of the intMEMOIR dataset submitted to the Allen Institute Cell Lineage Reconstruction DREAM Challenge was the best-performing algorithm (mean RF score of 0.53 and TRP score of 0.52) (15). It was the best ranking method when benchmarked against a Bayesian phylogenetic framework (40). By training the ML model to predict higher level relationships between cells, we have been able to further improve performance. We achieved a mean RF score of 0.31 and a mean TRP score of 0.41 (Figure 5).

The area of lineage tracing is expanding fast, with new tools being developed at the experimental and computational levels. Jointly profiling DNA methylation, chromatin accessibility, gene expression and lineage information in single cells was made possible by developing an inducible lineage tracing mouse model with extremely large lineage barcode diversity (41). A computational pipeline allowing to predict cell lineages over several cell divisions solely from transcriptomic data alone was devised by leveraging genes displaying conserved expression levels over cell divisions (42). Another important direction is to understand cell fate transitions during development. An internal cellular clock could be recovered by integrating single-cell transcriptomics with lineage tracing (43,44). The ML approach has the potential to integrate barcode recordings with additional information, such as population dynamic parameters (40), single-cell gene expression (45), proteomics (46), microsatellite mutations (47) or clonal correlations (48).

A key limitation of our proposed approach is that it requires the ground truth dataset, i.e. recorded barcodes and tracked lineage histories. Except in the mouse embryonic stem cell experiment (15), such data are not available. One solution would be to train ML on simulated data. Having a global database with data on different lineage tracing configurations and results could be a starting point for accumulating knowledge as required for simulations. Under time and budget restrictions, having ground truth data on a single lineage tree makes it possible to train ML models, as our analysis on deep trees indicates. Future work should explore these possibilities and evaluate how to practically go about the process of training and improving ML models for reconstructing whole organ or even whole body lineage trees.

ACKNOWLEDGEMENTS

We thank the anonymous reviewers for valuable comments that improved the quality of the paper.

Contributor Information

Alisa Prusokiene, School of Natural and Environmental Sciences, Newcastle University, Newcastle upon Tyne NE1 7RU, UK.

Augustinas Prusokas, Independent researcher, London, SW7 2BX, UK.

Renata Retkute, Department of Plant Sciences, University of Cambridge, Downing Street, Cambridge CB2 3EA, UK.

Data Availability

Analysis code used in this study can be accessed at the following URL: https://github.com/rretkute/AMbeRlandTR (permanent DOI: 10.5281/zenodo.8227619).

FUNDING

No external funding.

Conflict of interest statement. None declared.

REFERENCES

  • 1. Kapli P., Yang Z., Telford M.J.. Phylogenetic tree building in the genomic age. Nat. Rev. Genet. 2020; 21:428–444. [DOI] [PubMed] [Google Scholar]
  • 2. McKenna A., Gagnon J.A.. Recording development with single cell dynamic lineage tracing. Development. 2019; 146:dev169730. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Kretzschmar K., Watt F.M.. Lineage tracing. Cell. 2012; 148:33–45. [DOI] [PubMed] [Google Scholar]
  • 4. Ceto S., Sekiguchi K.J, Takashima Y., Nimmerjahn A., Tuszynski M.J.. Neural stem cell grafts form extensive synaptic networks that integrate with host circuits after spinal cord injury. Cell Stem Cell. 2020; 27:430–440. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Quinn J.J., Jones M.G., Okimoto R.A., Nanjo S., Chan M.M., Yosef N., Bivona T.G., Weissman J.S.. Single-cell lineages reveal the rates, routes, and drivers of metastasis in cancer xenografts. Science. 2021; 371:eabc1944. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Yang D., Jones M.G., Naranjo S., Rideout W.M., Min K.H., Ho R., Wu W., Replogle J.M., Page J.L., Quinn J.J.et al.. Lineage tracing reveals the phylodynamics, plasticity, and paths of tumor evolution. Cell. 2022; 185:1905–1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Sulston J.E., Schierenberg E., White J.G., Thomson J.N.. The embryonic cell lineage of the nematode Caenorhabditis elegans. Dev. Biol. 1983; 100:64–119. [DOI] [PubMed] [Google Scholar]
  • 8. Chow K.H.K., Budde M.W., Granados A.A., Cabrera M., Yoon S., Cho S., Huang T.H., Koulena N., Frieda K.L., Cai L.et al.. Imaging cell lineage with a synthetic digital recording system. Science. 2021; 372:eabb3099. [DOI] [PubMed] [Google Scholar]
  • 9. Frieda K.L., Linton J.M., Hormoz S., Choi J., Chow K.-H.K., Singer Z.S., Budde M.W., Elowitz M.B., Cai L.. Synthetic recording and in situ readout of lineage information in single cells. Nature. 2017; 541:107–111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Liu K., Deng S., Ye C., Yao Z., Wang J., Gong H., Liu L., He X.. Mapping single-cell-resolution cell phylogeny reveals cell population dynamics during organ development. Nat. Methods. 2021; 18:1506–1514. [DOI] [PubMed] [Google Scholar]
  • 11. Chen C., Liao Y., Peng G.. Connecting past and present: single-cell lineage tracing. Protein Cell. 2022; 13:790–807. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Stadler T., Pybus O.G., Stumpf M.P.H.. Phylodynamics for cell biologists. Science. 2021; 371:6526. [DOI] [PubMed] [Google Scholar]
  • 13. Paradis E. Analysis of Phylogenetics and Evolution with R. 2012; NY: Springer. [Google Scholar]
  • 14. Felsenstein J. Inferring Phylogenies. 2004; Sunderland, MA: Sinauer Associates. [Google Scholar]
  • 15. Gong W., Granados A.A., Hu J., Jones M.G., Raz O., Salvador-Martinez I., Zhang H., Chow K.K., Kwak I.Y., Retkute R.et al.. Benchmarked approaches for reconstruction of in vitro cell lineages and in silico models of C. elegans and M. musculus developmental trees. Cell Syst. 2021; 18:810–826. [DOI] [PubMed] [Google Scholar]
  • 16. Gong W., Kim H.J., Garry D.J., Kwak I.J.. Single cell lineage reconstruction using distance-based algorithms and the R package, DCLEAR. BMC Bioinformatics. 2022; 23:103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Jones M.G., Khodaverdian A., Quinn J.J., Chan M.M., Hussmann J.A., Wang R., Xu C., Weissman J.S., Yosef N.. Inference of single-cell phylogenies from lineage tracing data using Cassiopeia. Genome Biol. 2020; 21:92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Cardona G., Rossello F., Valiente G.. Extended Newick: it is time for a standard representation of phylogenetic networks. BMC Bioinformatics. 2008; 9:532. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Salvador-Martinez I., Grillo M., Averof M., Telford M.J.. Is it possible to reconstruct an accurate cell lineage using CRISPR recorders?. eLife. 2019; 8:e40292. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Retkute R., Touloupou P., Basanez K.G., Hollingsworth D., Spencer S.E.F.. Integrating geostatistical maps and infectious disease transmission models using adaptive multiple importance sampling. Ann. Appl. Stat. 2021; 15:1980–1998. [Google Scholar]
  • 21. Schliep K.P. phangorn: phylogenetic analysis in R. Bioinformatics. 2010; 27:592–593. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Greenwell B., Boehmke B., Cunningham J.GBM Developers . gbm: generalized boosted regression models. 2020; R package version 2.1.8.
  • 23. Friedman J.H., Hastie T., Tibshirani R.. Additive logistic regression: a statistical view of boosting. Ann. Stat. 2000; 28:337–407. [Google Scholar]
  • 24. Goldstein A., Kapelner A., Bleich J., Pitkin E.. Peeking inside the black box: visualizing statistical learning with plots of individual conditional expectation. J. Comput. Graph. Stat. 2015; 24:44–65. [Google Scholar]
  • 25. Doran H.C. MiscPsycho: an R package for miscellaneous psychometric analyses. 2010;
  • 26. Gronau I., Moran S.. Optimal implementations of UPGMA and other common clustering algorithms. Inf. Process. Lett. 2007; 104:205–210. [Google Scholar]
  • 27. Robinson D.F., Foulds L.R.. Comparison of phylogenetic trees. Math. Biosci. 1981; 53:131–147. [Google Scholar]
  • 28. Critchlow D.E., Pearl D.K., Qian C.. The triples distance for rooted bifurcating phylogenetic trees. Syst. Biol. 1996; 45:323–334. [Google Scholar]
  • 29. Brodal G.S., Fagerberg R., Mailund T., Pedersen C.N.S, Sand A.. Efficient algorithms for computing the triplet and quartet distance between trees of arbitrary degree. SODA ’13: Proceedings of the Twenty-Fourth Annual ACM–SIAM Symposium on Discrete Algorithms. 2013; 1814–1832. [Google Scholar]
  • 30. Smith M.R. Bayesian and parsimony approaches reconstruct informative trees from simulated morphological datasets. Biol. Lett. 2019; 15:20180632. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Estabrook G.F., McMorris F.R., Meacham C.A.. Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Syst. Biol. 1985; 34:193–200. [Google Scholar]
  • 32. Sand A., Holt M.K., Johansen J., Brodal G.S., Mailund T., Pedersen C.N.S.. tqDist: a library for computing the quartet and triplet distances between binary or general trees. Bioinformatics. 2014; 30:2079–2080. [DOI] [PubMed] [Google Scholar]
  • 33. Smith M.R. Quartet: comparison of phylogenetic trees using quartet and split measures. 2019; R package version 1.2.2.
  • 34. Smith M.R. Information theoretic generalized Robinson–Foulds metrics for comparing phylogenetic trees. Bioinformatics. 2020; 36:5007–5013. [DOI] [PubMed] [Google Scholar]
  • 35. Smith M.R. TreeDist: distances between phylogenetic trees. 2020; R package version 2.2.0. Comprehensive R Archive Network.
  • 36. Greener J.G., Kandathil S.M., Moffat L., Jones D.T.. A guide to machine learning for biologists. Nat. Rev. Mol. Cell Biol. 2022; 23:40–55. [DOI] [PubMed] [Google Scholar]
  • 37. Libbrecht M., Noble W.. Machine learning applications in genetics and genomics. Nat. Rev. Genet. 2015; 16:321–332. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Schrider D.R., Kern A.D.. Supervised machine learning for population genetics: a new paradigm. Trends Genet. 2018; 34:301–312. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Kim J., Rosenberg N.A., Palacios J.A.. Distance metrics for ranked evolutionary trees. Proc. Natl Acad. Sci. U.S.A. 2020; 117:28876–28886. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Seidel S., Stadler T.. TiDeTree: a Bayesian phylogenetic framework to estimate single-cell trees and population dynamic parameters from genetic lineage tracing data. Proc. R. Soc. B. 2022; 289:20221844. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Li L., Bowling S., Yu Q., McGeary S.E., Alcedo K., Lemke B., Ferreira M., Klein A.M., Wang S.H., Camargo F.D.. A mouse model with high clonal barcode diversity for joint lineage, transcriptomic, and epigenomic profiling in single cells. 2023; bioRxiv doi:31 January 2023, preprint: not peer reviewed 10.1101/2023.01.29.526062. [DOI] [PubMed]
  • 42. Eisele A.S., Tarbier M., Dormann A.A., Pelechano V., Suter D.M.. Barcode-free prediction of cell lineages from scRNA-seq datasets. 2022; bioRxiv doi:20 September 2022, preprint: not peer reviewed 10.1101/2022.09.20.508646. [DOI] [PMC free article] [PubMed]
  • 43. Wang S.W., Herriges M.J., Hurley K., Kotton D.N., Klein A.M.. CoSpar identifies early cell fate biases from single-cell transcriptomic and lineage information. Nat. Biotechnol. 2022; 40:1066–1074. [DOI] [PubMed] [Google Scholar]
  • 44. Wang K., Hou L., Lu Z., Wang X., Zi Z., Zhai W., He X., Curtis C., Zhou D., Hu Z.. Cell division history encodes directional information of fate transitions. 2022; bioRxiv doi:07 October 2022, preprint: not peer reviewed 10.1101/2022.10.06.511094. [DOI]
  • 45. Giecold G., Marco E., Garcia S.P., Trippa L., Yuan G.C.. Robust lineage reconstruction from high-dimensional single-cell data. Nucleic Acids Res. 2016; 44:e122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Pan X., Li H., Zhang X.. TedSim: temporal dynamics simulation of single-cell RNA sequencing data and cell division history. Nucleic Acids Res. 2022; 50:4272–4288. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Chapal-Ilani N., Maruvka Y.E., Spiro A., Reizel Y., Adar R., Shlush L.I., Shapiro E.. Comparing algorithms that reconstruct cell lineage trees utilizing information on microsatellite mutations. PLoS Comput. Biol. 2013; 9:e1003297. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Weinreb C., Klein A.M.. Lineage reconstruction from clonal correlations. Proc. Natl Acad. Sci. U.S.A. 2020; 117:17041–17048. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Analysis code used in this study can be accessed at the following URL: https://github.com/rretkute/AMbeRlandTR (permanent DOI: 10.5281/zenodo.8227619).


Articles from NAR Genomics and Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES