Abstract
Charting the spatiotemporal dynamics of cell fate determination in development and disease is a long-standing objective in biology. Here we present the design, development, and extensive validation of PEtracer, a prime editing-based, evolving lineage tracing technology compatible with both single-cell sequencing and multimodal imaging methodologies to jointly profile cell state and lineage in dissociated cells or while preserving cellular context in tissues with high spatial resolution. Using PEtracer coupled with MERFISH spatial transcriptomic profiling in a syngeneic mouse model of tumor metastasis, we reconstruct the growth of individually-seeded tumors in vivo and uncover distinct modules of cell-intrinsic and cell-extrinsic factors that coordinate tumor growth. More generally, PEtracer enables systematic characterization of cell state and lineage relationships in intact tissues over biologically-relevant temporal and spatial scales.
Defining the history of cell division, differentiation, movement, and death provides fundamental insights into how complex structures emerge in multicellular life (1–3), including embryonic development (4–7), tissue maintenance by stem cells (8–12), disease progression, and tumor evolution (13–20). Accordingly, a wealth of approaches have been developed to track cell lineages with naturally-occurring mutations or engineered systems (21–32).
Initial prospective lineage tracing efforts and modern elaborations track static marks in cells– including dyes, fluorescent proteins, or recombined DNA reporter sequences–that provide views of clonal tissue architecture (21, 33–38). While these methods can preserve spatial context, they typically resolve progenitor outputs from predefined timepoints, limiting their capacity to capture dynamic cellular processes over time.
Recently engineered molecular recording technologies that pair cell lineage tracing with rich phenotyping have greatly expanded our ability to construct cellular phylogenies with high temporal resolution (39–52). Typically, these approaches stochastically install heritable lineage tracing marks (LMs) at predetermined sites, for example using Cas9-based genome editing, which can then be read out along with cell state using single-cell approaches. Systems with continuous accrual of heritable marks and sufficient diversity can enable reconstruction of vast phylogenetic trees, ideally recording the complete lineage history of all cells in a tissue (39, 43, 44, 47, 53, 54). There is growing interest in developing spatially-resolved lineage readouts (34, 40, 42, 55, 56) that leverage imaging-based cell profiling to capture multimodal cellular information with true single-cell spatial resolution at tissue scale (57–65). However, integrating the depth and timescale of evolving lineage tracing with high-dimensional, spatially resolved cell state mapping in vivo remains challenging.
Here we describe the design, engineering and rigorous validation of PEtracer, a prime editing (PE) (66–69) based lineage tracing system (Fig. 1A–C). We use PE to stochastically install immutable, predefined five nucleotide (5nt) lineage tracing marks (LMs) that can be robustly assayed by either single-cell sequencing for whole transcriptome profiling (39, 53–55) (e.g., droplet-based single-cell RNA sequencing [scRNA-seq]) or fluorescence in situ hybridization (FISH)-based imaging–specifically, multiplexed error-robust FISH (MERFISH)–for spatial transcriptomic profiling at subcellular resolution (56, 58). This readout flexibility enables both transcriptome-wide profiling and high-resolution spatial imaging. We apply PEtracer to study the 4T1 transplantable model of murine metastatic breast cancer and reconstruct the three-dimensional growth of individually-seeded tumors in vivo to dissect how cell-intrinsic and cell-extrinsic factors coordinate cellular behaviors during metastatic outgrowth. The PEtracer system represents an important step towards creating comprehensive cell fate maps of development and disease.
Fig. 1. Design and engineering of the PEtracer system.

(A) Components of the PEtracer system: the PEmax-T2A-GFP genome editing agent, lineage recording cassettes in the 3’ UTR of an mCherry gene, where each cassette contains a unique integration barcode (intBC) and three distinct edit sites that can be assayed either by droplet-based single-cell sequencing or by in situ transcription via T7 or T3 polymerization, and pegArrays, which encode all 24 LMs used for lineage tracing (eight LMs for each of the three edit sites). (B) Schematic of evolving lineage tracing using PEtracer. PEmax installs LMs at pre-determined edit sites, each lineage tracing cassettes contains three edit sites, and intBCs enumerate distinct integrated cassettes. (C) PEtracer cell state and lineage readout compatibility with both dissociative single cell readouts and those that preserve the architecture of the original tissue context. Lung image from BioRender. (D) Mean Robinson-Foulds (RF) reconstruction error for simulated phylogenies (n = 10) while varying the number of lineage marks (LMs) and normalized entropy (Hnorm) of their relative installation efficiencies. (E) Optimization of lineage mark (LM) insertion strategies (diagrams on right) across six endogenous genomic loci (RNF2, HEK3, EMX1, RUNX1, AAVS1, FANCF) using two representative five nucleotide (5nt) LMs in HEK293T cells. Four strategies for installing 5nt LMs at edit sites were tested: 1) inserting a 5nt mark between the protospacer adjacent motif (PAM) and the adjacent genomic target sequence recognized by the protospacer of our pegRNAs, 2) deleting the PAM and 3nt of the genomic target seed region (positions 18–20 of the protospacer where the PAM NGG are positions 21–23) and replacing these deleted bases with a 5nt mark, 3) performing the same edit as variant 1 but recoding the 3nt seed and PAM, and 4) replacing the 3nt seed sequence with a 5nt mark. Editing rates reflect the percentage of sequencing reads that contain the intended edit out of all aligned reads. Mean of three biological replicates ± standard deviation depicted. Sites and strategies used for subsequent experiments highlighted in yellow. Reverse transcription template (RTT) lengths are specific to each edit site. (F) Bar plot of the normalized (Norm.) LM installation efficiencies for the eight LMs in the final 24-mer pegArray (without protospacer mismatches) at each of the three finalized edit sites in B16-F10 cells (on left for each edit site) averaged across two biological replicates; Hnorm values provided for each edit site. Predicted probe hybridization specificity (Gibbs free energy = ΔG) for discriminating each of the eight LMs from one another (on right for each site). (G) Editing kinetics for protospacer mismatch variants at edit sites 1, 2, and 3 in 4T1 cells. Grey lines show all protospacer variants tested; no mismatch protospacers shown in black; protospacer mismatches selected for 2–3-week timescales shown in blue; 4–6-week protospacer mismatches in red.
In silico modeling of lineage tracing systems
Compatibility with imaging readouts guided the design and engineering of our novel lineage tracing system. In particular, assaying lineage information using FISH-based imaging requires that LMs be discriminable with a predefined probe set. Imaging readouts of current Cas9-based lineage tracers would therefore be challenging, as these systems install hundreds of varied LMs (39, 53, 54, 70, 71). We therefore used in silico simulations to assess whether fewer LMs could still yield high-quality phylogenies and to evaluate how key, experimentally-relevant parameters influence reconstruction performance (72).
We systematically varied the number of cells being traced, the number of LMs installed at each ES, the relative installation efficiencies for different LMs, the number of distinct ESs used for tree building, the fraction of ESs with a LM, and LM detection efficiency. Using simulated phylogenies, we quantified performance by calculating both the Robinson-Foulds distance (73) and mean number of triplets shared (74) between the reconstructed and ground-truth phylogenies (Materials and methods).
Reconstruction accuracy improved with both higher numbers of LMs and more balanced LM distributions (i.e. higher normalized entropy; Hnorm) (Fig. 1D and table S1) (75). However, the incremental benefit of additional LMs diminishes as their number increases (Fig. 1D). Eight well-balanced (Hnorm >0.9) LMs provide a good compromise between reconstruction accuracy and the feasibility of FISH-based imaging readouts (fig. S1A–F and tables S2,3) (76–78). As expected, low detection efficiency can degrade reconstruction performance, but this effect can be counteracted by increasing ES number (fig. S1E). Overall, our simulations demonstrate that a system with eight well-balanced LMs performs equivalently to existing Cas9-based tracers, reconstructing accurate phylogenies across a range of experimental parameters in a manner robust to imperfect detection rates (fig. S1E).
There is a tradeoff between LM installation rate and the ability to mark later cell divisions, as unmodified sites diminish over time (fig. S1G,H). Our simulations indicate that 60–80% edit saturation maximizes reconstruction accuracy, highlighting the need to precisely tune editing kinetics for different biological timescales (fig. S1E,H). Since tracing longer timescales requires slower editing rates, more ESs are needed to maintain lineage resolution comparable to that of shorter timescales given similar cell division rates. Our simulations indicate that with optimized kinetics, 75 ESs can theoretically mark 90% of divisions in a 10⁶-cell phylogeny generated by twenty doublings while 100 ESs can achieve the same resolution for 10⁹ cells from thirty doublings (fig. S1I, table S4).
Design of the PEtracer system
PE is a gene editing technology uniquely suited to satisfy the requirements established in our simulation work, as it enables precise insertion of diverse, predefined sequences in genomic DNA with minimal undesired byproducts or editing-induced toxicity (66, 79–83). A predefined set of LMs can be encoded by unique prime editing guide RNAs (pegRNAs) targeting the same ES, and high Hnorm values can be achieved by selecting pegRNAs that evenly compete during LM installation. Further, PE-installed LMs can be designed to precisely alter the protospacer adjacent motif (PAM) or protospacer seed region to prevent re-editing, thus installing predefined and immutable marks that can be read out using FISH-based imaging.
We designed a PE-based lineage tracing system (PEtracer) with three components: 1) the PEmax editor (67) whose expression can be monitored using co-expressed GFP, 2) lineage tracing cassettes (LTCs) in the 3’ UTR of an mCherry gene containing a unique DNA integration barcode (intBC) and three distinct ESs where LMs can be continuously and stochastically installed, and 3) pegRNA arrays (pegArrays) encoding 24 LMs (8 LMs per ES) linked to a BFP gene where each pegRNA encodes a distinct LM (Fig. 1A–C). We selected PE2-type editing, as opposed to PE3-type which involves opposite strand nicking, to avoid double stranded DNA break-derived insertions and deletions (indels) (66). intBCs and LMs can be read out by sequencing of polyadenylated transcripts or imaging of in situ transcribed LTCs detected using FISH probes (Fig. 1C). As LTCs are randomly integrated throughout the genome, they are spatially distinguishable and can be decoded with hybridization probes following T7 transcription in situ (56). This system is compatible with endogenous RNA MERFISH, enabling simultaneous readout of lineage and gene expression in tissue sections.
To align PEtracer performance with benchmarks established with our simulations, we performed three core technical optimizations to our system: identification of ESs and pegRNAs that enable rapid irreversible LM installation, selection of predefined LMs installed with similar efficiencies and maximally discriminable using hybridization-based approaches, and tuning LM installation kinetics to use the PEtracer system across a broad range of biological timescales.
PEtracer edit site and LM installation modality optimizations
We first determined the ES sequences, DNA editing strategy, and LM lengths that allow for highly efficient editing and accurate detection by both sequencing and imaging readouts. We reasoned that five nucleotide (5nt) marks would be installed with high efficiency by PE2 (84), should evade efficient mismatch repair detection (67) which can limit PE editing efficiencies in a cell-type-specific manner, and provide diverse sequences for robust discrimination with FISH probes (fig. S1J, table S5, and Materials and methods).
We tested four 5nt LM installation strategies at candidate ESs (Fig. 1E, right), each designed to prevent re-editing to both boost PE efficiency and ensure immutable, heritable marking (66). We installed two representative LMs using these editing strategies across six genomic loci using pegRNAs with fixed primer binding site (PBS) lengths and a range of reverse transcription template (RTT) lengths (85). We selected three loci with consistent and robust editing (Fig. 1E; RNF2, HEK3, EMX1) and further improved editing efficiency at these sites 1.5- to 3.6-fold through RTT optimization using engineered pegRNAs (epegRNAs) (fig. S2A) (85). Notably, our initial optimizations were performed at human genomic loci; thus, our pegRNAs would modify both endogenous sequences in human cells and LTC ESs. We therefore generated orthogonalized ES and epegRNA variants that maintain high editing efficiency while preventing editing of endogenous loci (figs. S2B–D and Materials and methods) (86).
As done previously (39), we constructed LTCs by inserting ES sequences in the 3’ UTR of an mCherry reporter to enable fluorescence-activated cell sorting (FACS) enrichment of cells with high numbers of LTC integrations (Fig. 1A). The orthogonalized variants of the RNF2, HEK3, and EMX1 loci termed ES1, ES2, and ES3, respectively, with all editing occurring on the same DNA strand to minimize PE-mediated generation of tandem duplications or indels (67, 87). LTCs are highly expressed for efficient capture by scRNA-seq. To facilitate detection by imaging, we introduced T7 and T3 promoters in the LTC to enable in situ transcription and thus amplification of integrated cassettes (56). Each cassette is marked by an intBC containing both a sequencing barcode (30nt) and a FISH-based imaging barcode (183nt). We generated a library of 2,171 intBC-tagged LTCs (annotated as intBC 1–2171; whitelist intBC number randomly assigned) to enable polyclonal cell line engineering where individual clones can be distinguished by their set of intBCs (table S6).
Selection of balanced LM sequences compatible with imaging-based readout
With LTCs and preliminary epegRNAs designed, we next sought to identify sets of eight 5nt LMs for each ES that were both efficiently installed and maximally discriminable by hybridization-based readouts. We tested all 1,024 possible LMs for each target locus and identified ES-specific LMs that were installed with high efficiency (fig. S2E, fig. S3A, and table S7). Among this efficiently-installed set, we computed how effectively each LM could be discriminated using hybridization probes (figs. S3A,B, table S8, and Materials and methods). For each ES, we then nominated twenty 5nt LMs and, based on retesting at orthogonalized ESs, selected the best-performing eighteen for subsequent engineering efforts (fig. S3A–C).
Finally, to ensure that each LM was installed with a similar efficiency when all epegRNAs are expressed in the same cell, we created arrays comprising eight epegRNAs each (8-mer arrays, table S9). Initial 8-mer pegArrays showed reasonably balanced performance (0.90<Hnorm<0.98; fig. S4A) which we further optimized (Hnorm for ES1: 0.98, ES2: 0.99, ES3: 0.96; fig. S4B) to construct our final 24-mer array consisting of an 8-mer for each ES (Fig. 1F). The final sets of eight LMs for each ES are predicted to be highly discriminable from one another and the unedited state based on calculated Gibbs free energy differences (Fig. 1F and table S8).
Kinetic tuning of epegRNAs to use PEtracer across diverse timescales
To identify epegRNAs that enable continuous editing over timescales ranging from several days to months, we tested a library of mismatches in the protospacers of epegRNAs targeting each ES (39). We exhaustively altered all protospacer positions except the first G for epegRNAs targeting each ES (57 per ES, Fig. 1G) and tested them with scRNA-seq-compatible lentiviral libraries (88) (Materials and methods) in two MMR-competent mouse cancer cell lines used to model metastasis (B16-F10 and 4T1) (89–91).
We identified numerous protospacer mismatches with slower LM installation rates than unmodified protospacer sequences (Fig. 1G and fig. S4C). To estimate the rate at which ESs are modified (edits/[ES*day]), we fit curves modeling the exponential decay of unmodified ESs over time, demonstrating uniform editing rates throughout the duration of the experiment (table S10). Across both cell lines, we observed expected patterns of protospacer mismatches leading to reduced editing rates, with PAM-proximal mismatches showing the largest reductions (fig. S4D). Notably, the relative editing rates for each mismatch were consistent across both lines; however, the median edit rate was 1.74-fold faster in B16-F10 cells compared to 4T1 (fig. S4E,F). We selected protospacer variants which tuned the desired editing rate to match the timescales for subsequent in vitro and in vivo experiments over 2–3 weeks (0.05–0.1 edits/[ES*day]) and 4–6 weeks (0.02–0.04 edits/[ES*day]), respectively (Fig. 1G). Notably, the protospacer variants screened here could support tracing across diverse biological systems and timescales–ranging from a single cell division even in rapidly dividing cells to many months–though cell type-specific differences in PE performance necessitate empirical selection of optimal variants.
In vitro validation of PEtracer readout using fully-edited cells
To measure PEtracer intBC and LM detection efficiency and decoding accuracy using both sequencing and imaging readouts, we generated 4T1 cells with predefined intBC-LM linkages (Fig. 2A–C and table S11). We edited cells to completion and pooled four clones to represent all LMs and unedited states for each ES (“fully-edited cells”, Fig. 2A–C). Across 6,883 cells profiled by scRNA-seq, we observed a 98.5% true positive rate for intBC detection and correctly called LMs for detected integrations with 99.9% accuracy (Fig. 2D,E and fig. S5A).
Fig. 2. Evaluation of PEtracer intBC and LM detection efficiency and decoding accuracy.

(A) Experimental schematic for generating fully-edited PEtracer cells with defined linkages between integration barcodes (intBCs) and lineage marks (LMs) to enable ground-truth evaluation of intBC and LM detection efficiency and decoding accuracy. These clones have distinct sets of integrations, typically ~10 copies and thus ~30 edit sites per clone. Representative sorting gates shown in fig. S16. Sequencer and microscope images from BioRender. (B) Droplet-based single-cell RNA sequencing (scRNA-seq) of 4 clones (6,883 cells). Color blocks on the left indicate clone identity, where clone 1 = red, clone 2 = yellow, clone 3 = green, and clone 4 = blue. Lineage tracing cassettes consisting of three edit sites are annotated with their respective intBCs at the bottom and show distinct LM combinations across the different clones. Unedited state (UE) is shown in grey; when an intBC was not detected (ND) in a cell, it is shown as white. (C) Summarized LM representation across the four combined clones. All 24 LMs are observed in the fully-edited clones 1 and 2; the unedited state is represented in clones 3 and 4. ES = edit site. (D) Confusion matrix depicting true positive, false negative, and false positive rates for detected and expected intBCs by droplet-based scRNA-seq. (E) LM detection rate (blue, left) and accuracy (red, right) for scRNA-seq data based on expected intBC-LM pairing. (F) Wide field view (left) of fully-edited and unedited clones plated for imaging-based lineage readout experiments pseudo-colored to indicate clonal identity. Color scheme matches (B), with the addition of undetermined cells (Undet.) that could not be confidently assigned to a clone shown in grey. Higher magnification of a field of view (middle) and a representative single cell (right) with DAPI nuclear staining (blue) and the two common bits (C1 = green; C2 = magenta). Integration amplicons appear as colocalized common bit puncta (white) and decoded intBCs are numbered to match our codebook (see table S6). (G) Example readouts for intBC and LM detection and decoding following in situ T7 transcription of integrated lineage tracing cassettes. Seventeen imaging rounds in three colors (50 total bits, columns) for distinct integration amplicons (rows) include two common bits to identify candidate spots followed by 21 integration barcode bits (Hamming distance 4, Hamming weight 6 code), 24 LM bits, and three bits for the unedited state. (H) Equivalent panel to (B) for imaging-based readout of 6,750 fully-edited cells using in situ T7 transcription of lineage tracing cassettes. (I) Confusion matrix as in (D) for imaging-based readout. (J) Validation and performance of the logistic regression classifier for LM assignment from imaging-based PEtracer data. Freq. = Frequency. Assignment probability cutoff of p >0.7 for decoded LMs is noted as a dashed line. (K) LM detection rate and accuracy after filtering and decoding with the LM classifier in (J).
To evaluate intBC and LM decoding by MERFISH-based imaging, we mapped 30nt sequencing to 183nt imaging barcodes for each integration using scRNA-seq (fig. S5B, and table S11). Fully-edited cells were plated at high density onto coverslips and we generated localized RNA amplicons via in situ T7 transcription (56) of genomically-integrated LTCs using a modified in-gel transcription protocol which integrates tissue clearing to enhance the accessibility of genomic loci and detection efficiency of RNA species (92, 93) (Materials and methods). We then conducted 17 rounds of 3-color imaging using hybridization probes to identify candidate amplicons, decode intBCs with a Hamming distance 4, Hamming weight 6 error correcting MERFISH code, and detect the LMs and the unedited state for each ES (Fig. 2F–H). Notably, this experiment allowed us to test all imaging readout bits. Across 5,614 cells, our optimized protocol achieved an intBC detection rate of 81.8%, maintaining a low false positive rate and improving decoding performance for dimmer spots relative to published protocols (Fig. 2I, fig. S5C–E and tables S12–16). We next trained a logistic regression classifier to predict the most likely LM from spot intensities across imaging rounds, achieving 99.4% accuracy after excluding low-confidence LMs calls, with >98.3% accuracy per LM (Fig. 2J,K and fig. S5F–H). Our simulations indicate this LM decoding performance supports accurate tree reconstruction.
Benchmarking of in vitro tree reconstruction
Next, we rigorously evaluated the accuracy of phylogenies reconstructed with PEtracer using scRNA-seq. To introduce independent, ground-truth marks of shared ancestry, we performed two rounds of lentiviral static barcoding at sequential time points (day 5 and 7 of tracing) in 4T1 cells with ~10–15 integrated LTCs and a pegArray tuned to edit over a 2–3-week timescale (Fig. 3A). We then harvested cells for scRNA-seq, capturing evolving LM data, transcriptional profiles, and static barcodes for each cell.
Fig. 3. Reconstruction of high-resolution, accurate phylogenies with PEtracer.

(A) Schematic of in vitro benchmarking experiment where phylogenies reconstructed based on the continuous accrual of lineage marks (LMs) can be evaluated for their ability to correctly group cells that share static barcodes introduced at two timepoints. Top shows the experimental design, where cells engineered with PEtracer components begin evolving lineage tracing on day zero, followed by two rounds of lentiviral infection with static barcode (BC) libraries associated with antibiotic selection markers puromycin (Puro; blue lettering) and blasticidin (Blast; red lettering). Bottom depicts how ground-truth static BC groups can be compared to reconstructed phylogenies. (B) Character matrix captured by single-cell RNA-seq (scRNA-seq) readout paired with phylogenetic tree reconstructed from LMs using the neighbor joining algorithm for the 3,108 cells in clone 4. Each integration barcode (intBC) is listed below its set of three corresponding edit sites that are colored by LM identity; UE = unedited, ND = not detected. (C) Averaged edit site modification frequencies for six clones assayed in this experiment colored by LM. (D) Phylogeny from (B) depicted as a circle with rings colored to show puromycin (inner ring) and blasticidin (outer ring) static BC groups for each cell (white indicates missing data). Phylogeny branches colored to match BC assignments. The inferred lowest common ancestor (LCA) is marked and colored to match each BC group (Puro = square; Blast = circle). (E) Averaged Fowlkes-Mallows Index (FMI) scores for puromycin and blasticidin BC groups measuring the agreement between LCA clades and observed (Obs.) static barcode labels versus randomly permuted (Perm.) barcode labels across the 6 phylogenies captured in this experiment. (F) Inferred LCA timing for puromycin and blasticidin static BC groups across 6 phylogenies. Dashed Day 5 and Day 7 lines indicate the true timing of when these BCs were introduced into cells. Averaged puromycin and blasticidin barcode FMI scores when down-sampling the number of edit sites (G) or the detection rate (H) used for tree reconstruction with six experimental phylogenies assayed using scRNA-seq. (I) Imaged coverslip of 4T1 cells growing as intermixed clones with cells pseudo-colored to indicate clone identity. x and y dimensions provided in millimeters (mm). Grey inset box highlights a polyclonal colony comprising clones 2, 14, 15, and 29. Black inset box shows clone 3 of 64, which we analyze in (J-N). (J) Histogram of intBC detection rate for cells in clone 3. The average detection rate is shown in red; a 60% cutoff for cells included in phylogenetic reconstructions is noted with a dashed line. (K) Character matrix captured by imaging readout of the 697 cells with robust intBC and LM detection in clone 3. LM annotations identical to (B) are shown across the intBCs enumerated on the x-axis. Clades 1 through 14 are colored on the phylogeny to the left. NA = not assigned a clade. (L) Nuclear masks for cells in clone 3 colored by their phylogenetic clade assignment. (M) Spatial positions of cells in clone 3 marked with specified edits that define spatial neighborhoods of cells. (N) Pairwise phylogenetic (Phylo.) versus spatial distance for cells in clone 3.
We reconstructed phylogenies for six clonal groups of cells using only the evolving LM information (Fig. 3B, fig. S6, and fig. S7A). These phylogenies show efficient LM labeling of each ES (ES1: 73.4%, ES2: 47.7%, ES3: 86.5%), with representation from all LMs (Fig. 3C). Notably, edit saturation is consistent across cells and independent of PEmax RNA levels indicating stable expression (fig. S7B) and unintended edits occur at extremely low rates (“other”; ranging from 0.07–0.54%), demonstrating high editing fidelity (Fig. 3B,C). To assess the quality of reconstructed trees, we quantified the agreement between groups of cells with a shared static barcode and phylogenetic clades defined by the inferred lowest common ancestor (LCA) of each group (Materials and methods). We reproducibly observed near-perfect agreement between LCA clades and static barcoding groups quantified using the Fowlkes-Mallows Index (FMI) (94) (0.85≤ FMI scores ≤1.00) (Fig. 3D,E, fig. S6, fig. S7A, and table S17). Both day 5 and 7 static barcode groups have consistently-high FMI scores demonstrating high accuracy at multiple depths within reconstructed phylogenies (figs. S7A).
Given the consistent editing rates demonstrated in our kinetics experiments (Fig. 1G), we could estimate the lifespan of each ancestral node in the tree and use these branch lengths to infer when static barcoding events occurred in the experiment. While these branch length estimates tend to be noisy, the inferred mean time for static barcode LCAs aligned with the timeline of the experiment for most clones (Fig. 3D,F, fig. S7A). Phylogenetic distances based on branch length estimates significantly outperformed tree-independent LM distances in separating static barcode groups, and were therefore used for all subsequent analyses (fig. S7C,D).
We performed down-sampling analyses to determine reconstruction accuracy as biologically-relevant conditions vary. Across all experimentally-generated phylogenies benchmarked by static barcoding, we found that reconstruction accuracy remained high with >20 ESs and with a detection rate of >60%–cutoffs that agree with our simulation work and which we use for subsequent reconstructions (Fig. 3G,H and tables S18,19).
Imaging-based spatial lineage analysis of motile cells
To evaluate the quality and scalability of phylogenetic reconstructions generated using PEtracer with imaging-based readouts, we profiled the growth dynamics of highly-motile 4T1 cells in vitro (95). We sparsely seeded cells with active lineage tracing components onto coverslips and allowed these cells to grow into colonies over six days (Fig. 3I and fig. S8A). From a single coverslip, we captured lineage information for 18,675 cells across 64 evolving clones which had an average of 39.5±10.1 ESs, a mean edit fraction of 76.4±27.2%, an intBC detection rate of 75.8±2.5%, and up to 1,167 cells (Fig. 3I,J, fig. S8A–D, and table S20). Colonies can be seeded by individual cells or by multiple cells that grow into one another. Indeed, in these colonies, we observed considerable intermixing between seeding clones within polyclonal colonies, highlighting substantial cell movement during colony expansion (Fig. 3I). Similarly, we observed marked intermixing between distinct phylogenetic clades within clones (Fig. 3K,L). In one representative clone, we observe that early LMs are widely distributed, whereas LMs that occur later are restricted to progressively smaller spatial regions, reflecting the limited time for dispersion since these cells shared a more recent common ancestor (Fig. 3M). Indeed, phylogenetic and spatial distances are correlated, with mean spatial distances between pairs of cells that share a common ancestor within the last three days being substantially lower than randomly permuted pairs (Fig. 3N, and fig. S8E,F). This trend is consistent irrespective of detection rate, demonstrating the accuracy of the phylogeny, a conclusion further supported by the correlation between phylogenetic and LM distances (a tree-independent metric) as well as our simulation work (fig. S1 and fig. S8F,G). Direct observation of these colonies over time substantiated clones intermixing and rapid cell movement during outgrowth (Supp. video 1). These results indicate that PEtracer imaging data can be used to reconstruct large, highly-accurate phylogenies with true single-cell spatial resolution.
Spatially resolved transcriptomic profiling of metastatic tumors in vivo
We next applied PEtracer in vivo to jointly profile cellular state, microenvironment, and phylogeny using the syngeneic 4T1 breast carcinoma cell line which recapitulates cancer cell extravasation, metastatic colonization, and growth in immunocompetent mice (89, 90, 96). We engineered a clonally diverse population of 4T1 cells with ~10–15 integrated LTCs and a pegArray tuned to edit over 4–6 weeks. scRNA-seq of these cells during in vitro culture revealed two distinct transcriptional states, distinguished by differential expression of epithelial-to-mesenchymal transition (EMT) related genes (fig. S9A,B). Clonal lung metastases were then initiated by tail vein injection of tracing cells FACS sorted either 6 or 12-days post-lentiviral induction of PEmax expression (Fig. 4A).
Fig. 4. Spatial transcriptomics of metastatic 4T1 lung tumors.

(A) Experimental schematic for generating MERFISH data from lung tumors following tail vein injection of 4T1 cells engineered with PEtracer components into immunocompetent mice. PEtracer 4T1 cells undergo evolving lineage tracing throughout tumor growth; tumor-bearing lungs are embedded for sectioning and imaged. (B) UMAP embedding of 368,722 cells sampled from 10 tumors in two animals, where each cell is colored by its assignment to the 18 distinct cell types resolved by this library. Cell type color annotations are listed below the UMAP. (C) Spatial organization of cell types in 4T1 lesions in the lungs of two mice, with a zoom-in showing nuclear segmentation masks. Cell types are colored using the annotations in (B). Mouse 1 (M1; left) has four distinct tumors (T1-T4); Mouse 2 (M2; right) has six (T1-T6). M1 T1 is bolded to indicate that this tumor is analyzed more deeply in main text Fig. 5 and fig. S14. M1 T2, M1 T4, and M2 T5 are underlined as they are analyzed further in fig. S13. Lung boundary is shown in grey and normal lung cells profiled by MERFISH (on average 154 μm from tumor boundary indicated by dashed line) are colored based on cell type annotation; tissue-level scale bars = 1 mm; inset scale bar = 100 μm. (D) Percentage and cell number counts for each cell type grouped by tumor, colors consistent with those in (B). (E) Marker gene expression (expr.) across annotated cell types. Dot color indicates the mean gene expression for a marker gene within a cell type, while dot size indicates the percentage of cells within a cell type that express that gene. (F) Heatmap of cell type densities in neighborhoods identified by Hotspot (left, row z-scored). Spatial distribution of cell type neighborhoods for M1 T1 section 2 (right). (G) Normalized cell type densities as a function of their scaled distance to the lung/tumor interface (boundary between adjacent lung and tumor; −1 corresponds to cells within the lung furthest from the lung/tumor interface and +1 corresponds to cells within the tumor furthest from the lung/tumor interface). Cell type colors consistent with legend in (B). (F, G) depict data from 9 tumors (M2 T3 excluded from analysis due to low cell number). (H) Spatial distribution of Endothelial cells, CD11chi macs., ARG1hi macs., and CAFs across M1 T1 section 2. Cell type abbreviations are as follows: Malig. = malignant, Mac. = macrophage, Treg = regulatory T cell, cDC = conventional Dendritic Cell, Neu. = Neutrophil, NK = Natural Killer cell, pDC = plasmacytoid Dendritic Cell, Endo. = Endothelial cell, CAF = Cancer Associated Fibroblast, AF1/AF2 = Alveolar Fibroblast Type 1/2, AT1/AT2 = alveolar epithelial type 1/2.
Following tumor outgrowth in vivo, we harvested the lungs of tumor-bearing mice and collected 20 μm tissue sections. Extensive experimental optimizations to prevent RNA degradation and accurately segment cells (97–102) enabled MERFISH imaging with a 124-gene library designed to annotate major cancer, immune, and stromal cell types (103–105) (Fig. 4A, table S21, and Materials and methods). Across five sections from two mice, we captured ten distinct tumors, all of which were derived from a more mesenchymal clone observed in vitro (fig. S9C–E). In total, we profiled 368,722 cells (mean: 178 transcripts/cell) with high correlation between MERFISH replicates and with bulk and scRNA-seq data (Fig. 4B–D, fig. S10A–D, and table S22) (106). Eighteen cell types were annotated based on the expression of previously-described marker genes (Fig. 4D,E and fig. S10E,F) (103–105). Overall, we observed consistent cell type representation across tumors but variable frequencies related to tumor size, with less immune infiltration in larger tumors (Fig. 4D). Larger tumors often had necrotic cores containing cellular debris, with few intact nuclei or RNA molecules detected by MERFISH (Fig. 4C and fig. S10A).
We next defined five cell type neighborhoods within 4T1 tumors using Hotspot (107) (Fig. 4F, fig. S11) and examined cell type densities relative to the lung/tumor interface, which we found to be the most informative axis of spatial organization (Fig. 4G). A lung/endothelial neighborhood with increased AT1/AT2, club, and endothelial cell densities was predominantly localized in non-tumor, normal lung tissue while the tumor core mainly consisted of malignant cells (Fig. 4F–H). We identified an immune enriched neighborhood located near the lung/tumor interface, although we observe immune infiltration throughout the tumor (Fig. 4F–H). We noted focal B cell enriched neighborhoods located outside the tumor boundary that were also enriched for ALOX15hi (arachidonate 15-lipoxygenase-high) macrophages (Fig. 4F–H, fig. S11). ARG1hi (Arginase 1) macrophages and cancer-associated fibroblasts (CAFs) were enriched at the tumor boundary furthest from the adjacent lung (distal tumor boundary; Fig. 4G–H), as described previously (108). We also observed reduced endothelial cell density within this hypoxic neighborhood at the distal tumor boundary, consistent with prior descriptions of M2-like macrophages enriched in hypoxic tumor regions (109–112) (Fig. 4F–H). These results reveal the intricate spatial organization of diverse cell types within the tumor microenvironment in this syngeneic model of cancer metastasis.
Lineage maps reveal spatial patterns of metastatic tumor growth
To dissect how cell lineage informs tissue structure, we performed our optimized in-gel T7 amplification protocol to read out lineage information in MERFISH-processed tumor sections (Fig. 5A). We first evaluated intBC and LM decoding in vivo by imaging in a primary 4T1 tumor comprising fully-edited cells, where we observed 99.6% LM readout accuracy in tumor sections (fig. S12). We noted a relationship between intBC detection efficiency, nuclei volume, and nuclei z-position (Fig. 5B). Decreased nuclei volumes at the top and bottom of the tissue slice suggest bisection of nuclei during sectioning; we therefore only included nuclei with centroid z positions >4 μm (corresponding to the average nuclei diameter) from the top or bottom of tissue sections – the average intBC detection rate for these cells was 55.2±9.7% (fig. S13A and table S23). Four clonal tumors with mean edit fractions >50% and 36±3 ESs with a common set of intBCs were selected for further analysis (fig. S13B,C and fig. S9C–E). While the presence of tumors with low edit fractions suggests that PE silencing can occur at the clone level, potentially exacerbated by selection against immunogenic PEtracer components, the narrow distribution of edit fractions within clones indicates that subclonal silencing is rare (fig. S13B). As with the in vitro 4T1 colonies, we reconstructed phylogenies using all cells with >60% intBCs detected, on average 45.2% of whole malignant cell nuclei (range 35.8%–57.5%; Fig. 5C and fig. S13D). After quality filtering, in vivo trees ranged in size from 1,265 (Mouse 2 Tumor 5 Section 1) to 19,029 cells (Mouse 1 Tumor 1; M1 T1 Sections 1–4) with near-optimal editing kinetics and LM diversity, consistent with in vitro editing with these pegRNAs (Fig. 5D,E and fig. S13E,F).
Fig. 5. In vivo phylogenetic reconstructions at tissue scale.

(A) Experimental schematic for imaging-based readout of PEtracer lineage data from samples after completion of MERFISH RNA readout. Samples undergo extensive digestion, in-gel T7 in situ amplification, and imaging of PEtracer integration barcodes (intBCs) and lineage marks (LMs). (B) intBC detection (detect.) rate as a function of cell volume and its z-position. (C) Histogram of cell count by in vivo intBC detection rate for M1 T1. The average detection rate is shown in red; a 60% cutoff for cells included for phylogenetic reconstructions is noted as a dashed line. (D) Edit site LM modification representation across all cells included in the phylogeny for M1 T1. UE = unedited. (E) Character matrix and phylogeny for the 19,029 cells with >60% intBC detection efficiency across four sections (S1 to S4) for M1 T1. intBCs are listed below the character matrix which is colored by LM identity for each edit site. LM coloring same as in (D), where additionally white = not detected. Clades 1 through 18 are colored in the phylogeny. NA = not assigned. Clade color scheme shared with (F). (F) Cells from four sections (S1-S4) colored by clade assignment. The location of the normal lung and the upper and lower portions of the lesion are annotated. Scale bar = 1 mm. (G) Spatial organization of edits in M1 T1 S1 that define phylogenetically related spatial domains. Grey cells denote other LM or unedited, white cells denote not detected. ES = edit site. (H) Pairwise phylogenetic (Phylo.) versus spatial distance for M1 T1 phylogeny. (I) Three-dimensional reconstruction of the colored clades in M1 T1. Balls indicate the average spatial position of cells in a clade; branches indicate the average position of contributing ancestors to each node. Sections S1 to S4 depicted as light outlines. Axes lengths = 1 mm. (J) Estimated number of extant cells as a function of time based on inferred branch lengths. Each clade is colored as in (E); dotted line shows an exponential fit curve of tumor growth. (K) Phylogeny from (E) for M1 T1 cells across all four sections depicted as a circle with phylogeny branches colored by clade assignment. Outer ring depicts inferred fitness for each cell. (L) Spatial organization of cell fitness across four sections for M1 T1. A spatially-confined (M), high-fitness expansion (N) is observed at the lung/tumor boundary. Cells in this expansion in S1 are shown in red, other cells shown in grey. (O) Malignant cell distance (dist.) to the tumor boundary versus fitness. Black line shows regression, dotted line mark top and bottom deciles. (P) Pearson correlation between indicated cell type density with fitness versus correlation of cell type density with distance to tumor boundary. Black line depicts regression with ribbon indicating 95% confidence interval. Mac. = macrophage, AF1 = Alveolar Fibroblast 1, cDC = conventional Dendritic Cell. See table S24 for relevant statistics.
With these tissue-scale phylogenies, we examined the coordinated effects of cell state, lineage, and spatial organization on tumor evolution in vivo. Phylogenetic clades occupy distinct spatial regions with consistent patterns observed across multiple sections (Fig. 5E,F and fig. S13G). In M1 T1, for example, intBC2147 ES1 LM3 marked an early split in the tree, with cells bearing this LM predominantly located in the lower half of the tumor, suggesting an early axis of tumor growth parallel to the lung/tumor interface (Fig. 5G). LMs defining later branches in the tree–representing cells with more recent common ancestors–are confined to progressively smaller spatial regions (Fig. 5G). Notably, while spatial intermixing of clades indicates that 4T1 cells remain motile in vivo, the correlation between pairwise phylogenetic distance and maximum spatial distance was higher than in vitro colonies, possibly reflecting reduced local cell movement due to increased cell densities in vivo (Fig. 5F,H and fig. S13G). Cells with more similar sets of LMs (lower LM distances) also tended to be spatially close, though this trend was weaker than that observed for phylogenetic distance (fig. S14A,B).
To reconstruct tumor growth in 3-D, we calculated the average spatial location of cells within each phylogenetic clade and that of ancestral nodes in the tree across multiple tissue sections of the same tumor (Fig. 5I). This analysis nominated an early phylogenetic and spatial bifurcation of tumor growth, with clades 1–7 located in the upper half of the tumor and clades 8–12 in the lower half. Clades 17 and 18 represent a phylogenetic outgroup with rich lineage sub-structure that is spatially enriched at the lung/tumor interface in the upper region of the tumor, a region with high local phylogenetic diversity (Fig. 5F,I and fig. S14C–E). Estimations of extant cell number for each clade at various time points suggested exponential growth for all clades throughout tumor progression, consistent with mostly neutral evolution at the clade level (Fig. 5J). We expect that divergence from exponential expansion at later time points is caused by subsampling due to the limited number of tumor sections we collected despite high cell recovery for individual sections, although subsampling could also mask changes in growth rates later in tumor development.
To understand drivers of cell lineage dynamics in vivo, we used tree structure to assess phylogenetic fitness, an integrated measure over time of either increased cell division or decreased cell death relative to other sub-clades. We calculated fitness for each cell (Materials and methods) and examined the spatial distribution of high-fitness cells, which we found to be enriched along the tumor periphery adjacent to the normal lung, including a highly localized expansion at the lung/tumor interface (Fig. 5K–N) (54). Overall, we observed a negative correlation between phylogenetic fitness and distance to the tumor boundary (Fig. 5O). Fitness estimates were strongly associated with mean neighbor LM distance, a tree-independent metric, that also indicated elevated growth near the tumor boundary, and showed minimal correlation with the fraction of sites edited per cell (fig. S14F–H). Given the diverse spatial distribution of immune and stromal cells, we asked whether local cell type densities correlated with cancer cell fitness. Although the density of several cell types (e.g., conventional dendritic cells; cDCs) were weakly correlated with fitness, we found that the densities of fitness-associated cell types also varied by their distance to the tumor boundary, complicating efforts to disentangle cell type density from spatial position effects (Fig. 5P, fig. S14I–K, and table S24). The wide variability in cancer cell phylogenetic fitness we observed at the tumor boundary suggests that additional cell-intrinsic and cell-extrinsic factors may cooperate to influence cancer cell dynamics and fitness.
Transcriptional and spatial determinants of cancer cell fitness
To better characterize transcriptional and spatial features that underlie differences in cancer cell fitness, we augmented our MERFISH library to assess gene programs associated with cancer progression (e.g. hypoxia, EMT, and cell cycle). Using this expanded 175-gene library, we generated MERFISH data from three sections of an additional tumor, assaying 104,219 total cells (mean: 261 transcripts/cell) (Mouse 3 Tumor 1; M3 T1; tables S22,25). Our new library recovered previous annotations and further resolved previously detected cell types into transcriptionally distinct subtypes (fig. S15A–D). Further, we identified four spatially-coherent transcriptional modules of malignant cells using Hotspot (107): 1) “leading edge” cells at the distal tumor boundary, 2) “hypoxic” cells, 3) “tumor core” cells, and 4) “lung adjacent” cells (Fig. 6A–C and fig. S15E). These modules were also associated with distinct cellular microenvironmental neighborhoods (fig. S15F).
Fig. 6. High resolution spatial mapping of cell state and lineage uncovers drivers of tumor growth.

(A) UMAP embedding of 104,219 cells sampled across three sections (S1 to S3) of a tumor from Mouse 3 (M3 T1) where the malignant cells are outlined and colored by their assignment to four distinct cancer cell modules: Module 1/Leading edge, Module 2/Hypoxic, Module 3/Tumor core, and Module 4/Lung adjacent. Full UMAP embedding where each cell is colored by its annotated cell type can be found in fig. S15A. (B) Heatmap of marker gene expression (expr.) for each cancer cell module. (C) Spatial organization of modules 1–4 across M3 T1 S1-S3. (D) Phylogeny for all malignant cells with >60% integration barcode detection efficiency across M3 T1 S1-S3 depicted as a circle. Branches are colored by clade assignment (Clades 1–18) with a color key beneath. The inner ring denotes fitness while the outer ring denotes the module assignment for each of the cells in the tree. (E) Representation (rep.) of each of the 18 phylogenetic clades across the 4 cancer cell modules. (F) Spatial organization of phylogenetic clades 1–18 across M3 T1 S1-S3. (G) Quantification of module assignment heritability as measured by Moran’s I phylogenetic autocorrelation. (H) Spatial organization of cell fitness scores across M3 T1 S1-S3. (I) Violin plots of cell fitness scores by module with embedded boxplot showing quartiles. Module 1/leading edge cell analyses = (J-L). (J) MERFISH counts for Foxm1 transcripts in M3 T1 S3 (malignant cells only). (K) Spatial organization cell cycle phase in M3 T1 S3 (malignant cells only). (L) Quantitation of cell cycle phase by cancer cell module in contrast to stromal cells in the adjacent normal lung. Module 2/hypoxic cell analyses = (M-N). (M) MERFISH counts for Vegfa transcripts in M3 T1 S3 (all cells). (N) Spatial distribution of Tumor endothelial (Endo.), ARG1hi macrophages (Mac.), and capillary 1/2 (CAP1/CAP2) endo. cells. (O) Heritability versus Pearson fitness correlation (corr.) of individual genes colored by module assignment. Module 4/lung adjacent cell analyses = (P-S). MERFISH counts for (P) Cldn4, (Q) Fgf1, and (R) Fgfbp1 transcripts in M3 T1 S3. (S) Violin plot for module 4 malignant cell fitness as a function of Cldn4 and Fgf1 expression. Abundance of cells with varying states listed below each entry. (T) Pearson correlation of malignant cell fitness with cell type densities, cell-intrinsic gene expression, or tumor features (distance (dist.) to the tumor boundary or lung-tumor interface) across the 5 tumors analyzed in this study. cDC = conventional Dendritic Cell, Treg = T regulatory cell, AT1/AT2 = alveolar epithelial type 1/2. Scale bars = 1 mm for (C), (F), and (H). Scale bars = 200 μm for (J-K), (M-N), (P-R). Significance in (I) and (S) denotes Mann-Whitney-Wilcoxon two-sided test where *p < 0.05, **p < 0.01, ***p < 0.001, ****p<0.0001. Significance in (T) denotes Student’s t-test of Pearson correlation coefficient. See tables S24, S26, and S27 for relevant statistics.
To understand how malignant cell modules emerged during tumor growth, we constructed a phylogeny for the 10,061 cells with >60% intBC detection in M3 T1 (Fig. 6D and fig. S15G). Each cancer cell module had consistent editing rates and mostly uniform clade compositions, except the “lung adjacent” module which was enriched for clades 13–16 (Fig. 6E,F, fig. S15H). We observe intermixing of malignant cell module assignments within phylogenetic clades, suggesting 4T1 cells exhibit transcriptional plasticity to adapt to microenvironmental conditions (Fig. 6D). Despite this plasticity, we find striking differences in malignant cell module heritability. Specifically, the “lung adjacent” module was the most heritable (Fig. 6G and table S26) possibly related to high expression of epithelial markers (e.g. Cdh1; E-cadherin) in contrast to “leading edge” cells which upregulate mesenchymal markers (e.g. Snai1; Snail family zinc finger 1) and therefore are more likely to move and exhibit transcriptional plasticity (Fig. 6B) (113). We next used the M3 T1 phylogenetic tree to parse how module-associated features coordinately instruct cancer cell fitness. As before, high-fitness cells were overrepresented in certain phylogenetic clades and enriched at the tumor boundary, particularly within the “lung adjacent” module (Fig. 6D,H,I).
These data suggest that malignant cell modules reflect the integrated effects of transcriptional programs and environmental cues that instruct tumor growth. “Leading edge” cells show higher expression of cell cycle-related genes (including Foxm1 [forkhead box M1], a transcription factor previously identified as a genetic dependency for 4T1 growth in vivo) (114) and occupy a cellular neighborhood enriched for CAFs at the distal boundary (Fig. 6J–L, Fig. 4G,H, and fig. S15F). The spatial and transcriptional signature of “leading edge” cells has low heritability indicating that the proliferative capacity of these cells is a transient state influenced by local factors that do not instruct long-term behaviors (Fig. 6E,G). A hypoxic microenvironment beneath “leading edge” cells defines module 2, marked by expression of angiogenesis-related genes (e.g. Vegfa; vascular endothelial growth factor A) and a sharp boundary between tumor endothelial cells and ARG1hi macrophages (Fig. 6M,N) (109–112). “Tumor core” malignant cells show no distinct transcriptional or lineage signatures. By contrast, “lung adjacent” cells have the highest fitness and this module is the most heritable in this tumor (Fig. 6G,I). Phylogenetic fitness reflects proliferative potential across several weeks and many cell divisions while cell cycle state can fluctuate in a matter of hours, which we suspect is why “lung adjacent” cells have higher fitness than “leading edge” cells despite having a slightly lower proportion of actively cycling cells (Fig. 6L).
We next sought to identify the drivers of high fitness cancer cell states, including lineage, positional, cell-intrinsic, and cell-extrinsic factors. Similar to our observations in M1/M2 tumors, we found that the densities of several cell types (e.g. regulatory T cells; Tregs) were positively correlated with phylogenetic fitness and also enriched at the tumor boundary near “lung adjacent” cells (fig. S15F,I). To identify cell-intrinsic drivers of cancer cell growth, we looked for fitness associated genes that also exhibited heritable expression patterns (Fig. 6O and tables S24,26). This analysis nominated genes associated with an epithelial-like state that are highly expressed in high-fitness “lung adjacent” malignant cells but lowly expressed elsewhere including Cldn4 (Claudin 4, an epithelial cell tight junction protein) and Fgf1/Fgfbp1 (Fibroblast Growth Factor 1 and its carrier protein Fgf Binding Protein 1) (Fig. 6O–R). To disentangle the contributions of spatial cues versus heritability in these cells, we quantified the extent to which gene expression patterns were explained by spatial proximity versus phylogenetic relatedness. Cldn4 expression was highly heritable, indicating a stable cell-intrinsic state, whereas Fgf1/Fgfbp1 expression was more spatially restricted and less heritable, consistent with their signaling functions (Fig. 6O–R and fig. S15J). Notably, Fgf1 is also expressed by non-malignant cells (fig. S15C), suggesting both paracrine and autocrine signaling. These genes illustrate how spatially- and phylogenetically-coherent programs interact to promote “lung adjacent” cell fitness, where cells co-expressing both Cldn4 and Fgf1 exhibit the highest phylogenetic fitness (Fig. 6S). These findings are reproducible across tumors, with similar sets of cell-extrinsic and cell-intrinsic factors associated with fitness, including Cldn4 and Fgfbp1 expression (Fig. 6T, fig. S15k and table S24).
Discussion
Here we describe PEtracer, an evolving lineage tracing technology that enables joint phylogenetic and cell state profiling using either single-cell sequencing or high-resolution imaging while preserving cellular tissue context. Guided by computational modeling, we engineered PEtracer to enable robust, high-resolution phylogenetic reconstructions capable of marking individual cell divisions in fast-dividing cells and tuned editing kinetics to enable lineage recording over many weeks or months. PEtracer allows us to capture key events that occur dynamically throughout biological processes without a priori knowledge of event timing. We performed rigorous benchmarking of PEtracer to establish that we can assemble high-quality phylogenies that accurately reconstruct static barcode groups across multiple time points. Using MERFISH-based imaging–chosen for its high spatial (single-cell and subcellular) resolution, high RNA detection efficiency, and minimal cell loss relative to current sequencing-based spatial methods–we collected rich cell state and lineage data from intact tissue sections (115). This approach preserved the cellular tumor microenvironmental context and enabled high-resolution reconstruction of growth patterns in individually-seeded tumors within a transplantable model of murine breast cancer lung metastasis (89, 90, 96). PEtracer is also well-suited for integration with other imaging- and sequencing-based spatial transcriptomic methods (55, 116–118).
Our tumor metastasis studies highlight how integrating cell state, lineage, and spatial information at tissue scale in vivo provides unprecedented insights into tumor evolution. Using MERFISH, we identified spatially- and transcriptionally-coherent cancer cell modules defined by distinct cell-intrinsic and cell-extrinsic programs. At the outer tumor edge, increased malignant cell division is likely driven by space for further growth and a supportive cellular neighborhood rather than heritable factors, as phylogenetic analysis reveals low heritability of this state. Notably, phylogenetic fitness and cell cycle state reflect proliferative capacity over different time scales, and leading edge cells do not have the highest fitness within the tumor despite a modestly higher proportion of proliferating cells. As leading edge cells divide, they create a hypoxic, high-density cell layer where low oxygen availability induces upregulation of mesenchymal gene expression programs (119) and high Vegfa expression which promotes angiogenesis (120). Within the more-vascularized tumor core, spatial constraints limit growth despite an available blood supply. Conversely, epithelial-like malignant cells adjacent to the normal lung adapt to a nutrient- and immune-rich niche, upregulating heritable transcriptional programs (e.g. Cldn4, Fgf1/Fgfbp1) associated with high phylogenetic fitness. These adaptive programs differ from the more mesenchymal states that enable extravasation and initial seeding of lung lesions (121). Spatially-resolved phylogenies reveal complex relationships between transcriptional state, spatial location, and lineage that would be difficult to resolve using only spatial or transcriptional neighbors to make lineage inferences due to cell migration and phenotypic plasticity. Together, our results highlight the interplay between heritable transcriptional programs and the tumor microenvironment in shaping cancer cell fitness and evolutionary dynamics.
These efforts motivate several new directions to further understand evolutionary relationships between cells in complex biological systems: i) While our work here describes a single model of cancer cell extravasation and metastatic outgrowth, other tumor models better capture cancer initiation, progression, therapeutic resistance, and the complete metastatic cascade. As with our previous Cas9-based lineage tracing system–which shares similar core components–we envision using PEtracer to study a range of tumor models as well as development (13–16, 39, 53, 54). ii) Assaying tissues with both imaging and sequencing readouts (116–118) would enable full transcriptomic imputation using phylogenetic neighbors (122–125). iii) Imaging of thicker tissue sections with concomitant computational improvements would better enable 3-D tissue reconstruction (126). iv) Novel lineage reconstruction algorithms may further improve phylogenetic reconstructions (127–130). v) Layering perturbations on top of phylogenetic reconstructions would enable direct interrogation of the drivers of cell state transitions and evolutionary dynamics (131). vi) Integrating emerging PE-based signal recording methodologies (132) would enable decoration of phylogenies with LMs associated with signaling events of interest to assess their timing and impact on fate determination.
More broadly, understanding how cellular histories shape tissue-scale characteristics is a longstanding challenge in biology. The PEtracer system enables direct interrogation of cellular dynamics across diverse biological contexts, complementing descriptive cell atlases (8, 133–142) to understand the drivers of cell fate determination and tissue organization.
Methods and Materials
In silico lineage tracing simulation
Lineage trees and associated tracing data were simulated using the Cassiopeia framework (v2.0.0) (72). A birth-death process was employed to generate lineage trees, where internal nodes represented cell divisions, branch lengths corresponded to individual cell lifetimes, and leaves represented the cells observed at the experiment’s conclusion. The time between cell divisions was modeled using a log-normal distribution (mean = 1, sigma = 0.5) and the cell death rate was set at 0.25. Each simulation was run until the desired population size was reached or all cells died. These simulated trees were subsequently used to generate tracing data under varying conditions.
To evaluate how key experimental parameters influence the accuracy of reconstructed trees, tracing data was simulated while varying the following parameters:
: The number of extant cells at the end of the experiment
: The number of edit sites available for recording information
: The number of unique lineage marks (LMs) that could be installed at each edit site
: The distribution of LM installation probabilities
: The fraction of edit sites with an LM installed at the end of the experiment
: The efficiency of LM detection at each edit site
Simulations began at the tree’s root with unedited sites. Edits occurred stochastically and were inherited according to the tree structure. To achieve the target edit fraction over a total experimental time , edits occurred at a rate given by:
The LM installed during each edit was randomly selected based on the distribution . At the end of the simulation, a fraction of the LMs was randomly dropped out to account for imperfect detection.
Distance-based tree reconstruction
For tree reconstruction with neighbor joining and UPGMA a N-by-N weighted Hamming distance matrix was computed. The distance between two cells and is defined as:
Here, represents the cell by edit site matrix of detected LMs, is the number of edit sites where both cells have an LM detected, and the distance for edit site is defined as:
if matches or either is missing
if or is unedited
otherwise (both edited and not equal)
To efficiently reconstruct large trees, optimized versions of neighbor joining and UPGMA were used (78). These algorithms employ dynamic programming to achieve an runtime and are implemented in the CCPhylo software suite (v0.8.3, https://github.com/genomicepidemiology/ccphylo).
For rooting trees generated by neighbor joining, a synthetic cell with all sites unedited was added. The tree was then rooted using this synthetic cell as the outgroup, after which the synthetic cell was removed.
Greedy tree reconstruction
We also evaluated tree reconstruction using the “vanilla” greedy algorithm from Cassiopeia (72). This algorithm builds a tree from the top down by recursively selecting lineage marks (LMs) that split the cells into progressively smaller groups. LMs are chosen based on their frequency within the group, with lower-frequency LMs favored because they are less likely to exhibit homoplasy (i.e., multiple independent occurrences of the same edit at the same edit site in separate cells). Although the greedy algorithm performed well for phylogenies labeled with the Cas9 indel LM distribution, it underperformed in distance-based tree reconstruction with only eight LMs due to increased homoplasy. Consequently, only the neighbor joining and UPGMA methods were used to reconstruct phylogenies generated with PEtracer LM data. Hybrid approaches, which pair a greedy algorithm for solving the top of the tree with a more accurate method, such as integer linear programming, for solving the bottom of the tree, can perform well but were not evaluated here since they require high-confidence greedy splits (72).
Evaluation of simulated tree reconstruction
Two complementary metrics were used to evaluate the similarity between simulated ground-truth trees and reconstructed trees with UPGMA, neighbor joining, and Greedy tree reconstruction algorithms. The first metric was Robinson-Foulds (RF) distance, which quantifies topological differences by counting the number of node partitions unique to each tree (73). The RF metric was normalized by the total number of partitions, resulting in a range between 0 and 1. As a distance metric, lower values indicate greater similarity.
The second metric was depth-normalized mean triplets correct, which compares trees based on the relative ordering of leaf triplets rather than the set of partitions. To compute this metric, 1,000 triplets were sampled from the ground-truth tree, distributed evenly across lowest common ancestor (LCA) depths to minimize bias introduced by most triplets having an LCA near the tree’s root. The fraction of triplets with conserved ordering in the reconstructed tree was then calculated with values that range from 0 to 1. As a similarity metric, higher values indicate greater similarity.
All simulations were repeated 10 times, with reconstruction accuracies and parameters for each iteration reported tables S1–3. The mean value for each metric is presented with ribbons indicating standard deviation where applicable. In each iteration, the tree topology was kept constant while the simulated tracing data varied based on the specified parameters, except when assessing the effect of tree size on reconstruction accuracy. For values presented in heatmaps, the performance of the best algorithm for the given parameter combination is reported.
LM distributions
The uniformity of LM installation probabilities was quantified using normalized entropy (), calculated with the following formula:
Here, represents the probability of the i-th LM being installed, and is the total number of lineage marks.
In simulations, the LM distributions with a given were generated by adjusting the parameter of an exponential distribution until the target was achieved. Thus, the probability of the i-th LM is given by:
with probabilities normalized to sum to 1. The CRISPR-Cas9 indel distribution used in these simulations was empirically derived from data published by Yang, Jones, et al. 2022 (54).
Edit rate simulations
A simplified simulation framework was used to model the relationship between edit rate and edit saturation over time. In this framework, it was assumed that all cells proliferate uniformly at a rate of one division per day, with no cell death. Instead of modeling all cells, which is computationally expensive because the number of cells grows exponentially, we model 100 cells over the duration of the simulation. Each parameter combination was simulated 10 times, and the mean values were reported.
To determine the edit rates required to achieve 50%, 60%, 70%, 80%, and 90% saturation, simulations were performed for edit rates ranging from 0 to 0.3 edits/day in increments of 0.001. For each experiment length, spanning 1 to 100 days, the edit rate closest to achieving the desired saturation value was identified.
To identify the minimum number of edit sites required for 70%, 80%, and 90% of cell divisions to be marked by an edit as a function of tree size, we simulated experiments from 1 to 30 days in length–corresponding to tree sizes between 21 and 230 cells. For each tree size, the number of edit sites was varied from 1 to 120 (in increments of 1), and edit rates were tested from 0 to 0.3 (in increments of 0.001). The minimum number of edit sites required to achieve the target edit fraction was determined for each tree size using the optimal edit rate for each experiment length. These values represent a lower bound on the required number of edit sites, as the simulations assume uniform division rates, perfect detection of all cells in the tree, and ideal tracing kinetics. Results from these simulations are reported in table S4.
General mammalian cell culture conditions
4T1 cells (American Type Culture Collection (ATCC) CRL-2539) were maintained in RPMI 1640 medium (Thermo Fisher Scientific) supplemented with 10% FBS and 1% Penicillin-Streptomycin-Glutamine (PSG) (Thermo Fisher Scientific). B16-F10 cells (ATCC CRL-6475) were maintained in Dulbecco’s Modified Eagle’s Medium (DMEM; Thermo Fisher Scientific) supplemented with 10% FBS and 1% PSG. HEK293T (ATCC CRL-3216) and HEK293T/17 cells (ATCC CRL-11268) cells were maintained in DMEM supplemented with 10% FBS and 1% PSG. All cells were maintained at 37°C and 5% CO2, authenticated by their suppliers, and tested negative for mycoplasma.
Mice
All mouse experiments described in this study were approved by the Massachusetts Institute of Technology Institutional Animal Care and Use Committee (IACUC, protocol numbers 0421–043-24 and 2311000598). To mitigate rejection of engineered 4T1 PEtracer cells, which are syngeneic in the Balbc/J background and express immunogenic components including PEmax (spCas9(H840A)–M-MLVRT) and mCherry (linked to lineage tracing cassettes), experiments were performed in F1 Balbc/J x SELECTIV (143) (LSL-Aavr-spCas9-mCherry) or Balbc/J x [SELECTIV x CMV-Cre (Aavr-spCas9-mCherry)] hybrids. Balbc/J (JAX #000651), SELECTIV (JAX #037553), and CMV-Cre (JAX #006054) mice were procured from JAX and bred in house. Tumor experiments were performed in female animals between 8 and 12 weeks of age. We performed tail vein lineage tracing experiments in three mice and eleven total tumors to capture biological variability. Statistical tests were not used to determine sample size and investigators were not blinded to experimental groups.
Cloning of tracing cassette and editor constructs
DNA fragments for cloning were either ordered from Integrated DNA Technologies (IDT) or Twist Bioscience as dsDNA fragments or PCRed using Phusion Green Hot Start II High-Fidelity PCR 2x Master Mix (Thermo Fisher Scientific) with the following cycle conditions: 98°C for 2 minutes, [98°C for 10 seconds, Primer specific Tm for 20 seconds, 72°C for 30 seconds/kb of amplified insert]x34, 72°C for 2 minutes. GC rich sequences like universal chromatin opening elements (UCOEs) were amplified using the KOD Xtreme Hot Start polymerase for 34 cycles following manufacturer’s protocols (Sigma Aldrich, 71975-M). All PCR-amplified products were DpnI treated following manufacturer’s protocols (New England Biolabs, NEB). DpnI-treated fragments were run out on either 1% or 2% agarose gels (VWR) supplemented with 1 μL of ethidium bromide (Thermo Fisher Scientific) per 10 mL of 1x Tris-Acetate-EDTA (TAE) buffer. Amplified DNA was extracted from bands excised from the gel using the QIAquick Gel Extraction Kit (Qiagen). Gibson assembly was used to assemble fragments into completed plasmids according to manufacturer’s instructions. Specifically, we use 5 μL of NEBuilder 2x HiFi DNA Assembly master mix (NEB) combined with 0.05pmol backbone and 0.15pmol of each insert fragment brought to a final volume of 10 μL with ultrapure nuclease free water. Assemblies were incubated for 60 minutes at 50°C prior to transformation into NEB stable competent E. coli (C3040, NEB). Transformation included a 15-minute incubation of 50 μL of competent cells and 5 μL of assembly mix prior to a 30 seconds heat shock at 42°C and a subsequent 5-minute incubation on ice before plating on 100 μg/mL carbenicillin-containing agar plates. The Illustra TempliPhi 100 amplification kit (Cytivia) was used to perform rolling circle amplification (RCA) on bacterial colonies prior to Sanger sequencing or Whole Plasmid Sequencing (WPS) (Quintara Biosciences). Plasmid DNA was isolated from colonies with the correct sequence using the PureYield Plasmid Miniprep system (Promega) or with the ZymoPure II plasmid Midiprep system (Zymo Research).
The following tracing cassette, editor constructs, and other plasmids used in this manuscript were assembled using this protocol and are listed here for ease of reference. The sequence features of plasmids are described below. Addgene plasmid numbers are provided in parentheses where appropriate.
Lentiviral PEmax editor that includes 5’-UCOE- EF1a short promoter (EFS)-PEmax-P2A-GFP-KASH-WPRE (Addgene #238541)
Lentiviral lineage tracing cassettes that includes 5’-UCOE- EF1a promoter- mCherry- T7 promoter- T3 promoter- PacI digestion site-183 nucleotide (nt) MERFISH decodable integration barcode (intBC)- Forward primer for sequencing assays-30nt sequencing intBC- PacI digestion site- Edit Site 1 and flanking sequences- Edit Site 2 and flanking sequences- Edit Site 3 and flanking sequences- Reverse primer for sequencing assays- BGH polyA-500nt common sequence (Addgene pooled library #238548)
PiggyBac lineage tracing cassettes that includes 5’-UCOE- mCherry- T7 promoter- T3 promoter- PacI digestion site-183nt MERFISH decodable intBC- Forward primer for sequencing assays- 30nt sequencing intBC- PacI digestion site- Edit Site 1 and flanking sequences- Edit Site 2 and flanking sequences- Edit Site 3 and flanking sequences- Reverse primer for sequencing assays- SV40 polyA-500nt common sequence
Lentiviral puromycin static barcoding construct that includes 5’-UCOE- EF1a promoter- puromycin resistance cassette-10 N random barcode-BGH polyA (Addgene pooled library #238546)
Lentiviral blasticidin static barcoding construct that includes 5’-UCOE- EF1a promoter- blasticidin resistance cassette-10 N random barcode-BGH polyA (Addgene pooled library #238547)
Lentiviral CROP-seq BFP epegRNA expression vector that includes 5’-EF1a promoter- BFP- WPRE- Split 3’ LTR- hU6 promoter- pegRNA- Split 3’ LTR
Three BsmBI-digestible PiggyBac 8-mer pegArray acceptor vectors and one PaqCI- digestible 24-mer pegArray acceptor vector that include 5’-UCOE- EF1a promoter-BFP-WPRE-SV40 polyA-Digestion Cassette for pegArray insertions (ES1 8-mer vector = Addgene plasmid #238542, ES2 8-mer vector = Addgene plasmid #238543, ES3 8-mer vector = Addgene plasmid #238544, 24-mer vector = Addgene plasmid #238545).
Cloning of individual pegRNAs or epegRNAs
Individual pegRNAs and epegRNAs were cloned using Gibson assembly as described above. Briefly, hU6-pegRNA-GG-acceptor backbone (Addgene, #132777) expressing a guide RNA under the human U6 promoter was digested using BsaI-HFv2 (NEB, R3733L) according to the manufacturer’s protocol. Digested backbone gel extracted as described above from a 1% agarose gel. eBlock fragments were ordered from IDT or Twist Bioscience encoding the relevant pegRNA or epegRNA sequences which were resuspended in ultrapure nuclease-free water. Gibson assemblies, transformations, colony picking, and DNA isolation from colonies were carried out as described above.
Testing of individual pegRNAs or epegRNAs
Individual pegRNAs or epegRNAs were evaluated with transfections using low passage HEK293T cells plated into tissue culture-treated (TC-treated) 96-well flat bottom dishes (Corning, CLS3799). Cells were maintained in TC-treated 15cm dishes. For transfections, media was removed, cells were rinsed once with sterile 1x PBS (Thermo Fisher Scientific), and trypsinized with 0.25% trypsin (Gibco). Following a 5-minute incubation at 37°C for 5 minutes to ensure complete trypsinization, cells were resuspended in DMEM + 10% FBS to inactivate the trypsin and carefully triturated prior to cell counting using the Countess II automated cell counter (Thermo Fisher Scientific). Cells were diluted to 18,000 cells/mL and 100 μL of diluted cells were plated into 96-well dishes. Cells were transfected 18 to 24 hours after plating when they were ~60–70% confluent. For each well, transfections were carried out by combining 1) a DNA mixture comprising 200 ng of prime editor expressing plasmid (pCMV-PE2 plasmid = Addgene plasmid #132775; pCMV-PEmax plasmid = Addgene plasmid #174820) and 50 ng of a given pegRNA or epegRNA plasmid resuspended in 5 μL of OptiMEM (Thermo Fisher Scientific) with 2) 0.5 μL of Lipofectamine 2000 (Thermo Fisher Scientific, 11668019) and 4.5 μL of OptiMEM (Thermo Fisher Scientific, 31985062). This mixture of DNA and lipid were mixed and incubated at room temperature for 15 minutes prior to adding dropwise to cells. At 72 hours post-transfection, media was removed, cells were washed with 1x PBS, and genomic DNA (gDNA) was extracted by adding 35 μL of freshly-prepared lysis buffer (10 mM Tris-HCl pH 7.5, 0.05% sodium dodecyl sulfate (SDS, Thermo Fisher Scientific), 25 μg/ml Proteinase K (Thermo Fisher Scientific)) to each well prior to a 1-hour incubation at 37°C. Following this digestion, gDNA was transferred to a 96-well PCR plate (Thermo Fisher Scientific) and incubated on a deep-well PCR machine for 30 minutes at 80°C to inactivate the Proteinase K; we term this isolation protocol the Fast Lysis procedure.
Sequencing and analysis of individual pegRNA and epegRNA editing outcomes
Genomic DNA isolated from HEK293T cells was amplified with two successive rounds of PCR as done previously (66). Briefly, PCR1 was conducted using a pool of forward and reverse primers where each primer was pooled in an equimolar fashion to enable sequencing on an Illumina MiSeq with minimal need for added PhiX. PCR1 was performed using 2 μL of Fast Lysis-isolated genomic DNA from HEK293T cells. Amplifications were carried out using Phusion Green Hot Start II High-Fidelity PCR 2x Master Mix with the following cycle conditions for the EMX1, HEK3, and RNF2 endogenous genomic loci: 98°C for 2 minutes, [98°C for 10 seconds, 65°C for 20 seconds, 72°C for 1 minute]x30, 72°C for 2 minutes. PCR1 amplicons were all checked on 2% agarose (VWR) gels with 1 μL of ethidium bromide (Thermo Fisher Scientific) per 10 mL of 1x Tris-Acetate-EDTA (TAE) buffer. PCR2 amplification was carried out using TruSeq adapter PCR primers (listed in table S28) with the following conditions: 98°C for 2 minutes, [98°C for 10 seconds, 61°C for 20 seconds, 72°C for 1 minute]x8, 72°C for 2 minutes. 2 μL of PCR1 were used as template DNA for each PCR2 reaction and 500 nM final primer concentrations were used. Following PCR2, all reactions for a given genomic locus were pooled together, run on a 2% agarose gel, and gel extracted using the QIAquick gel extraction kit (Qiagen). Concentrations of purified libraries were determined using a Qubit double-stranded DNA high sensitivity Kit (Thermo Fisher Scientific) according to the manufacturer’s instructions. Libraries were diluted to 4 nM and sequenced on a MiSeq (Illumina) using an Illumina MiSeq v2 Reagent kit.
Samples were demultiplexed with the MiSeq Reporter (Illumina). CRISPResso2 (v2.2.6) (144) was used to analyze demultiplexed reads. For all prime editing yield quantification, CRISPRessoBatch was run with default settings using the unedited allele as the reference amplicon and desired allele as Amplicon1 (-a parameter) and a 10nt quantification window on each side of the cleavage position for a total length of 20nt (-w parameter). Editing yield was calculated as: (number of Amplicon1-aligned reads)/(total reads aligned to all amplicons). For all experiments, indel yields were calculated as: (number of indel-containing reads)/(total reads aligned to all amplicons).
HEK Site 3 Genomic Locus Forward Primer Pool:
ACACTCTTTCCCTACACGACGCTCTTCCGATCT-[1N to 4N]- ATGTGGGCTGCCTAGAAAGG
HEK Site 3 Genomic Locus Reverse Primer:
TGGAGTTCAGACGTGTGCTCTTCCGATCTCCCAGCCAAACTTGTCAACC
RNF2 Genomic Locus Forward Primer Pool:
ACACTCTTTCCCTACACGACGCTCTTCCGATCT-[1N to 4N]- ACGTCTCATATGCCCCTTGG
RNF2 Genomic Locus Reverse Primer:
TGGAGTTCAGACGTGTGCTCTTCCGATCTACGTAGGAATTTTGGTGGGACA
EMX1 Genomic Locus Forward Primer Pool:
ACACTCTTTCCCTACACGACGCTCTTCCGATCT-[1N to 4N]- CAGCTCAGCCTGAGTGTTGA
EMX1 Genomic Locus Reverse Primer:
TGGAGTTCAGACGTGTGCTCTTCCGATCTCTCGTGGGTTTGTGGTTGC
Cloning of a comprehensive 1,024 5 nucleotide (nt) LM epegRNA libraries
Gibson assemblies between PCR amplified epegRNA backbones and short oligonucleotide libraries ordered from IDT were used to generate libraries representing all possible 5nt LM libraries for each of the three target sequences. First, epegRNA backbones with optimized primer binding sequence (PBS) and reverse transcription template (RTT) lengths were amplified as described in “Cloning of tracing cassette and editor constructs” using Phusion Green Hot Start II High-Fidelity PCR Master Mix with the primers listed below and 34 cycles of amplification. Two independent 5N LM libraries were ordered from IDT for each site, a top and bottom strand library. Libraries were diluted to 1 μM and 1 μL was used for each of 10 parallel Gibson assemblies with 0.05 pmol of amplified backbone for their respective target loci. All 100 μL of Gibson assembly product were cleaned up using PCR cleanup columns (Qiagen) to remove salts prior to electroporation into MegaX cells as previously described (145). Colonies were scraped into a combined pool, grown out for 8 hours in 100 mL terrific broth (TB) containing 100 μg/mL carbenicillin, and midi-prepped using the ZymoPure II Plasmid Midiprep Kit (Zymo Research) following manufacturer’s procedures.
Forward primer for all: CGCGGTTCTATCTAGTTACGCGTTAAAC
EMX1 reverse primer: CTCCCATCACATGCACCGAC
HEK3 reverse primer: TGATGGCAGAGGAAAGGAAGCC
RNF2 reverse primer: CTGAGGTGTTCGTTGCACCG
EMX1 5N library top:
GTCGGTGCATGTGATGGGAGNNNNNTTCTTCTGCTCGGACGCGGTTCTATCTAGTTACG
EMX1 5N library bottom:
CGTAACTAGATAGAACCGCGTCCGAGCAGAAGAANNNNNCTCCCATCACATGCACCGAC
HEK3 5N library top:
CTTCCTTTCCTCTGCCATCANNNNNCGTGCTCAGTCTGCGCGGTTCTATCTAGTTACG
HEK3 5N library bottom:
CGTAACTAGATAGAACCGCGCAGACTGAGCACGNNNNNTGATGGCAGAGGAAAGGAAG
RNF2 5N library top:
CGGTGCAACGAACACCTCAGNNNNNGTAATGACTAAGATGCGCGGTTCTATCTAGTTACG
RNF2 5N library bottom:
CGTAACTAGATAGAACCGCGCATCTTAGTCATTACNNNNNCTGAGGTGTTCGTTGCACCG
Testing of 1,024 LM epegRNA libraries
Testing of 1,024 LM epegRNA libraries for three test sites was carried out with transfections using low passage HEK293T cells plated into TC-treated 6-well flat bottom dishes (Corning, 353046). Cell maintenance and splitting via trypsinization were conducted identical to 96-well format transfections. Following cell counting, cells were resuspended to 400,000 cells/mL and 1 mL of cells was added to each well of a 6-well TC-treated plate to which an additional 1 mL of media was added to avoid evaporation. 18 to 24 hours after plating, three wells were transfected for each 1,024 LM library to ensure comprehensive coverage of the entire library and biological consistency. Each library was transfected as before by combining a mixture of DNA with a lipid mixture prior to a 10-minute incubation at room temperature prior to adding dropwise to 6-well plates. For 6-well formats DNA mixtures comprise 6 μg of PE2 plasmid and 1 μg of 1,024 LM library were resuspended in a final volume of 25 μL of OptiMEM while lipid mixtures comprise 15 μL of Lipofectamine 2000 and 10 μL of OptiMEM. At 72 hours post-transfection, media was removed, cells were washed with 1x PBS, and trypsinized as above to isolate cell pellets. Genomic DNA was extracted using the DNeasy blood and tissue kit (Qiagen) following manufacturer’s protocols.
Sequencing and analysis of 1,024 LM epegRNA libraries
Plasmid DNA sequences were verified to ensure coverage of all edits 1,024 LMs across the library for each of the three loci. Two successive rounds of PCR were used to generate plasmid libraries using primers listed below. For plasmid libraries, PCR1 was performed using Phusion Green Hot Start II High-Fidelity PCR 2x Master Mix with the following cycle conditions: 98°C for 2 minutes, [98°C for 10 seconds, 65°C for 20 seconds, 72°C for 1 minute] x the number of cycles was empirically determined for each sample by qPCR, 72°C for 2 minutes. 10 reactions for each plasmid library where 500 ng of the plasmid library was used for each 25 μL amplification reaction along with 500 nM primer mix.
epegRNA Plasmid Library PCR1 Forward primer pool:
TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG-[1N to 4N]- GAAAAAGTGGCACCGAGTCGG
epegRNA Plasmid Library PCR1 Reverse primer pool:
GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG-[1N to 4N]- TTTGTGATGCTCGTCAGGGG
For genomic DNA, 48 25 μL PCR1 reactions with 500 ng of gDNA and 500 nM primer mix were run to ensure broad coverage across all samples. Genomic loci primers are the same as used previously for these amplifications (“Sequencing and analysis of individual pegRNA and epegRNA editing outcomes”). Amplifications were performed using Phusion Green Hot Start II High-Fidelity PCR 2x Master Mix with the following cycle conditions: 98°C for 2 minutes, [98°C for 10 seconds, 65°C for 20 seconds, 72°C for 1 minute]x30, 72°C for 2 minutes. As before, longer extension times were used to ensure complete extension for each cycle. PCR1 amplicons were all checked on 2% agarose as above.
PCR2 amplification for both plasmid libraries and genomic DNA from transfected cells was carried out using Nextera adapter PCR primers (listed in table S29) with the following conditions: 98°C for 2 minutes, [98°C for 10 seconds, 61°C for 20s, 72°C for 1 minute]x8, 72°C for 2 minutes. 5 μL of PCR1 products were used for each PCR2 reaction and each of the reactions was independently barcoded in PCR2. Following PCR2, all reactions for each genomic DNA or plasmid sample were pooled together, run on a 2% agarose gel, and gel extracted prior to quantification, library generation, and sequencing as described above (“Sequencing and analysis of individual pegRNA and epegRNA editing outcomes”).
CRISPResso2 (v2.2.6) was used to analyze demultiplexed reads. For both input plasmid libraries and genomic DNA from edited cells, CRISPRessoBatch was run with default settings using the unedited allele as the reference amplicon and desired allele with a 5 N insertion as Amplicon1 (-a parameter) and a 10nt quantification window on each side of the cleavage position for a total length of 20nt (-w parameter). Frequencies of aligned reads were extracted from the output Alleles_frequency_table files. Aligned reads were trimmed to 25nt flanking the cleavage position (−10nt and +10nt from the cleavage position to capture 5nt insertions +10nt of additional flanking sequence for edited alleles) and counts for reads with expected 5 nt insertions were quantified. On average, 52% (range 43–61%) of aligned reads from each gDNA sample matched an expected 5nt insertion and 43% (range 30–55%) of reads matched the unedited sequence, with only 4.6% (range 1.7–8.2%) of reads representing unintended alleles (i.e. indels or sequencing/PCR errors). For the input plasmid libraries, on average 88% (range 79–93%) of reads matched the expected epegRNA sequence with a 5nt insertion. For subsequent analysis, only aligned reads matched an expected 5nt insertion were considered. Normalized counts per million values were log2 transformed, replicates averaged, and used to calculate the fold change for each 5nt insertion sequence in the edited gDNA relative to the input plasmid library. 5nt insertion sequences were ranked by log2 fold change (see table S7) to nominate the top-performing LMs at each test site.
Cloning and testing of top-performing LMs at 3 test sites
The top 94, 96, and 85 LMs for the HEK3, EMX1, and RNF2 loci were tested individually. epegRNAs for each of these LMs were cloned, plasmid DNA was isolated, transfected, sequenced and analyzed as described above (“Testing of individual pegRNAs or epegRNAs” and “Sequencing and analysis of individual pegRNA and epegRNA editing outcomes”).
Generation and testing of orthogonal edit site sequences
To make edit site sequences orthogonal to the human genome as well as the genomes of other widely used model organisms, we reasoned that by making base changes in the 3nt seed region of our edit site sequences that we could preserve the primer binding site (PBS) sequence for each edit site while largely maintaining the base identity and annealing properties of the RTT to generate novel edit site sequences that are not directly present in various genomes of interest. We generated two orthogonal sequence variants per edit site. We designed new protospacer sequences where we used the opposite Watson-Crick-Franklin pairing DNA base at each position from the endogenous human locus for the 3nt genomic seed for Ortho v1 (i.e. if the edit site sequence were 5’-GCT-3’ we would use 5’-CGA-3’), and used the reverse complement of the 3nt seed for Ortho v2 (i.e. if the sequence were 5’-GCT-3’ we would use 5’-AGC-3’).
Sequencing and analysis of orthogonal edit site sequences
To experimentally evaluate our edit site orthogonalization strategies, we generated synthetic cassettes that harbored each of the RNF2, HEK3, and EMX1 edit site sequences and necessary flanking sequences to support efficient prime editing for the original, unmodified genomic edit site sequences and for each of the two new orthogonalized edit sites. These edit sites were ordered as e-blocks from IDT and cloned using Gibson assembly into lineage tracing cassette (LTC) vectors. This vector introduces edit site sequences into the 3’ UTR of an mCherry reporter gene. These alleles were integrated into HEK293T cell genomes and cells harboring these constructs were isolated by sorting for mCherry+ cells using fluorescence activated cell sorting (FACS) two weeks after integration into the genome (representative sorting gates shown in fig. S16). Cells were subsequently transfected with epegRNAs targeting the corresponding sequences, cloned as described above (“Cloning of individual pegRNAs or epegRNAs”).
Genomic DNA was isolated using the Fast Lysis procedure described for sequencing HEK293T genomic loci 72 hours post-transfection. Two successive rounds of PCR were used to amplify both synthetic edit sites (PCR1 primers listed below) as well as endogenous genomic loci as described above (PCR1 primers listed in “Sequencing and analysis of individual pegRNA and epegRNA editing outcomes”) with 2 μL of gDNA and 500 nM primer mix using the Phusion Green Hot Start II High-Fidelity PCR 2x Master Mix with the following cycle conditions: 98°C for 2 minutes, [98°C for 10 seconds, 65°C for 20 seconds, 72°C for 30 seconds]x28, 72°C for 2 minutes. PCR1 amplicons were all checked on 2% agarose as above. PCR2 amplification was carried out using Nextera adapter PCR primers with the following conditions: 98°C for 2 minutes, [98°C for 10 seconds, 61°C for 20s, 72°C for 30 seconds]x8, 72°C for 2 minutes. 5 μL of PCR1 products were used for each PCR2 reaction and each of the reactions was independently barcoded in PCR2. All steps through sequencing using the Illumina Miseq v2 Reagent kit were performed as above (“Sequencing and analysis of individual pegRNA and epegRNA editing outcomes”). Experiments to re-test optimal LMs for orthogonalized edit site sequences were carried out using this same protocol.
LTC Forward primer pool: TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG- [1N to 4N]- GAATCCAGCTAGCTGTGCAGC
LTC Reverse primer pool: GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG-[1N to 4N]- CCTTAGCCGCTAATAGGTGAGC
CRISPResso2 (v2.2.6) was used to analyze demultiplexed reads. For all prime editing yield quantification, CRISPRessoBatch was run with default settings using the unedited allele as the reference amplicon and desired allele as Amplicon1 (-a parameter) and a 10nt quantification window on each side of the cleavage position for a total length of 20nt (-w parameter). Editing yield was calculated as: (number of Amplicon1-aligned reads)/(total reads aligned to all amplicons). For all experiments, indel yields were calculated as: (number of indel-containing reads)/(total reads aligned to all amplicons).
In silico characterization of orthogonalized edit site sequences
To further ensure compatibility of Ortho v1 edit site sequences for use in widely used model systems, we used the Cas-OFFinder in silico prediction algorithm (86) (v2.4.1) to identify potential CRISPR/Cas9-dependent off-target loci with three or fewer mismatches or bulges to our three OrthoV1 protospacers of interest. Cas-OFFfinder nominated 69, 174, 13, and 4 Cas9-dependent off-targets with 3 or fewer mismatches or bulges (MMB) for H. sapiens, M. musculus, D. rerio, and D. melanogaster genomes, respectively. There were no predicted genomic off-targets with 1 mismatch in any of the analyzed genomes (fig. S2D). All predicted 2 mismatch off-targets were within intergenic or intronic regions of the genome with the exception of one in the promoter of the apoea gene in the D. rerio genome. Among the 251 putative genomic off-targets with 3 mismatches to our sgRNAs, only 4 were within exons of genes– 2 each in the human and mouse genomes– and 14 were in other regulatory genomic elements. When each of these 18 putative off-target loci were examined, we did not identify homologous sequences within 100nt downstream of the protospacer likely to support prime editing.
LM detection cross hybridization calculations
To select lineage marks suitable for accurate hybridization probe readout, we used NUPACK (v4.0.1.9) (146) simulations to evaluate off-target binding for various LM sequence combinations. Each simulation involved hybridizing an equimolar mixture of DNA readout sequences (20nt complementary region) with an equimolar mixture of DNA edit site sequences containing the selected LMs. Simulations were conducted using a 100-fold excess of readout sequences (10 nM vs. 0.1 nM) at 43°C in a sodium ion concentration of 0.3 M. For each LM sequence, the free energy difference (ΔG) between correct and incorrect probe binding was estimated. Cross-hybridization was quantified as the fraction of edit site molecules bound by incorrect readout sequences. Separate simulations were performed for the Ortho v1 Edit Site 1, Edit Site 2, and Edit Site 3 sequences. Results from these simulations are reported in table S8.
To investigate the effect of LM insert length on probe cross-hybridization, we simulated mixtures of edit site sequences containing random LM inserts of varying lengths (2–8nt). For each LM length, simulations were repeated 10 times with each iteration using 8 random LM sequences of the specified length plus the unedited sequence while keeping the total complementarity length and other hybridization parameters constant. The mean fraction of correct probes binding each edit site (1 – cross-hybridization fraction) is shown in fig. S1J and results for each iteration are reported in table S4.
Design and cloning of lineage tracing cassette libraries
With three edit site sequences selected, we finalized our lineage tracing cassette design. These cassettes consist of the following sequence organization: 5’-UCOE- EF1a promoter- mCherry- T7 promoter- T3 promoter- PacI digestion site- 30nt sequencing barcodes- sequencing primer- MERFISH decodable intBC- Forward primer for sequencing assays- sequencing intBC- PacI digestion site- Edit Site 1 and flanking sequences- Edit Site 2 and flanking sequences- Edit Site 3 and flanking sequences- Reverse primer for sequencing assays- polyA. For lentiviral construct designs, we included 500nt of constant sequence downstream for optimizing imaging readout experiments and used a BGH polyA sequence. For PiggyBac transposase-compatible vectors, we introduced this 500nt constant sequence before an SV40 polyA.
For testing of 24-mer pegArrays and epegRNA mismatch libraries, we generated a PiggyBac lineage tracing cassette plasmid library with a random 10nt sequencing barcode (LTCv0).
For in vitro single cell lineage tracing experiments, we generated a lentiviral lineage tracing cassette plasmid library with 125nt MERFISH barcodes with an additional 30nt sequencing barcode (LTCv1). Specifically, 125nt MERFISH barcodes were generated by choosing 6 among 21 of previously published MERFISH readout sequences (147) (B37-B66, 20nt for each) and concatenating them together with an “A” between each pair of readout sequences. 30nt sequencing barcodes were randomly generated.
For the in vitro imaging experiments and all in vivo experiments, we generated a lentiviral lineage tracing cassette plasmid library with matching 30nt sequencing barcodes and 183nt MERFISH barcodes (LTCv2). To design these orthogonal barcodes, we first randomly selected 9 of the 25nt orthogonal DNA among 240k published DNA barcodes (148), concatenated them in an randomly permuted order and selected the best permutation by checking if all 15mer are unique to this barcode. We repeated this process to select up to 6,000 candidate barcodes. We then predicted the self-binding partition function by “pfunc” from Nupack (146) and selected the candidate barcodes with highest free energy to minimize self binding. After filtering, we inserted the sequencing primer at position 30 of candidate barcodes and designated the first 30nt as sequencing barcodes, and then selected the next 183nt as MERFISH barcodes. We concatenated the 5’ sequence (T7, T3 and PacI site) and 56nt of 3’ sequence (Edit site 1) together as candidate intBCs for cross-hybridization testing. We first partitioned each 183nt MERFISH barcode into 6 non-overlapping 30nt segments, and used the reverse complement of these segments as candidate probes. To evaluate probe cross-hybridization, we enumerated the pairs of candidate intBCs, generated a Nupack Tube class with the candidate probes and these two candidate intBCs, and predicted the free energy and partition function for all possible bindings. After prediction, we summarized probe cross-hybridization for all candidate intBCs and retained intBCs with less than 1% of predicted average probe off-targeting probability. For barcodes that passed these requirements, we further filtered to have no 17nt matching to human, mouse, zebrafish and drosophila genomes.
As such, we ordered libraries of 2,171 intBCs (table S6) from Twist Bioscience and amplified this pooled library using Phusion Green Hot Start II High-Fidelity PCR 2x Master Mix with the following cycle conditions: 98°C for 2 minutes, [98°C for 10 seconds, 65°C for 20 seconds, 72°C for 1 minute] where the number of cycles for amplification were determined by quantitative PCR, followed by 72°C for 2 minutes. Given that libraries can be provided with variable quality, this qPCR determination was important for ensuring library quality. Following library amplification, samples were cleaned up with a QIAquick PCR purification column (Qiagen) according to manufacturer’s protocols. These amplified DNA fragments were assembled into PacI-digested and Quick CIP-treated (NEB) acceptor backbone via Gibson assembly using NEBuilder 2x HiFi DNA Assembly master mix as large scale 100 μL reactions described above. The entire 100 μL of Gibson assembly product was cleaned up using PCR cleanup columns (Qiagen) to remove salts prior to electroporation into MegaX cells as previously described. Colonies were scraped into a combined pool, grown out for 8 hours in 500 mL terrific broth (TB) containing 100 μg/mL carbenicillin, and midi-prepped using the ZymoPure II Plasmid Midiprep Kit (Zymo Research) following manufacturer’s procedures.
LTCv2 Cloning Forward primer: AATTAACCCTCACTAAAGGGATAATTTAATTAA
LTCv2 Cloning Reverse primer: GTCGTAATGACTAAGATGACTGCCATTAATTAA
Sequence validation and analysis of lineage tracing cassette libraries
Lentiviral plasmid libraries were sequence confirmed as described for the 1,024 epegRNA libraries with modest differences. As before, plasmid libraries were amplified with two successive rounds of PCR as done to sequence genomic loci. PCR1 was performed using Phusion Green Hot Start II High-Fidelity PCR 2x Master Mix with the following cycle conditions: 98°C for 2 minutes, [98°C for 10 seconds, 65°C for 20 seconds, 72°C for 1 minute] where the cycle number was determined by qPCR, followed by 72°C for 2 minutes, and a hold at 12°C. PCR2, gel extraction, library preparation, and sequencing were carried out as before.
intBC Sequencing Validation Forward primer:
TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG-[1N-4N]- GTAATTAACCCTCACTAAAGG
intBC Sequencing Validation Reverse primer:
TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG-[1N-4N]- CATCGATACCTAATACGACTCACTATAGGGAGAG
To assess the balance of library elements, we used the FastX toolkit (v0.0.14) to mask low quality bases (<q20) and collapse reads. We removed common flanking sequences and summed reads matching the whitelist of intBC barcodes for sequences with Hamming distance <4 to account for sequencing errors and the length of intBC sequences.
Plasmid design and cloning strategy for 8-mer and 24-mer pegArrays
To clone 24-mer arrays comprising an 8-mer or each of the edit sites, we designed a two-step Golden Gate assembly cloning strategy. For the first step, 8-mer arrays are cloned into three distinct acceptor backbones, one for each edit site where the first cloning step is carried out with the Type IIS restriction enzyme BsmBI (NEB, R0739L). These backbones can then be used for a second cloning step using the Type IIS restriction enzyme PaqCI (NEB, R0745L) for the final 24-mer assembly. The overhangs for the 8-mer assembly are the same for each of 8-mer backbones. The backbone overhangs are 5’-GGAG-3’ and 5’-CTAA-3’. The intervening 8 epegRNA inserts use the following overhangs: 5’-GGAG-INS 1-TGCC-3’, 5’-TGCC-INS 2-GCAA-3’, 5’-GCAA-INS 3-ACTA-3’, 5’-ACTA-INS 4-TTAC-3’, 5’-TTAC-INS 5-CAGA-3’, 5’-CAGA-INS 6-TGTG-3’, 5’-TGTG-INS 7-GAGC-3’, 5’-GAGC-INS 8-CTAA-3’. Fragments with these overhangs were ordered from either IDT or Twist Bioscience using table S30 as a design guide. The final 24-mer array has the three 8-mers in the following order to optimize for LM balance: Edit Site 2, Edit Site 3, then Edit site 1. To generate 24-mers, sequence verified 8-mer arrays were digested with PaqCI to generate 8-mer array fragments with the following overhangs: 5’-GGAG- [Edit Site 2 8-mers]-TGTG-3’, 5’-TGTG-[Edit Site 3 8-mers]-GAGC-3’, 5’-GAGC-[Edit Site 1 8-mers]-CTAA-3’. These fragments could then be cloned into a 24-mer acceptor backbone digested with PaqCI. All golden gate acceptor plasmids are provided on Addgene.
Cloning protocol for 8-mer pegArrays
Three different acceptor backbones were generated for 8-mer assemblies, one for each of the three edit sites. These acceptor backbones include an EF1a promoter driving a BFP reporter gene and have BsmBI type-IIS restriction cut sites. DNA fragments were ordered from Twist Bioscience that include a U6 Pol III promoter upstream of epegRNAs for each of the 8 LMs for a given edit site. Golden gate assemblies were performed using 3 nM of each U6–epegRNA fragment (typically 1 μL of each ~500nt fragment diluted to 20 ng/μL), 1.1 nM of the relevant backbone (typically 1 μL of the ~8200nt BsmBI pre-digested backbone diluted to 25 ng/μL), 2 μL of NEBridge Golden Gate Assembly Kit (BsmBI-v2) (NEB, E1602L), 2 μL T4 ligase buffer, and 7 μL of nuclease free water. Assemblies were carried out following manufacturer’s recommendations: 99 cycles of 5 minutes at 42°C and 5 minutes at 18°C before a 10-minute 60°C incubation and subsequent 4°C hold. These reactions were transformed into NEB Stable Competent E. coli (NEB, C3040), incubating on ice for 20 minutes prior to a 30 second heat shock at 42°C. After heat shock, bacteria were incubated on ice for 5 minutes prior to plating on 100 mg/mL carbenicillin resistant agar plates. Plates were incubated overnight at 30°C prior to picking colonies into 5 mL of 100 mg/mL carbenicillin-containing terrific broth, outgrowth overnight at 30°C, and DNA isolation using the Promega PureYield DNA Miniprep System. Some bacteria were expanded into 100 mL cultures overnight while isolated plasmids were diluted to 100 ng/μL and submitted for whole plasmid sequencing. DNA was isolated from these larger cultures using the ZymoPure II Plasmid Midiprep Kit. Plasmids were sequence confirmed by whole plasmid sequencing and further validated to have the correct insert by gel electrophoresis to be the correct size following digestion with PaqCI.
Cloning protocol for 24-mer pegArrays
24-mer assemblies require four pieces of DNA that include a PaqCI digested 24-mer backbone and a PaqCI digested fragment for each edit site 8-mer. PaqCI digestions were performed overnight using 4 μL of CutSmart buffer, 4 μL of PaqCI activator, 2 μL of PaqCI, and 30 μL of the plasmid of interest. These digested fragments were gel extracted using the QIAquick Gel Extraction Kit (Qiagen) following manufacturer’s recommendations. Digested fragments were assembled into 30 μL PaqCI golden gate assemblies with 3.33 nM of the PaqCI digested 8-mer fragments (typically 1 μL of the ~4000nt fragment diluted to 150 ng/μL), 1 μL of PaqCI pre-digested 24-mer acceptor backbone diluted to 25 ng/μL, 2.5 μL PaqCI enzyme, 1.25 μL PaqCI activator, 10 μL NEBridge Ligase Master Mix (M1100L), and water to bring the reaction to 30 μL total volume. Assemblies were carried out following manufacturer’s recommendations, carrying out 99 cycles of 5 minutes at 37°C and 5 minutes at 16°C before a 10-minute 60°C incubation and subsequent 4°C hold. All subsequent steps of transformation, plating, colony picking, and DNA prep and sequence confirmation were carried out as described for 8-mer assemblies above.
Testing of 8-mer and 24-mer pegArrays
We introduced PEmax editor and lineage tracing cassettes (LTCv0) into B16-F10 cells using sequential PiggyBac transposition. 1,000,000 cells were nucleofected with 1.5 μg of transfer plasmid and 150 ng of Super PiggyBac Transposase Expression Vector (System Biosciences PB210PA-1) using the SF Cell Line 4D-Nucleofector Kit L (Lonza V4XC-3024) with program CM-130 and sorted for GFP+ (PEmax+) cells and GFP+/mCherry+ (PEmax+/lineage tracing cassettes+) at least 8 days after nucleofection to ensure stable expression. We then introduced 8-mer and 24-mer pegArray variants into GFP+/mCherry+ cells using the same nucleofection conditions and sorted GFP+/mCherry+/BFP+ (PEmax+/lineage tracing cassettes+/pegArray+) at least 8 days after nucleofection. Representative sorting gates shown in fig. S16. Genomic DNA was isolated using QuickExtract DNA Extraction Solution (Biosearch Technologies, QE09050) following the manufacturer’s protocol.
Sequencing and analysis of 8-mer and 24-mer pegArrays
To assess pegRNA competition, lineage tracing cassettes were amplified and sequenced to quantify lineage mark installation efficiency. Two successive rounds of PCR were used to amplify lineage tracing cassettes with 2 μL of gDNA and 500 nM primer mix using the Phusion Green Hot Start II High-Fidelity PCR 2x Master Mix with the following cycle conditions: 98°C for 2 minutes, [98°C for 10 seconds, 65°C for 20 seconds, 72°C for 30 seconds]x28, 72°C for 2 minutes. PCR1 amplicons were all checked on 2% agarose as above. PCR2 amplification was carried out using Nextera adapter PCR primers with the following conditions: 98°C for 2 minutes, [98°C for 10 seconds, 61°C for 20 seconds, 72°C for 30 seconds]x8, 72°C for 2 minutes. 5 μL of PCR1 products were used for each PCR2 reaction and each of the reactions was independently barcoded in PCR2. Following PCR2, all reactions for each genomic DNA or plasmid sample were pooled together, run on a 2% agarose gel, and gel extracted prior to quantification, library generation, and sequencing as described above.
LTC Forward primer pool: TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG-[1N to 4N]- GAATCCAGCTAGCTGTGCAGC
LTC Reverse primer pool: GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG-[1N to 4N]- CCTTAGCCGCTAATAGGTGAGC
CRISPResso2 (v2.2.6) was used to analyze demultiplexed reads. CRISPResso was run with default settings using the unedited allele as the reference amplicon and desired allele with a 5N insertion as Amplicon1 (-a parameter) and a 10nt quantification window on each side of the cleavage position for a total length of 20nt (-w parameter). Frequencies of aligned reads were extracted from the output Alleles_frequency_table files. Aligned reads were trimmed to 25nt flanking the cleavage position (−10nt and +10nt from the cleavage position to capture 5nt insertions +10nt of additional flanking sequence for edited alleles) and counts for reads with expected 5nt insertions were quantified. For subsequent analysis, only aligned reads matched an expected 5nt insertion were considered to calculate the relative editing efficiency for each pegRNA. Read counts for each 5nt insertion and array are reported in table S9.
Cloning of a comprehensive epegRNA mismatch library
We designed a library of epegRNA protospacer mismatches that includes all possible single letter variants at positions 2–20 of an SpCas9 protospacer (where the NGG PAM is 21–23). We preserved a guanine (G) at position 1 to ensure efficient transcription by the human U6 promoter. Inspired by the CROP-seq vector (88), we cloned this epegRNA library into the 3’ UTR of a lentiviral genome (modified from Addgene plasmid #86708 to replace puromycin with BFP and introduce BsmBI epegRNA cloning site). This vector ensures that the sequence of each epegRNA could be captured in droplet-based single cell RNA sequencing experiments while producing an active copy of the epegRNA through the genome integration process. epegRNA protospacer mismatches were either synthesized and cloned by Twist Bioscience or ordered as eBlock fragments from IDT and cloned into the BsmBI-digested CROP-seq vector backbone using Gibson assembly. All protospacer mismatches are listed in table S10.
Lentiviral production
Lentiviral particles were generated as previously described (149). Briefly, HEK293T/17 cells (ATCC CRL-11268) were maintained in DMEM (Thermo Fisher Scientific) supplemented with 10% (v/v) FBS and 1% (v/v) PSG at 37°C with 5% CO2. Cells were split 1:8 two days prior to use to ensure rapid cycling. 23,000,000 cells were plated per 15cm dish on the day prior to transfection. The next day, cells were ~90% confluent and were transfected with the following transfection mix: 200 μL FuGENE HD transfection reagent and a DNA mixture with 1.3 pmol of psPAX2 (Addgene plasmid #12260), 0.72 pmol pMD2.G (Addgene plasmid #12259), and 1.64 pmol the transfer plasmid. DNA and Fugene mix was resuspended in a final volume of 3 mL of OptiMEM serum free media. This mixture was vortexed gently and incubated for 15 minutes at room temperature before adding dropwise to cells. 6 hours post-transfection, transfection media was removed and replaced with Opti-MEM I Reduced Serum Medium (OPTI-MEM) with GlutaMAX Supplement (Invitrogen, 31985088) supplemented with 5%FCS, 1 mM sodium pyruvate (Fisher Scientific), and 1x MEM nonessential amino acids (FisherScientific) (cOPTI-MEM) supplemented with 1x ViralBoost reagent (AlStem). 48 hours post-transfection, media was collected, spun at 3000g for 15 minutes to remove cell debris and concentrated using Lenti-X-Concentrator (TakaraBio, 631232) following the manufacturer’s instructions. Concentrated virus was resuspended in OPTI-MEM in 1% of the original culture volume without supplements. Lentiviral particles were subsequently aliquoted and frozen at−80°C.
epegRNA mismatch library profiling to tune lineage tracing kinetics
We introduced lineage tracing cassettes where each of the three edit sites were concatenated together (LTCv0) using PiggyBac transposition into two mismatch repair-competent cell lines, B16-F10 (a melanoma-derived line; ATCC CRL-6475) and 4T1 (a breast cancer-derived line; ATCC CRL-2539). We also introduced the PEmax editor using PiggyBac transposition and sorted for cells that were both GFP+ and mCherry+ to select for editor and tracing cassette containing cells, respectively. For PiggyBac transposition, one million cells were nucleofected with 1.5 μg of transfer plasmid and 150 ng of Super PiggyBac Transposase Expression Vector (System Biosciences, PB210PA-1). We used the SF Cell Line 4D-Nucleofector Kit L (Lonza V4XC-3024) with program CM-130 for B16-F10 cells and the SE Cell Line 4D-Nucleofector Kit L (Lonza V4XC-1024) with program CM-150 for 4T1 cells. We then treated GFP+/mCherry+ cells with lentivirus expressing the CROP-seq BFP epegRNA library, sorted BFP+ cells, and passaged these cells for 28 days, collecting samples throughout the duration of the experiment. Representative sorting gates shown in fig. S16.
Single cell RNA sequencing (scRNA-seq) was performed on harvested cells to link epegRNA mismatches with lineage tracing editing rates. scRNA-seq libraries were prepared using the 10X Genomics Chromium Next GEM Single Cell 3ʹ Kit v3.1 following the manufacturer’s instructions followed by targeted enrichment of CROP-seq epegRNA and lineage tracing cassette transcripts based on published protocols (39, 53, 150). To increase diversity in Read2 during sequencing, we used a pool of primers with an 8nt stagger following the Illumina Read2 primer binding site.
Two successive rounds of PCR were used to prepare lineage tracing cassette libraries from amplified full-length cDNA. For the first PCR reaction (PCR1), 2 μL of amplified cDNA and 600 nM primer mix listed below (LTC_10X_PCR1_0–8_F, 10X_PCR1_R) were used. Amplifications were performed using KAPA HiFi HotStart ReadyMix with the following cycle conditions: 95°C for 3 minutes, [98°C for 20s, 65°C for 15s, 72°C for 15s]x19–25, 72°C for 1 minute. PCR2 amplification was carried out using 4 μL of a 1:25 dilution of PCR1 product and 300 nM of the reverse primer (10X_PCR2_R) listed below and Nextera i7 adapter PCR primers (listed in table S29) with the following conditions: 95°C for 3 minutes, [98°C for 20s, 72°C for 30 seconds]x12, 72°C for 1 minute.
scRNA-seq Lineage tracing cassette amplification primers:
PCR1 forward (LTC_10X_PCR1_0–8_F): GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG -[1N to 8N]- GAATCCAGCTAGCTGTGCAGC
PCR1 reverse (10X_PCR1_R): ACACTCTTTCCCTACACGACG
PCR2 forward (Nextera i7): CAAGCAGAAGACGGCATACGAGATNNNNNNNNGTCTCGTGGGCTCGG
PCR2 reverse (10X_PCR2_R): AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTC
Three successive rounds of PCR were used to prepare epegRNA libraries from amplified full-length cDNA. For the first PCR reaction (PCR1), 2 μL of amplified cDNA and 300 nM primer mix listed below (pegRNA_10X_PCR1_F, pegRNA_10X_PCR1_R) were used. Amplifications were performed using KAPA HiFi HotStart ReadyMix with the following cycle conditions: 95°C for 3 minutes, [98°C for 20s, 65°C for 15s, 72°C for 15s]x29–33, 72°C for 1 minute. PCR2 amplification was carried out using 1 μL of a 1:25 dilution of PCR1 product and 300 nM of the primer mix listed below (pegRNA_10X_PCR2_0–8_F, 10X_PCR2_R) listed below with the following conditions: 95°C for 3 minutes, [98°C for 20s, 65°C for 15s, 72°C for 15s]x19, 72°C for 1 minute. PCR3 amplification was carried out using 4 μL of a 1:25 dilution of PCR2 product and 300 nM of the reverse primer (10X_PCR2_R) listed below and Nextera i7 adapter PCR primers (listed in table S29) with the following conditions: 95°C for 3 minutes, [98°C for 20s, 72°C for 30 seconds]x12, 72°C for 1 minute.
scRNA-seq epegRNA amplification primers:
PCR1 forward (pegRNA_10X_PCR1_F): ACACTCTTTCCCTACACGACG PCR1 reverse (pegRNA_10X_PCR1_R): TTTCCCATGATTCCTTCATATTTGC
PCR2 forward (pegRNA_10X_PCR2_0–8_F): GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG-[1N-to-8N]- CTTGTGGAAAGGACGAAACAC
PCR2/PCR3 reverse (10X_PCR2_R): AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTC
Following PCR amplification, lineage tracing cassette and epegRNA libraries were purified and size-selected using SPRI magnetic beads (0.9x single-sided selection) and quantified by BioAnalyzer (Agilent) to assess the size and purity of final libraries and Qubit double-stranded DNA high sensitivity Kit (Thermo Fisher Scientific) to determine the concentrations. Libraries were diluted to 4 nM and sequenced on a Nextseq 2000 (Illumina). All PCRs were first monitored to determine optimal cycle number by qPCR by adding 0.6x SYBR Green (Thermo) to the PCR reactions.
scRNA-seq data processing
All scRNA-seq data were processed using the Cellranger software package (v7.1.0). To capture lineage information, a custom reference genome was constructed by appending the 2,171 lineage tracing cassette sequences to the mouse mm10 genome build. Both gene expression and lineage tracing cassette libraries were then aligned to this reference using default parameters. Since the reference contains all 2,171 30nt barcode sequences, the intBC of each lineage tracing cassette read can be determined based on which element in the reference it aligns to.
A custom Python script was used to extract lineage information from the BAM alignment file generated by Cellranger. This script iterates through all reads aligned to a tracing cassette sequence. For each read, the LM sequence at each edit site was determined based on the alignment. For each UMI, the number of supporting reads was calculated, and the LM for each edit site was set to the LM with the most read support (ties broken randomly). UMIs were then aggregated by intBC, cellBC, and the set of LMs detected across the three edit sites to generate a final LM table.
scRNA-seq LM quality control
To generate high-quality lineage trees from scRNA-seq data, rigorous LM quality control is essential due to common issues such as PCR errors, sequencing errors, and ambient RNA contamination in tracing cassette reads. Since the levels of erroneous reads can vary between experiments–depending on sequencing depth and cell quality–and between lineage cassette integrations (intBCs) due to differences in expression levels, we applied adaptive thresholding to filter alleles in the LM table. First, for each 10X capture, we fit a two-component Gaussian mixture model to the distribution of reads/UMIs per allele and removed alleles corresponding to the lower peak of the distribution. Next, for each 10X capture and intBC, we applied the mixture model to the distribution of UMIs per allele, again filtering out alleles in the lower peak. The first step eliminates PCR and sequencing errors, which are characterized by low read support, while the second step removes alleles derived from ambient RNA, which exhibit low UMI support that varies according to the expression level of the intBC.
When a lineage tracer is read out using endogenous RNA expression, another source of error arises from the lag time between the integration of the LM into the genome and the point when RNA molecules containing that LM become the predominant species. This issue is exacerbated by the long RNA half-lives of lineage tracing cassettes, particularly in experiments conducted over short timescales, leading to errors in calling LMs that occur later in the experiment. To mitigate this problem, instead of resolving conflicting alleles by simply selecting the allele with the highest UMI count, we considered the edit state of the alleles. If the conflicting alleles differed at a single edit site–where one allele was edited and the other was not–we selected the edited allele, provided it accounted for >20% of the total UMIs for that intBC in the cell.
With only 2,171 distinct intBCs, duplicate integrations–where two tracing cassettes with the same intBC integrate into the genome of a single clone–can occasionally occur. If not filtered out, these duplicates could introduce errors, as they accrue different LMs, and we are unable to resolve which LM occurred at which integration. To address this, we identified and removed alleles from duplicate intBCs by detecting allele conflicts. For each clone and intBC, we calculated the total fraction of UMIs in cells containing more than one allele and excluded intBCs where this fraction exceeded 25%.
scRNA-seq cell quality control
Low-quality cells in scRNA-seq experiments were filtered out based on mitochondrial content and total UMI counts. Cells with >10% of UMIs assigned to mitochondrial transcripts were excluded. For the kinetics experiment, cells with <100 UMIs assigned to the transcriptome were also removed. The UMI threshold varied between experiments due to differences in transcriptome sequencing depth. For scRNA-seq of fully edited clones and lineage tracing cells with static barcodes, a threshold of 1,000 UMIs was used, while a higher threshold of 5,000 UMIs was applied to cells used in the in vitro heterogeneity analysis, reflecting their deeper sequencing coverage.
Lineage tracing enables robust and accurate detection of scRNA-seq doublets–droplets containing more than one cell–since conflicting alleles reveal the presence of multiple distinct lineages within a single droplet. Doublets were filtered out by identifying these conflicting alleles, as it is unlikely that two random cells would share the same set of LMs. Similar to the process used for removing duplicate intBCs, we calculated the total fraction of UMIs linked to intBCs with conflicting alleles for each cell and excluded cells where this fraction exceeded 25%.
epegRNA kinetics analysis
epegRNA libraries were processed using the “CRISPR Guide Capture” feature of Cellranger. epegRNA protospacer mismatch calls supported by <3 UMIs were filtered out and careful quality control performed to exclude cells with multiple epegRNA integrations as these cells could confound our estimation of protospacer mismatch kinetics. Cells with multiple epegRNA calls were removed as well as cells with lineage cassette reads where more than one edit site had an LM installed.
To calculate the edit fraction over time for each epegRNA protospacer mismatch in each cell line, we first determined the fraction of intBCs with an edit in each cell. We then calculated the mean edit fraction across cells for a given epegRNA call at each time point. Time points with fewer than 20 cells for a particular protospacer mismatch were excluded due to excessive noise in these data points.
To estimate the edit rate for each epegRNA protospacer mismatch, we fitted a saturating exponential curve to the edit fraction data. The curve is defined by the equation:
where is the edit fraction, is the time point (days), is the edit rate (edits/[edit site * day]), and is the saturation level which accounts for a small fraction of cells that do not undergo editing despite expressing the epegRNA. The curve was fitted using non-linear least squares, with constrained to the range [0.8,1] and constrained to the range [10−4,0.6]. Estimated edit rates in 4T1 and B16-F10 cells for each protospacer variant are reported in table S10.
Generation of 4T1 PEtracer cells with high numbers of integrated tracing cassettes
4T1 cells were maintained in typical culture conditions. Infections with lineage tracing cassette lentiviral libraries were carried out by plating 100,000 cells per well of a 6-well dish 18–24 hours prior to transduction. Immediately before adding the virus, media was changed on these cells to include 10ug/mL polybrene (Millipore Sigma). 50 μL of high titer virus was incubated on cells for 24 hours prior to a media change and outgrowth of cells. The brightest 10% of mCherry expressing cells were sorted 1 week after transduction to isolate cells with high integration numbers.
Representative sorting gates shown in fig. S16. This process was repeated serially to increase integration count until the average number of integrations in the population was over 10 as evaluated by qPCR using the following primer pairs and a single multiplicity of infection (MOI) line where infection rate was ~20% as an internal 1x integration control. qPCR reactions were carried out using 50 ng of genomic DNA isolated using DNeasy blood and tissue kit (Qiagen) with 500 nM primer mix and 1X DyNAmo ColorFlash SYBR Green Master Mix (Thermo) and the following cycling conditions: 95°C for 7 minutes, [95°C for 10 seconds, 60°C for 30 seconds]x40. Relative integration number was calculated by the ddCt method compared to β-actin.
Mouse β-Actin Control Forward: TTTGATGTCACGCACGATTT
Mouse β-Actin Control Reverse: AGGGCTATGCTCTCCCTCAC
mCherry Forward: AGGGCGAGATCAAGCAGAG
mCherry Reverse: CTCGTTGTGGGAGGTGATGT
Generation of 4T1 PEtracer cells edited to completion (fully-edited cells)
4T1 were nucleofected to introduce a no mismatch 24-mer pegArray (all epegRNAs in the array had protospacers with no mismatches that would slow their editing rates). Nucleofections with 4T1 cells were carried out using the Lonza 4-D nucleofector according to manufacturer’s instructions with the SE buffer, pulse code CM-150, 1.8 μg of donor pegArray plasmid and 180 ng of PiggyBac transposase plasmid (SBI). Cells were allowed to recover for 48 hours, split for >10 days to ensure loss of unintegrated plasmid, and BFP+ cells were isolated using FACS on a BD FACS Aria II (BD Biosciences). BFP+ cells were then engineered with high numbers of integrated tracing cassettes (LTCv2) by serial lentiviral infection as described above (“Generation of 4T1 PEtracer cells with high numbers of integrated tracing cassettes”). BFP+/mCherry+ cells were plated and infected with PEmax-P2A-GFP lentivirus as described for the tracing cassette lentivirus. When infecting cells with PEmax lentivirus, we aimed for 20–30% infection rates in these cells, corresponding to a single integrant per cell for the large majority of PE-containing cells. BFP+/mCherry+/GFP+ triple positive cells were FACS sorted >5 days post transduction, passaged for >3 weeks, and then re-sorted at 5 cells/well. Representative sorting gates shown in fig. S16. Clones were expanded, evaluated by bulk sequencing, and pooled to cover all unedited and edited states.
scRNA-seq of fully-edited 4T1 cells
Fully-edited 4T1 cells were passaged in RPMI as described above. Prior to droplet-based single cell sequencing using the 10X Genomics Chromium Next GEM Single Cell 3ʹ Kit v3.1, cells were split, counted, and diluted according to manufacturer’s protocols. Lineage tracing cassette libraries were generated as described above (“epegRNA mismatch library profiling to tune lineage tracing kinetics”).
We anticipated recombination during cloning and lentiviral production steps and therefore register a link between MERFISH and sequencing intBCs in fully-edited 4T1 cells by generating targeted sequencing libraries from amplified cDNA using T3 specific primers (located 5’ of both the MERFISH and sequencing intBCs).
Two successive rounds of PCR were used to prepare intBC libraries from amplified full-length cDNA. For the first PCR reaction (PCR1), 2 μL of amplified cDNA and 600 nM primer mix listed below (T3_10X_PCR1_1–4_F, 10X_PCR1_R) were used. Amplifications were performed using KAPA HiFi HotStart ReadyMix with the following cycle conditions: 95°C for 3 minutes, [98°C for 20s, 65°C for 15s, 72°C for 15s]x25, 72°C for 1 minute. PCR2 amplification was carried out using 4 μL of a 1:25 dilution of PCR1 product and 300 nM of the reverse primer (10X_PCR2_R) listed below and Nextera i7 adapter PCR primers (listed in table S29) with the following conditions: 95°C for 3 minutes, [98°C for 20 seconds, 72°C for 30 seconds]x12, 72°C for 1 minute.
scRNA-seq intBC amplification primers:
PCR1 forward (T3_10X_PCR1_1–4_F): GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG -[1N to 4N]- GTAATTAACCCTCACTAAAGG
PCR1 reverse (10X_PCR1_R): ACACTCTTTCCCTACACGACG
PCR2 forward (Nextera i7): CAAGCAGAAGACGGCATACGAGATNNNNNNNNGTCTCGTGGGCTCGG
PCR2 reverse (10X_PCR2_R): AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTC
Following PCR amplification, lineage tracing cassette and intBC libraries were purified and size-selected using SPRI magnetic beads (0.9x single-sided selection) and quantified by BioAnalyzer (Agilent) to assess the size and purity of final libraries and Qubit double-stranded DNA high sensitivity Kit (Thermo Fisher Scientific) to determine the concentrations. Libraries were diluted to 4 nM and sequenced on a Nextseq 2000 (Illumina). All PCRs were first monitored to determine optimal cycle number by qPCR by adding 0.6x SYBR Green (Thermo) to the PCR reactions.
Reads from the T3 library were processed similarly to reads from the standard tracing cassette library, with read/UMI and UMI thresholds determined using Gaussian mixture models. The 183nt MERFISH barcode for each T3 read was identified based on alignment, while the 30nt sequencing barcode was extracted from the read sequence and corrected using a whitelist of known barcodes.
To map 183nt MERFISH barcodes to 30nt sequencing barcodes, we assigned each MERFISH barcode to a sequencing barcode in each cell based on UMI support. We then calculated the fraction of cells supporting each mapping for each clone. This approach allowed us to confidently assign MERFISH barcodes to 29 of the 31 sequencing barcodes present in the fully-edited 4T1 cells, revealing multiple instances of barcode swapping (fig. S5B and table S11). In cases of barcode swapping, integrations are referred to by their MERFISH barcode identifier.
Clone calling using intBCs
Non-negative matrix factorization (NMF) was used to call clones which we define as groups of phylogenetically related cells that share a set of intBCs. Given the UMI counts for each intBC in each cell after quality control filtering, we represented the data as a cell-by-intBC matrix . NMF was then applied to factorize into the product of two matrices, and :
Here, is a cell-by-clone matrix representing the assignment of cells to clones, while is a clone-by-intBC matrix representing the assignment of intBCs to clones. The number of clones for each experiment was manually determined based on visual inspection of the matrix .
To filter out scRNA-seq doublets, which often contain cells from distinct clones, we identified cells with intBC sets that more closely matched the combination of two clones than any single clone. First, cells were putatively assigned to clones using the matrix . Next, for each intBC, we assigned it to a clone if it was detected in >50% of the cells within that clone. Using this clone-tointBC mapping, we calculated the Jaccard similarity between the set of intBCs detected in each cell and the intBC sets for both individual clones and all pairwise combinations of clones. Cells were assigned to the clone with the highest Jaccard similarity unless the highest match was to a pair of clones, in which case the cell was marked as a doublet. If the Jaccard similarity between a cell of interest and all possible clone assignments was <0.5, the cell was marked as unassigned.
Quantification of lineage tracing data accuracy in fully-edited cells
To quantify the accuracy of lineage tracing data in fully-edited 4T1 cells, we generated a whitelist (table S11) for each clone containing the LMs installed at each edit site for each intBC. This whitelist was derived from the scRNA-seq data, where cells were grouped into four clones based on their intBCs, as described above. For each clone, we then identified the most common LM at each edit site for each intBC, which formed the basis of the whitelist.
We assessed the accuracy and sensitivity of both intBC and LM detection using this whitelist. Based on each cell’s clone assignment, we identified the expected set of intBCs and LMs and compared these to the actual intBCs and LMs detected. For intBCs, we calculated the true positive and false negative rates from the fraction of expected intBCs that were detected, and the false positive rate from the fraction of detected intBCs that were not expected. For LMs, we calculated overall call accuracy, as well as accuracy per edit site and per LM. In scRNA-seq data, the LM detection rate matches the intBC detection rate, as LMs can be called for all detected intBCs. However, in imaging data, the LM detection rate is slightly lower than the intBC detection rate, since for some amplicons the intBC can be accurately called but the LM at one or more edit sites cannot.
Microscope setup for image acquisition
Image acquisition was performed using a custom-built microscope system, as previously described (151) with some modifications. The system was built around a Nikon Ti2-U microscope body with a Nikon CFI Plan Apochromat Lambda D 60x oil immersion objective with 1.42 NA. The system also included a Nikon CFI Plan Fluor 10x objective with 0.3 NA for quick identification of the sample position. Illumination of samples was through a Lumencor CELESTA light engine (a fiber-coupled solid-state laser-based illumination system) with the following wavelengths: 405 nm, 477 nm, 545 nm, 637 nm, and 748 nm. This system was used with a penta-bandpass dichroic mirror (Semrock, FF421/491/567/659/776-Di01–25X36), a penta-bandpass emission filter (Semrock, FF01–441/511/593/684/817–25), and a penta-bandpass excitation filter (Semrock, FF01– 391/477/549/636/741–25). A scientific CMOS camera (Hamamatsu C14440 with factory calibration for single-molecule imaging) was used for image acquisition. Each camera field of view (FOV) consisted of 2,304 × 2,304 pixels, with a camera pixel corresponding to 107 nm in the X and Y dimensions in the imaging plane for the 60x oil immersion objective with 1.42 NA. Sample position in three dimensions was controlled using a XYZ stage (Ludl Electronic Products). A custom-built auto-focus system was used to maintain a constant focal plane over prolonged time intervals. This was achieved by comparing the relative position of two IR laser (Thorlabs, LP980-SF15) beams reflected from the glass-fluid interface and imaged on a separate CMOS camera (Thorlabs, CS165MU1).
These different components were controlled using a National Instruments Data Acquisition card (NI PCIe-6321, X Series DAQ) and custom software (see the “Software for controlling experimental components” section below).
MERFISH fluidics system configuration
The fluidics system consisted of several core components similar to previous designs with some modifications (151): a syringe pump to control flow speed and volume; a set of daisy-chain-connected valves to select flow-in buffers; a flow chamber in which the sample was mounted; a set of tubing, connectors, and needles; and a 3-D printed chassis to hold a 24-well deep-well plate (Agilent, 204023–100). Four MVP4 High Torque valves (Hamilton, 94747–02) with 8-way connector heads (Hamilton, 97472–01) were daisy-chain connected (port 8 of each valve is connected to the input of the next valve), and the output of the first valve was connected to the FCS2 flow chamber (Bioptechs, 03060319–2-NH-30) in which samples were placed together with a 0.5 mm thick rubber gasket (Bioptechs, DIE# F18524). The chamber output was connected to the input port of the syringe pump (Hamilton, PSD6) which controlled the flow speed and volume. The output port of the syringe pump was further connected to a waste collecting vessel.
This system was controlled using a custom software (see the “Software for controlling experimental components” section below). Overall, this system allowed for 24 rounds of hybridization. We also added two 3-way adaptor sets (Bioptechs, 162003–1) and connected these two adaptors directly with tubing to allow fluidic bypass of the FCS2 chamber, which allows us to perform washing of the fluidic system without unmounting the sample if needed.
Software for controlling experimental components
All system components were controlled using custom-built software available at https://github.com/ZhuangLab/storm-control. The software package is composed of several main modules that work in concert:
“Hal” is the software package used to control and synchronize all illumination and microscope components. We note that in some cases it is necessary to install drivers for components which are not included in this package, for example, the camera for autofocus system (Thorlabs, CS165MU1) requires Thorlabs official drivers. Specific imaging parameters including laser intensities, z-scan steps, exposure time for each frame and the sequence of shutter wrapped in xml files are loaded and executed by Hal. These parameter files were generated by Python scripts provided in our Github repository (see “Code availability” below).
“Steve” is a module used to generate mosaic images (i.e., a composite of many individual fields of view (FOV)) of a sample to select FOV for imaging in experiments.
“Kilroy” is the software used to control fluidics components, and to define pre-programmed sequences of operations to be performed as sets (e.g., the set of operations that initiate a new round of hybridization).
“Dave” can issue commands to both Hal and Kilroy and is used to automate the performance of data collection by pre-programmed the complete set of fluidics system and microscope operations used for an experiment (i.e. the order and time-lag in which fluidics and imaging steps are to be performed).
The general outline of an experiment is as follows: before the experiment starts, Hal and Kilroy are loaded with the parameters and specifications to be used. After the sample is loaded and the chamber is filled with the imaging buffer, a mosaic image of the DAPI channel is taken using Steve, and FOVs of interest are selected. A file is then generated to specify the sequence of operations throughout the entire experiment and is loaded to Dave, together with the coordinates of the selected regions of interest. The rest of the experiment is automated. If the number of rounds in an experiment exceeds the capacity of the flow system, we can specify a Dave file to automatically run up to the capacity of the system, then use the bypass tubing to clean up the fluidic system without perturbing the mounted sample. After cleaning up, all buffers are replaced and a new Dave file is created. This is repeated until all rounds of imaging are completed.
Design of intBC encoding probes
A combinatorial detection scheme for our 2,171 intBCs was inspired by MERFISH experiments (58) and used a modified Hamming distance 4, Hamming weight 6 21-bit codebook derived from covering designs (https://www.dmgordon.org/cover/, with parameters: v=21, k=6, t=5). Each of the 21 total bits in the codebook are assigned to a unique readout sequence and each intBC was assigned to a six bit code in the codebook (table S6). We designed six encoding probes per intBC consisting of a 30nt targeting sequence and three readout sequences. For each intBC, all assigned readout sequences are represented across the set of encoding probes. The 183nt MERFISH barcode within each integration barcode was partitioned into 6 non-overlapping 30nt segments. These 30nt sequences have been tested for cross-hybridization during intBC design procedures, therefore the reverse-complement of these 30nt sequences were directly used as targeting sequences in probes. Readout sequences were appended to each targeting sequence separated by an “A”, with all assigned readout sequences evenly represented across the set of encoding probes designed for that intBC. 20nt forward and reverse primers were then added to the 5’ and 3’ of these probes. Then the fully assembled probes were further screened for cross-hybridization, so that any probes with a 17nt segment reverse-complementing to other probes were removed. Full probe sequences can be found in table S12.
Design of LM and unedited probes
A unique probe was designed to hybridize to each LM installed at its cognate edit site. Specifically, edit site sequences modified with each LM were generated in silico and probes were designed to hybridize to each modified edit site. Probes were designed to be 20nt in length and centered around each 5nt inserted LM. For each LM, a specific MERFISH readout sequence was chosen and two copies of this readout sequence were appended to the 5’ side of each edit site/LM probe. All LM probes were directly purchased from IDT. Several combinations were tested and the final set of probes with the highest signal to noise ratios (SNRs) were merged. Full probe sequences can be found in table S13.
Design of tracing cassette common sequence probes
Target sequences of common bits were directly chosen from the 500nt common sequence located 3’ of the edit sites in the lineage tracing cassette (Addgene # pending). Common sequence probes were designed to target 24 30nt sequences within this 500nt common sequence. Two 20nt MERFISH readout sequences (table S16) were appended to the 5’ and 3’ of these targeting sequences flanked by a set of forward and reverse primers. Full probe sequences could be found in table S14.
Encoding probe amplification
Encoding probes were ordered as oligo pools from Twist Bioscience. Each oligo pool was first dissolved in 100 μL TE buffer (Thermo Fisher Scientific, AM9849). Encoding probes were amplified from the template library using a previously described amplification protocol (151) with additional modifications:
The initial oligo pool was amplified using limited-cycle PCR for approximately 11–15 cycles dependent on the library size, where the reverse primer used for this amplification included a T7 promoter sequence to enable in vitro transcription in the next step. PCR primer sequences are listed below.
The resulting PCR product was purified via column purification (DNA Clean & Concentrator-25, Zymo Research, D4033).
The purified PCR product underwent further amplification and conversion to RNA by a high-yield in vitro T7-mediated transcription reaction (HiScribe® T7 Quick High Yield RNA Synthesis Kit, NEB, E2050).
The resulting RNA product was purified by column (Monarch® RNA Cleanup Kit, NEB, T2050).
The purified RNA product was converted back to single-stranded DNA (ssDNA) by a reverse transcription reaction (Maxima H Minus Reverse Transcriptase, Thermo Fisher Scientific, EP0753) using the forward primer from step 1 in this protocol. The final product was purified by adding 3x volume of the self-made SPRI beads, eluted in TE buffer followed by an ethanol precipitation to concentrate to 25 nM for each probe in TE buffer. This probe stock could be stored in −20°C. All primers were purchased from IDT.
Probes for common sequence in lineage cassette were amplified by:
Forward primer: CGGGTTTCGTTGCGCACACC
Reverse primer: TAATACGACTCACTATAGGGCTTGTGCATCGCGCCAAAGA
Probes for intBC were amplified by:
Forward primer: CCCGCAATGGCTGACAACCG
Reverse primer: TAATACGACTCACTATAGGGATTGCCGCATGGTTTCCG
Probes for 124-gene MERFISH library were amplified by:
Forward primer: CCCGCAATGGCTGACAACCG
Reverse primer: TAATACGACTCACTATAGGGATTGCCGCATGGTTTCCG
Probes for 175-gene MERFISH library were amplified by two sub-pools:
Amplification of 150-gene Sub-pool Forward primer: CCCGCAATGGCTGACAACCG Amplification of 150-gene Sub-pool Reverse primer:
TAATACGACTCACTATAGGGATTGCCGCATGGTTTCCG
Amplification of 25-gene Sub-pool Forward primer: CGGGTTTCGTTGCGCACACC Amplification of 25-gene Sub-pool Reverse primer:
TAATACGACTCACTATAGGGCTTGTGCATCGCGCCAAAGA
Glass coverslip cleaning and silanization
40 mm diameter, 0.17 mm thickness glass coverslips (Bioptechs, 40–1313-03192) were first cleaned by placing coverslips in a plastic holder (Entergris, A23–0215) and performing 3× 5 minute rinses with MilliQ water in a glass dish (Electron Microscopy Sciences, 70312–31) prior to a 30-minute incubation in a chemical hood at room temperature with a mixture of equal volumes of Methanol (Sigma, 34860) and 37% HCl (Sigma-Aldrich 258148). After this incubation, 4× 5-minute MilliQ water rinses were performed prior to 2× 5-minute washes with 70% ethanol. Coverslips were gently blown dry using compressed air and left to fully dry.
Cleaned coverslips were placed in a glass holder and incubated for 30 minutes at room temperature in a mixture of 1L chloroform (Sigma-Aldrich 319988), 1 mL triethylamine (Sigma-Aldrich 81101), and 2 mL allyl-trichlorosilane (Sigma-Aldrich 107778) which were all stored and handled according to provider’s instructions. Using glass pipetting agents is essential to generating high-quality coverslips. Following this incubation, coverslips were washed by performing 2× 5-minute chloroform washes followed by a 1× 5-minute wash in 100% ethanol. After this ethanol wash, coverslips were gently blown dry by compressed air and left to fully dry. Following silanization, coverslips were stored in a dry, dark place in a sealed container with a dedicated set of desiccant cartridges (VWR, 76538–672).
Poly-D lysine coating silanized glass coverslips
Silanized coverslips were incubated with 0.1 mg/mL Poly-D lysine (PDL; Thermo Fisher Scientific, A3890401) overnight at room temperature in a 3-D printed box with parafilm sealing top. Following PDL coating, coverslips were cleaned with 3× 5-minute rinses with MilliQ water followed by a 1x rinse with 100% ethanol before being gently blown dry with compressed air and left to dry. Coverslips were stored in 3-D printed cases at 4°C with the same desiccant cartridge and sterilized with ultraviolet light for 20 minutes immediately prior to use.
Seeding of fully-edited cells prior to imaging
Fully-edited cells were cultured as described above prior to seeding 400,000 cells on cleaned, silanized, and PDL-coated 40 mm diameter round coverslips in 60 mm diameter petri-dishes (Corning, 430166) in RPMI media supplemented with 10% FBS and 0.1% PSG. Samples were then prepared for FISH-based lineage readout using either the “Zombie” or “In-gel T7” protocols described in the following sections. 59 fields of view (FOVs) were imaged for one coverslip prepared with the “Zombie” protocol and 69 FOVs for one coverslip prepared with the “In-gel T7” protocol as enabling comparison of these methods.
Sample preparation for Zombie experiments
The Zombie protocol was adapted from previously published methods (42, 56). In brief, samples were washed 2x with 1x PBS for 5 minutes each and subsequently fixed with a 3:1 (vol:vol) mix of methanol and acetic acid at room temperature for 20 minutes. Cells were then washed 4x with 1x PBS prior to a 1x wash with nuclease-free water. Samples were subsequently incubated with T7 in vitro transcription mix (MEGAscript Transcription Kit; Invitrogen) supplemented with 2.5 mM aminoallyl-UTP (Thermo Fisher Scientific, R1091) at 37°C for 16 hours. After transcription, cells were fixed with 4% formaldehyde in PBS for 20 minutes at room temperature followed by 2x washes with 1x PBS prior to the same hybridization as in the section “Lineage probe hybridization” below.
Sample preparation for in-gel T7 amplification
Samples were washed 2x with 1x PBS for 5 minutes each and then fixed by treatment with 4% paraformaldehyde in 1x PBS for 10 minutes at room temperature. Samples were subsequently washed three times with 1x PBS and stored in 1x PBS supplemented with 1:500 RNase Inhibitor Murine (NEB, M0314L) for up to 8 weeks.
The fully-edited cells fixed on coverslip were embedded in a hydrogel prior to tissue clearing to reduce tissue autofluorescence and off-target probe binding. First, the sample was incubated in gel solution consisting of 300 mM NaCl, 4% (vol/vol) of 19:1 acrylamide/bisacrylamide, 60 mM Tris-HCl pH 8, 0.1% (vol/vol) TEMED, 1/40,000 (vol/vol) 505/515 nm fiducial beads (Invitrogen F8803), and 0.5% (wt/vol) ammonium persulfate for 5 minutes at room temperature. We then placed a 60 μL droplet of the gel solution onto a hydrophobic glass slide treated with GelSlick (Lonza). The coverslip bearing the sample was then inverted onto the droplet to form a uniform layer of monomer solution. The sample was allowed to polymerize completely for >1 hour at room temperature.
The coverslip bearing the gel-embedded sample was then removed from the glass slide with a thin razor blade. The sample was then incubated in digestion buffer containing 2% (wt/vol) SDS, 0.5% (vol/vol) Triton-X-100 (Sigma, X100), and 1% (vol/vol) Proteinase K (NEB P8107S) in 2x SSC for 24 hours at 47°C and a subsequent 24 hours at 37°C. Following digestion, the sample was washed by performing 6× 20 minute washes with 2x SSC buffer supplemented with 0.2% (vol/vol) Rnase Inhibitor Murine (NEB) at room temperature. The sample was further digested in 800 mM Guanidine HCl (IBI Scientific, IB05080), 1 mM EDTA, 50 mM Tris-HCl pH=8.0, and 0.5% Triton-X-100 supplemented by 1% (vol/vol) Proteinase K for 24 hours at 37°C.
After Guanidine HCl digestion, samples were transferred to new petri-dishes with fresh 2x SSC for one quick wash followed by a 10-minute incubation in 2x SSC supplemented with 1 mM phenylmethylsulfonyl fluoride (PMSF) protease inhibitor (Thermo Fisher Scientific, 36978). Samples were then thoroughly washed by performing 5× 15-minute rinses with 2x SSC on a shaker followed by 3× 15-minute washes with nuclease free water on the shaker. The remaining water was aspirated from the coverslips, and each coverslip was inverted onto a 100 μL droplet of T7 in vitro transcription mix (MEGAscript Transcription Kit; Invitrogen AM1334) supplemented with 2.5 mM aminoallyl-UTP (Thermo Fisher Scientific, R1091) on a parafilm-covered 60 mm diameter petri-dish. Transcription was carried out for 16 hours at 37°C. Optionally, samples were remounted onto fresh T7 transcription mixture after this 16 hours incubation for an additional 3 hours at 37°C.
After transcription, samples were fixed with 4% formaldehyde solution in 1x PBS for 20 minutes at room temperature followed by 2x washes with 1x PBS. After these washes, probe hybridization was carried out as described in the section “Lineage probe hybridization” below.
Lineage probe hybridization
Following either the Zombie or in-gel T7 amplification protocols above, probes for lineage intBCs, common sequences, and LMs were hybridized similar to published MERFISH protocols (152). Two separate hybridizations were performed: in the first, probes for the intBCs and common sequences were hybridized, while in the second LM-encoding probes were hybridized.
For the first hybridization, samples were washed once with 2x SSC prior to equilibration in encoding-probe wash buffer (30% (v/v) formamide [Thermo Scientific, AM9344] in 2x SSC) supplemented with 0.1% Tween-20 (Sigma, P9416) for 10–15 minutes at room temperature. Wash buffer was then aspirated from the coverslip, and coverslips were inverted with the sample side facing down onto a 50 μL droplet of probe mixture on a parafilm-covered 60 mm petri-dish. The probe mixture comprised encoding probes for each intBCs at 1 nM and encoding probes for the two-color common sequence at 10 nM in 2x SSC with 30% v/v formamide, 0.1% wt/v yeast tRNA (Life Technologies) and 10% v/v dextran sulfate (Sigma, D8906). Probe hybridization was carried out by incubating at 37°C for 24 hours in a humidified chamber. After hybridization, samples were washed 1x with encoding-probe wash buffer for 30 minutes at 47°C to remove excess probes prior to a 1x wash with 2x SSC before proceeding directly to LM probe hybridization.
Samples were equilibrated in LM-probe wash buffer (10% (v/v) formamide in 2x SSC) supplemented with 0.1% Tween-20 for 10–15 minutes at room temperature. Wash buffer was then aspirated from the coverslip, and each coverslip was inverted with the sample side facing down onto a 50 μL droplet of probe mixture on a parafilm-covered 60 mm petri-dish as above. This probe mixture comprised 300 nM of each LM probe (directly purchased from IDT) in 2x SSC with 30% v/v formamide, 0.1% wt/v yeast tRNA (Life Technologies) and 10% v/v dextran sulfate (D8906, Sigma). Samples were incubated at 37°C for 24 hours in a humidified chamber. After hybridization, the sample was washed 2x with the encoding-probe wash buffer for 15 minutes at room temperature to remove excess probes.
Following these two hybridizations, samples were embedded in a second 4% poly-acrylamide gel as described in the section “In-gel T7 amplification for fully-edited cells” above to further immobilize T7 amplicons. After this embedding, samples can be stored at 4°C in 2x SSC supplemented with 1:1000 Rnase Inhibitor Murine for up to 4 days.
Imaging of integration barcodes and lineage marks
After lineage probe hybridization and secondary gel-embedding, samples were assembled into the FCS2 flow chamber (Bioptechs, 060319–2). Samples were imaged by the home-built imaging platform described in the “Microscope setup for image acquisition” and “MERFISH fluidics system configuration” sections. All fluid exchanges on the microscope use the custom-built fluidics system described in the “Fluidics system configuration” section. For each coverslip, nuclear staining by DAPI was used to help select FOVs of interest from mosaic images generated at 10x magnification using “Steve” as described in “Software for controlling experimental components”.
Throughout the course of imaging, we performed multiple rounds of hybridization and imaging. Specifically, we imaged 2 colocalizing bits for common sequences used to nominate potential amplions in round 0; 21-bit combinatorial scheme for integration barcode readout in rounds 1–7, and 27 sequential bits for LMs and unedited states in rounds 8–16 (three colors per round). We designed 51 adaptor probes to convert bit-specific readout sequences in the encoding probes into common readout sequences. Each adaptor probe consists of: a 5’ sequence that is complementary to one of the bit-specific readout sequences used in encoding probes, and a 3’ sequence with two copies of the reverse-complement of one common readout sequence (table S16) (151). We designed 3 common readout sequences in total. Each common readout probe contains a readout sequence that is complementary to the adaptor probe and has a fluorescent dye (Alexa750, Alexa647 or Atto565) linked to the 5’ of the oligo via a disulfide bond (table S15).
For each round, samples were imaged as follows unless otherwise described:
Samples were hybridized with 3 adaptors probes labeling 3 different bits at 100 nM in secondary hybridization buffer (2x SSC, 30% v/v formamide) for 15 minutes at room temperature.
Samples were washed with the secondary hybridization buffer for 3 minutes at room temperature.
Samples were then hybridized with a set of fluorescently labeled oligonucleotide probes termed “common readout probes” (synthesized by IDT) (table S15) at 20–25 nM in secondary hybridization buffer for 15 minutes at room temperature.
Samples were washed again with the secondary hybridization buffer, for 3 minutes at room temperature.
Samples were then incubated and kept in imaging buffer comprising 5 mM 3,4-dihydroxybenzoic acid (Sigma, P5630), 100 μM of Trolox and Trolox quinone mixture (generated by UV radiation of a Trolox solution ~50 μM of Trolox quinone final quantified by UV-vis spectrometry; Sigma, 238813), 1:500 recombinant protocatechuate 3,4-dioxygenase (rPCO; OYC Americas, 46852904), 1:1000 Murine RNase inhibitor (NEB, M0314L), 10 mM Tris-HCl pH=7.5 (Thermo Fisher Scientific, 15568025), and 5 mM NaOH (to adjust pH to 8.0; Sigma, S2770) in 2x SSC.
Samples were imaged using a 60x oil immersion objective, and signals for the common readout probes and fiducial beads were collected in the 748 nm, 637 nm, 545 nm, and 488 nm channels, respectively at a rate of 10 Hz. All channels were imaged for 50–60 consecutive 0.6 μm-thick z-stacks for each FOV of interest.
After imaging all FOVs, the signal from the previous round was extinguished by incubating the sample in cleavage buffer comprising 2x SSC, 30% formamide supplemented with 50 mM Tris (2-carboxyethyl) phosphine (TCEP; Sigma, 646547) for 10 minutes at room temperature. TCEP treatment cleaves the disulfide bond connecting fluorophores to readout probes. Cleavage buffer also contained 167 nM common readout probes without fluorophores to block unoccupied common readout sequences on the adaptor probes from interfering with the next round of hybridization.
All rounds of imaging were carried out as described above except for the first round. For the first round, before the imaging buffer step (step 5), samples were additionally washed in 2x SSC once and then incubated in 2x SSC containing 2.5 μg/mL DAPI (Thermo Fisher Scientific, D1306) for 10 minutes to stain nuclei. DAPI signal was collected in the 405 nm channel, and either the 405 nm channel or all channels in this round were imaged for 50–60 consecutive 0.6 μm-thick z-stacks for each FOV of interest, depending on the experiment. Nuclear DAPI signal was imaged together with common sequences in round 0, which was later used as reference for segmentation and fine alignments described here: “Sample alignment between MERFISH and lineage imaging”.
Nuclei segmentation for in vitro samples
Cellpose (v3.1.0) was used to segment cell nuclei in 3-D based on DAPI signal (98). For each field of view (FOV) the DAPI image was adjusted to account for the illumination profile of the microscope and then downsampled by a factor of four in the x-y plane. Cellpose was run on a GPU using the “nuclei” model with a diameter of 25 pixels and a cell probability threshold of −4. The resulting nuclei masks were converted into polygons with a tolerance of 0.5 pixels.
Nuclei polygons from all FOVs were then aggregated. Since adjacent FOVs overlap, the same nuclei could be segmented in multiple FOVs, leading to overlapping polygons that required correction. For large overlaps (polygons from neighboring z planes that overlap >20% of their area), the two polygons were merged, as they were likely the same nuclei segmented twice. For smaller overlaps, the intersecting volume was subtracted from the smaller polygon, as these were likely segmentation errors. After aggregating all polygons and correcting overlaps, polygons with a volume <500 μm3 were removed.
3-D nuclei polygons, represented as a collection of 2-D polygons for each z-slice, were used to assign T7 amplicons and MERFISH transcripts to cells, as described in detail below. For 2-D visualization, nuclei outlines were generated by computing the union of all 2-D polygons along the z-axis.
T7 amplicon detection and quantification
After T7 amplification, each lineage tracing cassette integration should be marked by a local population of RNA molecules centered at the site of transcription. These amplicons were detected using hybridization probes targeting the common sequence shared by all integrations. To maximize the detection rate, the common sequence was probed in two independent rounds of hybridization. To identify amplicons, the two hybridization images were maximum-projected, and a 2-D unsharp mask (sigma = 10 pixels) was applied to reduce background. Spherical intensity peaks were then identified in 3-D using the Laplacian of Gaussian method (sigma = 2–10 pixels, minimum intensity = 200). The spot detection parameters were optimized to prioritize recall of real amplicons, accepting false positives that could be filtered out during integration barcode decoding.
Following amplicon detection, the intensity of each hybridization round for each amplicon was quantified. A 2-D unsharp mask (sigma = 10 pixels) was applied to each channel, and then the maximum signal intensity within a 10-pixel diameter circle in the x-y plane centered at each amplicon position was calculated. This quantification was performed in 3-D to resolve amplicons that overlapped in the z-dimension. To correct for x-y drift between imaging rounds, fiducial beads imaged in each round were used for registration. The drift was calculated using phase cross-correlation between bead images in each round compared to the first round, and amplicon positions were adjusted accordingly.
T7 amplicon integration barcode decoding
The intBC for each T7 amplicon was determined by probing the 183nt MERFISH barcode across 7 rounds of 3-color imaging. After detecting potential T7 amplicons and quantifying their intensity across the 21 bits as described above, data from all fields of view were combined into a N-by-21 intensity matrix, , where is the total number of potential amplicons. With this representation, intBC decoding becomes an optimization problem, where we seek to assign each row in to the most likely intBC codeword (21-bit Hamming distance 4, Hamming weight 6 code) accounting for variability in T7 amplification and probe hybridization efficiency across rounds. Following the approach of Moffit et al. (2016) we use an expectation maximization (EM) algorithm to solve this optimization problem (153).
First, is color normalized to correct for differences in laser power and fluorophore intensity by rescaling each column so that its mean intensity is consistent across channels. Then, to account for variability in amplicon intensity and probe hybridization efficiency, is normalized by the dot product of mean spot and bit intensities:
Here, is defined as the mean intensity of each spot across the “1” bits indicated by its corresponding codeword, and is the mean intensity of each bit across spots (after normalization by spot intensity). Initial estimates are obtained by setting to the 95th percentile of each row and to the 95th percentile of each column (after scaling by ). Finally, is adjusted for each bit’s noise profile:
where is the signal-to-noise ratio for each bit; values in are clipped to 1.
In the expectation step, is used to assign each spot to an intBC codeword from the predefined list of 2,171 valid binary codewords (listed in table S6). A KDTree is used to efficiently perform a nearest-neighbor search that identifies the intBC codeword minimizing the Euclidean distance to each column in . Spots with a distance ≥1.7 between the codeword and corrected intensity vector are masked out. Spots with a distance ≥1.7 are masked out, a threshold that balances error correction with the exclusion signals that do not correspond to true integration amplicons.
In the maximization step, intBC identity assignment in the previous expectation step is used to update , , and , where is recalculated for all rows, while only spots with confidently assigned intBCs contribute to the updates of and . For each bit, is computed as the ratio of the mean intensity for spots where the bit is present (“1” in the codeword) to the mean intensity for spots where it is absent (“0” in the codeword).
After 10 iterations–sufficient for convergence–each amplicon is decoded based on its assigned intBC. Amplicons with a matching distance of ≥1.7 are filtered out and excluded from downstream analysis.
T7 amplicon lineage mark decoding
Each LM is detected by a separate hybridization probe, so decoding the eight LMs for each edit site as well as the unedited state requires 9 rounds of 3-color FISH imaging (27 bits total, 9 bits per edit site x 3 edit sites). For spots with confident intBC assignments, data from all fields of view were combined into an N-by-27 intensity matrix X, as described above. X is then color-balanced by rescaling each column so that its mean intensity is consistent across channels, correcting for differences in laser power and fluorophore brightness.
To select an LM for each edit site, we extract the relevant columns from X to form an N-by-9 intensity matrix, Xsite. Each row in Xsite is normalized so that the sum of its values equals 1, accounting for variation in overall spot intensity. While simply choosing the LM corresponding to the maximum value in each row of Xsite works well, we found that training a logistic regression classifier further improves decoding accuracy. The classifier learns to account for variations in probe affinity and cross-hybridization that affect the intensity profile of each LM across hybridization rounds.
We trained the logistic regression classifier using the known linkage between intBCs and LMs in our fully-edited cells (see “Quantification of lineage tracing data accuracy in fully-edited cells” for details). For each row in Xsite, we determined the ground-truth LM based on the decoded intBC and then trained the classifier to predict this value. In vitro decoding accuracies for both our in-gel tissue-clearing protocol and the standard Zombie protocol were evaluated using 5-fold cross-validation. The classifier was subsequently retrained on all in vitro data from the in-gel protocol and then applied to decode all subsequent experiments, including in vivo data from fully-edited cells, which serve as an out-of-sample control.
Separate classifiers were trained for each edit site. Accuracies were reported for all decoded LMs as well as for the subset of LMs confidently decoded by each classifier (with prediction probability p ≥0.7). Only LMs with p ≥0.7 were used in downstream analyses such as tree reconstruction.
Assignment of T7 amplicons to cells
Decoded T7 amplicons were assigned to nuclei segmentation masks in 3-D using the geopandas package (v1.0.1). Using the 3-D coordinates of each amplicon and the 3-D polygons representing nuclei, amplicons falling within a polygon were assigned to the corresponding nucleus, then to account for segmentation errors, amplicons located just outside a mask (<4 μm) were assigned to the nearest nucleus. The Euclidean distance from each amplicon to its assigned mask was recorded, and amplicons outside segmentation masks were only used in tree reconstructions if no amplicon with the same intBC was detected within the mask.
Imaging readout LM quality control
Similar to scRNA-seq LM quality control (see “scRNA-seq LM quality control” for reference), we identify a set of high-confidence LM calls for each intBC in every imaged cell. We first remove amplicons with poor LM detection by excluding those with a mean LM assignment probability <0.7 across the three edit sites. To account for instances where the same amplicon is detected multiple times by the spot detection algorithm, we group intBCs with identical LM sets within each cell and retain only the brightest spot from each group. To eliminate duplicate integrations (i.e., two lineage tracing cassettes sharing the same intBC), we exclude intBCs detected in more than one amplicon with different LMs in >40% of the cells for a given clone. Additionally, to remove segmentation doublets–cases where two cells are not properly separated–we filter out cells with >50% conflicting amplicons. Finally, to uniquely assign LMs to intBCs, we resolve conflicting amplicons by selecting the brightest spot in each cell, with a preference for those contained within the nuclear mask.
After quality control the accuracy of image-based intBC and LM decoding was quantified using fully-edited cells as described in the “Quantification of lineage tracing data accuracy in fully-edited cells” section.
Cloning and evaluation of puromycin/blasticidin lentiviral static barcode libraries
Lentiviral plasmids expressing puromycin or blasticidin resistance genes were cloned with Gibson assembly as described above. We introduced a 10N random barcode that could be efficiently assayed using scRNA-seq-based capture of 3’ polyadenylated transcripts. These libraries were cloned by introducing oligonucleotide libraries ordered from IDT using Gibson assembly. The same strategy used for 5N epegRNA libraries described above (see section “Cloning of a comprehensive 1,024 5 nucleotide (nt) LM epegRNA libraries” for details) was used where top and bottom oligo libraries were ordered, cloned, and transformed before libraries were pooled, expanded through outgrowth for 8 hours and Midiprepped as detailed above. Lentiviral plasmid libraries were sequence confirmed as described for the epegRNA libraries with modest differences. As before, plasmid libraries were amplified with two successive rounds of PCR as done to sequence genomic loci. PCR1 was performed using Phusion Green Hot Start II High-Fidelity PCR 2x Master Mix with the following cycle conditions: 98°C for 2 minutes, [98°C for 10 seconds, 64°C for 20 seconds, 72°C for 1 minute] where the number of cycles was empirically determined by qPR, followed by a 72°C for 2 minutes. PCR2, gel extraction, library preparation, and sequencing were carried out as before.
Puromycin 10N library top: CAGAAAGCCTGGCGCCTGACNNNNNNNNNNGTTTAAACCCGCTGATCAGCCT CGACTG
Puromycin 10N library bottom: CAGTCGAGGCTGATCAGCGGGTTTAAACNNNNNNNNNNGTCAGGCGCCAGG CTTTCTG
Blasticidin 10N library top: TTATGTGTGGGAGGGCTAACNNNNNNNNNNGTTTAAACCCGCTGATCAGCCT CGACTG
Blasticidin 10N library bottom: CAGTCGAGGCTGATCAGCGGGTTTAAACNNNNNNNNNNGTTAGCCCTCCCAC ACATAA
Forward primer for puromycin library amplification: TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG -[1N to 4N]- CCTGGTGCATGACCAGAAAGC
Forward primer for blasticidin library amplification: TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG -[1N to 4N]- GTGAATTGCTGCCCTCTGGTTATGTG
Reverse primer for puromycin and blasticidin library amplification: GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG -[1N to 4N]- GCACAGTCGAGGCTGATCAG
To assess the balance of library elements, we used FastX toolkit (v0.0.14) to mask low quality bases (<q20) and collapse reads. We removed common flanking sequences and summed reads for unique barcode sequences. Lentivirus was produced from these plasmid libraries as described above (“Lentiviral production”).
In vitro tracing with ground-truth barcoding readout with scRNA-seq
4T1 cells with high numbers of integrated LTCv1 tracing cassettes were nucleofected with a pegArray tuned to a 2–3-week timescale (24-mer comprises: edit site 2 protospacer mutant 8, edit site 3 protospacer mutant = 34, edit site 1 protospacer mutant = 32; see table S10). >10 days post-nucleofection, BFP+/mCherry+ double positive cells were isolated by FACS sorting and expanded. On day 0 of the experiment, tracing cassette and pegArray containing cells were subsequently infected with PEmax-T2A-GFP lentivirus with an infection rate of ~30% and thus typically 1 integrant per cell as above. 48 hours post-infection (on day 2), BFP+/mCherry+/GFP+ triple positive cells were sorted as 2 cells/well into 96-well tissue culture-treated plates (Corning). Representative sorting gates shown in fig. S16. On day 5, sorted cells were infected with 10 μL of the puromycin static barcode lentiviral library. Media was changed on Day 6 to prevent continued infection events. On day 7, cells were infected with 10 μL of the blasticidin static barcode lentiviral library. On day 8, cells were split into a fresh TC-treated 96-well plate to continue expansion of the clones. Robust wells were expanded until treatment on day 12 with 4 μg/mL blasticidin-containing selection media. Cells were passaged in blasticidin selection media until day 14 when cells were split into media containing both 4 μg/mL blasticidin and 2 μg/mL puromycin. Double selection continued until day 16, at which points cells were split, counted, and droplet-based single cell sequencing was performed by pooling one to three clones per capture for a total of 6 clones and 3 captures using the 10X Genomics Chromium Next GEM Single Cell 3ʹ Kit v3.1. Lineage tracing cassette libraries were generated as described above (“epegRNA mismatch library profiling to tune lineage tracing kinetics”).
Sequencing and analysis of in vitro tracing with ground-truth barcoding readouts
Two successive rounds of PCR were used to prepare puromycin and blasticidin barcode libraries from amplified full-length cDNA. For the first PCR reaction (PCR1), 2 μL of amplified cDNA and 600 nM primer mix listed below (Puro_10X_PCR1_1–4_F/Blast_10X_PCR1_1–4_F, 10X_PCR1_R) were used. Amplifications were performed using KAPA HiFi HotStart ReadyMix with the following cycle conditions: 95°C for 3 minutes, [98°C for 20s, 65°C for 15s, 72°C for 15s]x18–20, 72°C for 1 minute. PCR2 amplification was carried out using 4 μL of a 1:25 dilution of PCR1 product and 300 nM of the reverse primer (10X_PCR2_R) listed below and Nextera i7 adapter PCR primers (listed in table S29) with the following conditions: 95°C for 3 minutes, [98°C for 20s, 72°C for 30 seconds]x12, 72°C for 1 minute.
intBC amplification primers:
PCR1 forward (Puro_10X_PCR1_1–4_F): GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG -[1N to 4N]- CCTGGTGCATGACCAGAAAGC
PCR1 forward (Blast_10X_PCR1_1–4_F): GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG -[1N to 4N]- GTGAATTGCTGCCCTCTGGTTATGTG
PCR1 reverse (10X_PCR1_R): ACACTCTTTCCCTACACGACG
PCR2 forward (Nextera i7): CAAGCAGAAGACGGCATACGAGATNNNNNNNNGTCTCGTGGGCTCGG
PCR2 reverse (10X_PCR2_R): AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTC
Following PCR amplification, lineage tracing cassette and Puro/Blast barcode libraries were purified and size-selected using SPRI magnetic beads (LTC libraries: 0.9x single-sided selection, Puro/Blast barcode libraries: 1.2x single-sided selection) and quantified by BioAnalyzer (Agilent) to assess the size and purity of final libraries and Qubit double-stranded DNA high sensitivity Kit (Thermo Fisher Scientific) to determine the concentrations. Libraries were diluted to 4 nM and sequenced on a Nextseq 2000 (Illumina). All PCRs were first monitored to determine optimal cycle number by qPCR by adding 0.6x SYBR Green (Thermo) to the PCR reactions.
scRNA-seq processing and quality control were performed as described in the “scRNA-seq data processing” section, with additional puromycin and blasticidin barcode libraries aligned using Cellranger to a custom reference. This reference included puromycin and blasticidin resistance cassette sequences with ‘N’s indicating the positions of the 10nt random barcodes. A custom Python script was used to extract the barcode sequence from each read aligned to these elements, generate a consensus barcode for each UMI, and aggregate UMIs by barcode for each cell. To separate barcode calls corresponding to real integrations from sequencing errors and ambient RNA, for each 10x capture a Gaussian mixture model was fitted to the distribution of UMIs per barcode per cell. Barcode calls corresponding to the lower peak were excluded as likely errors.
Quality control, clone calling, and phylogenetic tree reconstruction for each clone using neighbor joining were performed as described above.
Algorithm for calling barcode groups
A greedy algorithm was used to define barcode groups, which are clusters of cells sharing a set of puromycin or blasticidin barcode sequences. Starting with UMI counts for each barcode in each cell after quality control filtering, the algorithm iteratively performs the following steps until all cells are assigned to a barcode group or no viable seeds remain:
Select a seed: Choose the barcode detected in the highest fraction of unassigned cells that has not been used in a previous iteration.
Identify the cluster: Identify all other barcodes present in more than 80% of the cells where the seed barcode is detected.
Assign cells: Assign all unassigned cells to the barcode group if the barcodes in the cluster account for more than 80% of the total barcode UMIs in those cells.
Validate the group: Retain the barcode group if at least 5 cells are assigned; otherwise, discard the group and mark the cells as unassigned.
This algorithm accurately assigns the vast majority of cells to barcode groups, accounting for cases where groups are distinguished by multiple barcodes and instances where barcodes are occasionally reused between groups (fig. S6).
Quantification of agreement between phylogeny and barcode groups
The Fowlkes–Mallows index (FMI), a metric for quantifying the similarity between two clusterings, was used to assess the agreement between the phylogeny and barcode groups derived from puromycin and blasticidin static barcoding. FMI is defined as:
where is the number of true positives, is the number of false positives, and is the number of false negatives.
To calculate the FMI score for each clone, barcode groups were compared to clades representing the descendants of the ancestral cell that originally received each barcode. The lowest common ancestor (LCA) of each barcode group was inferred by computing the FMI score for all internal nodes in the phylogeny and selecting the node that maximized the FMI for that group. For each clone, FMI scores were calculated separately for puromycin and blasticidin barcode groups, using phylogenies reconstructed with both neighbor joining (NJ) and UPGMA. Cells lacking a barcode excluded from the FMI calculation. FMI scores were also calculated after randomly permuting the barcode group associated with each cell in the phylogeny. FMI values are reported in table S17.
Inference of ancestral LM states
To estimate branch lengths in the phylogeny and collapse branches that are not supported by a lineage mark (LM), we inferred the ancestral LM state for each internal node in the tree using the Sankoff algorithm (154). This algorithm was chosen because it allows for low-probability LM-to-LM transitions, thereby reducing the impact of noise and errors in phylogeny reconstruction on ancestral state inference. For each edit site, the algorithm minimizes the total transition cost across the tree by assigning ancestral states that are most consistent with the LMs observed at the leaves. We implemented the following transition costs:
Unedited → unedited: 0
LM → same LM: 0
Unedited → LM: 0.6
LM → unedited or different LM: 1
These constraints reflect the assumption the state of an edit site should remain the same unless an LM is installed while allowing for some LM-to-LM transitions to account for errors.
Inference of branch lengths
After reconstructing phylogenies and inferring ancestral states, we used ConvexML (v2.0.0) to estimate branch lengths in the tree (155). This maximum likelihood algorithm integrates a statistical model of the editing process with inferred ancestral LM states to efficiently estimate branch lengths. ConvexML assumes that editing occurs at a constant rate, meaning that longer branches–representing ancestral cells that take longer to divide–tend to accumulate more LMs than shorter branches. The agreement between our kinetics data and the expected saturating exponential curve suggests that the constant editing rate assumption holds true for PEtracer data for B16-F10 cells in vitro and for 4T1 both in vitro and in vivo (fig. S4E and table S10).
To generate the final phylogenies, we collapsed all branches that were not supported by a LM. This step introduced multifurcations in the tree, reflecting cases where the relationships between descendant cells could not be fully resolved, such as when one or more cell divisions were not marked by an LM. To perform this collapse, we conducted a reverse depth-first traversal of the tree’s edges. For each edge, if the inferred LM state of a parent node matched that of its child (i.e., no new LMs were acquired along that branch), we removed the edge and directly connected the child to its grandparent. This approach simplified the tree structure, preserving only branches supported by the available lineage tracing data.
Branch lengths were rescaled to reflect the tracing duration for each experiment, and node depths– defined as the number of edges from the root–were calculated for the collapsed tree. For each phylogeny, the “average tree depth” was determined by computing the mean depth of its leaves.
Plotting phylogenetic trees
Phylogenetic trees were stored using TreeData (v0.1.3, https://github.com/YosefLab/treedata) and visualized with Pycea (v0.0.1, https://github.com/YosefLab/pycea). The plots display inferred branch lengths, with internal nodes sorted by subtree size. When intBCs are shown, they are ordered based on the maximum frequency of lineage marks (LMs) at the corresponding edit sites, positioning earlier edits on the left. High-confidence clades were manually selected and numbered according to their location in the phylogeny. Note that similar clade colors do not necessarily indicate close phylogenetic relationships due to multifurcations and arbitrary sorting of internal nodes.
Downsampling analysis of phylogenetic reconstruction accuracy
To assess the impact of detection rate and number of edit sites on the accuracy of phylogenetic reconstruction, we performed a downsampling analysis on the scRNA-seq lineage tracing data with ground-truth barcodes. For each clone, we simulated detection rates ranging from 0.1 to 1 (in increments of 0.1) by randomly dropping out LMs to achieve the target detection rate. Similarly, we simulated different numbers of edit sites ranging from 5 to 45 (in increments of 5) by randomly removing edit sites. Phylogenies were then reconstructed using neighbor joining, and the Fowlkes– Mallows index (FMI) was calculated for puromycin and blasticidin barcode groups to quantify reconstruction accuracy. If the target detection rate or number of edit sites exceeded the detection rate or number of edit sites for that clone no downsampling was performed and the true detection rate or number of edit sites was reported. Each simulation was repeated 10 times per parameter combination, with FMI scores for each iteration reported in tables S19,20 and mean FMI scores reported in the text.
Comparison of pairwise phylogenetic, character, and spatial distance
To compare various distance metrics between cells in a phylogeny, 20,000 random cell pairs were selected, unless the total number of possible pairs was fewer than 20,000, in which case all pairs were used. The distances between each pair were calculated as follows:
Spatial distance: The Euclidean distance between the x-y coordinates of the two cells. For phylogenies spanning multiple sections, pairs from different sections were excluded.
LM distance: The Hamming distance between the set of LMs for each cell in the pair, as described in the “Distance-based tree reconstruction” section.
Phylogenetic distance: The LCA of the cell pair was identified using breadth-first search, and the total path length between the LCA and each of the two cells computed. We use inferred branch lengths for this calculation, so path lengths represent the estimated number of days since the LCA divided.
In vitro 4T1 colony sample preparation and imaging
4T1 cells with high numbers of LTCv2 integrated lineage tracing cassettes and a 24-mer pegArray with no protospacer mismatches that edits to completion within 5 to 7 days were generated as described above (“Generation of 4T1 PEtracer cells edited to completion (fully-edited cells)”). These cells were infected with PEmax-T2A-GFP lentivirus as described above for 1 integrant per cell on day 0 and sorted 48 hours later for high expression of the mCherry-containing tracing cassettes, consistent expression of the BFP-linked pegArray, and for consistent expression of GFP-linked PEmax. Representative sorting gates shown in fig. S16. After sorting, 11,000 actively-editing 4T1 cells were seeded on cleaned, silanized, and PDL-coated 40 mm diameter round coverslips in 60 mm diameter petri-dishes with RPMI media as described for fully-edited in vitro cell samples above.
After 6 days, cells were fixed by treating with freshly made 4% paraformaldehyde in 1x PBS for 10 minutes prior to 3x washes with 1x PBS and storage in 1x PBS supplemented with 0.2% RNase Inhibitor Murine (NEB, M0314L) for up to 8 weeks. Samples were prepared for FISH-based lineage readout as described in the “Sample preparation for In-gel T7 amplification” section and then imaged and analyzed as described above. Representative colonies of varying sizes were imaged across 262 FOVs of one coverslip.
Phylogenetic reconstruction of in vitro 4T1 colonies
Segmented nuclei were grouped into spatially distinct colonies using the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm (ε = 60 μm) (156), with colonies containing fewer than 25 cells filtered out. Since colonies could originate from multiple single cells that landed next to each other and therefore consist of more than one clone, clone calling was performed within each colony as described in the “Clone calling using intBCs” section. The only difference between scRNA-seq and imaging-based clone calling was that spectral clustering was used for initial assignments instead of non-negative matrix factorization (NMF) due to the lower proportion of doublets in imaging data.
LM quality control and phylogenetic tree reconstruction using neighbor joining were performed as described above. However, based on thresholds determined through in silico benchmarking and downsampling of scRNA-seq lineage tracing data, only nuclei with >60% intBC detection were included in tree reconstruction with neighbor joining. In some cases, phylogenies contained multiple clones that could not be distinguished by the clone-calling algorithm due to similar intBC compositions. These phylogenies were manually split by clone based on visual inspection of LM data: if no single LM was shared by all cells within a phylogeny, major clades marked by distinct LMs were separated into independent phylogenies.
Branch length estimation was performed as described in the “Inference of branch lengths” section and comparison of phylogenetic and spatial distances performed as described in the “Comparison of pairwise phylogenetic, character, and spatial distance” section. Statistics including number of cells and intBC detection rate for the 64 clones are listed in table S20.
Generation of primary 4T1 tumors
4T1 cells engineered with PEmax editor and lineage tracing cassettes (LTCv0) were generated as described above (“epegRNA mismatch library profiling to tune lineage tracing kinetics”) and barcoded using low MOI lentiviral infection with an sgRNA library that co-expresses PuroR-T2A-BFP (Addgene #1000000091) as described above (“Lentiviral production” and “Generation of 4T1 PEtracer cells with high numbers of integrated tracing cassettes”). Cells were selected with puromycin (2 μg/mL) for five days starting 72 hours after lentiviral infection. 500,000 or 1,000,000 cells were subcutaneously injected into the mammary fat pad of female Balbc/J mice. Following injection, mice were monitored in accordance with IACUC guidelines prior to sacrifice at six weeks post-injection. Mice were sacrificed by CO2 asphyxiation and primary tumor, liver, and lung tissue collected.
Single cell RNA sequencing of primary 4T1 tumors
To inform the design of MERFISH gene panels, we generated scRNA-seq libraries from cancer, immune, and stromal cells isolated from primary 4T1 tumors as well as the lungs and livers of two tumor-bearing mice. Tissue was dissociated into single cell suspension by mincing followed by enzymatic digestion with 300 U/mL collagenase/100 U/mL hyaluronidase (STEMCELL Technologies) and 150 μg/mL DNase I (Sigma Aldrich) in RPMI 1640 (Thermo Fisher Scientific) at 37°C with rotation for 25 minutes (tumor) or 20 minutes (lung/liver). Homogenized tissue was smashed through a 70 μm sterile strainer and incubated with ACK Lysing Buffer (Thermo Fisher Scientific) for 5 minutes at room temperature. Single-cell suspensions were stained in flow buffer (2% fetal bovine serum in 1x PBS) with the following antibodies (1:500 dilution):
anti-CD45 (clone 30-F11, Alexa Fluor 700, Biolegend Cat. No. 103127, RRID AB_493714)
anti-CD11b (clone M1/70, APC, Biolegend Cat. No. 101211, RRID AB_312794)
anti-CD11c (clone N418, APC/Cyanine7, Biolegend Cat. No. 117323, RRID AB_830646) ● anti-TCRb (clone H57–597, PE/Cyanine7, Biolegend Cat. No. 109221, RRID AB_893627) ● anti-Ly-6G/Ly-6C (Gr-1) (clone RB6–8C5, Brilliant Violet 711, Biolegend Cat. No. 108443, RRID AB_2562549)
anti-CD19 (clone 6D5, Brilliant Violet 605, Biolegend Cat. No. 115539, RRID AB_2563067)
Cells were stained for viability using eF506 dye (1:1000, eBioscience). Cells were sorted on BD FACS Aria II for viable cancer cells (mCherry+ CD45−), stromal cells (mCherry− CD45−), neutrophils (CD45+ Gr-1+) and other immune cells (CD45+ Gr-1−). After sorting, cells were pooled together by tissue for three total captures and captured using the 10X Genomics Chromium Next GEM Single Cell 3ʹ Kit v3.1. scRNA-seq libraries were prepared following the manufacturer’s instructions and sequenced on a NovaSeq 6000 (Illumina).
scRNA-seq analysis of primary 4T1 tumors
scRNA-seq data was processed with Cellranger as described above (“scRNA-seq data processing”). Filtered gene-cellBC matrices that contained only cellBCs with UMI counts that passed the threshold for cell detection were used for further analysis. Additional analysis was performed in R (v.4.2.3) using Seurat (157) (v.4.3.0) with default function parameters unless otherwise noted. Doublets were predicted using DoubletFinder (158) (v.2.0.3). Cell types were predicted using SingleR (159) (v.2.0.0) based on mouse bulk RNA-seq reference data (MouseRNAseqData) from celldex (v.1.8.0). Cells with <200 genes detected, >10% mitochondrial RNA content, or predicted doublets from DoubletFinder and cells annotated as Erythrocytes by SingleR were excluded from analysis. We normalized and identified variable features using the Seurat functions NormalizeData and FindVariableFeatures. We then regressed out percent mitochondrial RNA content and number of UMIs and scaled data with Seurat’s ScaleData and ran PCA using variable features with RunPCA. Clusters were identified using shared-nearest-neighbor-based clustering based on the first 10 PCs with resolution = 0.2 with FindNeighbors and FindClusters. The same principal components were used to generate the UMAP projections with RunUMAP. Cell clustering for fine-resolution cell types was performed as described above for malignant/stromal, myeloid, and T cell subsets with the following parameters:
Malignant/stromal: dims = 1:10, resolution = 0.25
Myeloid: dims = 1:15, resolution = 0.25
T cells: dims = 1:10, resolution = 0.3
For cell subsets, cell cycle scoring was performed using Seurat’s CellCycleScoring and S phase/G2M phase scores were regressed out during data scaling. Cell types were assigned using SingleR annotations and manually refined based on expression of known marker genes.
In vivo PEtracer sample generation
4T1 cells with high numbers of LTCv2 integrated lineage tracing cassettes and a 24-mer pegArray with protospacer mismatches designed to tune editing to ~70% completion over a 4–6-week timescale were generated as described above (“Generation of 4T1 PEtracer cells edited to completion [fully-edited cells]”). These cells were infected with lentivirus expressing PEmax-T2A-GFP on day 0 of the experiment with an infection rate of ~30% and thus roughly 1 integrant per cell as above and sorted on day 2 for high mCherry signal, consistent BFP signal, and high GFP signal. Representative sorting gates shown in fig. S16. For mouse 1, these cells were expanded in culture for an additional week prior to a second sort on day 9 of the experiment. These cells were again expanded in vitro prior to injection into mice on day 12 of the experiment. For mouse 2 and 3, higher cell numbers were infected on day 0, an initial sort was performed on day 2 and the second sort was performed on day 5 prior to injection into animals on day 6.
Between 200,000 and 500,000 cells undergoing evolving tracing were injected into the tail vein of female Balbc/J x SELECTIV mice (mouse 1) or Balbc/J x [SELECTIV x CMV-Cre] mice (mouse 2 and mouse 3). Following injection, mice were monitored in accordance with IACUC guidelines prior to sacrifice either at a predetermined time point or when mouse health deteriorated to a point that humane sacrifice was necessary (mouse 1 = 39 days post-injection, mouse 2 = 28 days post-injection, mouse 3 = 35 days post-injection). Experimental endpoints were chosen to minimize morbidity due to excessive tumor burden while generating tumors of sufficient size for imaging and to fully utilize the recording capacity of the PEtracer system. Mice were sacrificed by CO2 asphyxiation prior to perfusion with 1% diethyl pyrocarbonate (DEPC) in sterile PBS. Following perfusion, mouse lungs were dissected out, incubated with DEPC in sterile PBS at 4°C for 3 hours, washed 2x with fresh 1x PBS, and soaked in optimal cutting compound (OCT) for 5–10 minutes at room temperature prior to degassing in a vacuum chamber. Degassed, OCT-embedded samples were frozen on dry ice and stored at −80°C until sectioning.
scRNA-seq analysis of 4T1 in vitro heterogeneity
To characterize in vitro transcriptional heterogeneity and clonal dynamics of the cells used to generate in vivo tumors, we performed scRNA-seq and data processing as described in the “scRNA-seq of fully edited 4T1 cells” and “scRNA-seq data processing” sections. Processed data were analyzed using Scanpy (v1.10.0) (160) following standard best practices: data were normalized and log-transformed, 2,000 highly variable genes were identified, and the data were scaled and reduced to 20 principal components (PCs). A 2D embedding was then generated using UMAP (min_dist = 1). Clustering was performed using the Leiden algorithm (resolution = 0.1), which identified two transcriptional clusters. Differential gene expression between clusters was assessed using the rank_genes_groups function (method = t-test), restricted to genes with mean expression >0.1 across cells. Epithelial-to-mesenchymal transition (EMT)-related genes were highlighted using the EMTome database (http://www.emtome.org/) (161).
In vivo MERFISH sample preparation
Frozen lungs were sectioned around −15 to −20°C on a cryostat (Leica CM3050s). We started collecting adjacent 20 μm tissue sections (10–20 total sections) once tumors were visible during cryosectioning. For large tumors, we performed multiple collections of tissue sections at different depths (trimming 100–200 μm between collections) to sample the entirety of the tumor volume. Tissue slices were collected on silanized and PDL-treated 40 mm coverslips, and fixed by treating with freshly made 4% paraformaldehyde in 1x PBS for 15 minutes and were washed three times with 1x PBS and stored in 70% ethanol at 4°C for at least 18 hours to permeabilize cell membranes. Tissue sections that did not directly proceed to MERFISH hybridization were stored in 70% ethanol at 4°C for no longer than 2 months. For each collection of sections, we selected a representative section with good morphology and tumor representation for MERFISH imaging and lineage readout to identify tumors with optimal editing kinetics that were selected for additional imaging. Sections with successful MERFISH and lineage imaging were included in further analysis.
Design of MERFISH encoding probes
To discriminate transcriptionally distinct cell types in 4T1 tumor and probe the related cell physiology, we assembled the 124-gene MERFISH panel from two categories:
Differentially expressed genes calculated from our annotated 10X scRNA-seq results. A Wilcoxon test was performed to identify the most significant differentially expressed genes.
Manually selected genes including canonical cell type markers.
For these candidate genes, we identified all possible 30nt probe binding sites within annotated transcripts and kept candidate probe sequences that satisfy the following criteria using the MERFISH probe designer (https://github.com/xingjiepan/MERFISH_probe_design): 68°C<Tm<83°C given the salt concentration in 2x SSC buffer; 40%<GC content<70%; no 15-mer match with rRNA and tRNA; transcriptome specificity >99%. After filtering probe binding sites, genes with fewer than 400 possible probe binding sites (roughly 500nt in transcript length) were removed.
For the selected genes, we designed a codebook by importing a covering design (https://ljcr.dmgordon.org/cover/links.html with parameters: v=16 or 18, k=4, t=3), and then converting this design into a Hamming distance 4, Hamming weight 6 N-bit Hamming code (N=16 for the 124-gene MERFISH library, N=18 for the 175-gene MERFISH library) by removing codewords with less than 4 Hamming distance to other codewords. We then assigned the Hamming codes to genes by balancing the cell-type averaged UMI counts from reference scRNA-seq data across all bits and all cell types.
We then followed standard MERFISH library design procedures to generate the probe library. For each gene, we first assembled candidate probes by appending 3 readout sequences to the probe targeting sequences (reverse-complement of the probe targeting sequences designed above). These 3 readouts were randomly chosen from 4 assigned readouts assigned to this gene as indicated in the codebook and an extra “A” was added in between readout sequences and probe targeting sequences. Next, we appended 20nt forward and reverse primers to the 5’ and 3’ of these candidate probes. We then selected 90 to 96 of the overlapping probes for each gene among all possible candidate probes while minimizing the overlap binding to the target transcripts between neighboring probes. Then these fully assembled probes were further screened for cross-hybridization and any probe sharing >17nt of reverse-complement sequence with other probes was dropped. Full probe library sequences could be found in table S21.
To improve cell type resolution and monitor tumor progression we assembled the 175-gene MERFISH library from the following categories:
Genes from our initial 124-gene MERFISH library after removing genes with no obvious spatial or cell type specific enrichment (110 genes)
Cell cycle marker genes for identifying cell cycle states (7 genes)
EMT markers (5 genes)
Hypoxia markers (2 genes)
Additional FGF pathway markers (3 genes)
Finer cell type markers from existing references (34 genes)
Hits from a CRISPR screen in 4T1 cells (114) (4 genes)
Probes for this final set of selected genes were filtered with the same design criteria as the 124-gene library described above. For this 175-gene MERFISH library we used 18 choose 4 Hamming codes. Full probe library sequences could be found in table S25.
MERFISH encoding probe hybridization and tissue clearing
Both in vitro and in vivo samples were washed once with 2x SSC prior to equilibration in encoding-probe wash buffer (30% formamide in 2x SSC supplemented with 0.1% Tween-20) for 10– 15 minutes at room temperature. Encoding-probe wash buffer was then aspirated from the coverslip, and the coverslip was inverted onto a 50 μL droplet of probe mixture on a parafilm-covered 60 mm diameter petri-dish. The probe mixture comprised 1 nM of each encoding probe for MERFISH imaging, approximately 10 nM of each encoding probe for the two-color mCherry single-molecule FISH imaging, and 1 μM of a polyA-anchor probe (/5Acryd/TTGAGTGGATGGAGTGTAATT+TT+TT+TT+TT+TT+TT+TT+TT+TT+T, where T+ represents a locked nucleic acid T nucleobase, and /5Acryd/ is a 5′ acrydite modification available through IDT) in 2x SSC with 30% v/v formamide, 0.1% wt/v yeast tRNA (approximately, Life Technologies) and 10% v/v dextran sulfate (D8906, Sigma). We then incubated the sample at 37°C for 36–48 hours. The polyA-anchor probe was hybridized to polyA sequences on polyadenylated endogenous mRNAs, thereby anchoring these RNAs to the polyacrylamide gel. After hybridization, the samples were washed in the encoding-probe wash buffer for 30 minutes at 47°C for a total of two times to remove excess encoding probes and polyA-anchor probes. The samples were washed once with 2x SSC prior to poly-acrylamide gel-embedding.
Samples were embedded and tissue cleared similarly to section “Sample preparation for In-gel T7 amplification with fully-edited 4T1 cells” with the following modifications. Following tissue clearing, samples were washed in 2x SSC buffer supplemented with 0.2% (vol/vol) Rnase Inhibitor Murine (NEB) at room temperature for 2 hours. We changed the buffer every 20 minutes to ensure sufficient washing. Samples could be stored at 2x SSC supplemented with 0.2% Rnase Inhibitor Murine at 4°C for up to 2 weeks.
MERFISH Imaging
MERFISH imaging was performed for each sample as described above in the section “Imaging of integration barcodes and lineage marks” with the following changes: two adaptors and two readout probes were used in each round of imaging; z-stack images consisting of 24 or 25 planes separated by 1.2 μm were collected for all channels; the number of imaging rounds was adjusted for each MERFISH library, specifically, for the 124-gene MERFISH library we used 16-bits (8 imaging rounds with 2 bits per round) and for the 175-gene MERFISH library we used 18-bits (9 imaging rounds with 2 bits per round). Four sections were imaged for mice 1 and 3, and 1 section was imaged for mouse 2. The number of fields of view and the set of tumors captured for each section is listed in table S22.
MERFISH decoding
All MERFISH image analysis was performed using MERlin (available at https://github.com/zhengpuas47/MERlin) (162), as previously described (152). First, we identified the locations of fiducial beads in each FOV in each round of imaging and used these locations to determine the x–y drift in the stage position relative to the first round of imaging and to align images for each FOV across all imaging rounds. We then high-pass filtered the MERFISH image stacks for each FOV to remove background, performed ten rounds of Lucy-Richardson deconvolution to tighten RNA spots, and low-pass filtered them to account for small movements in the centroid of RNAs between imaging rounds. Individual RNA molecules imaged by MERFISH were identified by our previously-published pixel-based decoding algorithm using MERlin (162). After assigning barcodes to each pixel independently, we aggregated adjacent pixels that were assigned with the same barcodes into putative RNA molecules, and then filtered the list of putative RNA molecules to enrich for correctly identified transcripts as previously described for a gross barcode misidentification rate at 5% using MERlin.
Nuclei segmentation for in vivo samples
Nuclei segmentation for in vivo samples was performed using Cellpose, following a similar approach to that used for in vitro nuclei segmentation (see “Nuclei segmentation for in vitro samples” section for reference) (98). However, modifications were necessary to accurately segment the small, densely packed nuclei observed in vivo. To enhance image sharpness and contrast of DAPI images before segmentation, we applied the Deconwolf deconvolution algorithm (99). Deconwolf was run for 100 iterations, with each field of view (FOV) divided into four tiles to facilitate computation on an Nvidia GeForce GTX 2080ti GPU. The resulting deconvolved images were segmented using Cellpose (v3.1.0) with nucleus diameter set to 17 pixels. Nuclei with a volume <100 μm3 were excluded from analysis. The number of nuclei detected per tumor section is listed in table S22.
Assignment of MERFISH-decoded transcripts to cells
Using the 3-D coordinates from MERlin, transcripts were assigned to nuclei following the same procedure as for T7 amplicons (see “Assignment of T7 amplicons to cells”). However, rather than assigning cytoplasmic transcripts to the nearest nucleus, we used Proseg (v1.1.3) to infer the most likely cell boundaries based on the assumption that cytoplasmic gene expression mirrors that of the nucleus (102). Proseg was run separately for each tissue section with the following parameters: voxel-layers = 3, ncomponents = 10, nbglayers = 10, enforce-connectivity = True, max-transcript-nucleus-distance = 10, and nuclear-reassignment-prob = 0.2. The cells identified by Proseg were then mapped back to nuclear masks based on their assigned transcripts. Each Proseg cell was assigned to the nucleus with the highest Jaccard similarity, provided a similarity >0.4, which helped filter out spurious Proseg cell calls. Transcripts outside of nuclei were assigned using this mapping if the Proseg assignment probability was >0.5, while transcripts already within nuclei retained their original assignments. The total number of assigned transcripts and mean number of transcripts per cell for each tumor section is listed in table S22.
MERFISH cell clustering and annotation
RNA count matrices were used for cell clustering and annotation with resolVI (100), a scvi-tools (163) (version 1.3.0) method for deconvolution and label transfer of spatial transcriptomics data. Separate models were trained for the 124-gene and 175-gene MERFISH libraries using sections M1 S3 and M3 S2, respectively. For model training, initial cell clustering was performed with Scanpy (v1.10.0) following standard preprocessing steps: cells with fewer than 5 genes or 20 transcripts were removed, data were normalized and log-transformed, total transcript counts were regressed out, and the data were scaled and reduced in dimensionality (20 PCs). Clusters were then identified using the Leiden algorithm (resolution = 0.6), and UMAP embeddings (min_dist = 0.25) were generated for visualization. Fine resolution clustering was achieved by re-clustering individual clusters with varying resolutions (0.1–0.4).
Labeled RNA count matrices were used as input for the resolVI model, which was trained for 100 epochs under a semi-supervised framework without downsampling counts. The trained model was used for reference mapping with the full dataset from all sections imaged using the same MERFISH library, with input data filtered, normalized, regressed, and scaled as described, and then projected. Cells were annotated by predicting labels and retaining those with a probability >0.5. Diffusion and background transcript proportions were calculated, and cells with >20% diffusion or background transcripts were removed. The latent representation from resolVI served as input for a final UMAP projection (min_dist = 0.2).
For cells profiled with the 175-gene MERFISH library cell cycle phase was classified using the “score_genes_cell_cycle” function implemented by Scanpy (v1.10.0). Dscc1, Msh2, Rad51, Rpa2 and Ung were used to score S phase, while Cdca2, Kif2c, Ncapd2, Nek2 were used to score G2/M.
Comparison of MERFISH and scRNA-seq data
To assess MERFISH data quality, we compared correlation between MERFISH replicates as well as the correlation between MERFISH, bulk RNA-seq and scRNA-seq. For comparison between MERFISH replicates, we computed the average counts/cell for each transcript and calculated the Pearson correlation between two sections (M1 S1 and M1 S2). For comparison between MERFISH and bulk RNA-seq, we obtained normalized counts per million (CPM) data from published 4T1derived lung metastasis bulk RNA-seq data (106) and calculated the Pearson correlation with average counts/cell obtained by MERFISH. To compare to scRNA-seq data described above (“Single cell RNA analysis of primary 4T1 tumors”), we computed the average normalized counts/cell per scRNA-seq or MERFISH cluster and calculated the pairwise Pearson correlation between cell types, including all cell types that were identified in both datasets.
Cell type neighborhood enrichment and density analysis
Cell neighborhood enrichment analysis was performed using Squidpy (v1.6.2). First, spatial neighborhoods were computed by identifying the 6 nearest neighbors of each cell based on 2-D Euclidean distance, then neighborhood enrichment was assessed with the “nhood_enrichment” function.
Tumor and normal lung boundaries were delineated through a series of geometric operations and represented as polygons using geopandas (v1.0.1). To construct tumor boundaries, malignant cell centroids were dilated, and the outer edge of their combined area was determined. In contrast, normal lung boundaries were defined by subtracting the tumor boundary polygons from the union of dilated AT1/AT2 and Club cell centroids.
For each cell, distances to both tumor and lung boundaries were determined using the geopandas “sjoin_nearest” function. The distance to the lung/tumor interface was defined differently depending on cell location: for cells within the tumor boundary, it was the negative of the lung boundary distance; for cells outside the tumor boundary, it was the distance to the tumor boundary. These distances were scaled per tumor in each section by dividing by the maximum interface distance observed.
Cell type densities (cells per mm2) were determined by counting the total number of cells of each type within a 0.1 mm radius circle centered on each cell and dividing by the area of that circle. Rolling averages of these densities, computed along the gradient from normal lung to the outer tumor edge, were calculated using windows of 2,000 cells then scaled by dividing by the maximum average density for each cell type.
FISH-based lineage readout after MERFISH for in vivo samples
After MERFISH imaging, samples were removed from FCS2 flow chamber, washed once with hybridization wash buffer (2x SSC and 30% v/v formamide), further digested, washed, and T7 in situ amplified as described in the“Sample preparation for In-gel T7 amplification” section. Samples were then imaged and analyzed as described above.
Sample alignment between MERFISH and lineage imaging
To align samples on the home-built imaging platform for both MERFISH and lineage readouts, during MERFISH imaging we captured DAPI images using the 10x objective across entire tissue sections. During the setup of lineage imaging, we first adjusted the orientation of samples manually to roughly align their orientations with that used for MERFISH imaging. We then captured a new DAPI mosaic image for each complete tissue section using the 10x objective. Between the DAPI images captured for each sample during MERFISH and lineage imaging, we manually picked 10 positions as featured anchoring points and recorded their x-y coordinates. From these paired coordinates we performed a singular value decomposition (SVD) of the multiplicated position matrix to acquire a rotation and translation matrix by assuming samples only experienced affine transformation. Given the rotation and translation that we calculated, we applied the transformation to the center coordinates of the FOVs imaged during MERFISH for each sample and used these transformed coordinates as the new FOV coordinates for lineage imaging.
After image acquisition, we corrected residual alignment issues and sample distortions computationally. We generated large mosaic images by combining DAPI images from all fields of view acquired during the MERFISH and lineage readouts. These mosaics were downsampled by a factor of 4, smoothed with an unsharp mask (Gaussian sigma = 5 pixels), intensity-rescaled by clipping values >98th percentile, and rotated by an angle determined using anchoring points. Residual x-y drift was corrected by using phase cross-correlation to calculate the shift between the mosaics, and the lineage mosaic was adjusted accordingly. To address non-linear distortions, TV-L1 optical flow (implemented in scikit-image v0.24.0) (164) was used to compute a vector field aligning the two mosaics. This vector field was divided into 100×100 pixel tiles, and for each tile the mean, intensity-weighted shift was calculated. The global x-y shift was then added to the local shift for each tile, and the tiles were saved. In experiments combining MERFISH and lineage imaging, nuclear masks were generated from the MERFISH images, and T7 amplicons were assigned to these masks after adjusting their 3-D coordinates based on the corresponding tile.
Generation, preparation, and imaging of primary tumors with fully-edited 4T1 cells
Primary tumors were generated as above with fully-edited 4T1 cells used for in vitro validation experiments. 1,000,000 cells were subcutaneously injected into the mammary fat pad of female Balbc/J mice. Following injection, mice were monitored in accordance with IACUC guidelines prior to sacrifice at 2.5 weeks post-injection. Mice were sacrificed by CO2 asphyxiation and tumors collected.
Primary tumors embedded in OCT were prepared as in section “In vivo MERFISH sample preparation”; Specifically, the 20 μm sections were placed on treated coverslips and treated as MERFISH samples but skipping MERFISH imaging procedures. Samples were prepared for FISH-based lineage readout as described in the “Sample preparation for In-gel T7 amplification” section and then imaged and analyzed as described above. 146 FOVs were imaged, capturing the entirety of one primary tumor in one section.
Processing of lineage data from in vivo tumor sections
T7 amplicons in lineage readout images were detected, quantified, and decoded as described above, with one key modification in the intBC decoding step. Instead of using a whitelist of all 2,171 possible intBCs, we limited it to the 36 intBCs found in the tumor-initiating clones. This smaller whitelist improved decoding under higher in vivo noise levels and allowed us to increase the maximum allowable distance between the codeword and the intensity profile to 2, thanks to the increased Hamming distances among the 36 intBCs.
LM decoding and quality control were carried out as described earlier, using a logistic regression classifier trained on fully-edited in vitro cells. For the primary tumor containing fully-edited cells, clones were identified based on their Jaccard similarity to intBCs and LMs in the fully-edited clone whitelist (see “Quantification of lineage tracing data accuracy in fully-edited cells” section for details). Cells with a Jaccard similarity <0.2 were not assigned to a clone, as they likely represent non-malignant cells lacking lineage tracing cassettes. LM decoding accuracy was then evaluated for malignant cells, serving as an out-of-sample validation of the classifier trained on in vitro data.
To account for imperfect nuclei segmentation and sample distortion between the MERFISH and lineage imaging, decoded intBCs were assigned to a whitelist of malignant cells derived from the MERFISH data. If the position of the decoded T7 amplicon was just outside of a malignant cell’s nuclear segmentation mask, it was assigned to that cell even if it was closer to a non-malignant cell, as long as the distance to the nuclear mask was less than 4 μm.
Reconstruction of in vivo tumor phylogenies
Spatial data from all tissue sections for each mouse were manually aligned in the x-y plane using tissue morphology. Cells were then grouped into spatially distinct tumors using the DBSCAN clustering algorithm (ε = 0.3 mm), with abutting tumors separated manually. Clone calling, as described in the “Clone calling using intBCs” section, was performed for each tumor to determine its clonality and the corresponding set of intBCs. As with the colony data, spectral clustering was used for clone calling instead of NMF, due to the lower proportion of doublets.
Phylogenies for malignant cells were reconstructed using neighbor joining, as detailed in the “Distance-based tree reconstruction” section, incorporating data from all sections for each tumor, except for one mouse 3 section which was excluded because of low intBC detection. For tumor 2 in mouse 2 (M2 T2), separate phylogenies were built for the two contributing clones. To account for partial nuclei generated by tissue sectioning, only nuclei with centroids located more than 5 μm from the top or bottom of the section were used. Among these intact nuclei, the intBC detection rate was calculated, and only cells with a detection rate >60% were used for tree reconstruction. To improve tree reconstruction with missing data, edit sites with LMs shared by all cells were excluded, and select edit sites with LMs defining early splits were manually upweighted by a factor of 2 before calculating the Hamming distance matrix. Statistics for the 12 reconstructed tumor phylogenies are reported in table S23.
Further analysis was performed on phylogenies with edit fractions >50% (M1 T1, M1 T2, M1 T4, M2 T5, and M1 T1). This included inference of ancestral LM states and branch lengths as described in the “Inference of branch lengths” section and comparison of phylogenetic, character, and spatial distance as described in the “Comparison of pairwise phylogenetic, character, and spatial distance” section. To compute the LM installation efficiency and account for edits that occur at different depths of the tree, we computed the overall transition probability from the unedited state to a given lineage mark based on transitions along each branch in the tree. The resulting LM installation efficiency is normalized by tumor and edit site to account for differences in LM installation rate.
Linking in vitro 4T1 cells to in vivo tumors
To investigate how in vitro heterogeneity in PEtracer 4T1 cells relates to their potential to form tumors in vivo, we mapped in vitro cells (see “scRNA-seq analysis of 4T1 in vitro heterogeneity” section) to individual tumors based on their detected intBCs. For each in vitro cell, the 30-nt sequencing barcode was matched to its corresponding 183-nt imaging barcode, as described in the “scRNA-seq of fully edited 4T1 cells” section. We then calculated the Jaccard similarity between these barcodes and the barcode whitelist associated with each tumor.
Reconstruction of tumor growth dynamics
To reconstruct the tumor growth in 3-D we estimated the z-position of sections for mouse 1 tumor 1 (M1 T1) based on section number and thickness, and then inferred the position of each phylogenetic clade as the mean 3-D position of cells within that clade. The inferred tumor origin was determined by averaging the 3-D positions of all cells in the phylogeny, and the positions of ancestral nodes between the origin and each clade’s common ancestor were inferred by averaging the positions of their descendants with the inferred tumor origin.
To estimate the number of extant cells in each clade during the growth of M1 T1, we divided the timeline from day 0 to 39 into 20 discrete timepoints. At each timepoint, we cut the phylogenetic tree at the corresponding depth and counted the number of branches for each clade. These branches represent ancestral cells that were alive at that time point.
Calculation of local LM diversity
We define local character diversity as the average pairwise LM distance between all cells within a 100 μm radius of a given cell. To compute this metric, we first identify spatial neighbors based on x-y coordinates and then extract the corresponding M × M subset from the N × N weighted Hamming distance matrix given the M nearest neighbors. The mean pairwise distance within this subset represents the local character diversity. Cells in phylogenetically diverse regions have higher local character diversity scores, as neighboring cells are less likely to share similar LMs.
Estimation of cell fitness
Phylogenetic fitness was estimated using the infer_fitness function from the Jungle package (https://github.com/felixhorns/jungle) which implements the probabilistic fitness estimation algorithm described by Neher et al. 2014 (165). Conceptually, this algorithm leverages two key assumptions:
Fitter cells divide more rapidly, producing more offspring and leading to shorter branch lengths.
Fitness changes gradually along lineages, meaning a cell’s fitness is expected to be similar to that of its ancestors and relatives.
Phylogenetic fitness was calculated using the inferred branch lengths generated by ConvexML, as described in the “Branch Length Inference” section.
As an alternative tree-independent fitness metric, we calculated the mean neighbor LM distance for each cell in the phylogeny. Using the N-by-N weighted Hamming distance matrix, computed as described in the “Distance-Based Tree Reconstruction” section, we identified the 20 nearest neighbors of each cell and calculated the mean LM distance among them. This metric is based on the expectation that fitter cells will have closely related neighbors with similar LMs, as there has been less time for additional LMs to accumulate compared to less fit cells.
Fitness correlation analysis
To identify gene expression patterns, spatial positions, and cell type neighborhoods associated with increased phylogenetic fitness, we computed the Pearson correlation between fitness and each feature for every tumor with fitness estimates. For gene expression, raw counts were used; for spatial positioning, we employed distance to the tumor boundary and the lung/tumor interface; and for cell type neighborhoods, we used local cell type density (see “Cell type neighborhood enrichment and density analysis” for details on boundary distance and density calculations). Additionally, we calculated the Pearson correlation between these features and the distance to the tumor boundary to separate the effects of cellular features from those related to tumor position. P-values were derived using a two-sided Student’s t-test and adjusted for multiple comparisons with the Benjamini-Hochberg procedure. Results from this analysis are reported in table S24.
Comparison of MERFISH libraries
To compare gene expression data generated by our 124- and 175-gene MERFISH libraries, we computed the average normalized counts/cell per annotated cluster across all samples profiled using each library and calculated the pairwise Pearson correlation between cell types, including all cell types that were identified using both libraries.
Cancer cell module analysis
Hotspot (v1.1.1) was used to call spatially coherent gene expression modules within the mouse 3 malignant cell population (107). Modules were identified using sample M3 S2. Spatially variable genes were selected based on Geary’s C autocorrelation (calculated with 100 neighbors), retaining genes with Geary’s C >0.05 and an average expression >0.1 counts per cell. Hotspot was then run with a minimum_gene_threshold=4 and fdr_threshold=0.01. For each M3 section, scores for the four modules identified in S2 were calculated using the same parameters. Each cell was assigned to the module with the highest score.
To identify cell types associated with each module, we calculated the mean density (see “Cell type neighborhood enrichment and density analysis” for reference) of each cell type across cells in each module, and then z-scored these values. For genes selected by Hotspot to distinguish between modules, we calculated their mean expression by aggregating normalized counts across cells in each module, and then z-scored these values.
Transition probabilities between Hotspot modules were calculated by first identifying the set of phylogenetic neighbors for each cell–those with a last common ancestor within 10 days. Then, for each module, we computed the fraction of these neighbors assigned to every other module, yielding the transition probabilities between modules.
Moran’s I statistic for heritability and spatial variation
We use Moran’s autocorrelation statistic to quantify the heritability and spatial variation of gene expression values and cancer cell module scores within a phylogeny. While Moran’s is commonly applied to spatial data to measure the clustering or dispersion of a variable, it can also be used in phylogenetics by replacing the spatial neighbor graph with a phylogenetic neighbor graph (166). Moran’s is defined as:
where is the z-scored variable of interest, represents the connectivity between cells and ( for neighbors and 0 otherwise), and is the number of cells. To construct the phylogenetic neighbor graph, each cell in the phylogeny is connected to all other cells that share a LCA within the last 10 days. Similarly, to construct a spatial neighbor graph, each malignant cell is connected to all other malignant cells within 100 μm. Moran’s is then computed for z-scored MERFISH counts and cancer cell module scores. P-values are calculated with a permutation test (1,000 permutations) and adjusted for multiple comparisons with the Benjamini-Hochberg procedure (167). Results from this analysis are reported in table S26.
Supplementary Material
Supplementary Materials
Acknowledgments:
We thank members of the Weissman, Yosef, and Zhuang labs for helpful discussions. We thank Jeffrey Moffitt for helpful discussions and advice on MERFISH sample preparation. We also thank the Whitehead Flow Cytometry Core, the Whitehead Genome Technology Core, the Koch Institute Hope Babette Tang Histology Facility, and the Broad Institute Klarman Cell Observatory.
Funding:
Helen Hay Whitney/HHMI fellowship and Eunice Kennedy Shriver NICHD Pathway to Independence Award NIH K99HD118574 (L.W.K.)
National Institutes of Health grant K00CA253729 (K.E.Y.)
Fayez Sarofim Fellowship of Damon Runyon Cancer Research Foundation (P. Z.)
Damon Runyon Dale Frey Award (D.Y.)
NCI Transition Career Development Award 1K22CA289207 (D.Y.)
NIH Director’s New Innovator Award 1DP2OD037078 (D.Y.)
Lung Cancer Research Foundation Leading Edge Award (D.Y.)
NCI Pathway to Independence Award NIH K99CA286968 (M.G.J.)
National Institutes of Health grant 3T32DK007191–50S1 (J.S.)
Jane Coffin Childs Memorial Fund for Medical Research (A.S.)
Harvard Society of Fellows (R.A.S. and W.E.A)
Burroughs-Wellcome Fund Career Award at the Scientific Interface (W.E.A)
Howard Hughes Medical Institute (X.Z. and J.S.W.)
National Institutes of Health (NIH) Centers of Excellence in Genomic Science (RM1HG009490) (J.S.W.)
Chan Zuckerberg Initiative 2024–346405 (5022) (J.S.W.)
The Ludwig Center at MIT (J.S.W.)
Melanoma Research Alliance (C-13384133) (J.S.W.)
Bay Area Cancer Target Discover and Development (14198SC) (J.S.W.)
Mathers Charitable Foundation (J.S.W.)
Footnotes
Competing interests: M.G.J. consults for and has equity in Vevo Therapeutics. K.E.Y. is a consultant for Cartography Biosciences and Curie Bio. D.Y. declares outside interest in DEM Biopharma. W.N.C. consults for Merck Pharmaceuticals. X.Z. is a co-founder and consultant of Vizgen, Inc. N.Y. is a consultant for Cytoreason Inc. R.A.S. and W.E.A. are consultants for Xaira Therapeutics. J.S.W. declares outside interest in 5 AM Venture, Amgen, nChroma Bio, DEM Biosciences, KSQ Therapeutics, Maze Therapeutics, Tenaya Therapeutics, Tessera Therapeutics, Thermo Fisher, Third Rock Ventures, and Xaira Therapeutics.
Data and materials availability:
Plasmids generated in this study will be available on Addgene. All scRNA sequencing data generated in this study are available under GEO accession GSE290975. All other sequencing data is available on SRA under BioProject accession PRJNA1231108. Additional processed data including MERFISH data and lineage reconstructions are available on Figshare (https://figshare.com/articles/dataset/PEtracer_Data/28473866, DOI 10.6084/m9.figshare.28473866). Custom code used in this study is available at https://github.com/jweissmanlab/PEtracer-2025 (168).
References and Notes
- 1.Stadler T, Pybus OG, Stumpf MPH, Phylodynamics for cell biologists. Science 371, eaah6266 (2021). [DOI] [PubMed] [Google Scholar]
- 2.Domcke S, Shendure J, A reference cell tree will serve science better than a reference cell atlas. Cell 186, 1103–1114 (2023). [DOI] [PubMed] [Google Scholar]
- 3.Wagner DE, Klein AM, Lineage tracing meets single-cell omics: opportunities and challenges. Nat Rev Genet 21, 410–427 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Sulston JE, Schierenberg E, White JG, Thomson JN, The embryonic cell lineage of the nematode Caenorhabditis elegans. Dev Biol 100, 64–119 (1983). [DOI] [PubMed] [Google Scholar]
- 5.Qiu C, Martin BK, Welsh IC, Daza RM, Le T-M, Huang X, et al. , A single-cell time-lapse of mouse prenatal development from gastrula to birth. Nature 626, 1084–1093 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Qiu C, Cao J, Martin BK, Li T, Welsh IC, Srivatsan S, et al. , Systematic reconstruction of cellular trajectories across mouse embryogenesis. Nat Genet 54, 328–341 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Wolpert L, Positional information and the spatial pattern of cellular differentiation. J Theor Biol 25, 1–47 (1969). [DOI] [PubMed] [Google Scholar]
- 8.Chen A, Liao S, Cheng M, Ma K, Wu L, Lai Y, et al. , Spatiotemporal transcriptomic atlas of mouse organogenesis using DNA nanoball-patterned arrays. Cell 185, 1777–1792.e21 (2022). [DOI] [PubMed] [Google Scholar]
- 9.Metzger RJ, Klein OD, Martin GR, Krasnow MA, The branching programme of mouse lung development. Nature 453, 745–750 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Sankaran VG, Weissman JS, Zon LI, Cellular barcoding to decipher clonal dynamics in disease. Science 378, eabm5874 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Lareau CA, Ludwig LS, Muus C, Gohil SH, Zhao T, Chiang Z, et al. , Massively parallel single-cell mitochondrial DNA genotyping and chromatin profiling. Nat Biotechnol 39, 451–461 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Treutlein B, Brownfield DG, Wu AR, Neff NF, Mantalas GL, Espinoza FH, et al. , Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq. Nature 509, 371–375 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Jones MG, Yang D, Weissman JS, New Tools for Lineage Tracing in Cancer In Vivo. Annual Review of Cancer Biology 7, 111–129 (2023). [Google Scholar]
- 14.McGranahan N, Swanton C, Clonal Heterogeneity and Tumor Evolution: Past, Present, and the Future. Cell 168, 613–628 (2017). [DOI] [PubMed] [Google Scholar]
- 15.Zahir N, Sun R, Gallahan D, Gatenby RA, Curtis C, Characterizing the ecological and evolutionary dynamics of cancer. Nat Genet 52, 759–767 (2020). [DOI] [PubMed] [Google Scholar]
- 16.Dhainaut M, Rose SA, Akturk G, Wroblewska A, Nielsen SR, Park ES, et al. , Spatial CRISPR genomics identifies regulators of the tumor microenvironment. Cell 185, 1223–1239.e20 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Baslan T, Morris JP, Zhao Z, Reyes J, Ho Y-J, Tsanov KM, et al. , Ordered and deterministic cancer genome evolution after p53 loss. Nature 608, 795–802 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Ku SY, Rosario S, Wang Y, Mu P, Seshadri M, Goodrich ZW, et al. , Rb1 and Trp53 cooperate to suppress prostate cancer lineage plasticity, metastasis, and antiandrogen resistance. Science 355, 78–83 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Mu P, Zhang Z, Benelli M, Karthaus WR, Hoover E, Chen C-C, et al. , SOX2 promotes lineage plasticity and antiandrogen resistance in TP53- and RB1-deficient prostate cancer. Science 355, 84–88 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Chan JM, Zaidi S, Love JR, Zhao JL, Setty M, Wadosky KM, et al. , Lineage plasticity in prostate cancer depends on JAK/STAT inflammatory signaling. Science 377, 1180–1191 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.VanHorn S, Morris SA, Next-Generation Lineage Tracing and Fate Mapping to Interrogate Development. Dev Cell 56, 7–21 (2021). [DOI] [PubMed] [Google Scholar]
- 22.Al Bakir M, Huebner A, Martínez-Ruiz C, Grigoriadis K, Watkins TBK, Pich O, et al. , The evolution of non-small cell lung cancer metastases in TRACERx. Nature 616, 534–542 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Frankell AM, Dietzen M, Al Bakir M, Lim EL, Karasaki T, Ward S, et al. , The evolution of lung cancer and impact of subclonal selection in TRACERx. Nature 616, 525–533 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Ludwig LS, Lareau CA, Ulirsch JC, Christian E, Muus C, Li LH, et al. , Lineage Tracing in Humans Enabled by Mitochondrial Mutations and Single-Cell Genomics. Cell 176, 1325–1339.e22 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Sottoriva A, Kang H, Ma Z, Graham TA, Salomon MP, Zhao J, et al. , A Big Bang model of human colorectal tumor growth. Nat Genet 47, 209–216 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Martincorena I, Roshan A, Gerstung M, Ellis P, Van Loo P, McLaren S, et al. , Tumor evolution. High burden and pervasive positive selection of somatic mutations in normal human skin. Science 348, 880–886 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Mitchell E, Spencer Chapman M, Williams N, Dawson KJ, Mende N, Calderbank EF, et al. , Clonal dynamics of haematopoiesis across the human lifespan. Nature 606, 343–350 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Biswas D, Birkbak NJ, Rosenthal R, Hiley CT, Lim EL, Papp K, et al. , A clonal expression biomarker associates with lung cancer mortality. Nat Med 25, 1540–1548 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Rosenthal R, Cadieux EL, Salgado R, Bakir MA, Moore DA, Hiley CT, et al. , Neoantigen-directed immune escape in lung cancer evolution. Nature 567, 479–485 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Coorens THH, Moore L, Robinson PS, Sanghvi R, Christopher J, Hewinson J, et al. , Extensive phylogenies of human development inferred from somatic mutations. Nature 597, 387–392 (2021). [DOI] [PubMed] [Google Scholar]
- 31.Park S, Mali NM, Kim R, Choi J-W, Lee J, Lim J, et al. , Clonal dynamics in early human embryogenesis inferred from somatic mutation. Nature 597, 393–397 (2021). [DOI] [PubMed] [Google Scholar]
- 32.Spencer Chapman M, Ranzoni AM, Myers B, Williams N, Coorens THH, Mitchell E, et al. , Lineage tracing of human development through somatic mutations. Nature 595, 85–90 (2021). [DOI] [PubMed] [Google Scholar]
- 33.Tzouanacou E, Wegener A, Wymeersch FJ, Wilson V, Nicolas J-F, Redefining the progression of lineage segregations during mammalian embryogenesis by clonal analysis. Dev Cell 17, 365–376 (2009). [DOI] [PubMed] [Google Scholar]
- 34.Chow K-HK, Budde MW, Granados AA, Cabrera M, Yoon S, Cho S, et al. , Imaging cell lineage with a synthetic digital recording system. Science 372, eabb3099 (2021). [DOI] [PubMed] [Google Scholar]
- 35.Livet J, Weissman TA, Kang H, Draft RW, Lu J, Bennis RA, et al. , Transgenic strategies for combinatorial expression of fluorescent proteins in the nervous system. Nature 450, 56–62 (2007). [DOI] [PubMed] [Google Scholar]
- 36.Snippert HJ, van der Flier LG, Sato T, van Es JH, van den Born M, Kroon-Veenboer C, et al. , Intestinal crypt homeostasis results from neutral competition between symmetrically dividing Lgr5 stem cells. Cell 143, 134–144 (2010). [DOI] [PubMed] [Google Scholar]
- 37.Rodrigo Albors A, Halley PA, Storey KG, Lineage tracing of axial progenitors using Nkx1–2CreERT2 mice defines their trunk and tail contributions. Development 145, dev164319 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Weinreb C, Rodriguez-Fraticelli A, Camargo FD, Klein AM, Lineage tracing on transcriptional landscapes links state to fate during differentiation. Science 367, eaaw3381 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Chan MM, Smith ZD, Grosswendt S, Kretzmer H, Norman TM, Adamson B, et al. , Molecular recording of mammalian embryogenesis. Nature 570, 77–82 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Frieda KL, Linton JM, Hormoz S, Choi J, Chow K-HK, Singer ZS, et al. , Synthetic recording and in situ readout of lineage information in single cells. Nature 541, 107–111 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.McKenna A, Findlay GM, Gagnon JA, Horwitz MS, Schier AF, Shendure J, Whole-organism lineage tracing by combinatorial and cumulative genome editing. Science 353, aaf7907 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Chadly DM, Frieda KL, Gui C, Klock L, Tran M, Sui MY, et al. , Reconstructing cell histories in space with image-readable base editor recording. bioRxiv [Preprint] (2024). 10.1101/2024.01.03.573434. [DOI] [Google Scholar]
- 43.Loveless TB, Grotts JH, Schechter MW, Forouzmand E, Carlson CK, Agahi BS, et al. , Lineage tracing and analog recording in mammalian cells by single-site DNA writing. Nat Chem Biol 17, 739–747 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Loveless TB, Carlson CK, Dentzel Helmy CA, Hu VJ, Ross SK, Demelo MC, et al. , Open-ended molecular recording of sequential cellular events into DNA. Nat Chem Biol, doi: 10.1038/s41589-024-01764-5 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Bowling S, Sritharan D, Osorio FG, Nguyen M, Cheung P, Rodriguez-Fraticelli A, et al. , An Engineered CRISPR-Cas9 Mouse Line for Simultaneous Readout of Lineage Histories and Gene Expression Profiles in Single Cells. Cell 181, 1410–1422.e27 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Li L, Bowling S, McGeary SE, Yu Q, Lemke B, Alcedo K, et al. , A mouse model with high clonal barcode diversity for joint lineage, transcriptomic, and epigenomic profiling in single cells. Cell 186, 5183–5199.e22 (2023). [DOI] [PubMed] [Google Scholar]
- 47.Choi J, Chen W, Minkina A, Chardon FM, Suiter CC, Regalado SG, et al. , A time-resolved, multi-symbol molecular recorder via sequential genome editing. Nature 608, 98–107 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Raj B, Wagner DE, McKenna A, Pandey S, Klein AM, Shendure J, et al. , Simultaneous single-cell profiling of lineages and cell types in the vertebrate brain. Nat Biotechnol 36, 442–450 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Spanjaard B, Hu B, Mitic N, Olivares-Chauvet P, Janjuha S, Ninov N, et al. , Simultaneous lineage tracing and cell-type identification using CRISPR-Cas9-induced genetic scars. Nat Biotechnol 36, 469–473 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Alemany A, Florescu M, Baron CS, Peterson-Maduro J, van Oudenaarden A, Whole-organism clone tracing using single-cell sequencing. Nature 556, 108–112 (2018). [DOI] [PubMed] [Google Scholar]
- 51.Kalhor R, Kalhor K, Mejia L, Leeper K, Graveline A, Mali P, et al. , Developmental barcoding of whole mouse via homing CRISPR. Science 361, eaat9804 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.He Z, Maynard A, Jain A, Gerber T, Petri R, Lin H-C, et al. , Lineage recording in human cerebral organoids. Nat Methods 19, 90–99 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Quinn JJ, Jones MG, Okimoto RA, Nanjo S, Chan MM, Yosef N, et al. , Single-cell lineages reveal the rates, routes, and drivers of metastasis in cancer xenografts. Science 371, eabc1944 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Yang D, Jones MG, Naranjo S, Rideout WM, Min KHJ, Ho R, et al. , Lineage tracing reveals the phylodynamics, plasticity, and paths of tumor evolution. Cell 185, 1905–1923.e25 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Jones MG, Sun D, Min KHJ, Colgan WN, Tian L, Weir JA, et al. , Spatiotemporal lineage tracing reveals the dynamic spatial architecture of tumor growth and metastasis. bioRxiv, 2024.10.21.619529 (2024). [Google Scholar]
- 56.Askary A, Sanchez-Guardado L, Linton JM, Chadly DM, Budde MW, Cai L, et al. , In situ readout of DNA barcodes and single base edits facilitated by in vitro transcription. Nat Biotechnol 38, 66–75 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Bressan D, Battistoni G, Hannon GJ, The dawn of spatial omics. Science 381, eabq4964 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Chen KH, Boettiger AN, Moffitt JR, Wang S, Zhuang X, RNA imaging. Spatially resolved, highly multiplexed RNA profiling in single cells. Science 348, aaa6090 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Goltsev Y, Samusik N, Kennedy-Darling J, Bhate S, Hale M, Vazquez G, et al. , Deep Profiling of Mouse Splenic Architecture with CODEX Multiplexed Imaging. Cell 174, 968–981.e15 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Lin J-R, Izar B, Wang S, Yapp C, Mei S, Shah PM, et al. , Highly multiplexed immunofluorescence imaging of human tissues and tumors using t-CyCIF and conventional optical microscopes. Elife 7, e31657 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Gut G, Herrmann MD, Pelkmans L, Multiplexed protein maps link subcellular organization to cellular states. Science 361, eaar7042 (2018). [DOI] [PubMed] [Google Scholar]
- 62.Lubeck E, Coskun AF, Zhiyentayev T, Ahmad M, Cai L, Single-cell in situ RNA profiling by sequential hybridization. Nat Methods 11, 360–361 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Rao A, Barkley D, França GS, Yanai I, Exploring tissue architecture using spatial transcriptomics. Nature 596, 211–220 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Zhuang X, Spatially resolved single-cell genomics and transcriptomics by imaging. Nat Methods 18, 18–22 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Wang X, Allen WE, Wright MA, Sylwestrak EL, Samusik N, Vesuna S, et al. , Three-dimensional intact-tissue sequencing of single-cell transcriptional states. Science 361, eaat5691 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Anzalone AV, Randolph PB, Davis JR, Sousa AA, Koblan LW, Levy JM, et al. , Search-and-replace genome editing without double-strand breaks or donor DNA. Nature 576, 149–157 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Chen PJ, Hussmann JA, Yan J, Knipping F, Ravisankar P, Chen P-F, et al. , Enhanced prime editing systems by manipulating cellular determinants of editing outcomes. Cell 184, 5635–5652.e29 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Chen PJ, Liu DR, Prime editing for precise and highly versatile genome manipulation. Nat Rev Genet 24, 161–177 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Anzalone AV, Koblan LW, Liu DR, Genome editing with CRISPR-Cas nucleases, base editors, transposases and prime editors. Nat Biotechnol 38, 824–844 (2020). [DOI] [PubMed] [Google Scholar]
- 70.van Overbeek M, Capurso D, Carter MM, Thompson MS, Frias E, Russ C, et al. , DNA Repair Profiling Reveals Nonrandom Outcomes at Cas9-Mediated Breaks. Mol Cell 63, 633–646 (2016). [DOI] [PubMed] [Google Scholar]
- 71.Shen MW, Arbab M, Hsu JY, Worstell D, Culbertson SJ, Krabbe O, et al. , Predictable and precise template-free CRISPR editing of pathogenic variants. Nature 563, 646–651 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Jones MG, Khodaverdian A, Quinn JJ, Chan MM, Hussmann JA, Wang R, et al. , Inference of single-cell phylogenies from lineage tracing data using Cassiopeia. Genome Biol 21, 92 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Robinson DF, Foulds LR, Comparison of phylogenetic trees. Mathematical Biosciences 53, 131–147 (1981). [Google Scholar]
- 74.Critchlow DE, Pearl DK, Qian C, The Triples Distance for Rooted Bifurcating Phylogenetic Trees. Systematic Biology 45, 323–334 (1996). [Google Scholar]
- 75.Wang R, Zhang R, Khodaverdian A, Yosef N, Theoretical guarantees for phylogeny inference from single-cell lineage tracing. Proc Natl Acad Sci U S A 120, e2203352120 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Saitou N, Nei M, The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4, 406–425 (1987). [DOI] [PubMed] [Google Scholar]
- 77.Sokal RR, Michener CD, A Statistical Method for Evaluating Systematic Relationships (University of Kansas, 1958). [Google Scholar]
- 78.Clausen PTLC, Scaling neighbor joining to one million taxa with dynamic and heuristic neighbor joining. Bioinformatics 39, btac774 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Ihry RJ, Worringer KA, Salick MR, Frias E, Ho D, Theriault K, et al. , p53 inhibits CRISPR-Cas9 engineering in human pluripotent stem cells. Nat Med 24, 939–946 (2018). [DOI] [PubMed] [Google Scholar]
- 80.Haapaniemi E, Botla S, Persson J, Schmierer B, Taipale J, CRISPR-Cas9 genome editing induces a p53-mediated DNA damage response. Nat Med 24, 927–930 (2018). [DOI] [PubMed] [Google Scholar]
- 81.Leibowitz ML, Papathanasiou S, Doerfler PA, Blaine LJ, Sun L, Yao Y, et al. , Chromothripsis as an on-target consequence of CRISPR-Cas9 genome editing. Nat Genet 53, 895–905 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Kosicki M, Tomberg K, Bradley A, Repair of double-strand breaks induced by CRISPR-Cas9 leads to large deletions and complex rearrangements. Nat Biotechnol 36, 765–771 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Enache OM, Rendo V, Abdusamad M, Lam D, Davison D, Pal S, et al. , Cas9 activates the p53 pathway and selects for p53-inactivating mutations. Nat Genet 52, 662–668 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Yu G, Kim HK, Park J, Kwak H, Cheong Y, Kim D, et al. , Prediction of efficiencies for diverse prime editing systems in multiple cell types. Cell 186, 2256–2272.e23 (2023). [DOI] [PubMed] [Google Scholar]
- 85.Nelson JW, Randolph PB, Shen SP, Everette KA, Chen PJ, Anzalone AV, et al. , Engineered pegRNAs improve prime editing efficiency. Nat Biotechnol 40, 402–410 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Bae S, Park J, Kim J-S, Cas-OFFinder: a fast and versatile algorithm that searches for potential off-target sites of Cas9 RNA-guided endonucleases. Bioinformatics 30, 1473–1475 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Levesque S, Mayorga D, Fiset J-P, Goupil C, Duringer A, Loiselle A, et al. , Marker-free co-selection for successive rounds of prime editing in human cells. Nat Commun 13, 5909 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Datlinger P, Rendeiro AF, Schmidl C, Krausgruber T, Traxler P, Klughammer J, et al. , Pooled CRISPR screening with single-cell transcriptome readout. Nat Methods 14, 297–301 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Aslakson CJ, Miller FR, Selective events in the metastatic process defined by analysis of the sequential dissemination of subpopulations of a mouse mammary tumor. Cancer Res 52, 1399–1405 (1992). [PubMed] [Google Scholar]
- 90.Dexter DL, Kowalski HM, Blazar BA, Fligiel Z, Vogel R, Heppner GH, Heterogeneity of tumor cells from a single mouse mammary tumor. Cancer Res 38, 3174–3181 (1978). [PubMed] [Google Scholar]
- 91.Fidler IJ, Biological behavior of malignant melanoma cells correlated to their survival in vivo. Cancer Res 35, 218–224 (1975). [PubMed] [Google Scholar]
- 92.Lu T, Ang CE, Zhuang X, Spatially resolved epigenomic profiling of single cells in complex tissues. Cell 185, 4448–4464.e17 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.Moffitt JR, Hao J, Bambah-Mukku D, Lu T, Dulac C, Zhuang X, High-performance multiplexed fluorescence in situ hybridization in culture and tissue with matrix imprinting and clearing. Proc Natl Acad Sci U S A 113, 14456–14461 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.Fowlkes EB, Mallows CL, A Method for Comparing Two Hierarchical Clusterings. Journal of the American Statistical Association 78, 553–569 (1983). [Google Scholar]
- 95.Alard A, Katsara O, Rios-Fuller T, de la Parra C, Ozerdem U, Ernlund A, et al. , Breast cancer cell mesenchymal transition and metastasis directed by DAP5/eIF3d-mediated selective mRNA translation. Cell Rep 42, 112646 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Smith MCP, Luker KE, Garbow JR, Prior JL, Jackson E, Piwnica-Worms D, et al. , CXCR4 regulates growth of both primary and metastatic breast cancer. Cancer Res 64, 8604–8612 (2004). [DOI] [PubMed] [Google Scholar]
- 97.Martin BK, Qiu C, Nichols E, Phung M, Green-Gladden R, Srivatsan S, et al. , Optimized single-nucleus transcriptional profiling by combinatorial indexing. Nat Protoc 18, 188–207 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98.Stringer C, Pachitariu M, Cellpose3: one-click image restoration for improved cellular segmentation. Nat Methods, doi: 10.1038/s41592-025-02595-5 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 99.Wernersson E, Gelali E, Girelli G, Wang S, Castillo D, Mattsson Langseth C, et al. , Deconwolf enables high-performance deconvolution of widefield fluorescence microscopy images. Nat Methods 21, 1245–1256 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100.Ergen C, Yosef N, ResolVI - addressing noise and bias in spatial transcriptomics. bioRxiv [Preprint] (2025). 10.1101/2025.01.20.634005. [DOI] [Google Scholar]
- 101.Cadinu P, Sivanathan KN, Misra A, Xu RJ, Mangani D, Yang E, et al. , Charting the cellular biogeography in colitis reveals fibroblast trajectories and coordinated spatial remodeling. Cell 187, 2010–2028.e30 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102.Jones DC, Elz AE, Hadadianpour A, Ryu H, Glass DR, Newell EW, Cell simulation as cell segmentation. Nat Methods 22, 1331–1342 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 103.Zilionis R, Engblom C, Pfirschke C, Savova V, Zemmour D, Saatcioglu HD, et al. , Single-Cell Transcriptomics of Human and Mouse Lung Cancers Reveals Conserved Myeloid Populations across Individuals and Species. Immunity 50, 1317–1334.e10 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 104.Guo M, Morley MP, Jiang C, Wu Y, Li G, Du Y, et al. , Guided construction of single cell reference for human and mouse lung. Nat Commun 14, 4566 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 105.Schrörs B, Boegel S, Albrecht C, Bukur T, Bukur V, Holtsträter C, et al. , Multi-Omics Characterization of the 4T1 Murine Mammary Gland Tumor Model. Front Oncol 10, 1195 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 106.Ferrer CM, Cho HM, Boon R, Bernasocchi T, Wong LP, Cetinbas M, et al. , The glutathione S-transferase Gstt1 drives survival and dissemination in metastases. Nat Cell Biol 26, 975–990 (2024). [DOI] [PubMed] [Google Scholar]
- 107.DeTomaso D, Yosef N, Hotspot identifies informative gene modules across modalities of single-cell genomics. Cell Syst 12, 446–456.e9 (2021). [DOI] [PubMed] [Google Scholar]
- 108.Casanova-Acebes M, Dalla E, Leader AM, LeBerichel J, Nikolic J, Morales BM, et al. , Tissue-resident macrophages provide a pro-tumorigenic niche to early NSCLC cells. Nature 595, 578–584 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 109.Laoui D, Van Overmeire E, Di Conza G, Aldeni C, Keirsse J, Morias Y, et al. , Tumor hypoxia does not drive differentiation of tumor-associated macrophages but rather fine-tunes the M2-like macrophage population. Cancer Res 74, 24–30 (2014). [DOI] [PubMed] [Google Scholar]
- 110.Henze A-T, Mazzone M, The impact of hypoxia on tumor-associated macrophages. J Clin Invest 126, 3672–3679 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 111.Murdoch C, Muthana M, Lewis CE, Hypoxia regulates macrophage functions in inflammation. J Immunol 175, 6257–6263 (2005). [DOI] [PubMed] [Google Scholar]
- 112.Lewis C, Murdoch C, Macrophage responses to hypoxia: implications for tumor progression and anti-cancer therapies. Am J Pathol 167, 627–635 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 113.Dongre A, Weinberg RA, New insights into the mechanisms of epithelial-mesenchymal transition and implications for cancer. Nat Rev Mol Cell Biol 20, 69–84 (2019). [DOI] [PubMed] [Google Scholar]
- 114.Wang X, Tokheim C, Gu SS, Wang B, Tang Q, Li Y, et al. , In vivo CRISPR screens identify the E3 ligase Cop1 as a modulator of macrophage infiltration and cancer immunotherapy target. Cell 184, 5357–5374.e22 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 115.Moses L, Pachter L, Museum of spatial transcriptomics. Nat Methods 19, 534–546 (2022). [DOI] [PubMed] [Google Scholar]
- 116.Rodriques SG, Stickels RR, Goeva A, Martin CA, Murray E, Vanderburg CR, et al. , Slide-seq: A scalable technology for measuring genome-wide expression at high spatial resolution. Science 363, 1463–1467 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 117.Russell AJC, Weir JA, Nadaf NM, Shabet M, Kumar V, Kambhampati S, et al. , Slide-tags enables single-nucleus barcoding for multimodal spatial genomics. Nature 625, 101–109 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 118.Stickels RR, Murray E, Kumar P, Li J, Marshall JL, Di Bella DJ, et al. , Highly sensitive spatial transcriptomics at near-cellular resolution with Slide-seqV2. Nat Biotechnol 39, 313–319 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 119.Tam SY, Wu VWC, Law HKW, Hypoxia-Induced Epithelial-Mesenchymal Transition in Cancers: HIF-1α and Beyond. Front Oncol 10, 486 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 120.Ferrara N, VEGF and the quest for tumour angiogenesis factors. Nat Rev Cancer 2, 795–803 (2002). [DOI] [PubMed] [Google Scholar]
- 121.Zhang Y, Weinberg RA, Epithelial-to-mesenchymal transition in cancer: complexity and opportunities. Front Med 12, 361–373 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 122.Biancalani T, Scalia G, Buffoni L, Avasthi R, Lu Z, Sanger A, et al. , Deep learning and alignment of spatially resolved single-cell transcriptomes with Tangram. Nat Methods 18, 1352–1362 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 123.Abdelaal T, Mourragui S, Mahfouz A, Reinders MJT, SpaGE: Spatial Gene Enhancement using scRNA-seq. Nucleic Acids Res 48, e107 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 124.Allen WE, Blosser TR, Sullivan ZA, Dulac C, Zhuang X, Molecular and spatial signatures of mouse brain aging at single-cell resolution. Cell 186, 194–208.e18 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 125.Satija R, Farrell JA, Gennert D, Schier AF, Regev A, Spatial reconstruction of single-cell gene expression data. Nat Biotechnol 33, 495–502 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 126.Fang R, Halpern A, Rahman MM, Huang Z, Lei Z, Hell SJ, et al. , Three-dimensional single-cell transcriptome imaging of thick tissues. Elife 12, RP90029 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 127.Mai U, Chu G, Raphael BJ, Maximum Likelihood Inference of Time-scaled Cell Lineage Trees with Mixed-type Missing Data. bioRxiv, 2024.03.05.583638 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 128.Wang S-W, Herriges MJ, Hurley K, Kotton DN, Klein AM, CoSpar identifies early cell fate biases from single-cell transcriptomic and lineage information. Nat Biotechnol 40, 1066–1074 (2022). [DOI] [PubMed] [Google Scholar]
- 129.Konno N, Kijima Y, Watano K, Ishiguro S, Ono K, Tanaka M, et al. , Deep distributed computing to reconstruct extremely large lineage trees. Nat Biotechnol 40, 566–575 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 130.Fang W, Bell CM, Sapirstein A, Asami S, Leeper K, Zack DJ, et al. , Quantitative fate mapping: A general framework for analyzing progenitor state dynamics via retrospective lineage barcoding. Cell 185, 4604–4620.e32 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 131.Saunders RA, Allen WE, Pan X, Sandhu J, Lu J, Lau TK, et al. , Perturb-Multimodal: A platform for pooled genetic screens with imaging and sequencing in intact mammalian tissue. Cell 0 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 132.Chen W, Choi J, Li X, Nathans JF, Martin B, Yang W, et al. , Symbolic recording of signalling and cis-regulatory element activity to DNA. Nature 632, 1073–1081 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 133.Tanay A, Regev A, Scaling single-cell genomics from phenomenology to mechanism. Nature 541, 331–338 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 134.Tabula Muris Consortium, Overall coordination, Logistical coordination, Organ collection and processing, Library preparation and sequencing, Computational data analysis, et al. , Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 135.Srivatsan SR, Regier MC, Barkan E, Franks JM, Packer, Grosjean P, et al. , Embryo-scale, single-cell spatial transcriptomics. Science 373, 111–117 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 136.Sampath Kumar A, Tian L, Bolondi A, Hernández AA, Stickels R, Kretzmer H, et al. , Spatiotemporal transcriptomic maps of whole mouse embryos at the onset of organogenesis. Nat Genet 55, 1176–1185 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 137.Huang X, Henck J, Qiu C, Sreenivasan VKA, Balachandran S, Amarie OV, et al. , Single-cell, whole-embryo phenotyping of mammalian developmental disorders. Nature 623, 772–781 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 138.Tabula Sapiens Consortium*, Jones RC, Karkanias J, Krasnow MA, Pisco AO, Quake SR, et al. , The Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans. Science 376, eabl4896 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 139.Cusanovich DA, Hill AJ, Aghamirzaie D, Daza RM, Pliner HA, Berletch JB, et al. , A Single-Cell Atlas of In Vivo Mammalian Chromatin Accessibility. Cell 174, 1309–1324.e18 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 140.Rozenblatt-Rosen O, Stubbington MJT, Regev A, Teichmann SA, The Human Cell Atlas: from vision to reality. Nature 550, 451–453 (2017). [DOI] [PubMed] [Google Scholar]
- 141.Regev, Teichmann SA, Lander ES, Amit I, Benoist C, Birney E, et al., The Human Cell Atlas. eLife 6, e27041 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 142.Travaglini KJ, Nabhan AN, Penland L, Sinha R, Gillich A, Sit RV, et al. , A molecular cell atlas of the human lung from single-cell RNA sequencing. Nature 587, 619–625 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 143.Zengel J, Wang YX, Seo JW, Ning K, Hamilton JN, Wu B, et al. , Hardwiring tissue-specific AAV transduction in mice through engineered receptor expression. Nat Methods 20, 1070–1081 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 144.Clement K, Rees H, Canver MC, Gehrke JM, Farouni R, Hsu JY, et al. , CRISPResso2 provides accurate and rapid genome editing sequence analysis. Nat Biotechnol 37, 224–226 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 145.Replogle JM, Bonnar JL, Pogson AN, Liem CR, Maier NK, Ding Y, et al. , Maximizing CRISPRi efficacy and accessibility with dual-sgRNA libraries and optimal effectors. eLife 11, e81856 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 146.Zadeh JN, Steenberg CD, Bois JS, Wolfe BR, Pierce MB, Khan AR, et al. , NUPACK: Analysis and design of nucleic acid systems. Journal of Computational Chemistry 32, 170–173 (2011). [DOI] [PubMed] [Google Scholar]
- 147.Xia C, Fan J, Emanuel G, Hao J, Zhuang X, Spatial transcriptome profiling by MERFISH reveals subcellular RNA compartmentalization and cell cycle-dependent gene expression. Proc Natl Acad Sci U S A 116, 19490–19499 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 148.Xu Q, Schlabach MR, Hannon GJ, Elledge SJ, Design of 240,000 orthogonal 25mer DNA barcode probes. Proc Natl Acad Sci U S A 106, 2289–2294 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 149.Schmidt R, Steinhart Z, Layeghi M, Freimer JW, Bueno R, Nguyen VQ, et al. , CRISPR activation and interference screens decode stimulation responses in primary human T cells. Science 375, eabj4008 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 150.Hill AJ, McFaline-Figueroa JL, Starita LM, Gasperini MJ, Matreyek KA, Packer J, et al. , On the design of CRISPR-based single-cell molecular screens. Nat Methods 15, 271–274 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 151.Su J-H, Zheng P, Kinrot SS, Bintu B, Zhuang X, Genome-Scale Imaging of the 3D Organization and Transcriptional Activity of Chromatin. Cell 182, 1641–1659.e26 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 152.Zhang M, Pan X, Jung W, Halpern AR, Eichhorn SW, Lei Z, et al. , Molecularly defined and spatially resolved cell atlas of the whole mouse brain. Nature 624, 343–354 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 153.Moffitt JR, Hao J, Wang G, Chen KH, Babcock HP, Zhuang X, High-throughput single-cell gene-expression profiling with multiplexed error-robust fluorescence in situ hybridization. Proceedings of the National Academy of Sciences 113, 11046–11051 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 154.Sankoff D, Minimal Mutation Trees of Sequences. SIAM Journal on Applied Mathematics 28, 35–42 (1975). [Google Scholar]
- 155.Prillo S, Ravoor A, Yosef N, Song YS, ConvexML: Scalable and accurate inference of single-cell chronograms from CRISPR/Cas9 lineage tracing data. bioRxiv [Preprint] (2023). 10.1101/2023.12.03.569785. [DOI] [Google Scholar]
- 156.Ester M, Kriegel H-P, Sander J, Xu X, “A density-based algorithm for discovering clusters in large spatial databases with noise” in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (AAAI Press, Portland, Oregon, 1996)KDD’96, pp. 226–231. [Google Scholar]
- 157.Hao Y, Hao S, Andersen-Nissen E, Mauck WM, Zheng S, Butler A, et al. , Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587.e29 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 158.McGinnis CS, Murrow LM, Gartner ZJ, DoubletFinder: Doublet Detection in Single-Cell RNA Sequencing Data Using Artificial Nearest Neighbors. Cell Syst 8, 329–337.e4 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 159.Aran D, Looney AP, Liu L, Wu E, Fong V, Hsu A, et al. , Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat Immunol 20, 163–172 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 160.Wolf FA, Angerer P, Theis FJ, SCANPY: large-scale single-cell gene expression data analysis. Genome Biol 19, 15 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 161.Vasaikar SV, Deshmukh AP, den Hollander P, Addanki S, Kuburich NA, Kudaravalli S, et al. , EMTome: a resource for pan-cancer analysis of epithelial-mesenchymal transition genes and signatures. Br J Cancer 124, 259–269 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 162.Emanuel G, seichhorn H Babcock, leonardosepulveda, timblosser, ZhuangLab/MERlin: MERlin v0.1.6, version v0.1.6, Zenodo (2020); 10.5281/zenodo.3758540. [DOI] [Google Scholar]
- 163.Gayoso A, Lopez R, Xing G, Boyeau P, Valiollah Pour Amiri V, Hong J, et al. , A Python library for probabilistic analysis of single-cell omics data. Nat Biotechnol 40, 163–166 (2022). [DOI] [PubMed] [Google Scholar]
- 164.Pérez JS, Meinhardt-Llopis E, Facciolo G, TV-L1 Optical Flow Estimation. Image Processing On Line 3, 137–150 (2013). [Google Scholar]
- 165.Neher RA, Russell CA, Shraiman BI, Predicting evolution from the shape of genealogical trees. eLife 3, e03568 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 166.Chaligne R, Gaiti F, Silverbush D, Schiffman JS, Weisman HR, Kluegel L, et al. , Epigenetic encoding, heritability and plasticity of glioma transcriptional cell states. Nat Genet 53, 1469–1479 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 167.Rey SJ, Anselin L, “PySAL: A Python Library of Spatial Analytical Methods” in Handbook of Applied Spatial Analysis: Software Tools, Methods and Applications, Fischer MM, Getis A, Eds. (Springer, Berlin, Heidelberg, 2010; 10.1007/978-3-642-03647-7_11), pp. 175–193. [DOI] [Google Scholar]
- 168.Colgan W, Yost K, Zheng P, jweissmanlab/PEtracer-2025: v0.0.2, Zenodo (2025); 10.5281/zenodo.15635629. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Plasmids generated in this study will be available on Addgene. All scRNA sequencing data generated in this study are available under GEO accession GSE290975. All other sequencing data is available on SRA under BioProject accession PRJNA1231108. Additional processed data including MERFISH data and lineage reconstructions are available on Figshare (https://figshare.com/articles/dataset/PEtracer_Data/28473866, DOI 10.6084/m9.figshare.28473866). Custom code used in this study is available at https://github.com/jweissmanlab/PEtracer-2025 (168).
