Abstract
Somatic mosaicism is pervasively observed in human aging, with clonal expansions of cells harboring mutations in recurrently mutated driver genes. Bulk sequencing of tissues captures mutation frequencies, but cannot reconstruct clonal architectures nor delineate how driver mutations impact cellular phenotypes. We developed single-cell Genotype-to-Phenotype sequencing (scG2P) for high-throughput, highly-multiplexed, joint capture of genotyping of mutation hotspots and mRNA markers. We applied scG2P to aged esophagus samples from six individuals and observed large numbers of clones with a single driver event, accompanied by rare clones with two driver mutations. NOTCH1 mutants dominate the clonal landscape and are linked to stunted epithelial differentiation, while TP53 mutants promote clonal expansion through both differentiation biases and increased cell cycling. Thus, joint single-cell highly multiplexed capture of somatic mutations and mRNA transcripts enables high resolution reconstruction of clonal architecture and associated phenotypes in solid tissue somatic mosaicism.
Introduction
Somatic evolution leads to clonal outgrowths in cancer(1,2), as well as across normal(3,4) and diseased non-malignant(5–7) human tissues. Genomic profiling of phenotypically normal esophagus (PNE) has uncovered a diverse landscape of driver-mutated clones increasing with age, replacing up to 80% of the epithelium by age 60(3,8,9). Despite these diverse clones, PNE tissue maintains histological integrity, cell type composition, and function(10,11). PNE driver mutations overlap with those found in esophageal squamous cell carcinoma (ESCC), including NOTCH1, TP53, NOTCH2, NOTCH3, and FAT1(3,8). Some driver mutations are more frequent in PNE versus ESCC, indicating that clones can be selected against in carcinogenesis. In the aging esophagus, NOTCH1 mutants outcompete tumor cells, but maintain tissue integrity similar to wild-type cells(10,11).
Non-malignant clonal mosaicism (CM) studied through bulk sequencing of microdissections has left two major gaps in understanding somatic mosaicism. First, bulk sequencing is limited in resolving clonal hierarchies of mutations, which broadly align with either nested (two driver mutations within the same clone) or branched (driver mutations impact sibling clones) structures(9). Determining whether non-malignant tissue clones contain multiple driver mutations, or if driver mutations affect distinct clones requires multiplexed mutational capture at single-cell resolution. Second, previous studies focused on DNA sequencing of somatic mutations, without capturing their phenotypic impact in primary samples. Exploring phenotypic effects of somatic mutations requires single-cell multi-modality technologies that can link genotype and phenotypic readouts (e.g., transcriptome) at single-cell resolution.
Single-cell genotype-to-phenotype mapping using plate-based(12) or droplet-based(13,14) methods revealed that CM drivers often disrupt differentiation hierarchies in blood, where ease of sampling allows prior bulk genotyping to direct limited targeted mutational capture with single-cell multimodal methods. Viably frozen whole cells from blood additionally ensures higher mRNA content, unlike single-nucleus approaches needed for archival solid tissues, facilitating genotyping from single-cell RNA sequencing (scRNAseq). In contrast, genotype-to-phenotype mapping in solid tissue CM is more challenging. In PNE, mutations often distribute across genes, hindering hotspot capture genotyping. Moreover, clones are comparatively small and spatially segregated, limiting mutation profiling before single-cell analysis. Archival solid tissues require nuclei extraction, resulting in lower mRNA content, limiting application of single-cell genotype-to-phenotype mapping of somatic clones in solid tissues.
To address this challenge, we developed single-cell Genotype-to-Phenotype sequencing (scG2P), a single-cell approach for highly multiplexed capture of multiple recurrently mutated driver gene regions to decipher mosaicism in solid tissue, defining cell states with matched mRNA in PNE. High-throughput microfluidics enables the profiling of thousands of cells required to capture smaller clones in solid tissue. Cell mixing analysis demonstrates accurate co-capture of 118 genomic regions between 160–260 bp and 56 mRNA transcripts. We profiled >10,000 single nuclei and cells from aged PNE samples from six donors, detecting somatic variants across six driver genes (NOTCH1, TP53, NOTCH2, NOTCH3, FAT1, PPM1D). We resolved the clonal architecture of driver mutants, finding most clones having single driver mutations, accompanied by rare instances of clones with two driver mutations. Using the matched transcriptional information, we assigned cells with epithelial differentiation stages and cell cycle scores to define driver-specific phenotypes. NOTCH1-mutated clones show stalled epithelial differentiation, while TP53-mutant clones have both stalled differentiation and increased proliferation. We capture clones with NOTCH1 loss of heterozygosity (LOH) using germline variants, tracing phenotypic biases of NOTCH1-mutant cells that acquire additional NOTCH1 mutations or LOH. These results provide key insights into somatic evolution to deliver a novel framework for deciphering the functional consequences of somatic mutations in solid tissues.
Results
Co-capture of highly multiplexed somatic mutation profiling and targeted mRNAs in single cells
Capturing the diversity of somatic mutants in solid tissue requires targeting a range of loci along driver genes. We utilized a single-cell DNA sequencing (scDNAseq) technology developed to capture mutational hotspots(15–18). This microfluidic platform uses double encapsulation, where the first encapsulation releases DNA content through cell lysis, followed by targeted amplification, and a second encapsulation step adds cell barcodes to targeted amplicons. To link genotypes to cell states, we modified the assay for reverse transcription of mRNA targets during the first encapsulation step, adding capture handles to cDNA amplicons for downstream barcoding. During second encapsulation, we add cell barcodes to the DNA and RNA amplicons, followed by multiplexed PCR amplification. To differentiate transcripts from off-target gDNA capture, we designed RNA amplicons to cover exon-exon junctions (Fig. 1A). Barcoded RNA and DNA amplicons are separated with streptavidin bead pulldown post emulsion amplification for separate library preparation. The final barcoded DNA and RNA amplicon libraries enable linking somatic mutations with RNA signals through shared cell barcodes.
Figure 1. Targeted capture of mutation hotspots and RNA in single cells.

A) Schematic representation of the scG2P workflow. Dissociated cells or nuclei undergo cell lysis and targeted reverse transcription (RT), followed by barcode addition and targeted loci amplification in two sequential encapsulations. DNA and RNA amplicons are separated by streptavidin bead capture for separate library preparation. This workflow combines DNA mutation hotspot capture to reconstruct clonal architecture with exon-exon capture of RNA targets for single-cell genotype to phenotype linkage (created with BioRender.com). B) Previously reported mutations per codon and tiling amplicon coverage across NOTCH1 (top) and TP53 (bottom) protein position. The number of mutations per codon, as previously reported from bulk sequencing of the esophagus (Yokoyama et al.(8)), is displayed across the protein positions in the upper plots. A 15-bp rolling window is used to determine the number of mutations per codon. In the middle plots, black bars represent the presence of an amplicon covering the locus in the scG2P panel, while grey bars represent the absence of an amplicon covering the locus. Domains for NOTCH1 and TP53 are indicated below (EGF = Epidermal Growth Factor-like repeats, LNR = LIN12/Notch Repeat, ANK = Ankyrin Repeats, TAD = transactivation domain, DBD = DNA binding domain, OLD = oligomerization domain). The percentage of GC content across 15-bp windows is displayed in black (lower plots). Red lines indicate GC content of the genotyping amplicons corresponding to the regions captured by the panel. Grey dashed lines represent the mean GC content across all amplicons in the gene. Amplicon designs for FAT1, NOTCH2, NOTCH3, and PPM1D are provided in Supplementary Figure 1B. C) Heatmap of filtered variants detected in mixing study. Cell lines HCT116, KYSE270, and KYSE410 were mixed and processed using scG2P. The heatmap displays the detected filtered variants for each individual cell, clustered by cell line based on the variant allele frequencies (VAFs) of the DNA variants. Variants annotated in red were independently validated using whole-exome sequencing data from the Cancer Cell Line Encyclopedia (CCLE) database. The genotype of each variant is indicated as homozygous (HOM), heterozygous (HET), wild-type (WT), or missing. D) (Left) RNA expression-based uniform manifold approximation projection (UMAP) for cell line mixing experiment. HCT116 (n = 234 cells), KYSE270 (n = 403 cells), and KYSE410 (n = 355 cells) are colored according to their assigned cell line identity, determined by k-means clustering of the variant allele frequencies of DNA variants. (Right) Violin plots displaying the RNA expression levels (centered log ratio) of four marker genes (KRT23, KRT5, KRT7, and EPCAM) across the three cell lines. E) Confusion matrix comparing RNA-based clustering labels (predicted) and DNA-based clustering labels (ground truth) of the cell line mixing study. The matrix displays the percentage of cells assigned to each cell line based on RNA expression profiles compared to the ground truth DNA-based assignments. Diagonal elements represent correctly classified cells, while off-diagonal elements indicate misclassifications. The mean accuracy across all cell lines is 0.95.
We tested this technology in PNE, where somatic mosaicism is pervasive in aging donors. We designed amplicon panels capturing frequently mutated sites across PNE driver genes NOTCH1, NOTCH2, NOTCH3, PPM1D, FAT1, and TP53(3,8) (Supplementary Table 1). NOTCH1, NOTCH2, NOTCH3, TP53 have key roles in keratinocyte differentiation and contain positively selected mutations in both aging esophagus and sun-exposed skin(19–21), and NOTCH1 mutations confer competitive advantage without disrupting the epithelium structure(10). Although GC content or primer interactions resulted in variable capture efficiency of tiled regions, the amplicons covered >400 previously reported mutations in PNE in >21Kb of genomic DNA across the six genes (Fig. 1B, Supplementary Fig. 1A, B).
To design an informative mRNA panel to assign cell states, we performed scRNAseq and spatial transcriptomics on PNE samples. Human esophageal tissue comprises layers of basal epithelial cells farthest from the lumen that serves as a reservoir of progenitor cells. The next layers are composed of proliferating suprabasal cells that migrate towards the lumen, with an increasingly differentiated phenotype marked by becoming flattened and forming a tight barrier across the lumen. Differentiated cells shed as cells migrate, maintaining homeostasis and cell density(10,22). We performed scRNAseq on dissociated single cells from endoscopy punch biopsies from three older donors with high alcohol and/or tobacco exposure (Supplementary Table 2) and annotated cell types based on marker genes (Supplementary Fig. 2A–C) from an esophageal single-cell reference dataset(23,24). We captured the established trajectory of epithelial cell differentiation, spanning early basal, differentiating suprabasal and differentiated states (Supplementary Fig. 2B, C). We also performed spatial transcriptomics (ST;10x Visium) to assay sections from optimal cutting temperature compound (OCT)-embedded biopsy punches from three donors (Supplementary Table 2). Using annotations from the single-cell datasets, we categorized spatial barcodes as clusters of cell types, observing basal, suprabasal, and differentiated epithelial cell type regions (Supplementary Fig. 2D). Trajectory analysis of ST barcodes recapitulated these differentiation transcriptional signatures, comprising markers of basal cells (KRT5, KRT15), suprabasal cells (KRT13, KRT4), and differentiated cells (S100A9, SPRR3), and confirmed TP63, SOX2, and COL17A1 as regulators of epithelial differentiation in basal and suprabasal cells (Supplementary Fig. 2E,F), as previously reported(23,24). No notable differences in cell composition or clonal expansions were captured using these high-resolution methods, underscoring the need for single-cell genotyping. Using these datasets, we designed a targeted mRNA panel of 56 markers (Supplementary Table 3; Methods) to enable identification of cell types, epithelial differentiation stages, and markers of stemness and proliferation in mutant vs. wild-type cells. We verified that these 56 mRNA targets alone can recapitulate the cell types observed from scRNAseq (Supplementary Fig. 2G).
To validate scG2P performance, we performed mixing experiments with two esophageal (KYSE-270 and KYSE-410) and a colon (HCT-116) cell lines. Valid cell barcodes were determined by read depth and coverage uniformity across the DNA panel (Methods), and variants were filtered based on read depth, allelic frequency, and genotyping quality (Supplementary Table 4)(15–18). Cell lines were identified by clustering based on variant allele frequencies (VAFs) of filtered variants (Fig. 1C, Supplementary Table 4). We identified nine single-nucleotide variants (SNVs) that were private to individual cell lines. Five of these were variants previously identified by whole-exome sequencing (WES) in the Cancer Cell Line Encyclopedia (CCLE) database (Fig. 1C). The remaining four mutations (one intronic, three coding) likely represent mutations acquired by individual cell lines during culture (Supplementary Fig. 3A). We also detected in each cell line three additional shared germline variants with high population frequency in the gnomAD database (Supplementary Fig. 3A, Supplementary Table 4). Genotyping accuracy (percentage of correct WES variant call assigned) was 92% (SEM: 3.64%) (Supplementary Fig. 3B). We compared VAFs of the known WES and phased variants (Methods) to determine distributions of false positives and allelic dropout (ADO) rates, and estimated a median ADO of <10% (Supplementary Fig. 3C, D). To assess amplicon sequencing efficiency, we plotted all amplicon reads per cell in relation to amplicon GC content and amplicon length across all driver genes. As the percentage of GC content increases, there were more amplicons that did not pass the minimum threshold of 10 reads per amplicon per cell needed for genotyping, while there was no correlation between number of reads per amplicon per cell and insert length (Supplementary Fig. 3E, F). Our DNA panel had similar (15–18) genotyping efficiency and cells recovered as previous scDNAseq panels(15–18), with the addition of mRNA capture (Supplementary Table 5).
To integrate mRNA data, we aligned mRNA amplicons and removed reads that did not align to primer positions. To distinguish between genomic DNA contamination and captured transcript, we filtered out reads with alignment to intronic regions (Supplementary Fig. 4A). We generated a cell by gene counts matrix using the filtered reads for downstream analysis and observed three clusters representing each cell line (Fig. 1D, Supplementary Table 6). The number of RNA reads that passed filtering was highly correlated with DNA reads in the cell line mixing (Supplementary Fig. 4B). We assigned cell line identity based on expression level of cell line-specific marker genes in each cluster (Fig. 1D). To determine genotype-phenotype linkage accuracy, we used genotype assignments as ground truth compared to RNA-based clustering, achieving 0.95 accuracy (Fig. 1E, Supplementary Fig. 4C). While scG2P does not contain UMIs(25), we minimized PCR biases by designing primers with similar melting temperatures, amplicon lengths, GC content, and specificity. We compared our pseudo-bulk mRNA levels to bulk RNAseq from CCLE, obtaining Spearman’s correlation of 0.72–0.76 across cell lines (Supplementary Fig. 4D).
We tested the feasibility of increasing the number of captured transcripts in scG2P. Mixing Y79 and K562 cells, we included amplicons to capture 906 transcripts, while maintaining genotyping efficiency and cell type assignment (Supplementary Fig. 5A–C). 298/906 of transcripts suffered dropout, but overall correlation with bulk RNAseq signal of the remaining 608 transcripts was similar to the smaller sized panel (Spearman’s correlation of 0.92 for K562; 0.93 for Y79). Transcripts that suffered from dropout had increased GC content (Supplementary Fig. 5D). In future panel design, this issue can be circumvented by targeting alternate lower GC content exon-exon junctions in transcripts with multiple junctions.
Somatic clones with single driver mutations predominate in PNE
Bulk micro-dissection sequencing revealed prevalent somatic mosaicism in the esophageal epithelium; yet precise clonal structure (clone size, mutation ordering) cannot be fully resolved in bulk sequencing. To define the clonal architecture and link somatic mutations with cellular phenotypes in PNE, we isolated single nuclei from OCT-embedded punch biopsies from five donors of Japanese ancestry as input for scG2P (Methods). Barcoded amplicons were used to call cells and variants, followed by genotyping and clonal reconstruction (Supplementary Fig. 6A). We used as input 40,000–120,000 (median: 50,000) nuclei per donor and called 713–1,747 (median: 1,105) cells per sample (Methods; Supplementary Fig. 6B). Variants were filtered for genotyping cells using the following criteria: total read counts >= 10 per cell(15), alternate allele count >= 3(16), and genotyping quality >30. We removed variants with genotyping calls in <50% of cells and removed cells in which <50% of potential variants reported informative genotypes (genotyping completeness < 50%, Methods).
Due to variable amplicon coverage across genes (Supplementary Fig. 6C), we assessed amplicon performance across driver genes relative to GC content and measured mappability using Alignability scores(26) (Supplementary Fig. 7A). Amplicon subsets performed consistently across samples. Amplicon efficiency was negatively impacted when GC content > 55%, while mappability of low-performing amplicons remained similar (Supplementary Table 7). This suggests that GC content of amplicons, rather than mappability, amplicon length (Supplementary Fig. 3F), or technical noise across cells, more likely affects amplicon performance.
To further assess differences in gene capture, we measured the proportion of amplicons with enough depth to be genotyped per cell in each driver gene, observing high rates of amplicon capture, although with lower capture of NOTCH1 and NOTCH3 (Supplementary Fig. 7B). Using previous mutation data(8) (Fig. 1B, Supplementary Fig. 1B), we showed that amplicons cover all known mutations in PPM1D and FAT1, 95% of mutations in TP53, 96% in NOTCH2, 81% in NOTCH3, and 86% in NOTCH1 (Supplementary Fig. 7C). However, dropout and decreased amplicon capture efficiency at certain regions may prevent mutation detection even if an amplicon covers the region. To estimate potentially missed mutations due to lower locus genotyping, we multiplied the number of mutations in each locus(8) by mean genotyping efficiency, summed these estimated proportions of mutations captured, and divided by the total number of mutations in each gene (Supplementary Fig. 7D). The highest rate of missed mutations occur in NOTCH1 (31%) and NOTCH3 (26%), which have worse performing amplicons.
We detected 6–22 (median: 19) nonsynonymous somatic variants per sample (Fig. 2A, B, Supplementary Table 8). As expected from previous bulk data(3,8), mutations were in NOTCH1, TP53, and NOTCH2, the most frequently, second, and fifth most frequently mutated genes, respectively, in bulk PNE profiling(8), were detected in all samples (Fig. 2B).
Figure 2. Single-cell mutational landscape in aging esophagus.

A) Bar plot displaying the number of nonsynonymous mutations detected in each driver gene across all donor samples. B) Bar plot showing the contribution of nonsynonymous mutations by each driver gene for each donor sample. C) Illustration of the clonal structure map and clone fractions detected in ESO-5 using single-cell genotyping data. Terminal nodes (colored circles) represent distinct subclones defined by distinct variants across driver genes. Circle size represents the cell fraction of each subclone, ordered by size. Branch lengths are scaled to reflect the acquisition of a single mutation, displaying single mutant and double mutant (n = 2) clones. D) Fraction of cells harboring 0, 1, or 2 mutations per clone for cells passing the 50% genotyping completeness threshold for each donor sample. E) Fraction of cells with mutation in indicated driver gene, or combination of mutations in driver genes, for each donor sample.
To account for ADO, we filtered cells based on genotyping completeness (percentage of variant loci genotyped in a cell of all variant loci detected in the sample). Minimum genotyping completeness was set at 50% (Methods). While increased genotyping completeness stringency lowered the number of cells with high dropout, this also decreased total cells analyzed (Supplementary Fig. 8A). We used VAFs as features to assign cells to genotype clusters based on density-based clustering, merging similar clusters to obtain clonal assignment (Methods, Supplementary Fig. 6A). We tested a second clonal inference approach by adapting a reinforcement learning model(15). We used both methods to construct clonal structure, observing the same structure across all samples (Fig. 2C, Supplementary Fig. 8B, Methods). The median genotyping completeness of cells within a clone did not affect clone size estimates (Supplementary Fig. 8C). At 50% genotyping completeness, 31–89% (median: 45%) of cells had at least one driver mutation across donors, where most mutant cells had one mutation and clones with two mutations were observed in only two samples (Fig. 2D, E). To determine whether mutant cell fractions were underestimated due to dropout, we applied more stringent (>80%) genotyping completeness, at which the number of mutant cells increased slightly (34–92%; median: 52%) (Supplementary Fig. 8D). However, matching parental and double mutant clones in ESO-5 were identified at similar proportions compared to clonal analysis with 50% genotyping completeness (Supplementary Fig. 8E), suggesting that dropout does not drastically hinder detection of double mutant clones.
To quantify the size and diversity of clonal populations, we quantified clone fraction (CF), which is the proportion of cells within a clone out of all cells in that sample. There was high diversity in CF, reflecting heterogeneous clonal landscape, where maximum CF ranged from 11–16% (median: 12%) across samples (Fig. 3A). NOTCH1-mutated clones represented the largest clone in 4/5 donors, comprising 10–16% of cells per sample (Fig. 2E, 3A), with between 3–8 smaller NOTCH1-mutated clones per sample. At least one TP53-mutated clone was detected in all samples, and was the largest clone in ESO-5. The oldest donor (ESO-5, 71 years old) had many NOTCH3 mutations (Fig. 2C, E, 3A), which were absent in ESO-3 and were 3- to 5-fold less frequent than NOTCH1 mutations in the remaining three donors. NOTCH3 mutation rate variability was previously reported(3), potentially reflecting effects of age, environmental factors or genetic background. The proportion of wild-type cells trended lower with age with a trend of increase in clonal diversity (Shannon Index; Methods; Fig. 3B), suggesting that the esophageal age-associated increased mutation burden is due to large numbers of varied-size mutant clones, instead of a large dominant clone. Larger cohorts will be required to confirm these findings. To assess the growth advantage imparted by mutations across clones, we stratified clone sizes based on driver mutation, observing that NOTCH1 and TP53 mutations resulted in larger clone fractions compared to NOTCH2, NOTCH3, or FAT1 mutations (Fig. 3C). To compare with previous data(8), we re-plotted estimated clone fractions, stratifying clone sizes by driver mutations (Supplementary Fig. 8F). Similar to our results, bulk microdissections (8)contained the highest clone fractions with TP53 and NOTCH1 mutations, which occurred in high mutation frequency regions(8) (Fig. 3D, Supplementary Fig. 9), further suggesting a relationship between selection of clones with NOTCH1 and TP53 mutations and their larger sizes. The increased clonal diversity during aging (Fig. 3B) indicates that there is continued expansion of multiple clonal populations competing in aging tissue, consistent with the model where neighboring mutant clones of similar fitness compete in the spatially ordered esophageal epithelium to maintain tissue homeostasis over time(10), resulting in a lack of dominant clones with high enough selective advantage to drive down clonal diversity with age.
Figure 3. Mutant driver gene clonal architecture of the aging esophagus.

A) Quantification of the clonal fraction for each clone detected across five donors. Circles represent the clones, colored by individual driver mutation or double mutation, that are detected within each donor (x axis, random order within sample). The y-axis represents the fraction of total cells that each clone constitutes within each sample (number of mutant cells in a clone divided by total number of cells in a sample). B) (Top) Scatter plot illustrating the Pearson correlation between the proportion of wild-type cells and donor age (R = 0.84). (Bottom) Scatter plot illustrating the Pearson correlation between clonal diversity, computed by the Shannon entropy index, and donor age (R = 0.48). Each point represents a donor, and the lines represent the linear fit. C) Clone fractions of cells with single driver mutation. Box plot display the distribution of mean mutant cell fractions of clones for each mutated gene across all donors (Center line, median; box, IQR; whisker, 1.5*IQR). Only cells with single mutations were included. D) Fraction of cells with mutation in the NOTCH1 (top) and TP53 (bottom) genes for ESO(1–5). The cell fractions (left y-axis) of detected variants for each sample are plotted along the protein positions of NOTCH1 and TP53. Each variant is annotated with its coding impact (missense, synonymous, in frame, or frameshift mutation). The mutant cell fraction plot is overlaid with the hotspot mutation density, representing the number of mutations per codon (right y-axis) as previously reported by Yokoyama et al.(8) from bulk sequencing of the esophageal microdissections. NOTCH1 and TP53 domains are indicated with annotations from UniProt (EGF = Epidermal Growth Factor-like repeats, LNR = LIN12/Notch Repeat, ANK = Ankyrin Repeats, TAD = transactivation domain, DBD = DNA binding domain, OLD = oligomerization domain). The genotyping efficiency is displayed below each plot using a color scale that indicates the proportion of cells genotyped at each locus. Note that fraction of cells with mutation is represented by the clone fraction in all single-mutant clones, whereas in 2 out of 5 donors with detected double mutant clones, the mutant fraction is the sum of the parent clone and double mutant subclone fractions.
Bulk sequencing of PNE microdissections revealed examples of branching evolution of subclones after initial clonal expansion from a driver mutation for clones sufficiently large to apply pigeonhole principle assumptions(3,27). To investigate subclonal architecture, we leveraged our single-cell resolution to detect double-mutant clones that would be too small for the pigeonhole principle to apply, identifying double mutants in ESO-3 and ESO-5. In ESO-3, we detected a clone with mutations in NOTCH1 and TP53 (CF = 6%), but not its ancestor clone (Supplementary Fig. 8B). In ESO-5, we detected a clone with mutations in TP53 and NOTCH1 (CF = 8%) and its TP53-mutated parent clone (CF = 12%) (Fig. 2C). Additionally, we observed a second clone with co-mutations in NOTCH1 and FAT1 (CF = 4%) with the FAT1-mutant parent clone (CF = 6%; Fig. 2C). The ESO-5 TP53-mutated clone that produced the double mutant subclone was the largest clone in this donor (Fig. 2C, 3A), suggesting ongoing evolution in larger clones. Altogether, our scDNAseq strategy allows us to determine clonal structure in PNE, resolving clones with double mutations and unveiling a highly mutated esophageal epithelium dominated by a large number of single mutant clones driven primarily by NOTCH1 and TP53 mutations.
Genotype-phenotype mapping of somatic mutant clones in the PNE
Our scDNAseq strategy revealed a large number of clones, of diverse sizes, driven primarily by single driver mutation in the aging PNE. With the associated epithelial mRNA panel, we sought to further decompose cell type biases to gain insights into possible mechanisms supporting selection advantage in expanded clones. We annotated the cell clusters using canonical marker genes, identifying basal, basal DCN+, basal proliferative, suprabasal, suprabasal KLF4+, mature suprabasal, and fibroblast cells (Fig. 4A, Supplementary Table 9). We computed a cycling score using our panel’s cycling genes (UBE2C, MKI67, TOP2A) and a differentiation score by using the difference between early epithelial score (KRT15, TP63, COL17A1, KRT14) and late epithelial module score (SPRR3, S100A8, S100A9, KRT4, KRT13) (Fig. 4B, Supplementary Table 10). We confirmed the generalization of differentiation and cell cycling modules using reference scRNAseq. We re-analyzed scRNAseq data using only RNA panel transcripts, performing cluster analysis, assigning cell types (basal, basal proliferative and suprabasal) and calculating differentiation and cycling gene module scores using the same gene modules (Supplementary Table 11). We observed the expected increase in differentiation score from basal to suprabasal and increase in cycling score in the basal proliferative cells from our data in the scRNAseq data (Supplementary Fig. 10A, B). We compared the cell cycling score to the G2M score used widely in scRNAseq and differentiation scores to pseudotime values in the scRNAseq data and observed strong correlation in both (Supplementary Fig. 10C,D), confirming that both gene modules are applicable to scRNAseq and scG2P data.
Figure 4. Mapping cellular phenotypes to esophageal clones.

A) (Left) Uniform manifold approximation and projection (UMAP) of single cells from ESO-(1–5), clustered using RNA expression, colored by annotated cell types. (Right) Dot plot displaying cell type marker genes (x-axis) used for cell type annotation (y-axis). Dot size represents the percentage of cells expressing the marker gene, and color scale indicates the mean expression level (centered log ratio) within each cell type. B) Violin plots comparing cell cycling scores (left) and differentiation scores (right) across assigned cell types. Cell cycle and differentiation scores were calculated as gene-module scores based on cycling and differentiation gene sets expression from the RNA panel (Supplementary Tables 9,10). C) Diffusion map of epithelial single cells merged from ESO-1, ESO-2, ESO-3, and ESO-5, annotated by cell types (left) and overlaid with trajectory scores (right). The trajectory scores represent the differentiation stage of each cell along the inferred pseudotime trajectory. Fibroblasts from A were excluded from this analysis. D) Clonal composition (fraction of mutant clones comprising the total sample) calculated across the pseudotime diffusion map to assess the relative proportion of clones with different driver mutations throughout differentiation stages (from early to late) for ESO-5 sample. Clone fractions are summed by driver gene mutation and the relative abundance is plotted against pseudotime quantiles (from C) representing the differentiation trajectory.
Next, to examine the relationship between PNE driver mutations and differentiation, we mapped epithelial cells along diffusion components (Fig. 4C) and projected the clonal proportions across driver mutations along pseudotime deciles reflecting differentiation state to visualize cell state changes of mutants (Fig. 4D). We observed that clones with mutations in TP53 (ESO-2, ESO-5), NOTCH1+TP53 mutant (ESO-3), and TP53+NOTCH1 and FAT1+NOTCH1 double mutants (ESO-5) were enriched in earlier differentiation states (Fig. 4D, Supplementary Fig. 11). Through calculating the differential contribution of the mutant clones to cell populations across the stages of epithelial differentiation, we can assess differentiation bias as a potential mechanism for clonal fitness advantage, providing a quantitative framework to define how driver mutations alter cell state proportions across PNE differentiation.
Widespread Notch1 loss in mice has minimal effects on epithelial structure and cell dynamics, but increases mutant clonal fitness(10). The dividing progenitor cells in the basal compartment generate either two progenitor cells, two differentiating cells, or one of each(15); wild-type tissue retains equal proportions of both progenitor and differentiated cells to maintain homeostasis, but mutations that bias differentiation to progenitor-like cells can drive clonal expansion. Murine lineage tracing experiments have demonstrated that heterozygous Trp53 mutants induce higher rates of progenitor production while maintaining epithelium structure(28,29). To assess the impact of mutations on differentiation status, we compared differentiation scores across wild-type cells and clonal populations, finding a wide distribution of differentiation biases across mutant clones compared to wild-type cells (Fig. 5A, Supplementary Fig. 12, Methods). For instance, 7/11 TP53-mutant clones and 18/28 NOTCH1-mutant clones had lower differentiation scores compared to wild-type cells, indicating that the majority of these mutant clones exhibited bias towards earlier differentiation states. Notably, the largest clone in every sample had early cell state bias, and double-mutant clones also showed increased bias towards early differentiation, despite different mutant combinations (Fig. 5A, Supplementary Fig. 12). Finally, when clones were stratified by mutated gene, TP53, NOTCH1, and NOTCH3 mutants were most biased towards early differentiation states (Fig. 5B). Thus, these driver gene mutations in human PNE are associated with differentiation biases that may underlie clonal expansions, similar to animal models. This phenotypic mapping helps explain how the acquisition of additional driver mutations induces clones that span larger spatial areas(8), as all double mutant clones were biased towards earlier differentiation states.
Figure 5. Esophageal epithelium is colonized by diverse somatic mutant clones with phenotypic biases.

A) Mean normalized differentiation scores calculated from the differentiation module score (y axis) capturing esophageal epithelial differentiation stages (Methods) of cells assigned to clones based on driver gene variant for each donor sample (x axis, random order within sample). Dot size represents the clone fraction (number of mutant cells in a clone divided by total number of cells in a sample) and color indicates the mutant driver gene. The differentiation scores are normalized through min-max scaling within each sample to allow for comparison across samples. B) Mean differentiation and cycling scores of cells assigned to clones with single driver mutation, aggregated by mutant driver genes across all samples. Each point represents aggregation of cells with mutation in a specific driver gene, with error bars indicating the standard error of the mean (SEM) across cells. C) Clones from the ESO-5 sample projected onto the differentiation and cell cycling score axes (mean score of cells assigned to clones). Each dot represents a clone, with the dot size proportional to the clone fraction and color indicating the mutant driver gene. WT clone fraction is fixed at 0.1 for visualization purposes. True WT proportions represented in Fig. 2D. D) Diffusion map from Fig. 4C overlaid with kernel density estimates over those dimensions for cells with TP53:p.R135W mutation. E) Volcano plot comparing differentially expressed transcripts between TP53-mutant cells and wild-type cells. Horizontal dotted line represents FDR < 0.05; vertical dotted lines represent average log2 fold change (log2FC) > 0.15. F) Clones from the ESO-5 sample projected according to their mean expression of KLF5 and TP63. Dot size represents the clone fraction, and color indicates the mutant driver gene. WT cells are fixed at clone fraction of 0.1 and used to visualize as comparison to clone scores.
While differentiation biases may contribute to clonal expansions, driver mutations may also directly impact cell proliferation as an alternative route to enhanced fitness. We therefore compared the differentiation state and cell cycling activity of single-mutant clones, uncovering that while TP53 mutants showed both early differentiation and higher cell cycling activity, NOTCH1 mutants only showed early differentiation bias (Fig. 5B). For double-mutant clones (detected in two samples), we observed bias towards early differentiation or increased proliferation. Specifically, in ESO-5, a FAT1+NOTCH1 mutant clone and its parent FAT1-mutant clone were both highly skewed towards early differentiation state and increased proliferation score. Another TP53+NOTCH1 mutant clone and its parental TP53-mutant clone had the highest cell cycling score in this sample (Fig. 5C). In ESO-3, a TP53+NOTCH1 double mutant clone and a TP53-mutant clone both were highly biased towards early differentiation relative to wild-type cells (Supplementary Fig. 12). These data suggest that driver gene mutations may promote clonal expansion through mechanisms that stall differentiation or increase proliferation, or through the combination of both.
ESO-5 contained cells with a TP53p.R135W mutation that are phenotypically biased along the epithelial differentiation (Fig. 5D). Although the increased fitness of TP53-mutant cells has been documented, particularly in ESCC(29), the molecular basis in the PNE context is not well understood. When compared to wild-type cells in the same donor, TP53-mutant cells had overexpression of basal cell markers (KRT14, KRT15), cycling genes (UBE2C, MKI67), and keratinocyte differentiation regulators (TP63, KLF5) (Fig. 5E). Notably, p53 normally interacts with regulators of differentiation, including p63 and members of the KLF family (e.g., KLF4, KLF5) that are commonly dysregulated in esophageal cancers(30,31). As such, disrupted p63 and KLF signaling may underlie the bias towards earlier differentiation states in these TP53-mutant clones. Indeed, in our oldest donor ESO-5, cells in the large TP53-mutant clone (along with its subclone harboring an additional NOTCH1 mutation) had overexpression of TP63 and KLF5 compared to wild-type cells (Fig. 5F; see Supplementary Fig. 13 for TP63 and KLF5 expression profiles for other donors). It was shown that p53 loss in esophageal epithelial layers in mice increased p63 levels, allowing the cells to activate p21 in lieu of p53 to maintain tissue integrity(32). In addition, TP63 expression marks stem and progenitor basal cells(33), and was hypothesized to act as a switch between proliferating basal to differentiating suprabasal cells(32). These reports suggest that the increase in TP63 expression that we observe in our TP53-mutant clones could allow these cells to respond to DNA stress, maintain early progenitor state and stall differentiation. In turn, these phenotypes could increase the fitness of TP53-mutant cells over wild-type cells. Together, we demonstrate the linkage of phenotypic readouts with a cell’s genotype is required to uncover how stalled differentiation and enhanced proliferation can enable mutant cells to gain fitness advantages in PNE without drastically disrupting tissue composition of cell dynamics.
Assessment of loss of heterozygosity at driver gene loci in PNE
Notably, LOH events have been frequently observed in aged esophagus tissue from bulk sequencing data(3,8). To address this aspect of esophageal somatic mosaicism, we sought to leverage germline heterozygous single-nucleotide polymorphisms (SNPs) that are captured by our panel to determine driver gene LOH events. As our five samples lacked adjacent heterozygous germline SNPs (potentially due to Japanese ancestry(34)), we analyzed an additional donor (ESO-6), collecting four punch biopsies over multiple regions. Sorted single-cell suspensions (EPCAM+, DAPI-) from freshly isolated punch biopsies allowed us to recover 5,850 cells passing 50% genotyping completeness (Methods), a 5-fold increase over previous samples, with 4,910 cells passing 80% genotyping completeness (Supplementary Fig. 14A). We detected 1,141 cells with driver mutations, including two TP53-mutant clones and seven NOTCH1-mutant clones, one of which had acquired a second NOTCH1 mutation (Fig. 6A, Supplementary Table 12). This sample contained multiple NOTCH1 germline SNPs, which we leveraged to detect LOH at the NOTCH1 locus. Using the five germline SNPs included in our panel, the status of NOTCH1 LOH was determined for 4,976/5,850 cells (85%; Methods), with LOH events impacting either parental alleles in 1,697 (34%) cells (LOH with only the A [n=432] or B [n=1,265] allele retained; Fig. 6B, Supplementary Fig. 14B, C, Supplementary Table 13). Notably, the clone with two NOTCH1 SNV mutations did not harbor LOH, potentially indicating bi-allelic loss that would obviate the selective pressure for LOH. 48% (811/1,697) of cells with LOH harbored driver SNV mutations (Supplementary Fig. 14C), compared with 22% (732/3,279) of cells without NOTCH1 LOH. The cells with SNVs and NOTCH1 LOH included four clones with a NOTCH1 SNV, consistent with LOH amplifying the mutated allele. In addition, we captured a parental heterozygous NOTCH1-mutant clone along with its descendant clone that acquired NOTCH1 LOH (Fig. 6B). That many cells with NOTCH1 LOH do not harbor a NOTCH1 SNV is somewhat surprising, as the expectation is that an initial point mutation followed by LOH leads to biallelic loss of NOTCH1, imparting a selective advantage to the cell. In part, our observations may be due to incomplete capture of SNVs due to ADO, or less efficient amplicons leading to incomplete gene coverage. Indeed, there is a decrease in NOTCH1 amplicon efficiency per cell (fraction of NOTCH1 amplicons >= 10 reads per cell) in cells with NOTCH1 LOH and without NOTCH1 mutations compared to wild-type cells without any NOTCH1 mutations, or cells with NOTCH1 mutations along with NOTCH1 LOH (Supplementary Fig. 14D). Similarly, our sensitivity analysis showed lower cell numbers with increasing genotype completeness, indicating a higher likelihood that cells with NOTCH1 LOH may have decreased detection of NOTCH1 SNVs (Supplementary Fig. 14E). However, the finding of NOTCH1 LOH without a detected NOTCH1 SNV is also consistent with prior literature where sequencing of 0.05mm2 microdissections of esophageal biopsies (King et al., bioRxiv, 2023) showed that 30% of samples with NOTCH1 copy-number aberration had no detectable NOTCH1 SNV, suggesting that clones with NOTCH1 LOH may not always have biallelic loss of NOTCH1. While the exact mechanism remains unknown, we speculate potential roles for variants in non-coding regions, or non-genetic mechanisms of gene silencing, that compromise NOTCH1 function in these cells.
Figure 6. Uncovering clones with NOTCH1 loss of heterozygosity.

A) Illustration of the clonal SNV structure map of ESO-6 generated by identifying driver gene variants from the single-cell genotyping data and determining mutant cell fraction. Terminal nodes (colored circles) represent distinct subclones defined by distinct variants across driver genes. Circle size represents the cell fraction of each subclone. Branch lengths are scaled to reflect the acquisition of a single mutation, displaying single mutant and double mutant clones. B) Germline NOTCH1 SNPs (n = 5, X-axis) are genotyped to define loss of heterozygosity at NOTCH1 across clones (NOTCH1 SNP VAF). In addition, SNV VAF is shown for each mutation within the indicated clone. C) Clones from the ESO-6 sample projected onto the differentiation and cell cycling score axes (mean score of cells assigned to clones). Each dot represents a clone, with the dot size proportional to the clone fraction, shape indicating NOTCH1 LOH status, and color indicating the mutant driver gene. Specific clones are highlighted with bold borders: the NOTCH1 single mutation clone with its subclone with two NOTCH1 mutations with red borders, and the NOTCH1 single mutation clone with its subclone that acquired LOH with light brown borders. D) Scaled differentiation score with Standard Error of Mean (SEM) of cells classified as wild-type (WT), cells with one NOTCH1 mutation only (NOTCH1 SNV), cells with NOTCH1 LOH, cells with both NOTCH1 SNV mutation and LOH, and cells with two NOTCH1 SNV mutations in ESO-6. LOH A = NOTCH1 loss of heterozygosity with A allele retained; LOH B = NOTCH1 loss of heterozygosity with B allele retained; WT = wild type.
Leveraging the matched RNA panel, we performed genotype-to-phenotype analysis for cells from ESO-6 to assess the effect of mutations (driver variant and/or LOH event at NOTCH1) on cell cycling and differentiation score (Fig. 6C, Supplementary Table 14). Relative to scores for wild-type cells (no mutation across six driver genes and no LOH at NOTCH1), NOTCH1 LOH clones, including the two clones that additionally had NOTCH1 SNV mutations, had lower differentiation scores, consistent with our initial five samples, suggesting that cells with NOTCH1 aberrations have stalled differentiation. The double NOTCH1-mutant clone had an even lower differentiation score, with shifts towards earlier differentiation and increased cycling compared to the single-mutant parental clone (Fig. 6C). Similarly, the NOTCH1-mutant subclone that gained NOTCH1 LOH had a shift towards earlier differentiation compared to its parental clone. Direct comparison of differentiation scores showed that cells with complete loss of NOTCH1, including those with NOTCH1 SNV mutation + NOTCH1 LOH and double NOTCH1 SNV mutation, had earlier differentiation scores compared to wild-type, single NOTCH1-mutant or NOTCH1 LOH (but no NOTCH1 SNV mutation) cells (Fig. 6D).
To expand the LOH analysis to the other donor samples, which lacked sufficient adjacent heterozygous SNPs for LOH determination, we used zygosity of a detected variant itself as a proxy for LOH within driver genes. Using known LOH events in ESO-6, we can effectively classify NOTCH1 mutations in clones as homozygous with NOTCH1 LOH when clonal VAF is greater than 0.85 (Methods), confirmed by using the germline NOTCH1 variants to determine LOH status for the eight NOTCH1-mutant clones (Supplementary Fig. 15A). Using this clonal cutoff, we assigned zygosity to mutations in all driver mutant clones across all donors (Supplementary Fig. 15B). While this method is unlikely to identify more complex scenarios where some cells with the SNV harbor an LOH event at that locus and some do not, our findings were largely consistent with bulk sequencing results(3,8), as homozygous mutations were highly frequent in NOTCH1, with over >50% of cells and >30% of clones classified as having homozygous mutations (Supplementary Fig. 15C). FAT1 and NOTCH3 had one detected clone with a homozygous mutation each, whereas TP53 had two such clones. Altogether our analysis of zygosity at driver genes align with previous reports showing that LOH is most common at NOTCH1 in the aging PNE.
Discussion
Bulk genomic characterization of solid tissue CM has revealed the extent of clonal expansions, but not at resolutions required to identify all incidences of co-occurrence of mutations or mutation order in clones. In bulk deep-targeted sequencing of PNE, only 25 out of 844 microdissections had large enough clones to determine co-occurring driver mutations in the same cells through inference methods(3,8). Furthermore, bulk sequencing methods lack the capability to capture phenotypic differences between clones that elucidate mechanisms of fitness. A single-cell perspective can identify presence and order of nested mutations, along with capturing cell states of clones.
We developed scG2P to jointly capture genotypes at mutational hotspots and mRNA phenotype markers. Using scG2P, we analyzed normal esophagus biopsies from aged donors to delineate clonal structure and mutation complexity. By targeting the six most highly mutated driver genes, we detected a driver mutation in >half of cells, aligning with previous bulk studies. By resolving clonal hierarchies, we address a key question in CM. While the same driver mutations are observed in cancer, malignant clones typically harbor multiple drivers(35,36). In contrast, we show through our single-cell DNA sequencing that clones in normal tissue harbor mostly single driver mutations, with only rare instances of two driver mutations in the same clone.
By adding targeted mRNA capture with genotyping in scG2P, we demonstrate the first genotype-to-phenotype mapping of solid tissue CM. The application of scG2P to PNE samples allowed the measurement of RNA markers of differentiation and cell cycle, and linked them to driver mutation identity, number of mutations and clone size. We inferred clonal fitness based on clone fraction, and found that clones with the highest fitness were biased towards early differentiation phenotypes. NOTCH1 mutants had stunted differentiation compared to wild-type cells within the same sample. In contrast, TP53 mutants showed both differentiation biases and increased proliferation as drivers of clonal outgrowth. We further link TP53 mutants to increased TP63 expression, suggesting a pathway to increased clonal fitness, as elevated levels of p63 have also been observed both in esophageal tumors as well as adjacent normal tissues, suggesting its role as an early event in carcinogenesis by field effect(37). Finally, rare double mutant clones showed the highest phenotypic differences compared to wild-type cells, with both greater proliferation and more pronounced differentiation biases. These data support the model in which acquisition of multiple drivers is required for a stepwise malignant transformation process, and suggest that tissue integrity of the esophagus is maintained despite the overwhelming clonal driver gene colonization due to the fact that only rarely do clones harbor more than one driver.
We note several limitations. First, our analysis likely underestimates the true extent of clonal diversity and tissue colonization due to the technical challenges associated with high-throughput single-cell mutational profiling and effectively capturing GC-rich regions. However, our sensitivity analysis confirmed that we observe similar clonal proportions when increasing coverage stringency. Second, LOH events at the NOTCH1 locus are prevalent in aged PNE(3,8). scG2P detected a portion of the expected NOTCH1 LOH events from freshly isolated whole cells from a sample that harbored several adjacent heterozygous germline SNPs. An extension of the panel to include genomic loci that include germline SNPs across driver genes and adjacent genomic regions in chromosome 9q (NOTCH1) and 17p (TP53) could more accurately detect these events (Mays et al., bioRxiv, 2024). Last, our mRNA panel does not cover the full transcriptome, although we have demonstrated proof-of-concept expansion to >600 transcripts, with the potential to increase to transcriptome wide scale by incorporating existing protocols(38). Integration of whole transcriptome capture with this assay would increase the discovery power of novel markers and phenotypic changes associated with mutant genotypes. Future technology development will aim to increase the breadth of the genome and transcriptome capture at single-cell resolution, while retaining the advantages of throughput afforded by droplet microfluidic empowered single-cell genomics that can profile thousands of cells in a single experiment.
In summary, uncovering clonal evolution in normal tissues requires the ability to link genotype to phenotype in primary human samples. Our technology addressed a critical gap in single-cell genotype-phenotype mapping, where technologies have been limited in throughput(12,39) and in targeting isolated hotspots(13,14,40) by combining highly multiplexed genotyping across entire driver genes and phenotypic analysis using targeted mRNA capture to interrogate clonal diversification and differentiation biases. Moreover, scG2P addresses specific challenges in human solid tissue mosaicism by enabling the study of nuclei from archival tissues, with the throughput and multiplexing capability required for the study of clonally complex samples. This framework is poised to advance our understanding of CM as one of the most exciting frontiers in human genetics, with major opportunities for discovery related to cancer(8,15), aging(3,8), non-malignant chronic disease(5,6,41), and pre-cancer states(42–44).
Methods
Esophageal biopsy collection
We enrolled patients who underwent therapeutic or diagnostic endoscopy for upper gastrointestinal symptoms at Kyoto University Hospital. We obtained written informed consent from all patients. The studies were conducted in accordance with the Ethical Guidelines for Life Science and Medical Research Involving Human Subjects (Japan, 2021), which are based on the principles of the Declaration of Helsinki, and were approved by the institutional review board of Kyoto University (G0645). Characteristics of these patients and healthy individuals are summarized in Supplementary Information. A history of heavy alcohol drinking (HIGH indicated by ≥396 g alcohol per week) or tobacco smoking (HIGH indicated by ≥30 pack-years) was reported in these individuals (Supplementary Table 2), who were considered to have positive lifestyle ESCC risks (high-risk individuals). Biomaterials, including esophageal tissues, were newly collected from 11 individuals using endoscopic biopsy.
Tissue processing
Single cell dissociation
Tissue biopsies were rinsed in ice cold 1X PBS and kept in ice cold Keratinocyte Serum Free Medium (KSFM, ThermoFisher) immediately after collection until processing. Tissues were placed with 500 μL ice cold 1X PBS and minced with surgical scissors. 500 μL of 0.25% trypsin (ThermoFisher) were added to the sample and incubated at 37 °C, with vortexing every 10 minutes. Reaction was stopped by adding 1 mL of KSFM with 10% FBS and strained through a 70 μm strainer. Cells were collected by centrifugation at 300 g for 5 minutes and washed with Cell Staining Buffer (Biolegend). Cells were incubated with FcX (Biolegend) for 10 minutes at 4 °C and then stained with FITC - anti-human EPCAM (BioLegend CAT #324204; RRID: AB_756078) and PE - anti-CD45 (BioLegend, CAT # 368510; RRID:AB_2566370) for 30 minutes at 4 °C before sorting for CD45-negative and EPCAM-positive cells.
Single nuclei dissociation
Tissue biopsies that were OCT embedded were trimmed of excess OCT, and then loaded on Singulator along with final concentration of 1.0 U RNAse inhibitor (Protector, Sigma) and 1mM DTT using standard nuclei extraction protocol. Sucrose (final concentration 250 mM) was added to nuclei output to mix and spun down at 500 g, 4 °C, for 5 minutes. Nuclei were washed with nuclei wash buffer (1X DPBS, 1.0 U RNAse inhibitor, 1mM DTT, 1% BSA) twice and filtered through a 40 μm strainer. Nuclei were stained with DAPI and counted on a Countess.
10x Chromium Single-cell RNAseq / 10x Visium Spatial Gene Expression
Single-cell RNAseq was carried out according to the manufacturer’s standard protocols using 10x Chromium 3’ Single Cell Gene Expression protocol. For 10x Visium assays, tissue biopsies were OCT embedded and stored. On the day of analysis, extra OCT was trimmed and tissue cryosectioned to 10 μm thickness and placed on Visium Spatial Gene Expression slides. Spatial transcriptomics was then carried out according to the manufacturer’s standard protocols. Fastqs generated from Illumina sequencing was input to cellranger (v7.1.0, Chromium scRNAseq; RRID:SCR_023221) and spaceranger (v1.3.1, Visum spatial transcriptomics; RRID:SCR_025848) with standard parameters aligning to hg38.
Cellranger and spaceranger outputs were exported to the R package Seurat(45) (v4.0.0; RRID:SCR_007322) for downstream analysis. Standard pre-processing workflow was used for QC and to select cells (number of genes > 200, and % mitochondrial reads < 5). Log normalization was applied to the data, with the top 2,000 variable features used for principal component analysis. The top 15 principal components were used for graph-based clustering and non-linear dimensional reduction to uniform manifold approximation projection (UMAP). The reference esophageal scRNAseq dataset from Madissoon et al.(23) was downloaded from CellxGene (RRID:SCR_021059), and a Jaccard similarity score was calculated between clusters and cell type annotations provided by the authors. Cluster annotation was performed manually by examining Jaccard similarity scores. The top differentially expressed genes were determined by Seurat’s FindMarkers function and compared to marker genes provided in Madissoon et al.(23)
Cell lines
KYSE-410 (Sigma-Aldrich CAT # 94072023–1VL; RRID:CVCL_1352), KYSE-270 (Sigma-Aldrich CAT #94072021–1VL; RRID:CVCL_1350), HCT-116 (ATCC CAT #CCL-247; RRID:CVCL_E7EB), K562 (ATCC CAT #CCL-243; RRID:CVCL_K562) and Y79 (ATCC CAT #HTB-18; RRID:CVCL_ZF06) were cultured according to manufacturer instructions before loading onto Tapestri platforms. Cell lines in culture were screened biweekly for mycoplasma contamination using the MycoAlert PLUS Mycoplasma Detection Kit (Lonza, 801 #LT07–703). Reference RNAseq datasets for KYSE-410, KYSE-270, and HCT-116 were obtained from public datasets (Cancer Cell Line Encyclopedia, https://sites.broadinstitute.org/ccle/). K562 and Y79 cell lines were sent to Lexogen for RNA extraction and mRNA sequencing.
Whole-exome sequencing (WES) data for each cell line were downloaded from DepMap portal (depmap.org, DepMap Public 22Q1; RRID:SCR_017655). Mutation data from Cancer Cell Line Encyclopedia (CCLE; RRID:SCR_013836) for each cell line (CCLE_mutations) were extracted to filter for variants intersecting with our scG2P mutation panel.
scG2P RNA/DNA
DNA panel design
Our DNA panel was designed using the Mission Bio Tapestri Designer (RRID:SCR_025736) by inputting all mutations detected in Yokoyama et al.(8) to generate amplicon primers (hg19). The Mission Bio panels consist of amplicons ranging in size from 175 to 275 bp generated using primers of 18–35 bp in length. Amplicons can have decreased targeting performance in regions with high %GC content or that are highly repetitive. The final DNA panel size was composed of 118 DNA amplicons covering 7 driver genes (Supplementary Table 1). Notably, while previous scDNAseq panels have largely focused on isolated mutational hotspots, we designed amplicons tiling across entire gene coding regions to better characterize the mutational landscape. No variants were detected in ZFP36L2 (1 amplicon), which was not included in further analysis.
RNA panel design
A panel of 86 targets was originally designed to capture esophagus-specific mRNA targets (hg19 alignment). The expanded panel was designed to capture expressed genes in leukemic and retinoblastoma cell lines. For this design, the most frequent exon-exon junction was selected for each gene. The transcript of interest was confirmed to contain the exon-exon junction targeted and the amplicon was designed to cross these exons, allowing for differentiation from cDNA amplification and gDNA amplification by the presence of the intron sequence in the amplicon. Targets were removed if they could not include an exon-exon junction.
The primers were designed as 12–30 nt with ~20 nt preferable, within a GC range of 50–60%, and for amplicon lengths of 200–400 bp. A single set of capture primers was selected for each gene target (Supplementary Table 3). These primers were confirmed not to contain any common SNP sites at the 5 bases located at the 3’ end of the primer. The Gibbs free energy was also calculated and any RNA reverse primers with a low ΔG (ΔG < −6 kcal/mol using the OligoAnalyzer tool on https://www.idtdna.com/pages/tools/oligoanalyzer; RRID:SCR_001363) were redesigned. Additional checks were performed to minimize primer interactions. To assess all primers (DNA and RNA) for interactions, primers were checked to ensure that the 8 bases of the 3’ end were not an exact match with the 8 bases of the 3’ end of any other primer. A more stringent primer interaction check for the primers used for reverse transcription was performed only with the RNA reverse primers. The Multiple Primer Analyzer tool (Thermofisher) was used to confirm that the 5 bases at the 3’ end did not have an exact complementary match to any other RNA reverse primer, including the adaptor sequence. Further targets were excluded if they were not amplified or non-specifically amplified in the cell line mixing experiment.
Primer PCR handles were added to the RNA primers. The forward primers had 5’-GTACTCGCAGTAGTC-3’ appended to the 5’ end of the gene-specific primer and the reverse primers had 5’-GTGATACACGACTATGAGCGCTA-3’ appended to the 5’ end of the gene-specific primer.
scG2P RNA+DNA protocol
Single cells or nuclei were counted using Trypan exclusion using a Countess 3 and resuspended in nuclei wash buffer (1X DPBS, 1.0 U Protector RNAse inhibitor, 1mM DTT, 1% BSA) at concentration between 1,000–3,000 cells per μL to reach a minimum volume of 35 μL. The Mission Bio v2 chemistry was used for this protocol. Step one of the Mission Bio Tapestri protocol was performed by adding in the cell suspension and encapsulation oil v2 (Mission Bio) to the Tapestri cartridge along with a mastermix containing reverse transcriptase and cell lysis reagents. This mastermix contained final concentrations of 10 U/μL of SuperScript IV (Thermo #18091050), 5 mM DTT (Thermo Fisher #18091050), 0.5 mM each dNTP (Thermo Fisher #18091050), 1X 5X buffer (Thermo Fisher #18091050), 2 U/μL Rnase inhibitor (Thermo Fisher #18091050), 1% NP-40 Surfact-Amps™ Detergent Solution (Thermo Fisher #85124), 0.04 mg/mL Proteinase K (Sigma #3115828001) and 14.6 μM RNA reverse primer pool. The Proteinase K was diluted in nuclease-free water prior to adding it to the mastermix and was added immediately before loading on the Tapestri cartridge to avoid degradation of the enzymes.
Encapsulation protocol (Mission Bio) was run following manufacturer’s instructions to generate the first emulsion. A gel loading tip was used to remove excess encapsulation oil from the emulsion tube. The lysis and digestion protocol was run on a preheated thermocycler with the following thermocycler protocol: 1) 50 °C for 60 minutes; 2) 80 °C for 10 minutes; 3) 4 °C hold.
Emulsions were loaded onto Tapestri cartridge with Barcoding Mix (293.8 μL Barcoding Master Mix v2 with final concentrations of 0.21 μM DNA forward primer pool, 2.1 μM DNA reverse primer pool, 0.42 μM RNA forward primer pool), Barcoding beads, and Barcoding oil v2 (Mission Bio). Barcoding program was run to generate a second emulsion and UV light was used to cleave off barcode-containing forward primers from Barcoding beads prior to PCR amplification. A gel loading tip was used to remove excess barcoding oil from the emulsion tubes.
Emulsions were loaded onto the Tapestri cartridge with Barcoding Mix (Barcoding Master Mix, DNA forward primer pool, DNA reverse primer pool, RNA forward primer pool), Barcoding beads, and Barcoding oil (Mission Bio). Barcoding program was run to generate a second emulsion and UV light was used to cleave off barcode-containing forward primers from Barcoding beads prior to PCR amplification. A gel loading tip was used to remove excess barcoding oil from the emulsion tube.
Targeted PCR was performed using the following thermocycler protocol: 1) 98 °C for 6 minutes; 2) 10 cycles of 95 °C for 30 seconds, 72 °C for 10 seconds, 61 °C for 6 minutes, and 72 °C for 20 seconds; 3) 12 cycles of 95 °C for 30 seconds, 72 °C for 10 seconds, 51 °C for 4.5 minutes, and 72 °C for 20 seconds; 4) 72 °C for 2 minutes, 4 °C hold.
After targeted PCR, 15 μL of an extraction agent (Mission Bio) was added to samples to break emulsion. DNA clean up buffer and clean up enzyme were added to sample and incubated at 37 °C for 60 minutes at 350 RPM. Two AMPure XP library clean-up steps were performed at 0.72X.
The RNA PCR product isolation was performed by washing M-270 Streptavidin beads (ThermoFisher) in Binding and Washing Buffer (10M Tris-HCl, 1 mM EDTA, 2M NaCl). An RNA Biotin Oligo (/5BioTinTEG/GTGATACACGACTATGAGCGCTA/3C6/) was added to the PCR product at a final concentration of 0.7 μM. The PCR product with the RNA Biotin Oligo was incubated at 99 °C for 5 minutes and transferred immediately to ice for 5 minutes. The Streptavidin beads were added and incubated at room temperature for 30 minutes before magnetic separation of the DNA and RNA fractions. A 0.72X AMPure cleanup was then performed on the DNA fraction.
Both fractions undergo library amplification using the entire product using Library Mix (Mission Bio). The DNA PCR product was amplified with the Mission Bio kit indexes (Mission Bio) and the RNA PCR product was amplified with custom indices using the following thermocycler protocol: 1) 95 °C for 3 minutes; 2) 15 cycles of 98 °C for 20 seconds, 62 °C for 20 seconds, 72 °C for 45 seconds; 3) 72 °C for 2 minutes and 4 °C hold.
All libraries were sequenced with paired-end 150 bp on Illumina Novaseq 6000 at the Weill Cornell Medicine Genomics Core.
The cell line mixing experiment was sequenced to the depth of 400X per amplicon per cell for RNA and 160X per amplicon per cell for DNA. For the esophageal samples, we reasoned from bulk microdissection data of biopsies of similar size that clone fractions are in the range of lower than 0.05, and sought to increase detection power by increasing sequencing depth to 300X per cell per amplicon for DNA.
Data processing
Fastqs for DNA libraries generated were processed using the Tapestri pipeline (Mission Bio; RRID:SCR_025736) for adapter trimming, barcode correction, sequence alignment, cell calling, and variant calling. RNA libraries were processed for adapter trimming and sequence alignment with STAR (RRID:SCR_004463). Both DNA and RNA libraries were aligned to hg19.
Cell calling
Cell finding from the DNA library was performed using a “Correlation UMAP” algorithm in Tapestri. Given a barcode by amplicon read count matrix, two metric matrices were derived, normalized read counts and correlation-coverage. Normalized read counts were generated by first normalizing each amplicon read count to its mean plus one across all barcodes, then normalizing these values to the median read counts of the top 10% of all barcodes plus a small constant. Correlation-coverage was derived from calculating the log10 of the mean coverage for the barcode across all amplicons and the $r^2$ of the per-amplicon coverage for the barcode with that of the experiment as a whole. These two feature sets were then combined and scaled to have equal weighting under L1-distance. UMAP was then performed on this combined feature set. The resulting UMAP was then clustered with HDBSCAN and the resulting clusters were used to identify the cell cluster based on quality control metrics.
Variant calling
Mission Bio’s Tapestri-2.0.1 pipeline (RRID:SCR_025736) employs a barcode decoding step to demultiplex the sequence data, followed by BWA mapping of the sequences. Variant calling was performed on a per-cell basis with GATK HaplotypeCaller (v4.1.7.0; RRID:SCR_001876). The set of gVCFs were then merged and converted to a MissionBio object for further analysis.
Single-cell genotype calling
For our single-cell genotyping strategy, we first assigned genotypes to single cells, then filtered at the variant level for the quality of the variant and how frequently it is genotyped, followed by filtering at the cell level.
In the cell line mixing experiment, variants for genotyping cells required the following criteria: total read counts >= 10 per cell(15), alternate allele count >= 3(16), and genotyping quality score (GATK) >30. We further removed variants where genotyping calls were present in less than 50% of cells and removed cells with <50% informative genotypes (Genotyping completeness <50%).
In the esophagus samples, assigning genotypes for each variant in a cell required the following criteria and rationale: total read counts >= 10 per cell (filtering for read depths of amplicons as in Miles et al.(15), detection of mutations by alternate allele count >= 3 (filtering for mutant alleles as in Morita et al.(16), and genotyping variant quality score >= 30 (as in GATK Best Practices(46)). Variants that did not pass the filters per cell were annotated as no genotype call for the cell. We removed variants in which more than 50% of cells had no genotype calls to generate a filtered variant list. Post variant filtering, we removed cells that had a no genotype call for more than 50% of the filtered variants to generate our final cell by genotype matrices. To determine whether mutant cell fractions could be underestimated due to dropout, we reanalyzed the dataset with a more stringent filtering of genotyping completeness (>80% genotyping completeness).
For quality of genotyping using the designed panel, we generated metrics of amplicon efficiency and genotyping efficiency, both measured as metric per cell. Amplicon efficiency was measured as a fraction of amplicons with >= 10 reads divided by the number of total amplicons in the panel (n = 118), calculated on a per cell basis, indicating that the amplicon was able to be genotyped in that cell, for variants or calling the cell as wild type. We also calculated gene-specific amplicon efficiency, which was measured as a fraction of gene specific amplicons with >= 10 reads divided by total amplicons in that gene (e.g., NOTCH1), calculated on a per cell basis. Genotyping efficiency was measured as a fraction of filtered variants with a genotype call in a cell (versus no call at that variant).
We note that high GC content presents a challenge across single-cell amplification methods, regardless of primer design (targeted or random). Indeed, scWGS (15X coverage per cell) generated using primary template-directed amplification(39) (PTA) that uses random hexamers showed lower average coverage at NOTCH1 compared to other driver genes (Supplementary Fig. 16).
Clonal reconstruction and genotyping completeness
1) We implemented an unsupervised iterative soft-clustering algorithm to assign cells to cluster genotype states that maximized each cell’s membership vector. Sites where genotypes were missing in a cell had their weight redistributed evenly to all other valid sites in our distance metric. 2) We implemented a reinforcement learning clonal trajectory model based on Miles et al.(15) to reconstruct clonal structure. Briefly, a reward matrix was constructed with rewards proportional to genotype state prevalence observed in the sample. The action space was restricted to those that define transitions where a single site is mutated and infinite sites is satisfied. A Q-learning agent was trained using an epsilon-greedy with experience replay algorithm to learn the values of each genotype state transition. The resulting Q-matrix was then transformed into a genotype state graph, with states as nodes and mutational events as edges weighted by their Q value. The optimal path to each genotype state observed in the sample was extracted to produce the resulting optimal clonal trajectory. We filtered out clones that were not reproducible with bootstrapping methods(15). Additionally, to examine whether clone sizes were correlated with genotyping completeness (Supplementary Fig. 8C), we thresholded for cells with >50% and >80% genotyping across all variants and observed that there were no clone size biases based on genotyping completeness. We performed this analysis specifically for clones with nested double mutations where the single mutant clone was also detectable in the same sample (Supplementary Fig. 8D). To account for the possibility of allelic dropout at the second site, we used high levels of stringent filtering to confidently detect single and nested double mutant clones, and verified that the ratios of both were consistent between low and high genotyping completeness thresholds.
To ensure clone reproducibility, we adapted the bootstrapping analysis over 100,000 samplings to calculate 95% confidence intervals for presence of each clone, and filtered out clones in which the lower 95% CI was < 1% of total cell population. This was due to the smaller total number of cells analyzed compared to previous studies in blood, which set the minimum threshold to an absolute count of at least 10 cells(15), and previous analysis that showed that the detection rate of scDNAseq for cell line spike-in experiments was 1%(47). We adapted the reinforcement learning model from Miles et al.(15) to utilize a reward matrix, with the restriction that cells that acquire one mutation would not revert that mutation back to a wild-type state. A resulting Q-matrix was then transformed into a genotype state graph with states as nodes and mutational events as edges weighted by their Q-value.
Microdissection DNA sequencing was obtained and filtered for estimated clone sizes from biopsy punches of 4 mm2, which corresponds most closely to our samples. Estimated clone sizes were quantified based on non-synonomous mutations in selected driver genes. While direct comparison of estimated clone sizes using VAF in bulk sequencing versus single-cell methods has shown high correlation in blood samples(15,16,47), there are larger differences when comparing small clone sizes. Notably, in normal solid tissue mosaicism, a majority of detected clones have expectedly small clone fractions below 0.1, indicating that clone sizes as determined by bulk versus single-cell methods in solid tissues may not align as directly as they do in blood.
RNA Sequencing Reads Filtering
To mitigate potential gDNA contamination in the scRNA panel, we developed a computational pipeline, PRIMR (https://github.com/landau-lab/PRIMR), that selectively identified reads exhibiting exon-exon junction. First, the BAM file outputted by STAR (v2.7.0; RRID:SCR_004463) was filtered to only include reads with no more than 5 bases aligned to an intronic region. This BAM file was then partitioned into amplicon-specific BAM files, where each file corresponded to a distinct amplicon (e.g., a gene from the panel). These BAM files were then sorted based on the read names (QNAMES). Next, a paired-end bed file-like file was generated, with each row representing a read and its mate (R1/R2) from the same fragment. The start and end positions of the reads were updated using the CIGAR string information, accounting for soft-clipped bases in the position calculations. This approach allowed for the identification of reads derived from gDNA, which mapped to the intronic region but had bases soft-clipped to exclusively align to the exonic region by the STAR splice-aware aligner. By utilizing the CIGAR-corrected positions, fragments mapping to intronic regions were filtered out, retaining only those fragments that exhibited exon-exon junctions in at least one of their reads [e.g., read 1 (R1) or read 2 (R2)]. These fragments were subsequently employed to compute the corrected RNA counts matrix (Supplementary Fig. 4A).
Cell Line Mixing Data Processing and Analysis
DNA Data Processing and Analysis
Raw DNA sequencing data were extracted from the h5 file generated by the Mission Bio Mosaic pipeline. To reduce the dimensionality of the DNA VAF data, the UMAP algorithm implemented in the run_umap function of the Mission Bio Mosaic package (v3.4.0; RRID:SCR_017388) was used. This allowed us to visualize the high-dimensional data in a two-dimensional space. Subsequently, k-means clustering with k=3 was performed to identify the three distinct cell lines based on the allele frequencies of SNVs. Clusters were labeled using the SNVs that were unique to each cell-line, as shown in Fig. 1C.
RNA Data Processing and Analysis
For RNA data processing, the PRIMR R package (https://github.com/landau-lab/PRIMR), developed by our team, was employed to filter the raw sequencing reads to retain only exon-exon junction reads (method detailed above). Droplets identified as cells by the Tapestri cell caller and expressing more than 100 RNA reads were retained. To address potential multiplets and exclude remaining empty droplets, barcodes expressing a number of unique features beyond the 95th percentile or below the 5th percentile of cells were removed. The processed RNA counts matrix was normalized using a Centered-Log-Ratio transformation across cells as defined in the Seurat package (v4.0.0, RRID:SCR_007322). Principal Component Analysis (PCA) was then performed on all features to reduce dimensionality. Further dimensionality reduction was achieved using the RunUMAP function with the first 5 principal components, and parameters seed.use and min.dist set to 123 and 0.01, respectively. To identify clusters of cells, the Louvain algorithm was employed, calculating k-nearest neighbors and constructing the shared nearest neighbor (SNN) graph using the FindNeighbors function. The FindClusters function with a resolution parameter of 0.15 was used to determine the clusters.
Visualization and Cell Line Assignment Accuracy
A plot of the UMAP space demonstrated clear separation of the three cell lines, marked by expression of their respective markers KRT5, KRT7, and KRT23. The accuracy of the cell line assignment was calculated by computing the ratio of cells assigned to the same cell lines using both the RNA and DNA panels and the total number of cells.
Correlation Analysis between scG2P scRNA Data and CCLE Bulk Reference Datasets
Bulk RNA sequencing data for HCT116 (RRID:CVCL_E7EB), KYSE270 (RRID:CVCL_1350), and KYSE410 (RRID:CVCL_1352) cell lines were obtained from the CCLE database (RRID:SCR_013836) and the reported log2(TPM+1) values were used for the analysis. Bulk RNA sequencing data for Y79 (RRID:CVCL_ZF06) and K562 (RRID:CVCL_K562) were obtained from Lexogen and log2(TPM+1) values were calculated. scG2P single-cell RNA sequencing data were pseudo-bulked based on the DNA VAF clustering labels, using the average of the Centered Log-Ratio values normalized across features. Correlations were computed using the Spearman correlation coefficient and visualized per cell line using scatter plots.
Correlation Analysis between scG2P Single-Cell Data and Drop-seq Dataset
HCT116 (RRID:CVCL_E7EB) single-cell data sequenced from the Park et al.(48) study using the Drop-seq protocol were downloaded from the GEO database (GSE149224). Data filtering was performed using the GSE149224_meta.information.csv file to retain only HCT116 cells with a dose equal to 0. scG2P single-cell RNA sequencing data were pseudo-bulked based on the DNA VAF clustering labels, using the average of the Centered Log-Ratio values normalized across features. Correlations were computed using the Spearman correlation coefficient and visualized per cell line using scatter plots.
Patient Samples Data Processing and Analysis
DNA Data Processing and Analysis
Raw DNA sequencing data were extracted from the h5 file generated by the Mission Bio Mosaic pipeline. To reduce the dimensionality of the DNA VAF data, the UMAP algorithm implemented in the run_umap function of the Mission Bio Mosaic package (v3.4.0; RRID:SCR_017388) was used. This allowed us to visualize the high-dimensional data in a two-dimensional space. Subsequently, k-means clustering with k=3 was performed to identify the three distinct cell lines based on the allele frequencies of SNVs. Clusters were labeled using the SNVs that were unique to each cell-line, as shown in Fig. 1C.
RNA Data Processing and Analysis
Raw RNA data for each patient were processed by the PRIMR pipeline (https://github.com/landau-lab/PRIMR) to select only for fragments exhibiting an exon-exon junction. Corrected RNA counts matrices were merged and droplets identified as cells by the Tapestri cell caller were retained. Cells expressing fewer than 50 RNA reads were removed from the analysis. The RNA counts matrix was normalized using a Centered-Log-Ratio transformation across cells as defined in the Seurat package. To account for variations in RNA sequencing depth, the ScaleData Seurat function was applied with the vars.to.regress parameter set to nCount_RNA; both the center and do.scale parameters were set to False. Dimensionality reduction was performed using the RunUMAP function on the normalized matrix, with parameters seed.use and min.dist set to 123 and 0.5. The FindNeighbors Seurat function was used to compute the SNN graph. Subsequently, the FindClusters function was employed to cluster cells using the Louvain algorithm, with a resolution parameter set to 0.15.
Cell Cycling and Differentiation Module Scores
Cell Cycling and Differentiation module scores were calculated using the AddModuleScore Seurat function. Briefly, the function calculates the average expression levels of each module at the single-cell level, subtracted by the aggregated expression of control feature sets. All analyzed features are binned based on averaged expression, and the control features are randomly selected from each bin. For this analysis, we used 10 control features per analyzed feature and 5 bins of aggregate expression levels for all analyzed features.
To validate the signature used with the RNA panel with transcriptome-wide scRNAseq data, publicly available 10x Genomics scRNAseq data for epithelial cells from Madissoon et al.(23), obtained from dissociated esophagus samples, were used. The cell by gene counts matrix was subsetted for only the transcripts included the RNA panel for downstream cluster analysis and assigning cell types (basal, basal proliferative and suprabasal). Gene module scores for differentiation and cycling were calculated using AddModuleScore from Seurat(45), as had been done for scG2P data. G2M score, which is commonly used for cell proliferation in scRNAseq data, alongside cell cycling score, was calculated in the scRNAseq data. Correlation of both scores in the three cell types was measured. The first two principal components of the scRNAseq data were used as input to calculate the slingshot pseudotime(49) from basal to suprabasal cells, and cells were binned in quintiles along the pseudotime trajectory. Differentiation scores of cells in each bin were plotted and a Pearson’s correlation was measured.
Diffusion map and trajectory inference
Fibroblasts were removed from gene counts matrix and a diffusion map was generated from the neighbors graph of the scRNA data as implemented in scanpy(50) (v1.9.0; RRID:SCR_018139) and diffusion pseudotime calculated along the diffusion components while rooted in our basal cell cluster. Clonal abundance fish plots were generated by calculating density of mutant cells with driver gene mutations over pseudotime quantiles, representing trajectory of differentiation from basal to differentiated. ESO-4 was not used for diffusion map construction due to the lower number of features detected compared to other samples.
NOTCH1 loss of heterozygosity calling
NOTCH1 variants that were called in intronic regions (n = 5) were matched in GnomAD (RRID:SCR_014964) to confirm germline variant status. Cells that passed filtering in ESO-6 were re-clustered using the Mosaic pipeline (DBSCAN function) using only the VAFs of the five germline variants, with three main clusters emerging representing wild-type, LOH A, and LOH B. 4,976/5,850 cells were assigned in one of the three clusters, with median number of five called genotypes from the five germline SNPs per cell, and with 96.5% of cells with at least four variants genotyped.
Estimation of zygosity in detected variants
Mean clone VAFs of each variant were calculated, and the clone was assigned as heterozygous or homozygous using a cutoff of VAF > 0.85. The heterozygous SNPs in NOTCH1 amplicons in ESO-6 provided an opportunity for ground truth determination regarding the relationship between clonal VAF and zygosity. In this donor, we detected eight total NOTCH1-mutant clones (four with LOH and four without LOH; Supplementary Fig. 15A). Using those labels, we performed 1,000-fold bootstrapping on clone-mean VAFs of the NOTCH1 driver mutation to identify the operating point that best separated LOH vs non-LOH (measured by Adjusted Rand Index and F1). A threshold of 0.85 for the clone-mean VAF maximized separation (F1 > 0.9). We applied this empirically-derived threshold to assign homozygous/hemizygous vs heterozygous status to driver variants in other esophagus samples (ESO-1-5).
Primary templated-directed amplification of epithelial cells
Single cell suspensions were stained, washed 2 times with staining buffer (BioLegend #42020), incubated with FcX for FcR blocking for 10 minutes at 4 °C, then incubated with anti-human CD45 (APC-Cy7, BioLegend #368515, RRID:AB_2566375) and anti-human EPCAM (FITC, BioLegend #324203, RRID:AB_756077) antibodies. Samples were washed 3 times with Cell Staining Buffer and sorted at the Weill Cornell Sorting Core Facility using single-cell sorting mode on the BD Symphony S6 into 96-well plates pre-filled with Cell Buffer (BioSkryb).
Primary template-directed amplification was performed using the BioSkryb ResolveOME kits. Briefly, cells underwent a reverse transcription reaction, followed by a stop reaction. Cells then underwent nuclear lysis to access genomic DNA, DNA neutralization, followed by whole-genome amplification with the primary template-directed amplification(39) for 10 hours. Affinity bead separation was used to separate DNA and cDNA fractions. The DNA fraction was used as input for fragmentation, A-tailing, and ligation of sequencing adapters. Libraries were cleaned up with ResolveOME beads and the library was amplified for 10 cycles according to manufacturer’s protocol, before a final clean up step. The quantified library was sequenced on Ultima Genomics UG100 to 15X genome-wide coverage per cell.
500-kb windows were used to determine coverage along driver gene exons of NOTCH1, NOTCH2, NOTCH3, TP53, FAT1, PPM1D, and ZFP36L2.
Plotting
All schematics were generated in Biorender. Plots were generated with ggplot2 (v3.4.4; RRID:SCR_014601) in R (v4.2.1; RRID:SCR_001905) or seaborn (RRID:SCR_018132) and matplotlib (RRID:SCR_008624) in python (RRID:SCR_008394).
Supplementary Material
Statement of significance:
Joint single-cell capture of somatic mutations and mRNA transcripts reconstructs clonal architecture and associated phenotypes of the phenotypically normal esophagus, providing the first single-cell genotype-phenotype map of this clonally mosaic tissue to accelerate our understanding of human somatic evolution in solid tissues and provide a window into early cancerous states.
Acknowledgments
We thank H. R. He at Weill Cornell, J. Park and all members of the Landau laboratory. D.A.L. is supported by the Burroughs Wellcome Fund Career Award for Medical Scientists, Vallee Scholar Award, Blood Cancer United Scholar Award and the Mark Foundation Emerging Leader Award. This work was also supported by the National Cancer Institute (R33 CA267219), the National Human Genome Research Institute and the Center of Excellence in Genomic Science (RM1HG011014) and the National Institutes of Health Common Fund Somatic Mosaicism Across Human Tissues (UG3NS132139–01). This work is supported by the Japan Agency for Medical Research and Development (AMED): The Core Research for Evolutional Science and Technology (CREST) (JP22gm1110011 to S.O.) and the Moonshot Research and Development Program (JP22zf0127009 to S.O.), AMED (JP23ck0106798 to N.K.), the Japan Society for the Promotion of Science (JSPS), Scientific Research on Innovative Areas (JP15H05909 to S.O.), JSPS KAKENHI (JP20H03660 to A.Y.) and the Japan Science and Technology Agency (JST): Fusion Oriented Research for disruptive Science and Technology (FOREST) Program (JPMJFR215V to N.K.) and Moonshot Research and Development Program (JPMJMS2022–25 to N.K.). This work was made possible by the MacMillan Family Foundation and the MacMillan Center for the Study of the Non-Coding Cancer Genome at the New York Genome Center. This work was supported in part by the Clinical and Biospecimen and Research Core of the Columbia University Digestive and Liver Disease Research Center (P30 DK132710).
Footnotes
Conflicts-of interest:
D.J.Y. has received travel support from Ultima Genomics and Bioskryb Genomics, outside of this work. D.A.L. serves on the Scientific Advisory Board of Mission Bio, Pangea, Alethiomics, Montage, Ultima and Veracyte. D.A.L. has received prior research funding from 10x Genomics, Illumina, and Ultima Genomics unrelated to the current manuscript. D.D. and S.W. are employees of Mission Bio. D.D. is listed as an inventor on a granted patent (US patent 11365441) and a submitted patent (US patent application 16/839,057). S.W. is listed as an inventor on a submitted patent (US patent application 16/936,378). No other authors report competing interests.
Data availability
Raw data are available on European Genome-Phenome Archives (EGA accession number EGAS50000001429) in the form of raw FASTQs and h5 files for each sample.
Code availability
The PRIMR R package for RNA data processing to identify exon-exon junction reads is available from GitHub at https://github.com/landau-lab/PRIMR.
Notebooks for all the computational analyses are available at https://github.com/landau-lab/scG2P.
References
- 1.Landau DA, Carter SL, Stojanov P, McKenna A, Stevenson K, Lawrence MS, et al. Evolution and impact of subclonal mutations in chronic lymphocytic leukemia. Cell. 2013;152:714–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Landau DA, Tausch E, Taylor-Weiner AN, Stewart C, Reiter JG, Bahlo J, et al. Mutations driving CLL and their evolution in progression and relapse. Nature. Nature Publishing Group; 2015;526:525–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Martincorena I, Fowler JC, Wabik A, Lawson ARJ, Abascal F, Hall MWJ, et al. Somatic mutant clones colonize the human esophagus with age. Science. 2018;362:911–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Lee-Six H, Øbro NF, Shepherd MS, Grossmann S, Dawson K, Belmonte M, et al. Population dynamics of normal human blood inferred from somatic mutations. Nature. Nature Publishing Group; 2018;561:473–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Olafsson S, McIntyre RE, Coorens T, Butler T, Jung H, Robinson PS, et al. Somatic Evolution in Non-neoplastic IBD-Affected Colon. Cell. 2020;182:672–684.e11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Anglesio MS, Papadopoulos N, Ayhan A, Nazeran TM, Noë M, Horlings HM, et al. Cancer-Associated Mutations in Endometriosis without Cancer. N Engl J Med. 2017;376:1835–48. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Ng SWK, Rouhani FJ, Brunner SF, Brzozowska N, Aitken SJ, Yang M, et al. Convergent somatic mutations in metabolism genes in chronic liver disease. Nature. 2021;598:473–8. [DOI] [PubMed] [Google Scholar]
- 8.Yokoyama A, Kakiuchi N, Yoshizato T, Nannya Y, Suzuki H, Takeuchi Y, et al. Age-related remodelling of oesophageal epithelia by mutated cancer drivers. Nature. Nature Publishing Group; 2019;565:312–7. [DOI] [PubMed] [Google Scholar]
- 9.Kakiuchi N, Ogawa S. Clonal expansion in non-cancer tissues. Nat Rev Cancer. Nature Publishing Group; 2021;21:239–56. [DOI] [PubMed] [Google Scholar]
- 10.Abby E, Dentro SC, Hall MWJ, Fowler JC, Ong SH, Sood R, et al. Notch1 mutations drive clonal expansion in normal esophageal epithelium but impair tumor growth. Nat Genet. Nature Publishing Group; 2023;55:232–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Colom B, Alcolea MP, Piedrafita G, Hall MW, Wabik A, Dentro SC, et al. Spatial competition shapes the dynamic mutational landscape of normal esophageal epithelium. Nat Genet. 2020;52:604–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Macaulay IC, Haerty W, Kumar P, Li YI, Hu TX, Teng MJ, et al. G&T-seq: parallel sequencing of single-cell genomes and transcriptomes. Nat Methods. 2015;12:519–22. [DOI] [PubMed] [Google Scholar]
- 13.Nam AS, Kim K-T, Chaligne R, Izzo F, Ang C, Taylor J, et al. Somatic mutations and cell identity linked by Genotyping of Transcriptomes. Nature. Nature Publishing Group; 2019;571:355–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Cortés-López M, Chamely P, Hawkins AG, Stanley RF, Swett AD, Ganesan S, et al. Single-cell multi-omics defines the cell-type-specific impact of splicing aberrations in human hematopoietic clonal outgrowths. Cell Stem Cell. 2023;30:1262–1281.e8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Miles LA, Bowman RL, Merlinsky TR, Csete IS, Ooi AT, Durruthy-Durruthy R, et al. Single-cell mutation analysis of clonal evolution in myeloid malignancies. Nature. Nature Publishing Group; 2020;587:477–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Morita K, Wang F, Jahn K, Hu T, Tanaka T, Sasaki Y, et al. Clonal evolution of acute myeloid leukemia revealed by high-throughput single-cell genomics. Nat Commun. 2020;11:5327. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Nadeu F, Royo R, Massoni-Badosa R, Playa-Albinyana H, Garcia-Torre B, Duran-Ferrer M, et al. Detection of early seeding of Richter transformation in chronic lymphocytic leukemia. Nat Med. Nature Publishing Group; 2022;28:1662–71. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Leighton J, Hu M, Sei E, Meric-Bernstam F, Navin NE. Reconstructing mutational lineages in breast cancer by multi-patient-targeted single-cell DNA sequencing. Cell Genomics. 2023;3:100215. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Yugawa T, Handa K, Narisawa-Saito M, Ohno S, Fujita M, Kiyono T. Regulation of Notch1 Gene Expression by p53 in Epithelial Cells. Mol Cell Biol. 2007;27:3732–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Ohashi S, Natsuizaka M, Yashiro-Ohtani Y, Kalman RA, Nakagawa M, Wu L, et al. NOTCH1 and NOTCH3 coordinate esophageal squamous differentiation through a CSL-dependent transcriptional network. Gastroenterology. 2010;139:2113–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Sakamoto K, Fujii T, Kawachi H, Miki Y, Omura K, Morita K, et al. Reduction of NOTCH1 expression pertains to maturation abnormalities of keratinocytes in squamous neoplasms. Lab Investig J Tech Methods Pathol. 2012;92:688–702. [DOI] [PubMed] [Google Scholar]
- 22.Rochman M, Wen T, Kotliar M, Dexheimer PJ, Ben-Baruch Morgenstern N, Caldwell JM, et al. Single-cell RNA-Seq of human esophageal epithelium in homeostasis and allergic inflammation. JCI Insight. 2022;7:e159093. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Madissoon E, Wilbrey-Clark A, Miragaia RJ, Saeb-Parsy K, Mahbubani KT, Georgakopoulos N, et al. scRNA-seq assessment of the human lung, spleen, and esophagus tissue stability after cold preservation. Genome Biol. 2019;21:1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Busslinger GA, Weusten BLA, Bogte A, Begthel H, Brosens LAA, Clevers H. Human gastrointestinal epithelia of the esophagus, stomach, and duodenum resolved at single-cell resolution. Cell Rep. 2021;34:108819. [DOI] [PubMed] [Google Scholar]
- 25.Lindenhofer D, Bauman JR, Hawkins JA, Fitzgerald D, Yildiz U, Jung H, et al. Functional phenotyping of genomic variants using joint multiomic single-cell DNA–RNA sequencing. Nat Methods. Nature Publishing Group; 2025;22:2032–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Derrien T, Estellé J, Marco Sola S, Knowles DG, Raineri E, Guigó R, et al. Fast Computation and Applications of Genome Mappability. PLoS ONE. 2012;7:e30377. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Martincorena I, Roshan A, Gerstung M, Ellis P, Van Loo P, McLaren S, et al. High burden and pervasive positive selection of somatic mutations in normal human skin. Science. American Association for the Advancement of Science; 2015;348:880–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Murai K, Skrupskelyte G, Piedrafita G, Hall M, Kostiou V, Ong SH, et al. Epidermal Tissue Adapts to Restrain Progenitors Carrying Clonal p53 Mutations. Cell Stem Cell. 2018;23:687–699.e8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Murai K, Dentro S, Ong SH, Sood R, Fernandez-Antoran D, Herms A, et al. p53 mutation in normal esophagus promotes multiple stages of carcinogenesis but is constrained by clonal competition. Nat Commun. Nature Publishing Group; 2022;13:6206. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Yang Y, Goldstein BG, Chao H-H, Katz J. KLF4 and KLF5 regulate proliferation, Apoptosis and invasion in esophageal cancer cells. Cancer Biol Ther. Taylor & Francis; 2005;4:1216–21. [DOI] [PubMed] [Google Scholar]
- 31.Yang Y, Bhargava D, Chen X, Zhou T, Dursuk G, Jiang W, et al. KLF5 and p53 comprise an incoherent feed-forward loop directing cell-fate decisions following stress. Cell Death Dis. Nature Publishing Group; 2023;14:1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Suliman Y, Opitz OG, Avadhani A, Burns TC, El-Deiry W, Wong DT, et al. p63 Expression Is Associated with p53 Loss in Oral-Esophageal Epithelia of p53-deficient Mice1. Cancer Res. 2001;61:6467–73. [PubMed] [Google Scholar]
- 33.Daniely Y, Liao G, Dixon D, Linnoila RI, Lori A, Randell SH, et al. Critical role of p63 in the development of a normal esophageal and tracheobronchial epithelium. Am J Physiol-Cell Physiol. 2004;287:C171–81. [DOI] [PubMed] [Google Scholar]
- 34.Haga H, Yamada R, Ohnishi Y, Nakamura Y, Tanaka T. Gene-based SNP discovery as part of the Japanese Millennium Genome Project: identification of 190,562 genetic variations in the human genome. Single-nucleotide polymorphism. J Hum Genet. 2002;47:605–10. [DOI] [PubMed] [Google Scholar]
- 35.Tomasetti C, Marchionni L, Nowak MA, Parmigiani G, Vogelstein B. Only three driver gene mutations are required for the development of lung and colorectal cancers. Proc Natl Acad Sci. Proceedings of the National Academy of Sciences; 2015;112:118–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Bailey MH, Tokheim C, Porta-Pardo E, Sengupta S, Bertrand D, Weerasinghe A, et al. Comprehensive Characterization of Cancer Driver Genes and Mutations. Cell. Elsevier; 2018;173:371–385.e18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Hu H, Xia S-H, Li A-D, Xu X, Cai Y, Han Y-L, et al. Elevated expression of p63 protein in human esophageal squamous cell carcinomas. Int J Cancer. 2002;102:580–3. [DOI] [PubMed] [Google Scholar]
- 38.Marshall JL, Doughty BR, Subramanian V, Guckelberger P, Wang Q, Chen LM, et al. HyPR-seq: Single-cell quantification of chosen RNAs via hybridization and sequencing of DNA probes. Proc Natl Acad Sci. Proceedings of the National Academy of Sciences; 2020;117:33404–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Gonzalez-Pena V, Natarajan S, Xia Y, Klein D, Carter R, Pang Y, et al. Accurate genomic variant detection in single cells with primary template-directed amplification. Proc Natl Acad Sci. Proceedings of the National Academy of Sciences; 2021;118:e2024176118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Nam AS, Chaligne R, Landau DA. Integrating genetic and non-genetic determinants of cancer evolution by single-cell multi-omics. Nat Rev Genet. 2021;22:3–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Nanki K, Fujii M, Shimokawa M, Matano M, Nishikori S, Date S, et al. Somatic inflammatory gene mutations in human ulcerative colitis epithelium. Nature. Nature Publishing Group; 2020;577:254–9. [DOI] [PubMed] [Google Scholar]
- 42.Ross-Innes CS, Becq J, Warren A, Cheetham RK, Northen H, O’Donovan M, et al. Whole-genome sequencing provides new insights into the clonal architecture of Barrett’s esophagus and esophageal adenocarcinoma. Nat Genet. Nature Publishing Group; 2015;47:1038–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Jaiswal S, Fontanillas P, Flannick J, Manning A, Grauman PV, Mar BG, et al. Age-related clonal hematopoiesis associated with adverse outcomes. N Engl J Med. 2014;371:2488–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Yoshizato T, Dumitriu B, Hosokawa K, Makishima H, Yoshida K, Townsley D, et al. Somatic Mutations and Clonal Hematopoiesis in Aplastic Anemia. N Engl J Med. 2015;373:35–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM, et al. Comprehensive Integration of Single-Cell Data. Cell. 2019;177:1888–1902.e21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, Del Angel G, Levy-Moonshine A, et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinforma. 2013;43:11.10.1–11.10.33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Pellegrino M, Sciambi A, Treusch S, Durruthy-Durruthy R, Gokhale K, Jacob J, et al. High-throughput single-cell DNA sequencing of acute myeloid leukemia tumors with droplet microfluidics. Genome Res. 2018;28:1345–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Park SR, Namkoong S, Friesen L, Cho C-S, Zhang ZZ, Chen Y-C, et al. Single-Cell Transcriptome Analysis of Colon Cancer Cell Response to 5-Fluorouracil-Induced DNA Damage. Cell Rep. 2020;32:108077. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Street K, Risso D, Fletcher RB, Das D, Ngai J, Yosef N, et al. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics. 2018;19:477. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19:15. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Raw data are available on European Genome-Phenome Archives (EGA accession number EGAS50000001429) in the form of raw FASTQs and h5 files for each sample.
The PRIMR R package for RNA data processing to identify exon-exon junction reads is available from GitHub at https://github.com/landau-lab/PRIMR.
Notebooks for all the computational analyses are available at https://github.com/landau-lab/scG2P.
