Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

bioRxiv logoLink to bioRxiv
[Preprint]. 2024 Mar 8:2024.03.06.583775. [Version 1] doi: 10.1101/2024.03.06.583775

De novo detection of somatic variants in long-read single-cell RNA sequencing data

Arthur Dondi 1,2, Nico Borgsmüller 1,2, Pedro F Ferreira 1,2, Brian J Haas 3, Francis Jacob 4, Viola Heinzelmann-Schwarz 4; Tumor Profiler Consortium, Niko Beerenwinkel 1,2,*
PMCID: PMC10942462  PMID: 38496441

Abstract

In cancer, genetic and transcriptomic variations generate clonal heterogeneity, possibly leading to treatment resistance. Long-read single-cell RNA sequencing (LR scRNA-seq) has the potential to detect genetic and transcriptomic variations simultaneously. Here, we present LongSom, a computational workflow leveraging LR scRNA-seq data to call de novo somatic single-nucleotide variants (SNVs), copy-number alterations (CNAs), and gene fusions to reconstruct the tumor clonal heterogeneity. For SNV calling, LongSom distinguishes somatic SNVs from germline polymorphisms by reannotating marker gene expression-based cell types using called variants and applying strict filters. Applying LongSom to ovarian cancer samples, we detected clinically relevant somatic SNVs that were validated against single-cell and bulk panel DNA-seq data and could not be detected with short-read (SR) scRNA-seq. Leveraging somatic SNVs and fusions, LongSom found subclones with different predicted treatment outcomes. In summary, LongSom enables de novo SNVs, CNAs, and fusions detection, thus enabling the study of cancer evolution, clonal heterogeneity, and treatment resistance.

Introduction

Cancer cells accumulate genomic variations, such as single-nucleotide variants (SNVs), copy number alterations (CNAs), and gene fusions during their lifetime, leading to subpopulations with distinct genotypes. Together with changes in the tumor microenvironment, genomic variations result in distinct phenotypes, such as expression patterns (Lappalainen et al. 2013). Intratumor heterogeneity, i.e., the existence of cancer subpopulations with distinct genotypes and phenotypes, is presumed to be a leading cause of therapy resistance and one of the main reasons for poor overall survival in cancer patients with metastatic disease (Jamal-Hanjani et al. 2015; Ramón Y Cajal et al. 2020). The adaptive mechanisms underlying therapy resistance are of both genetic (SNVs, CNAs, gene fusions, etc.) and non-genetic (epigenetic, transcriptomic, microenvironment, etc.) origin. The first step to identifying therapy-resistant subclones is to capture those genetic and transcriptomic variants through sequencing (Mansoori et al. 2017; Marine et al. 2020). Unraveling different subpopulations is particularly challenging with bulk techniques; however, the advent of single-cell sequencing technologies has significantly improved our ability to decipher intratumor heterogeneity within complex tissue samples (Dagogo-Jack and Shaw 2018).

In scDNA-seq data, cancer cell subpopulations are inferred from SNVs and CNAs, which are conventionally obtained from exome or whole-genome sequencing approaches (Roth et al. 2016; Duan et al. 2018). In scRNA-seq, gene expression patterns are commonly used to differentiate between cell types or cancer cell subpopulations. However, relying solely on gene-level expression may be insufficient, as cells can express different isoforms, resulting in different phenotypes (Ding et al. 2020). Isoform-specific cancer resistance can be induced, for example, through alternative splicing (Mitra et al. 2009; Chen et al. 2022), polyadenylation (Guo et al. 2022), or large genomic rearrangements leading to gene fusions (Amatu et al. 2016; Lei et al. 2018; (Cesi et al. 2018). These interlinked features need to be examined together, thus requiring complete isoform coverage (Foord et al. 2023). High-throughput droplet-based scRNA-seq protocols (10X Genomics Chromium) capture reads via their 3’ polyA tails. In short-read (SR) scRNA-seq, this results in a heavy coverage bias towards the 3’ ends as only a few hundred base pairs of each molecule are sequenced. Long-read (LR) scRNA-seq, in contrast, sequences full-length RNA molecules, and thus can access gene expression at the isoform level (Joglekar et al. 2021; Al’Khafaji et al. 2023; Dondi et al. 2023).

Linking genetic to transcriptomic variations is crucial to understanding treatment resistance in cancer (Vasan et al. 2019). However, this is challenging with SR sequencing, as genetic variations are difficult to recover from SR scRNA-seq data due to capture bias, while scDNA-seq cannot assess gene expression. Recently, DNA-free de novo scRNA SNV (Muyas et al. 2023; Zhang et al. 2023) and CNA (Serin Harmanci et al. 2020); (Gao et al. 2021, 2023) calling methods were developed for SR sequencing, compensating the 3’ capture bias by pooling large amounts of cells or sequencing at very high read depths per cell. However, SR sequencing is unsuited to detect isoforms or gene fusions. Because it is less sensitive to capture bias, we have shown in recent work that LR scRNA-seq is more suited to detect genetic variations than SR scRNA-seq (Dondi et al. 2023). Furthermore, LR scRNA-seq can simultaneously detect SNVs, CNAs, fusions, and gene isoform expression in the same cells (Dondi et al. 2023; Shiau et al. 2023).

In this study, we present LongSom, a computational workflow for calling de novo somatic SNVs, fusions, and CNAs in LR scRNA-seq, and integrating them to reconstruct the samples’ clonal heterogeneity. Applied to omentum metastasis samples obtained from three chemo-naive high-grade serous ovarian cancer (HGSOC) patients, we show that LongSom can detect clinically relevant somatic SNVs validated against scDNA and panel data, whereas SR scRNA-seq fails to do so. We demonstrate that by leveraging somatic SNVs and fusions, LongSom can detect subclones with different predicted treatment outcomes, and those subclones were highly concordant with gene expression clusters and CNAs subclones. Additionally, we find that tumor microenvironment cells are contaminated by cancer cell-derived mitochondria.

Results

Overview of LongSom

We developed LongSom, a workflow for detecting genetic variants in LR scRNA-seq data without requiring matched DNA sequencing and finding cancer subclones based on these. Briefly, LongSom takes BAM files as input, calls SNVs in pseudo-bulk and fusions and CNAs in single cells with the Trinity Cancer Transcriptome Analysis Toolkit (CTAT, https://github.com/NCIP/Trinity_CTAT), and then reconstructs the clonal heterogeneity using the Bayesian non-parametric clustering method BnpC (Borgsmüller et al. 2020).

LongSom first calls candidate SNV loci in a pseudo-bulk generated by aggregating LR scRNA-seq data from all cells, using CTAT-Mutations (https://github.com/NCIP/ctat-mutations), which we enhanced here for scRNA-seq and long isoform reads (see Methods). Next, to distinguish between somatic and germline variants, the variant allele frequency (VAF) is calculated for each candidate locus and each cell, and cells are grouped into cancer or non-cancer cells based on marker-gene expression. SNVs detected across multiple cell types are considered germline polymorphisms. Accordingly, if cancer cells are misannotated as non-cancer cells, SNVs will wrongly be filtered out as germline variants (false negatives). To account for this, LongSom first defines a set of cancer-specific variants (SNVs and fusions). SNVs are defined as cancer-specific if their VAF is high in cancer, low in non-cancer, and, when available, zero in normal sample cells (Methods). Fusions are detected using CTAT-LR-fusion (https://github.com/TrinityCTAT/CTAT-LR-fusion) (Qin et al. 2024) and cancer-specific fusions are those expressed in more than 5% of cancer cells and less than 1% of non-cancer cells. LongSom reannotates cells as cancer cells if they harbor at least two cancer-specific variants (Figure 1, Methods).

Figure 1: Overview of LongSom.

Figure 1:

LongSom’s methodology for detecting somatic SNVs, fusions, and CNAs and subsequently inferring cancer subclones in LR scRNA-seq individual patients data. (1) SNV candidates are detected from pseudo-bulk samples. (2) Population germline SNVs and SNVs present in normal samples (optional) are filtered out. (3) A cell-SNV matrix based on the remaining SNV candidates is computed. (4) A cell-fusion matrix is computed. (5) Using high-confidence cancer fusions and SNVs, cells are reannotated. (6) Following reannotation, SNVs present in non-cancer cells (germlines) are filtered out. (7) cells are clustered based on somatic fusions and SNVs. In parallel, (8) gene expression per cell is computed, (9) CNAs are detected, (10) cells are clustered based on CNAs, and (11) CNA clones are incorporated to the fusions and SNVs clustered matrix.

After cell reannotation, LongSom performs germline SNV filtering in five steps: (A) It filters SNV loci detected in the matched normal, when available. (B) It filters SNV loci from the gnomAD database (Chen et al. 2024) with a frequency of at least 0.01% in the total population. (C) After cell-type reannotation, it filters SNV loci that were called in more than 1% of the non-cancer cells. (D) SNV loci where less than 1% of the non-cancer cells are covered by at least one read are filtered. This step helps to filter germline SNVs not detected due to low expression in non-cancer cells. (E) Finally, adjacent SNV loci within a 10,000 bp distance are filtered, as these are likely to be misalignment artifacts in low-complexity regions. Of note, steps (C) and (E) are not applied to mitochondrial SNVs. Finally, LongSom keeps somatic loci that are mutated in a minimum of five cancer cells or 5% of cancer cells (user-defined parameters) and filters loci matching known RNA-editing sites.

Finally, LongSom infers the clonal structure of the samples using two different approaches. One approach leverages the detected SNVs and fusions as input for the Bayesian non-parametric clustering method BnpC (Borgsmüller et al. 2020). The other approach predicts CNAs based on gene expression in cancer cells and defines subclusters using inferCNV (https://github.com/broadinstitute/infercnv) (Methods).

Cell-type reannotation improves somatic SNV detection sensitivity

We applied LongSom to previously published (Dondi et al. 2023) SR and LR scRNA-seq data of five omentum metastasis samples obtained from three chemo-naive HGSOC patients: P1, P2, and P3. Three samples were derived from HGSOC omental metastases and two from matching distal tumor-free omental tissues (normal). After cell-type reannotation (Methods), the reannotated cells were always more similar to the expression-based clustering (Jaccard similarity score in patient P1: 0.97, P2: 0.99, P3: 0.97) than the previous annotation derived from marker-gene expression (Jaccard similarity score in patient P1: 0.94, P2: 0.98, P3: 0.76), supporting the reannotation (Figure 2a). We found that 6, 2, and 21% of the cells that we annotated as cancer were previously annotated as non-cancer cells in the tumor biopsy samples of patients P1, P2, and P3, respectively (Figure 2b). The tumor biopsy of patient P3 had only 10% cancer cells (Dondi et al. 2023), which could explain the high level of cell misannotation. In the following, cancer or non-cancer cells refer to the reannotated cell types.

Figure 2: Cell-type reannotation improves somatic SNVs detection sensitivity.

Figure 2:

a. UMAP embeddings of LR scRNA-seq expression per patient. Cells are colored by annotation status; light-red cells show cells predicted as non-cancer using marker gene expression based annotation, and cancer using high-confidence somatic variants reannotation b. Confusion matrices of cells predicted as cancer or non-cancer using marker genes, and cells reannotated as cancer or non-cancer by LongSom, colored and annotated by the percentage of the total number of cells in each category. E.g. the bottom left square represents cells previously annotated as non-cancer that were reannotated as cancer (false negative cancer cells). c. Number of SNVs found per patient, with or without cell type reannotation before filtering germline SNPs. d. Boxplots of the mean VAF per covered SNV loci of each cell, per patient, colored by their annotation status. Boxes display the first to third quartile with median as horizontal line, whiskers encompass 1.5 times the interquartile range, and data beyond that threshold is indicated as outliers. P values were calculated using a two-sided Student’s t-test between groups and are described with the following symbols: n.s : P > 0.05, *: P ≤ 0.05, **: P ≤ 0.01, ***: P ≤ 0.001. e. Waffle plot representing each somatic SNV detected, colored by their genomic region and effect on the coding sequence.

After cell-type reannotation and germline polymorphism filtering, we found 32, 50, and 62 somatic SNVs and 4, 16, and 2 somatic fusions in patients P1, P2, and P3, respectively (Supplementary Tables 1, 2). In patient P1, a variant at locus chr21:8455886 was manually detected as a technical false positive due to mismapping in a highly repetitive region, and it was excluded from further analyses. Without cell type reannotation, we only found 23/32 (72%), 48/50 (96%), and 19/62 (31%) of those SNVs, in P1, P2, and P3, respectively, and no additional SNV was discovered (Figure 2c). In patient P3, numerous cancer cells were misannotated as non-cancer cells before reannotation (Figure 2a,b), leading to 69% of false negative somatic SNVs during germline SNV filtering (Figure 2c). Cells reannotated from non-cancer to cancer cells showed a mean VAF across somatic SNVs significantly different from cells annotated as non-cancer cells in both methods (P<0.001 in all patients, two-tailed two-sample t-test), but not from cells annotated as cancer in both methods (P>0.05 in all patients), thus further supporting the cell-type reannotation (Figure 2d). Out of the 144 somatic SNVs identified, we found 32.6% of variants in or affecting coding regions (2.1% start or stop codon gain (n=3), 1.4% splice region (n=2), 22.2% missense (n=32), and 6.9% synonymous variants (n=10)) and 67.4% in non-coding regions (17.4% 3’ or 5’ UTR (n=25), 40.3% intron (n=48) and 9.7% intergenic variants (n=14)) (Figure 2e).

Validation of LR scRNA-seq-derived SNVs using scDNA-seq data

For validation of the SNVs detected using LongSom, we employed scDNA-seq data from matched omental metastases for each patient. First, we inferred the cellular copy number profiles based on the scDNA-seq data and identified fully diploid subclones (likely non-cancer) and aneuploid subclones (likely cancer) from these (Kuipers et al. 2020) (Methods). We found two aneuploid clones in patient P1, one in P2, and two in P3 (Figure 3a-c). In each somatic locus detected by LongSom, we estimated the mean VAF of diploid and aneuploid scDNA subclones by generating pseudo-bulks. We assumed that a scDNA subclone supported an SNV if the mean VAF was greater than 10% at the respective locus.

Figure 3: Somatic SNVs detected in scRNA are validated as somatic in scDNA.

Figure 3:

a,b,c. scDNA-seq copy number values per subclone in a. patient 1, b. patient 2, and c. patient 3 data. Subclones with multiple copy number alterations are aneuploid (likely cancer), while copy number neutral subclones are diploid (likely non-cancer). d,e,f Venn diagrams of somatic SNVs supported (VAF>10%) in scDNA cancer subclones (purple), scDNA non-cancer subclones (green), and both (brown). g. scDNA cancer subclones coverage per somatic locus identified in scRNA, categorized by whether they are found mutated in cancer subclones (Yes) or not (No). h. Number of mutated reads in scDNA subclones per somatic loci identified in LR scRNA, categorized by cancer and non-cancer scDNA subclones.

Overall, 55% (n=79) of the somatic SNVs detected in LR scRNA were found exclusively in scDNA aneuploid subclones and were therefore likely somatic (Figure 3d-f). In all cases where SNVs were not detected in scDNA aneuploid subclones, the scDNA-seq coverage was <10x (Figure 3g). The 10% (n=15) of SNVs detected in diploid scDNA subclones (suggesting germline polymorphism) were all in patient P2 except one in patient P1 which was only supported by one read (Figure 3h). No normal LR scRNA-seq sample was available for Patient P2, and non-cancer cells were mainly T-cells with an overall low gene expression(Joglekar et al. 2021), possibly explaining why germline SNVs were insufficiently filtered. This finding highlights the utility of matched normal samples to filter germlines sufficiently.

Somatic mitochondrial reads contaminate tumor microenvironment cells in scRNA-seq and scDNA-seq data

As somatic mitochondrial SNVs can also be used for genotype and clonal reconstruction (Miller et al. 2022), LongSom also detects them. Due to the high mitochondrial RNA expression in cells(Osorio and Cai 2021), somatic mitochondrial SNVs (mtSNVs) were amongst the most covered across all cell types and patients in the HGSOC dataset. We found three somatic mitochondrial SNVs in patient P1 (chrM:3092:T>C, chrM:5179:T>C, chrM:16192:C>T), three in patient P2 (chrM:2573:G>A, chrM:4308:G>A, chrM:16065:G>A), and none in patient P3 (Supplementary Table 1).

In patient P1, at locus chrM:3092, all covered cancer cells exhibited a >99% VAF in scRNA data, while non-cancer cells showed heteroplasmy (VAF ranging between 0–40%, with 28% of cells mutated, median VAF when mutated 4%) (Figure 4a). However, all cells from distal samples exhibited a VAF <1% (>99% cells covered), ruling out a germline SNV. We detected the same mutational profile in matching scDNA-seq data: amongst the diploid subclones, the average VAF was 5%, while the average VAF in aneuploid subclones was >99% (Figure 3a). At locus chrM:5179, cancer cells were either mutated (n = 44, median VAF = 49.7%) or not (n = 30, median VAF <0.1%) in scRNA-seq data, suggesting the presence of two subclones. In non-cancer cells, the VAF ranged from 0 to 33% (11% cells mutated, median VAF when mutated 3%), and all cells from distal samples exhibited again a VAF <1% (>99% cells covered, Figure 4b). In the matched scDNA-seq data, at locus chrM:5179, the VAF of aneuploid subclone C3 (Figure 3a) was 42%, while it was <1% in the second aneuploid subclone C1 and 2% in the diploid subclones, confirming the subclone specificity of this somatic SNV (Figure 4c).

Figure 4: Non-cancer cells are contaminated by cancer cells mitochondria.

Figure 4:

a,b, VAF of cells in patient P1 at loci a. chrM:3092:C and b. chrM:5179:C, categorized by reannotated cell types. Color gradient represents the number of variant allele reads per cell. Cells with more than 100 mutated reads are represented with 100 mutated reads. N refers to the number of cells with at least one read covering the locus. c. Number of reads supporting the reference or alternative allele in patient P1’s scDNA aneuploid (cancer) subclones C1 and C3 at locus chrM:5179, normalized by the number of cells per subclone. d,e, VAF of cells in patient P2 at loci d. chrM:2573:C and e. chrM:4308:C. f. Log aggregated mutated reads in non-cancer cells, as a function of log aggregated mutated reads in cancer cells for loci chrM:3092:C, chrM:5179:C and chrM:16192:T in patient P1, and chrM:2573:C, chrM:4308:C and chrM:16065:A in patient 2.

In Patient P2, locus chrM:2573 showed the same pattern with a mean VAF of 75% in cancer cells compared to a mean VAF of 2% in non-cancer cells. This SNV was observed in 20% of the non-cancer cells, with a mean VAF of 5% (SD=3.6)) when mutated (Figure 4d). No matching normal sample was available for this patient. In scDNA, the aneuploid subclone had a VAF of 19% while the diploid subclone had a VAF of 81%. At locus chrM:4308, cancer cells had a mean VAF of 97%, and non-cancer cells had a VAF ranging between 0 and 100% (19% cells mutated, mean VAF when mutated 88% (SD=25)). All cells mutated at this locus had only one variant allele read (Figure 4e).

In summary, we observed mitochondrial SNVs with high VAF in cancer cells and lower (but non-zero) VAF in non-cancer cells from the same scRNA-seq and scDNA-seq samples. Remarkably, cells from distal (normal) scRNA samples had a VAF of zero in those loci, suggesting that the mutated mitochondrial reads found in non-cancer cells originate from cancer cells. This phenomenon could be explained via biological mechanisms such as intercellular mitochondrial transfer (Liu et al. 2021), or via technical contaminations such as mitochondria from dead cancer cells being captured together with non-cancer cells during single-cell encapsulation. The correlation between the amount of mutated mitochondrial SNV reads found in cancer and in non-cancer supports the contamination hypothesis (Figure 4f). To account for those contaminations, LongSom does not apply step (E) of germline filtering (filtering of loci that were called in more than 1% of the non-cancer cells).

LR scRNA-seq enables somatic SNV detection with higher sensitivity than SR scRNA-seq

Next, we aimed to compare LR to SR scRNA-seq to detect SNVs. The HGSOC study had LR and SR scRNA-seq data from the same cells available, and while the SR dataset had 4.3 times more sequenced reads compared to LR (mean 117.4k vs. 26.9k reads per cell), it had 3.5 times fewer mapped bases (mean 16Gb mapped vs. 33Gb mapped) due to shorter read length (Supplementary Figure 1a,b). When we applied LongSom to SR scRNA-seq, we found only 4/32 (13%), 9/50 (18%), and 9/62 (15%) somatic SNVs in patients P1, P2, and P3 respectively, and no new SNV was detected (Supplementary Figure 1c). Additionally, only 1/4 (25%), 9/17 (53%), and 1/2 (50%) fusions were detected in SR scRNA-seq data from patients P1, P2, and P3, respectively (Qin et al. 2024).

We computed cell-variants sparse matrices from each patient’s LR and SR data, using cells as columns and somatic SNVs and fusions as rows (Methods, Figure 5a-c). For comparison, we computed the same matrix in each patient using SR scRNA-seq data (Figure 5d-f). On average, 13.7% (standard deviation (SD) = 1.2) of the matrix positions had at least one LR coverage, whereas only 4.7% (SD = 1.2) had at least one SR coverage (Supplementary Figure 1d-f). However, the coverage depends on the cell type expression, and certain cell types, for example, T cells, rarely express mutated genes (Figure 5a-f). In cancer cells, on average 27.9% (SD = 7.5) of the matrix positions were covered by at least one LR, whereas only 8.1% (SD = 0.8) had at least one SR coverage (Figure 5g-i). On average, LR covered 94.8% (SD=3) of the positions covered by SR and covered an additional 3.4 times more positions (SD = 0.6), in line with the 3.5 times more bases mapped in LR.

Figure 5: Patient-specific cell-variant matrices created from LR and SR scRNA-seq.

Figure 5:

a-f. Matrices of somatic SNVs and fusions (rows) by single cells (columns) computed using LR sc-RNA-seq from the tumor biopsy of (a) patient P1, (b) P2 and (c) P3, and using SR sc-RNA-seq of (d) patient P1, (e) P2, and (f) P3, ordered by gene expression-derived cell types. VAF is depicted as a gradient from white (no mutated reads, VAF=0) to red (only mutated reads, VAF=1). Grey indicates no coverage in the cell at a given locus. Rows are colored by the scDNA VAF of aggregated diploid and aneuploid cells at the loci: SNVs with high aneuploid VAF and low diploid VAF are somatic in scDNA data. RNA fusions do not give a direct indication of the DNA breakpoint, thus we could not assess their presence in scDNA data, and they appear in pink. Columns are colored by marker-expression-derived cell-types (top row) and cell-types reannotated by LongSom (bottom row) d,e,f. Venn diagram of matrices’ positions covered in the cancer cells in (h) patient P1, (i) patient P2, and (i) patient P3, colored by sequencing data modality. Total positions equal n variants x m cancer cells. Blue indicates positions with coverage >0 in LR and 0 in SR. Red indicates positions with coverage 0 in LR and coverage >0 in SR. Purple indicates positions with coverage >0 in both LR and SR. Grey indicates positions with coverage 0 in both LR and SR.

LongSom detects panel-validated resistance-associated variants

The three patients also underwent bulk panel DNA sequencing (Methods), where 29 SNVs were found (Supplementary Table 3). All three patients had at least one somatic SNV called in TP53 (including a variant introducing a stop codon in patient P3) with a VAF >30%, and patient P1 had a second TP53 SNV detected with VAF 1%. LongSom detected all TP53 somatic SNVs with VAF >30% in LR scRNA-seq. The remaining SNVs detected in the panel were not retained with our method for the following reasons: they were either germline variants found in normal and non-cancer cells (n=5), detected in cancer but with insufficient coverage in non-cancer cells (n=3), detected but not in enough cancer cells (n=7), not detected despite sufficient coverage (n=3), or not covered (n=8) (Supplementary Table 3). Overall, 62% of the SNVs detected in the panel also found support in scRNA data. Of note, two deletions were found in panel sequencing, and they were detected manually in the LR scRNA-seq data. LongSom detected none of the panel SNVs in SR scRNA-seq data.

In addition to TP53, LongSom was able to detect SNVs predicted as clinically relevant in genes not included in the bulk panel (Methods). In patient P1, we found missense variants predicted as pathogenic in the apoptosis regulator genes CCAR2 (Arg722Trp) and FAM129B (Leu508Pro), and another missense variant in the ferroptosis regulator ALDH3A2 (Val321Leu) (Supplementary Table 4). ALDH3A2 is a tumor suppressor in multiple cancers (Xia et al. 2023) and ALDH3A2, CCAR2, and FAM129B are all associated with treatment resistance in ovarian cancer (Cheng et al. 2019; Iyer et al. 2022; Dong et al. 2023). In patient P2, the chrM:4308 G>A variant was predicted as likely pathogenic. In patient P3, we detected a missense variant in AHCY, as well as a pathogenic NIF3L1 variant and a missense variant in the resistance-associated gene KDM6B (He et al. 2019). In SR scRNA-seq data, LongSom found no clinically relevant variants in patients P1 and P3, and only found chrM:4308 G>A in patient P2.

LongSom identifies subclones in LR scRNA-seq data

Next, LongSom inferred the clonal structure of the tumors based on the SNVs and fusions it detected using BnpC. LongSom also inferred the clonal structure from CNA profiles in the same cells, using inferCNV (Supplementary Figure 2a-c, Methods). We also clustered the cells based on their gene expression, manually annotated the cancer clusters, and used those clusters as transcriptomic validation. Finally, we used the subclones inferred from scDNA as external validation (Figure 3).

In patient P2, LongSom found one cancer clone using mutations and fusions, and this clone coincided very well with the aneuploid CNA clone found in scRNA (Jaccard similarity = 98%) and the gene-expression-based cancer cluster (Jaccard similarity = 97%, Supplementary Figures 2b, 3a). Similarly, in scDNA-seq data we only found one aneuploid CNA clone (Figure 3b). Therefore, all available data modalities point toward a monoclonal cancer population in this patient. Using SR scRNA-seq, LongSom reconstructed the clonal structure with lower accuracy than LR due to low coverage (Jaccard similarity BnpC clone - cancer cluster = 85%, Supplementary Figure 3b).

In patient P3, LongSom found one clone, coinciding with the scRNA gene expression-based cancer cluster (Jaccard similarity = 93%, Supplementary Figure 4a), however, two aneuploid subclones were detected in both scDNA-seq and scRNA-seq data using CNA analysis (Figure 2c, Supplementary Figures 2c,4a). This difference could be due to the difficulty of calling subclones in a low number of cancer cells (n=42 after reannotation) or due to inter-sample heterogeneity. Using SR data, the clustering resulted in a subclone only partially covering the expression-based cancer cluster, and many individual cells formed singleton subclones (Jaccard similarity BnpC subclone - cancer cluster = 36%, Supplementary Figure 4b).

In patient P1, LongSom found two cancer subclones, referred to as A and B, as well as a subclone composed of one cell that we assigned to subclone B (Figure 6a). The larger subclone A (n = 40 cells) was predominantly defined by an SNV at locus chrM:5179 and the smaller subclone B (n = 34 cells) was mainly defined by the fusion SMG7--CH507–513H4.1. In expression-based UMAP embedding, cancer cells formed two distinct expression clusters that highly overlapped the genotypic cancer subclones found based on SNVs and fusions (Jaccard similarity BnpC subclone A - Expression cluster 1 = 79%, BnpC subclone B - expression cluster 2 = 72%, Figure 6 a-d). CNAs subclones and expression clusters were also very similar (Jaccard similarity inferCNV subclone A - expression cluster 2 = 89%, inferCNV subclone B - expression cluster 2 = 85%), likely because they are both derived from gene expression (Figure 6a,e, Supplementary Figure 3a). Clonal assignments based on SNVs and fusions and on CNA data were also similar (Jaccard similarity subclone A = 72%, subclone B = 68%. In patient P1’s matched scDNA data, we also found two aneuploid (cancer) subclones based on CNA profiles (Figure 3a), and only one of the subclones harbored the SNV chrM:5179:T>C (Figure 4c), concordantly with LR scRNA-seq data. Unfortunately, the three other subclone-defining SNVs had insufficient scDNA-seq coverage, and we could not detect any reads supporting the variant allele at those loci, therefore we could not confirm their subclonality. Using SR scRNA-seq, LongSom also identified cancer subclones in patient P1, mainly based on chrM:5179 status. However, as the fusion SMG7--CH507–513H4.1 was not detected in SR, multiple cancer cells clustered with non-cancer cells (Jaccard similarity BnpC SR subclone A - Expression cluster 1 = 70%, BnpC SR subclone B - expression cluster 2 = 57%, Supplementary Figure 5).

Figure 6: Analysis of intra-tumor heterogeneity using somatic variants detected in LR scRNA-seq in Patient P1.

Figure 6:

a. BnpC clustering of single cells from the tumor biopsy of patient P1 (columns) by somatic SNVs and fusions (rows). VAF is depicted as a gradient from white (no mutated reads, VAF=0) to red (only mutated reads, VAF=1). Grey indicates no coverage in the cell at a given locus. Rows are colored by the scDNA VAF of aggregated diploid and aneuploid cells at the loci: SNVs with high aneuploid VAF and low diploid VAF are somatic in scDNA data. Fusions appear in pink. Columns are colored from top to bottom by cell types reannotated by LongSom, CNAs subclones, expression clusters, and BnpC clusters b,c. UMAP embedding of patient P1 gene expression data, colored by (b) cell types reannotated by LongSom and (c) subclones inferred from somatic SNVs and fusions. The dashed line indicates the manual separation between cancer clusters 1 and 2. d,e. Confusion matrix of cells in each expression-derived cancer cluster (rows) and (d) cells in the subclones inferred from BnpC, and (e) cells in the subclones inferred from inferCNV (columns), colored by the percentage of the total number of cells in each subclone and annotated by the absolute numbers. f. Volcano plot of differentially expressed genes identified between subclones B and A. Keratin genes downregulated in subclone B are annotated. g. ScisorWiz representation of CHPF isoforms in subclones A and B. Colored areas are exons, whitespace areas are intronic space, not drawn to scale, and each horizontal line represents a single read colored according to subclones.

Subclones identified in patient P1 have differing predicted treatment outcomes

To explore the potential therapeutic resistance of the subclones identified in Patient P1 by LongSom, we investigated the genomic and transcriptomic variations between them. The ALDH3A2 pathogenic variant identified earlier was exclusively expressed in subclone A, while the CCAR2 pathogenic variant was exclusive to subclone B (Supplementary Table 4). Remarkably, ALDH3A2 is a ferroptosis inhibitor and its loss of function would lower cisplatin resistance (Dong et al. 2023), while CCAR2 is a suppressor of homologous recombination, and its loss would lead to resistance against PARP inhibitors (Iyer et al. 2022). Therefore, based on SNVs, subclone A is more likely to be treatment-sensitive, while subclone B is more likely to be treatment-resistant. Fusions SMG7--CH507–513H4.1 and GS1–279B7.2--GNG4 were exclusively expressed in subclone B (Figure 6a), however, their pathogenicity is difficult to predict. On the transcriptomic level, Subclone B had notably downregulated expression of keratin genes KRT8 and KRT18, two epithelial markers used to differentiate HGSOC cells from non-cancer cells (Figure 6f). When compared to cancer subclones in patients P2 and P3, KRT8 and KRT18 were downregulated in subclone B, but not upregulated in subclone A, thus confirming a downregulation in subclone B (Supplementary Figure 6a,b). It has been shown in vitro that KRT8 and KRT18 have a protective effect against cell death (Bozza et al. 2018), and loss of KRT8 and KRT18 leads to increased invasiveness but also cisplatin sensitivity (Fortier et al. 2013). Subclone B is therefore more likely to be chemosensitive than subclone A. We additionally investigated differential isoform usage, and while both subclones were mostly similar, we found a significant difference in CHPF (Figure 6g), MYL6, the tumor suppressor BTG2, and NUTM2B-AS1 (Supplementary Figure 6c-e), however, we could not predict their pathogenicity. None of the subclone-exclusive variants or isoforms were detected in Patient P1 SR scRNA-seq data.

Discussion

SNVs, CNAs, fusions, gene expression, isoforms expression, and the micro-environment composition can all affect cancer treatment outcomes (Marine et al. 2020). Assessing all of these variations simultaneously from a single patient sample is particularly relevant in a clinical setting, where biological material is limited. Here, we show for the first time that this is possible using LR scRNA-seq data and we introduce LongSom, a workflow for detecting de novo somatic SNVs, fusions, and CNAs in LR scRNA-seq. When applied to data from three HGSOC patients, it detected panel- and scDNA-seq-validated SNVs, including clinically relevant TP53, ALDH3A2, and CCAR2 SNVs. By integrating SNVs and fusions, LongSom successfully reconstructed the clonal heterogeneity and linked variants-defined subclones to CNA-defined subclones and gene expression clusters. Finally, in each subclone, we identified differentially expressed genes as well as subclone-specific SNVs with different implications for treatment resistance. Thus, we demonstrated that LR scRNA-seq is suitable for predicting treatment outcomes.

The cell-type reannotation step implemented in LongSom, based on the somatic variation profile of cells, led to the detection of up to 2.4 times more somatic SNVs (patient P3) and significantly reduced the false-negative rate without sacrificing sensitivity for precision. In general, cell-type annotation is an open challenge due to overlapping, poorly expressed, or incomplete marker genes, e.g., in ovarian cancer omentum metastases (Lähnemann et al. 2020; Van Egeren et al. 2021, 2022). Our proposed reannotation can improve existing methods and shows the potential of combining genomic variants with transcriptomic cell typing.

To our knowledge, LongSom is the first method combining de novo detection of SNVs and fusions from the same cell to reconstruct clonal heterogeneity. Besides nuclear SNVs, which are commonly obtained from RNA (Muyas et al. 2023; Zhang et al. 2023) and DNA seq (Zafar et al. 2016), LongSom also calls mitochondrial SNVs. In the analyzed HGSOC dataset, the mitochondrial SNVs were called in most cancer and non-cancer cells, and some high-confidence fusion calls were expressed in most clones or subclones (P2: IGF2BP2::TESPA1, P1: SMG7::CH507–513H4.1, etc.), making them ideal variations for cell-type reannotation and clustering. Furthermore, both can be clinically relevant (Amatu et al. 2016; Lei et al. 2018; Cesi et al. 2018; Dentro et al. 2021), as we demonstrated in Patient P2. Mitochondrial RNA is particularly abundant in cancer cells, especially HGSOC (Yuan et al. 2023), and an increasing number of methods are leveraging them for clonal reconstruction or validation (Kwok et al. 2022; Miller et al. 2022; Gao et al. 2023). However, we demonstrated that mitochondrial SNVs require special filtering thresholds, as non-cancer cells were frequently contaminated by cancer mitochondrial reads. We assume that entire cancer mitochondria might contaminate non-cancer cells, as we observed mitochondrial SNVs in both scRNA and scDNA-seq data. Whether these mitochondria originate from a biological mechanism, e.g. intercellular transfer from cancer cells into non-cancer cells for microenvironment revitalization (Liu et al. 2021; Zampieri et al. 2021), or technical contaminations, e.g. cancer mitochondria encapsulated jointly with non-cancer cells during single-cell preparation, requires further investigation.

One limitation of this study is the lack of isoform and fusion annotation in the literature, resulting from the difficulty of detecting them in SR scRNA-seq (Dentro et al. 2021), making it challenging to explore the biological implications of subclone-specific isoforms or fusions. To fully exhaust the possibilities of LR scRNA-seq, characterizing more isoforms and fusions will be necessary in the future. Furthermore, a population-level database dedicated to fusions, similar to gnomAD (Chen et al. 2024) for SNVs, would be beneficial to filter germline fusions. We believe that the reliable detection of isoforms and fusions with LR scRNA-seq is the first step toward this goal.

Despite rapid progress in the LR scRNA-seq field (Al’Khafaji et al. 2023; Dondi et al. 2023; Joglekar et al. 2023; Marx 2023), multiple technical limitations remain unsolved, limiting the potential of downstream analysis. First, variant detection remains challenging due to the sparsity and low coverage of scRNA-seq assays. LongSom excludes SNVs called in non-cancer cells to filter germline SNVs, possibly leading to false negative somatic SNVs, as shown by the matched panel-seq. Second, read coverage is also uneven within a transcript, as transcripts produced by 10X Genomics Chromium remain incomplete on the 5’ end (Hsu et al. 2022). Third, RNA-seq is inherently limited to detecting only expressed SNVs and fusions. Nevertheless, LongSom detected a large fraction of variants in intronic or even intergenic regions. Last, indels are the most common source of errors in LR scRNA-seq data, whereby they are frequently excluded from the analyses, ours included (Shiau et al. 2023). To further improve the genomic analyses of scRNA-seq data, algorithms for filtering technical indels while detecting somatic indels are required, especially as technical indels can lead to false positive somatic SNVs (Ahsan et al. 2021).

In summary, we proposed a workflow for detecting multiple genetic variants (SNVs, CNAs, fusions) in LR scRNA-seq, enabling clonal heterogeneity reconstruction and clonal genotypes to treatment-resistance phenotype linkage. LR scRNA-seq provides a unique snapshot of the cellular mechanisms by capturing multiple genomic and transcriptomic readouts from the same cell, including expressed isoforms, fusion transcripts, SNVs, and CNAs, more effectively than with any other sequencing technique. With decreasing costs and increasing data size in LR scRNA-seq, we envision that LR scRNA-seq will become more common, potentially facilitating a better understanding of the processes underlying cancer treatment resistance. LongSom can be a valuable first step in guiding these analyses.

Methods

scRNA expression analysis

Gene expression counts

LR gene expression counts were generated as described in (Dondi et al. 2023). Briefly, we preprocessed the BAM files using scIsoPrep (https://github.com/cbg-ethz/scIsoPrep/tree/master) and generated a gene expression-cell matrix. UMI counts were quality-controlled and cells and genes were filtered to remove mitochondrial and ribosomal contaminants. Cells for which over 50% of the reads mapped to mitochondrial genes and cells with fewer than 400 genes expressed were removed. By default, all non-protein-coding genes, genes coding for ribosomal and mitochondrial proteins, and genes that were expressed in less than 20 cells were removed. Subsequently, counts were normalized with sctransform (Hafemeister and Satija 2019), regressing out cell cycle effects and library size as non-regularized dependent variables.

Marker gene expression-based annotation

Cells were annotated with scROSHI (Prummer et al. 2023) using ovarian cancer marker gene lists. Marker genes are available at https://github.com/ETH-NEXUS/scAmpi_single_cell_RNA/blob/master/required_files/ovarian/celltype_list_ovarian.gm). We used “HGSOC” labels as cancer cells, and “Mesothelial.cells”, “Fibroblast”, “T.NK.cells”, “B.cells”, “Myeloid.cells”, “Endothelial.cells” labels as non-cancer cells.

Clustering and visualization

Similar cells were grouped using Seurat FindClusters (Hao et al. 2024), and clusters with a majority (>90%) of non-cancer cells were grouped together as “non-cancer”. The results of the clustering and cell typing are visualized in a low-dimensional representation using Uniform Manifold Approximation and Projection (UMAP).

Differential gene expression analysis

Differential expression was computed using Seurat FindMarkers (Hao et al. 2024), which uses a Wilcoxon test, corrected for multiple testing using the Bonferroni correction. A threshold of corrected P-value <0.01 and abs(log2(fold change)) >1 was used for significance.

Differential isoform usage analysis

Isoform classification and quantification were performed using scIsoPrep. Differential isoform testing was performed using a χ2 test as previously described in Scisorseqr (Joglekar et al. 2021). Differentially used isoforms were visualized using ScisorWiz (Stein et al. 2022).

Somatic variants calling in LR scRNA-seq data with LongSom

To call somatic variants in LR scRNA-seq, we developed LongSom, a workflow implemented in python3 using Snakemake (Köster and Rahmann 2012) and available at https://github.com/cbg-ethz/LongSom.

Preprocessing

Long reads with minimal quality Q20 were de-concatenated, adapters were trimmed, demultiplexed, polyA tails were trimmed and finally, UMIs were deduplicated using scIsoPrep (https://github.com/cbg-ethz/scIsoPrep/tree/master) as described in (Dondi et al. 2023). All deduplicated reads belonging to a cell passing filter (cells for which under 50% of the reads mapped to mitochondrial genes and cells with more than 400 genes expressed, see (Dondi et al. 2023), were then pooled together in a pseudo bulk fashion. Gene expression-based cell types were derived from the same work (Dondi et al. 2023).

SNV calling in LR scRNA-seq data using CTAT-Mutations

First, LongSom calls somatic SNVs in the tumor and (when available) normal biopsy pseudo bulks, using the CTAT mutations pipeline v4.0.0 (https://github.com/NCIP/ctat-mutations/releases/tag/CTAT-Mutations-v4.0.0, which we enhanced to enable compatibility with long reads and report variants according to single cell barcodes. When executed with option --is_long_reads, minimap2 (Li 2018) is used to align long isoform reads to the reference genome hg38 (instead of the STAR aligner used with shorter Illumina RAN-seq), followed by our implementation of the GATK best practices for variant calling using RNA-seq (https://gatk.broadinstitute.org/hc/en-us/articles/360035531192-RNAseq-short-variant-discovery-SNPs-Indels). Loci flagged as RNA-editing sites or with less than 5 reads mutated are filtered out. For generating variant reports at single-cell resolution, allele-supporting reads annotated with cell barcodes and UMIs were captured from the aligned reads, tallied, and reported for downstream integration with cell typing and related metadata.

Fusion calling in LR scRNA-seq data using CTAT-LR-Fusion

LongSom detects fusions on the single cell level using CTAT-LR-fusion v0.13.0 ( https://github.com/TrinityCTAT/CTAT-LR-fusion/releases/tag/ctat-LR-fusion-v0.13.0) with standard options (Qin et al. 2024).

Cell-variant matrices construction

LongSom defines three groups based on the marker-expression-based cell types: cancer cells in the tumor biopsy (in this study, HGSOC cells), non-cancer cells in the tumor biopsy (in this study: mesothelial cells, fibroblasts, T cells, myeloid cells, B cells, and endothelial cells) and, if available, normal cells from the normal biopsy. For each of those groups, LongSom builds a cell-variant matrix with n cells (columns) and m SNVs + p fusions (rows). For SNV rows, the matrices are filled as follows: if at least one read is covering the locus in a cell, a VAF is computed for this cell (with a value ranging from 0 to 1), otherwise, the position is a missing value. A cell is defined as “mutated” at an SNV locus if it has a VAF >= 0.3. For fusion rows, the matrices are filled as follows: a cell with at least one fusion read is considered “mutated” for this fusion (value = 1), otherwise, it is a missing value.

Cell type reannotation

To improve the cell type annotation, LongSom defines a set of “high-confidence cancer variants”. To be a “high-confidence cancer variant”, an SNV needs to (1) be mutated in more than 5% of cancer cells, (2) be mutated in >20% of the cancer cells covering the locus, (3) have >1% of non-cancer cells covering the locus, (4) be mutated in less than 5% of the non-cancer cells covering the locus, and (5) be mutated in 0 normal cells (optional). For mitochondrial SNVs, due to the contaminations observed, LongSom does not follow those rules. Instead, a mitochondrial SNV is a “high-confidence cancer variant” if:

%  of  cancer  cells  mutated%  of  noncancer  cells  mutated>20%

To be a “high-confidence cancer variant”, a fusion needs to be found in more than 5 cancer cells and less than 5% of the non-cancer cells. We then reannotated the cell types by defining as “cancer” any cell mutated in more than two of the ‘‘high-confidence cancer variants’’, and as “non-cancer” all the other cells in the tumor biopsy.

Final somatic variants call set and matrix

After cell reannotation, LongSom rebuilds two cell-variants matrices using the annotated cancer and non-cancer labels. Longsom then filters germline polymorphisms (rows) from the variant matrices in five steps: (A) It filters SNV loci detected in the matched normal, when available. (B) It filters SNV loci from the gnomAD database (Chen et al. 2024) with a frequency of at least 0.01% in the total population. (C) After cell-type reannotation, it filters SNV loci that were called in more than 1% of the non-cancer cells. (D) SNV loci where less than 1% of the non-cancer cells are covered by at least one read are filtered. This step helps to filter germline SNVs not detected due to low expression in non-cancer cells. (E) Finally, adjacent SNV loci within a 10,000 bp distance are filtered, as these are likely to be misalignment artifacts in low-complexity regions. Of note, steps (C) and (E) are not applied to mitochondrial SNVs. Finally, LongSom keeps somatic loci that are mutated in a minimum of five cancer cells or 5% of cancer cells (user-defined parameters).

Cancer and non-cancer cell-variant matrices containing only somatic SNVs and fusions are then concatenated to create the final cell-variant matrix. SNVs are sorted in decreasing order by:

Diff=mean%  of  covered cancer  cells  mutatedmean%  of  covered noncancer  cells  mutated

Clonal detection based on SNVs and fusions

LongSom uses the cell-variant matrices as input for Bayesian non-parametric clustering (BnpC) (Borgsmüller et al. 2020) to detect subclones in cancer samples, with arguments -n 16 --steps 1000 --DPa_prior [1,1] --conc_update_prob 0 --param_prior [1,1].

Clonal detection based on CNAs

LongSom first computes cell-gene matrices using featureCounts from Subread v2.0.6 (https://subread.sourceforge.net/) with parameters -L, using hg38 and gencode v36 as reference. It then uses those matrices as input for inferCNV to detect CNA subclones (https://github.com/broadinstitute/infercnv). For running CreateInfercnvObject, reannotated non-cancer cells are used as a reference, and the parameter min_max_counts_per_cell=c(1e3,1e7) is used. For running inferCNV, the parameters cutoff=0.1 and leiden_resolution=0.01 are used. The CNA profiles displayed in this study are the ones obtained from the Hidden Markov Model learned by inferCNV.

scDNA analysis

Preprocessing and clonal reconstruction

Using annotated cell types, we re-computed the cell-variant matrices as well as the percentage of cells mutated, the percentage of cells covered, and the percentage of covered cells mutated, for each locus. We then called the final somatic SNVs set at all loci mutated in more than 5% of cancer cells, mutated in less than 1% of the non-cancer cells (min. 1% non-cancer cells covered), and mutated in no normal cells. We obtained copy number profiles and detected the main clonal structure of samples using SCICoNE (Kuipers et al. 2020). Subclones were considered as cancer subclones if they had an aneuploid CNA profile, and as non-cancer subclones if they had a fully diploid CNA profile.

Variant allele calling in scDNA subclones

We investigated all loci from the final somatic SNV set in scDNA subclones in a pseudobulk manner. Cancer subclones were pooled together as well as non-cancer subclones because the coverage was low (<10x per subclone). scDNA subclones with a mean VAF>10% in an SNV locus were considered as supporting the SNV.

Clinically relevant SNVs

Clinically relevant SNVs were detected using the CTAT-Mutations pipeline (https://github.com/NCIP/ctat-mutations/releases/tag/CTAT-Mutations-v4.0.0). Briefly, an SNV was considered clinically relevant if it completed one of these conditions: it was flagged as pathogenic by ClinVar (Landrum et al. 2014), the CHASMplus (Tokheim and Karchin 2019) P-value was <0.05, the VEST (Carter et al. 2013) P-value was <0.05, or FATHMM (Rogers et al. 2018) flagged it as “CANCER”, or “PATHOGENIC”.

Panel validation

To investigate LongSom somatic SNV calls, we used the FoundationOne®CDx targeted NGS panel (Milbury et al. 2022) in matched bulk DNA samples. SNVs detected in the bulk DNA panel but not by LongSom were independently investigated in scRNA-seq data to detect variant allele read support.

Acknowledgments

We thank Joanna Hård for her help with mitochondrial mutation contamination. We thank Ivan Topolsky for his support with cloud computing. We thank Lara Fuhrmann for naming LongSom. A.D. and N.Bo were supported by the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement (#766030 to N.Be). B.J.H. was supported by the National Cancer Institute grant U24CA180922.

Footnotes

Code availability

LongSom is available at https://github.com/cbg-ethz/LongSom.

Conflict of interest

The authors declare no competing interests.

Data availability

The raw sequencing files, as well as the associated analysis files reported in this study are available in the European Genome-phenome Archive (EGA) under the accession number EGAS00001006807. Gencode v36 gene annotation used in this study is available at https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_36/gencode.v36.annotation.gtf.gz. All additional information will be made available upon reasonable request to the authors. Marker genes for cancer and non-cancer cells are available at https://github.com/ETH-NEXUS/scAmpi_single_cell_RNA/blob/master/required_files/ovarian/celltype_list_ovarian.gmx.

Bibliography

  1. Ahsan MU, Liu Q, Fang L, Wang K. 2021. NanoCaller for accurate detection of SNPs and indels in difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks. Genome Biol 22: 261. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Al’Khafaji AM, Smith JT, Garimella KV, Babadi M, Popic V, Sade-Feldman M, Gatzen M, Sarkizova S, Schwartz MA, Blaum EM, et al. 2023. High-throughput RNA isoform sequencing using programmed cDNA concatenation. Nat Biotechnol. [DOI] [PubMed] [Google Scholar]
  3. Amatu A, Sartore-Bianchi A, Siena S. 2016. NTRK gene fusions as novel targets of cancer therapy across multiple tumour types. ESMO Open 1: e000023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Borgsmüller N, Bonet J, Marass F, Gonzalez-Perez A, Lopez-Bigas N, Beerenwinkel N. 2020. BnpC: Bayesian non-parametric clustering of single-cell mutation profiles. Bioinformatics 36: 4854–4859. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bozza WP, Zhang Y, Zhang B. 2018. Cytokeratin 8/18 protects breast cancer cell lines from TRAIL-induced apoptosis. Oncotarget 9: 23264–23273. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Carter H, Douville C, Stenson PD, Cooper DN, Karchin R. 2013. Identifying Mendelian disease genes with the variant effect scoring tool. BMC Genomics 14 Suppl 3: S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Cesi G, Philippidou D, Kozar I, Kim YJ, Bernardin F, Van Niel G, Wienecke-Baldacchino A, Felten P, Letellier E, Dengler S, et al. 2018. A new ALK isoform transported by extracellular vesicles confers drug resistance to melanoma cells. Mol Cancer 17: 145. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Cheng K-C, Lin R-J, Cheng J-Y, Wang S-H, Yu J-C, Wu J-C, Liang Y-J, Hsu H-M, Yu J, Yu AL. 2019. FAM129B, an antioxidative protein, reduces chemosensitivity by competing with Nrf2 for Keap1 binding. EBioMedicine 45: 25–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Chen C, Zhao S, Zhao X, Cao L, Karnad A, Kumar AP, Freeman JW. 2022. Gemcitabine resistance of pancreatic cancer cells is mediated by IGF1R dependent upregulation of CD44 expression and isoform switching. Cell Death Dis 13: 682. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Chen S, Francioli LC, Goodrich JK, Collins RL, Kanai M, Wang Q, Alföldi J, Watts NA, Vittal C, Gauthier LD, et al. 2024. A genomic mutational constraint map using variation in 76,156 human genomes. Nature 625: 92–100. [DOI] [PubMed] [Google Scholar]
  11. Dagogo-Jack I, Shaw AT. 2018. Tumour heterogeneity and resistance to cancer therapies. Nat Rev Clin Oncol 15: 81–94. [DOI] [PubMed] [Google Scholar]
  12. Dentro SC, Leshchiner I, Haase K, Tarabichi M, Wintersinger J, Deshwar AG, Yu K, Rubanova Y, Macintyre G, Demeulemeester J, et al. 2021. Characterizing genetic intra-tumor heterogeneity across 2,658 human cancer genomes. Cell 184: 2239–2254.e39. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Ding S, Chen X, Shen K. 2020. Single-cell RNA sequencing in breast cancer: Understanding tumor heterogeneity and paving roads to individualized therapy. Cancer Commun (Lond) 40: 329–344. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Dondi A, Lischetti U, Jacob F, Singer F, Borgsmüller N, Coelho R, Tumor Profiler Consortium, Heinzelmann-Schwarz V, Beisel C, Beerenwinkel N. 2023. Detection of isoforms and genomic alterations by high-throughput full-length single-cell RNA sequencing in ovarian cancer. Nat Commun 14: 7780. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Dong H, He L, Sun Q, Zhan J, Li J, Xiong X, Zhuang L, Wu S, Li Y, Yin C, et al. 2023. Inhibit ALDH3A2 reduce ovarian cancer cells survival via elevating ferroptosis sensitivity. Gene 876: 147515. [DOI] [PubMed] [Google Scholar]
  16. Duan M, Hao J, Cui S, Worthley DL, Zhang S, Wang Z, Shi J, Liu L, Wang X, Ke A, et al. 2018. Diverse modes of clonal evolution in HBV-related hepatocellular carcinoma revealed by single-cell genome sequencing. Cell Res 28: 359–373. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Foord C, Hsu J, Jarroux J, Hu W, Belchikov N, Pollard S, He Y, Joglekar A, Tilgner HU. 2023. The variables on RNA molecules: concert or cacophony? Answers in long-read sequencing. Nat Methods 20: 20–24. [DOI] [PubMed] [Google Scholar]
  18. Fortier A-M, Asselin E, Cadrin M. 2013. Keratin 8 and 18 loss in epithelial cancer cells increases collective cell migration and cisplatin sensitivity through claudin1 up-regulation. J Biol Chem 288: 11555–11571. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Gao R, Bai S, Henderson YC, Lin Y, Schalck A, Yan Y, Kumar T, Hu M, Sei E, Davis A, et al. 2021. Delineating copy number and clonal substructure in human tumors from single-cell transcriptomes. Nat Biotechnol 39: 599–608. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Gao T, Soldatov R, Sarkar H, Kurkiewicz A, Biederstedt E, Loh P-R, Kharchenko PV. 2023. Haplotype-aware analysis of somatic copy number variations from single-cell transcriptomes. Nat Biotechnol 41: 417–426. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Guo Q, Wang H, Duan J, Luo W, Zhao R, Shen Y, Wang B, Tao S, Sun Y, Ye Q, et al. 2022. An alternatively spliced p62 isoform confers resistance to chemotherapy in breast cancer. Cancer Res 82: 4001–4015. [DOI] [PubMed] [Google Scholar]
  22. Hafemeister C, Satija R. 2019. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol 20: 296. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Hao Y, Stuart T, Kowalski MH, Choudhary S, Hoffman P, Hartman A, Srivastava A, Molla G, Madad S, Fernandez-Granda C, et al. 2024. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat Biotechnol 42: 293–304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. He C, Sun J, Liu C, Jiang Y, Hao Y. 2019. Elevated H3K27me3 levels sensitize osteosarcoma to cisplatin. Clin Epigenetics 11: 8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Hsu J, Jarroux J, Joglekar A, Romero JP, Nemec C, Reyes D, Royall A, He Y, Belchikov N, Leo K, et al. 2022. Comparing 10x Genomics single-cell 3’ and 5’ assay in short-and long-read sequencing. BioRxiv. [Google Scholar]
  26. Iyer DR, Harada N, Clairmont C, Jiang L, Martignetti D, Nguyen H, He YJ, Chowdhury D, D’Andrea AD. 2022. CCAR2 functions downstream of the Shieldin complex to promote double-strand break end-joining. Proc Natl Acad Sci USA 119: e2214935119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Jamal-Hanjani M, Quezada SA, Larkin J, Swanton C. 2015. Translational implications of tumor heterogeneity. Clin Cancer Res 21: 1258–1266. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Joglekar A, Foord C, Jarroux J, Pollard S, Tilgner HU. 2023. From words to complete phrases: insight into single-cell isoforms using short and long reads. Transcription 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Joglekar A, Prjibelski A, Mahfouz A, Collier P, Lin S, Schlusche AK, Marrocco J, Williams SR, Haase B, Hayes A, et al. 2021. A spatially resolved brain region- and cell type-specific isoform atlas of the postnatal mouse brain. Nat Commun 12: 463. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Köster J, Rahmann S. 2012. Snakemake--a scalable bioinformatics workflow engine. Bioinformatics 28: 2520–2522. [DOI] [PubMed] [Google Scholar]
  31. Kuipers J, Tuncel MA, Ferreira P, Jahn K, Beerenwinkel N. 2020. Single-cell copy number calling and event history reconstruction. BioRxiv. [Google Scholar]
  32. Kwok AWC, Qiao C, Huang R, Sham M-H, Ho JWK, Huang Y. 2022. MQuad enables clonal substructure discovery using single cell mitochondrial variants. Nat Commun 13: 1205. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Lähnemann D, Köster J, Szczurek E, McCarthy DJ, Hicks SC, Robinson MD, Vallejos CA, Campbell KR, Beerenwinkel N, Mahfouz A, et al. 2020. Eleven grand challenges in single-cell data science. Genome Biol 21: 31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, Church DM, Maglott DR. 2014. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res 42: D980–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Lappalainen T, Sammeth M, Friedländer MR, ‘t Hoen PAC, Monlong J, Rivas MA, Gonzàlez-Porta M, Kurbatova N, Griebel T, Ferreira PG, et al. 2013. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501: 506–511. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Lei JT, Shao J, Zhang J, Iglesia M, Chan DW, Cao J, Anurag M, Singh P, He X, Kosaka Y, et al. 2018. Functional Annotation of ESR1 Gene Fusions in Estrogen Receptor-Positive Breast Cancer. Cell Rep 24: 1434–1444.e7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Liu D, Gao Y, Liu J, Huang Y, Yin J, Feng Y, Shi L, Meloni BP, Zhang C, Zheng M, et al. 2021. Intercellular mitochondrial transfer as a means of tissue revitalization. Signal Transduct Target Ther 6: 65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Li H. 2018. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34: 3094–3100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Mansoori B, Mohammadi A, Davudian S, Shirjang S, Baradaran B. 2017. The different mechanisms of cancer drug resistance: A brief review. Adv Pharm Bull 7: 339–348. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Marine J-C, Dawson S-J, Dawson MA. 2020. Non-genetic mechanisms of therapeutic resistance in cancer. Nat Rev Cancer 20: 743–756. [DOI] [PubMed] [Google Scholar]
  41. Marx V. 2023. Method of the year: long-read sequencing. Nat Methods 20: 6–11. [DOI] [PubMed] [Google Scholar]
  42. Milbury CA, Creeden J, Yip W-K, Smith DL, Pattani V, Maxwell K, Sawchyn B, Gjoerup O, Meng W, Skoletsky J, et al. 2022. Clinical and analytical validation of FoundationOne®CDx, a comprehensive genomic profiling assay for solid tumors. PLoS ONE 17: e0264138. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Miller TE, Lareau CA, Verga JA, DePasquale EAK, Liu V, Ssozi D, Sandor K, Yin Y, Ludwig LS, El Farran CA, et al. 2022. Mitochondrial variant enrichment from high-throughput single-cell RNA sequencing resolves clonal populations. Nat Biotechnol 40: 1030–1034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Mitra D, Brumlik MJ, Okamgba SU, Zhu Y, Duplessis TT, Parvani JG, Lesko SM, Brogi E, Jones FE. 2009. An oncogenic isoform of HER2 associated with locally disseminated breast cancer and trastuzumab resistance. Mol Cancer Ther 8: 2152–2162. [DOI] [PubMed] [Google Scholar]
  45. Muyas F, Sauer CM, Valle-Inclán JE, Li R, Rahbari R, Mitchell TJ, Hormoz S, Cortés-Ciriano I. 2023. De novo detection of somatic mutations in high-throughput single-cell profiling data sets. Nat Biotechnol. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Osorio D, Cai JJ. 2021. Systematic determination of the mitochondrial proportion in human and mice tissues for single-cell RNA-sequencing data quality control. Bioinformatics 37: 963–967. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Prummer M, Bertolini A, Bosshard L, Barkmann F, Yates J, Boeva V, Tumor Profiler Consortium, Stekhoven D, Singer F. 2023. scROSHI: robust supervised hierarchical identification of single cells. NAR Genom Bioinform 5: lqad058. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Qin Q, Popic V, Yu H, White E, Khorgade A, Shin A, Wienand K, Dondi A, Beerenwinkel N, Vazquez F, et al. 2024. CTAT-LR-fusion: accurate fusion transcript identification from long and short read isoform sequencing at bulk or single cell resolution. BioRxiv. [Google Scholar]
  49. Ramón Y Cajal S, Sesé M, Capdevila C, Aasen T, De Mattos-Arruda L, Diaz-Cano SJ, Hernández-Losa J, Castellví J. 2020. Clinical implications of intratumor heterogeneity: challenges and opportunities. J Mol Med 98: 161–177. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Rogers MF, Shihab HA, Mort M, Cooper DN, Gaunt TR, Campbell C. 2018. FATHMM-XF: accurate prediction of pathogenic point mutations via extended features. Bioinformatics 34: 511–513. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Roth A, McPherson A, Laks E, Biele J, Yap D, Wan A, Smith MA, Nielsen CB, McAlpine JN, Aparicio S, et al. 2016. Clonal genotype and population structure inference from single-cell tumor sequencing. Nat Methods 13: 573–576. [DOI] [PubMed] [Google Scholar]
  52. Serin Harmanci A, Harmanci AO, Zhou X. 2020. CaSpER identifies and visualizes CNV events by integrative analysis of single-cell or bulk RNA-sequencing data. Nat Commun 11: 89. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Shiau C-K, Lu L, Kieser R, Fukumura K, Pan T, Lin H-Y, Yang J, Tong EL, Lee G, Yan Y, et al. 2023. High throughput single cell long-read sequencing analyses of same-cell genotypes and phenotypes in human tumors. Nat Commun 14: 4124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Stein AN, Joglekar A, Poon C-L, Tilgner HU. 2022. ScisorWiz: visualizing differential isoform expression in single-cell long-read data. Bioinformatics 38: 3474–3476. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Tokheim C, Karchin R. 2019. Chasmplus reveals the scope of somatic missense mutations driving human cancers. Cell Syst 9: 9–23.e8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Van Egeren D, Escabi J, Nguyen M, Liu S, Reilly CR, Patel S, Kamaz B, Kalyva M, DeAngelo DJ, Galinsky I, et al. 2021. Reconstructing the lineage histories and differentiation trajectories of individual cancer cells in myeloproliferative neoplasms. Cell Stem Cell 28: 514–523.e9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Van Egeren D, Kamaz B, Liu S, Nguyen M, Reilly CR, Kalyva M, DeAngelo DJ, Galinsky I, Wadleigh M, Winer ES, et al. 2022. Transcriptional differences between JAK2-V617F and wild-type bone marrow cells in patients with myeloproliferative neoplasms. Exp Hematol 107: 14–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Vasan N, Baselga J, Hyman DM. 2019. A view on drug resistance in cancer. Nature 575: 299–309. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Xia J, Li S, Liu S, Zhang L. 2023. Aldehyde dehydrogenase in solid tumors and other diseases: Potential biomarkers and therapeutic targets. MedComm 4: e195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Yuan Y, Ju YS, Kim Y, Li J, Wang Y, Yoon CJ, Yang Y, Martincorena I, Creighton CJ, Weinstein JN, et al. 2023. Author Correction: Comprehensive molecular characterization of mitochondrial genomes in human cancers. Nat Genet 55: 1078. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Zafar H, Wang Y, Nakhleh L, Navin N, Chen K. 2016. Monovar: single-nucleotide variant detection in single cells. Nat Methods 13: 505–507. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Zampieri LX, Silva-Almeida C, Rondeau JD, Sonveaux P. 2021. Mitochondrial transfer in cancer: A comprehensive review. Int J Mol Sci 22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Zhang T, Jia H, Song T, Lv L, Gulhan DC, Wang H, Guo W, Xi R, Guo H, Shen N. 2023. De novo identification of expressed cancer somatic mutations from single-cell RNA sequencing data. Genome Med 15: 115. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The raw sequencing files, as well as the associated analysis files reported in this study are available in the European Genome-phenome Archive (EGA) under the accession number EGAS00001006807. Gencode v36 gene annotation used in this study is available at https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_36/gencode.v36.annotation.gtf.gz. All additional information will be made available upon reasonable request to the authors. Marker genes for cancer and non-cancer cells are available at https://github.com/ETH-NEXUS/scAmpi_single_cell_RNA/blob/master/required_files/ovarian/celltype_list_ovarian.gmx.


Articles from bioRxiv are provided here courtesy of Cold Spring Harbor Laboratory Preprints

RESOURCES