Skip to main content
eLife logoLink to eLife
. 2022 Mar 1;11:e73520. doi: 10.7554/eLife.73520

Robust and annotation-free analysis of alternative splicing across diverse cell types in mice

Gonzalo Benegas 1, Jonathan Fischer 2, Yun S Song 3,4,5,
Editors: Eduardo Eyras6, James L Manley7
PMCID: PMC8975553  PMID: 35229721

Abstract

Although alternative splicing is a fundamental and pervasive aspect of gene expression in higher eukaryotes, it is often omitted from single-cell studies due to quantification challenges inherent to commonly used short-read sequencing technologies. Here, we undertake the analysis of alternative splicing across numerous diverse murine cell types from two large-scale single-cell datasets—the Tabula Muris and BRAIN Initiative Cell Census Network—while accounting for understudied technical artifacts and unannotated events. We find strong and general cell-type-specific alternative splicing, complementary to total gene expression but of similar discriminatory value, and identify a large volume of novel splicing events. We specifically highlight splicing variation across different cell types in primary motor cortex neurons, bone marrow B cells, and various epithelial cells, and we show that the implicated transcripts include many genes which do not display total expression differences. To elucidate the regulation of alternative splicing, we build a custom predictive model based on splicing factor activity, recovering several known interactions while generating new hypotheses, including potential regulatory roles for novel alternative splicing events in critical genes like Khdrbs3 and Rbfox1. We make our results available using public interactive browsers to spur further exploration by the community.

Research organism: Mouse

eLife digest

Cells are the basic building blocks of all living things. There are numerous types of cells, and each cell has its own machinery to fulfill a specialised role. Despite their different purposes, most cells contain the same instructions, stored as DNA, on how to assemble the proteins needed to perform their intended functions. Cell types often vary in the frequency that each gene is read, leading to different quantities of proteins produced.

Moreover, a process known as alternative splicing enables cells to build multiple proteins from the same gene. It works by joining fragments of a gene’s code in various combinations. The resulting RNA sequences are molecular templates that cells use to assemble proteins.

Analysing these RNA sequences reveals which genes are switched on in different tissues of the body, and what proteins are being made. However, despite recent advancements, alternative splicing is rarely studied in single cells because of some sizeable technical challenges.

Benegas, Fischer and Song developed a computational toolkit designed to handle the unique challenges of analysing alternative splicing events in single cells. The analysis pipeline, called scQuint, was tested on two large datasets that capture cell-to-cell differences in the brain and other tissues of mice.

Nearly all the cell types studied exhibited clear differences in alternative splicing, such that cell types could be distinguished based on their splicing profiles. Intriguing patterns of splicing were highlighted in some immune cells and certain types of neurons. Across cell types, the genes with unique splicing patterns were often not the same as those with unique activity patterns, indicating that gene expression and alternative splicing are two complementary processes. New types of alternative splicing events were also identified. Benegas et al. also developed a statistical model to probe the roles of splicing regulators in different cell types.

In summary, the scQuint toolkit overcomes critical technical challenges typically encountered when analysing alternative splicing in single cells. It also reveals new insights about mechanisms of alternative splicing. The results are open access, made available using public interactive browsers, which should spur on other researchers to interrogate how alternative splicing differs in single cells.

Introduction

The past decade’s advances in single-cell genomics have enabled the data-driven characterization of a wide variety of distinct cell populations. Despite affecting more than 90% of human pre-mRNAs (Wang et al., 2008), isoform-level variation in gene expression has often been ignored because of quantification difficulties when using data from popular short-read sequencing technologies such as 10x Genomics Chromium and Smart-seq2 (Picelli et al., 2014). Long-read single-cell technologies, which greatly simplify isoform quantification, are improving (Byrne et al., 2017; Gupta et al., 2018; Volden and Vollmers, 2020; Lebrigand et al., 2020; Joglekar et al., 2021), but remain more costly and lower-throughput than their short-read counterparts. For these reasons and others, short-read datasets predominate and we must work with short reads to make use of the rich compendium of available data. In response, researchers have developed several computational methods to investigate splicing variation despite the sizable technical challenges inherent to this regime. A selection of these challenges and methods are summarized in the Appendix.

To complement single-cell gene expression atlases, we analyze alternative splicing in large single-cell RNA-seq (scRNA-seq) datasets from the Tabula Muris consortium (Schaum et al., 2018) and BRAIN Initiative Cell Census Network (BICCN) (Yao et al., 2021). These data span a broad range of mouse tissues and cell types, and remain largely unexplored at the level of transcript variation. During our initial analyses, we encountered pervasive coverage biases, a heretofore largely unappreciated mode of technical variation which greatly confounds biological variation across cell types. Unsatisfied with the performance of current methods when confronted by these biases, we implemented our own quantification, visualization, and testing pipeline, named scQuint (single-cell quantification of introns), which allowed us to continue our analyses in a robust, annotation-free, and computationally tractable manner. Parts of the scQuint pipeline are based on adaptations of the bulk RNA-seq alternative splicing analysis method LeafCutter (Li et al., 2018) to handle the unique challenges of scRNA-seq data. As we demonstrate in subsequent sections, our modifications in the quantification, statistical modeling, and optimization procedures lead to improved robustness, scalability, and calibration when working with data from single cells (Figure 2—figure supplement 2, also see Materials and methods).

Applying scQuint to these datasets, we find a strong signal of cell-type-specific alternative splicing and demonstrate that cell type can be accurately predicted given only splicing proportions. Moreover, our annotation-free approach enables us to detect a large quantity of cell-type-specific novel splicing events. In certain cell types, particularly the neuron subclasses, as many as 30% of differential splicing events that we detect are novel. In general, across the many considered cell types and tissues in both datasets, we find only a narrow overlap between the top differentially expressed and the top differentially spliced genes within a given cell type, illustrating the complementarity of splicing to expression. Our examination of neurons in the primary motor cortex suggests that splicing distinguishes neuron classes and subclasses as readily as does expression. We showcase alternative splicing patterns specific to the GABAergic (inhibitory) and Glutamatergic (excitatory) neuron classes as well as the subclasses therein. The implicated transcripts include key synaptic molecules and genes which do not display expression differences across subclasses. In developing marrow B cells, we find alternative splicing and novel transcription start sites (TSS) in critical transcription factors such as Smarca4 and Foxp1, while further investigation reveals dissimilar trajectories for expression and alternative splicing in numerous genes across B cell developmental stages. These findings buttress our belief in the complementary nature of these processes and provide clues to the regulatory architecture controlling the early B cell life cycle. To facilitate easy exploration of these datasets and our results, we make available several interactive browsers as a resource for the genomics community.

Finally, to advance our understanding of alternative splicing regulation, we build a statistical machine learning model to predict splicing events by leveraging both the expression levels and splicing patterns of splicing factors across cell types. This model recovers several known regulatory interactions such as the repression of splice site four exons in neurexins by Khdrbs3, while generating new hypotheses for experimental follow-up. For example, in addition to the regulatory effect of the whole-gene Khdrbs3 expression, the model predicts a regulatory role for a novel alternative TSS in this gene. In aggregate, our results imply that alternative splicing serves as a complementary rather than redundant component of transcriptional regulation and supports the mining of large-scale single-cell transcriptomic data via careful modeling to generate hypothetical regulatory roles for splicing events.

Results

Methods overview

Robust, annotation-free quantification based on alternative introns

Most methods rely on the assumption that coverage depth across a transcript is essentially uniform (e.g., Akr1r1, Figure 1—figure supplement 1a). We instead found that Smart-seq2 data (Picelli et al., 2014) frequently contain sizable fractions of genes with coverage that decays with increasing distance from the 3’ ends of transcripts. For example, in mammary gland basal cells from the Tabula Muris dataset (Schaum et al., 2018), Ctnbb1 shows a gradual drop in coverage (Figure 1—figure supplement 1b) while Pdpn displays an abrupt reduction halfway through the 3’ UTR (Figure 1—figure supplement 1c). That the magnitude of these effects varies across technical replicates (plates) suggests they could be artifacts, possibly related to degradation or interrupted reverse transcription. Similar coverage bias artifacts are also apparent in the BICCN primary motor cortex data (Yao et al., 2021; Figure 1—figure supplement 2).

Such coverage biases affect gene expression quantification, and in some cases these batch effects are sufficient to comprise a significant proportion of the observed variation in expression levels. For the Tabula Muris mammary gland dataset, a low-dimensional embedding of cells based on gene expression reveals that some cell type clusters exhibit internal stratification by plate (Figure 1a). A subsequent test of differential gene expression between plate B002438 and all other plates returns 2870 significant hits after correction for multiple hypothesis testing, and all manually inspected differentially expressed genes exhibit these types of coverage biases. Perhaps unsurprisingly, quantification at the transcript level is apt to be even more sensitive to these artifacts than gene-level quantification, especially if it is based on coverage differences across the whole length of the transcript. The UMAP embeddings of isoform proportions (kallisto by Bray et al., 2016), exon proportions (DEXSeq by Anders et al., 2012), 100 bp bin coverage proportions (ODEGR-NMF by Matsumoto et al., 2020) or junction usage proportions across the whole gene (DESJ by Liu et al., 2021) depict a plate clustering pattern which scrambles the anticipated cell type clusters (Figure 1b–e).

Figure 1. Clustering patterns by cell type and plate in the mammary gland from a three month-old female mouse in Tabula Muris.

Cell embeddings based on different features were obtained by running PCA (gene expression) or VAE (the rest) followed by UMAP and subsequently colored by cell type (left column) and the plate in which they were processed (right column). (a) Gene expression, quantified using featureCounts (log-transformed normalized counts). (b) Isoform proportions. Isoform expression was estimated with kallisto and divided by the total expression of the corresponding gene to obtain isoform proportions. (c) Coverage proportions of 100 base-pair bins along the gene, as proposed by ODEGR-NMF. (d) Exon proportions, as proposed by DEXSeq. (e) Intron proportions across the whole gene, as proposed by DESJ. (f) Alternative intron proportions quantified by LeafCutter. (g) Alternative intron proportions (for introns sharing a 3’ acceptor site) as quantified by scQuint.

Figure 1.

Figure 1—figure supplement 1. Coverage artifacts in mammary gland basal cells from Tabula Muris.

Figure 1—figure supplement 1.

Aggregate read coverage of basal cells is shown for three genes in two female mice: 3_38_F, processed in three different plates, and 3_56_F, processed in two different plates. Visualization on the UCSC Genome Browser. (a) Akr1r1, with relatively uniform coverage, what we expect. (b) Ctnbb1, with a gradual drop in coverage away from the 3’ end. The rate of coverage decay varies across plates. (c) Pdpn, with a sudden drop in coverage halfway through the 3’ UTR. The magnitude of the drop varies across plates.
Figure 1—figure supplement 2. Technical artifacts in BICCN Cortex.

Figure 1—figure supplement 2.

Aggregate read coverage in Pdpk1 in two cell types, Vip and Sst, further separated according to the ‘batch’ metadata label into two groups. The first group contains cells from batches R8S4-180530 and,R8S4-180524 while the second group contains the remaining batches. Cells from different batches belong to different mice and were processed on different dates. In all groups, coverage decreases rapidly in the 3’ UTR of the isoform with the longest 3’ UTR, eventually reaching zero. Additionally, the relative coverage in this region compared to the rest of the gene (seeming to originate from a different isoform with a short 3’ UTR) varies drastically across batches, and consistently in different cell types. In principle, this could be due to biological differences between mice from different batches, but an explanation based on technical factors such as amplification bias may be more plausible.

With these considerations in mind, we sought to quantify transcript variation in a fashion that would be more robust to coverage differences along the transcript. Although some bulk RNA-seq methods such as RSEM (Li and Dewey, 2011) can model positional bias, they do so globally rather than in the gene-specific manner we encounter. One potential approach is alternative intron quantification as performed by bulk RNA-seq methods MAJIQ (Vaquero-Garcia et al., 2016), JUM (Wang and Rio, 2018), and LeafCutter (Li et al., 2018). Promisingly, quantification via LeafCutter (Figure 1f) yields an embedding that displays less clustering by plate than the other approaches we tried. We therefore based scQuint’s quantification approach on LeafCutter’s, with the key difference of restricting to alternative introns which share a common 3’ acceptor site (Figure 2). This results in alternative splicing events that are equidistant from the 3’ end of transcripts and which are less affected by the coverage biases we observed in scRNA-seq data. The embedding of cells based on our quantification approach (Figure 1g) shows less clustering by plate than LeafCutter and other methods.

Figure 2. Overview of scQuint.

(a) Intron usage is quantified from split reads in each cell, with introns sharing 3’ splice sites forming alternative intron groups. (b) Genome-wide intron usage is mapped into a low dimensional latent space using a Dirichlet-Multinomial VAE. Visualization of the latent space is done via UMAP. (c) A Dirichlet-Multinomial GLM tests for differential splicing across conditions such as predefined cell types or clusters identified from the splicing latent space.

Figure 2.

Figure 2—figure supplement 1. Splicing latent space when alternative intron counts are shuffled.

Figure 2—figure supplement 1.

To verify that absolute gene expression does not affect the splicing latent space, we perturbed the BICCN Cortex dataset by resampling alternative intron counts with a fixed proportion in all cells (the proportions in different alternative intron groups varied and were sampled from a uniform Dirichlet distribution). In this scenario, different cell types still vary in their gene expression levels but not in their splicing patterns. As hoped, the splicing latent space does not distinguish between cell types, indicating it is only capturing differences in splicing proportions rather than changes in absolute gene expression.
Figure 2—figure supplement 2. Comparison with LeafCutter.

Figure 2—figure supplement 2.

(a) Quantification runtime. Time to perform intron quantification on BICCN Cortex dataset, including cell subsampling to understand effect of number of cells. (b) Differential splicing runtime and memory usage. We randomly split all 6220 BICCN Cortex cells into two equally sized groups and performed differential splicing between them. Runtime (left) and memory usage (right) are displayed. (c) Differential splicing p-value calibration. In the same random split of (b), the null hypothesis of no difference in splicing proportions holds, and we expect the distribution of p-values to be uniform. The quantile-quantile plot of p-values obtained with scQuint shows their distribution is indeed uniform, suggesting that the model is well-calibrated under the null; this is not true for p-values obtained by LeafCutter. All experiments were performed on a Skylake processor (2 × 16 cores @ 2.1 GHz) with 96 GB of RAM.

Another advantage of alternative intron quantification is the ability to easily discover novel alternative splicing events. Whereas short reads generally cannot be associated with specific transcript isoforms, nor even exons if they partially overlap, split reads uniquely associate with a particular intron. Consequently, intron-based quantification does not depend on annotated transcriptome references and permits the discovery of novel alternative splicing events. This is important since, as detailed later, we estimate up to 30% of cell-type-specific differential splicing events are novel. Other annotation-free methods have been applied to single-cell short-read full-length data, but they do not provide a statistical test for differential splicing between two groups of cells (Appendix 1—table 1).

We do not recommend using scQuint to analyze alternative splicing in 10x Genomics Chromium data given its strong 3’ transcript bias and evidence suggesting that these data can detect about half the number of junctions detected by Smart-seq2 (Wang et al., 2021). This imposes a fundamental limit on the number of transcripts that can be distinguished, and we expect alternative intron quantification to be sub-optimal in this setting. Nonetheless, several approaches for differential transcript usage in 10x data have been developed: Sierra (Patrick et al., 2020), SpliZ (Olivieri et al., 2020), and a kallisto-based approach which could be adapted for this task (Ntranos et al., 2019).

Dimensionality reduction with Variational Autoencoder

To perform dimensionality reduction using splicing profiles, we developed a novel Variational Autoencoder (VAE) (Kingma and Welling, 2014) with a Dirichlet-Multinomial noise model, a natural distribution for sparse, overdispersed count data (Figure 2b, Materials and methods). For example, the often encountered ‘binary’ splicing (Buen Abad Najar et al., 2020) can be modeled by fitting a concentration parameter close to zero. VAEs are flexible and scalable generative models which have been successfully applied to analyze gene expression (Lopez et al., 2020) but have not yet been employed to investigate alternative splicing. To verify that we prevent leakage of gene expression information into our splicing profiles, we applied our VAE to embed a shuffled dataset obtained by resampling alternative intron counts with a fixed proportion in all cells. This shuffled dataset contained expression variability between cells but no splicing differences, and, as hoped, the resulting splicing latent space did not distinguish among cell types, indicating that it captures differences in splicing proportions rather than changes in absolute gene expression (Figure 2—figure supplement 1). We compared the latent space obtained with the VAE to the one obtained using Principal Component Analysis (PCA), a standard dimensionality reduction technique used in the LeafCutter and BRIE2 software packages. The VAE better distinguishes cell types than PCA (Figure 3), especially in the mammary gland and diaphragm.

Figure 3. Comparison of splicing latent spaces obtained with PCA and VAE.

Figure 3.

Cells from (a) the cortex, (b) mammary gland and (c) diaphragm are projected into a latent space using PCA or VAE and visualized using UMAP. Cell type labels are obtained from the original data sources and are based on clustering in the expression latent space. The VAE is able to better distinguish cell types in the splicing latent space than PCA.

Differential splicing hypothesis testing with Generalized Linear Model

To test for differential splicing across cell types or conditions, we adopt a Dirichlet-Multinomial Generalized Linear Model (GLM) coupled with a likelihood-ratio test (Figure 2c, Materials and methods). We do so by adapting one of LeafCutter’s proposed models for bulk RNA-seq to the scRNA-seq setting and apply it to our Smart-seq2 intron quantification. Namely, due to the sparse nature of scRNA-seq splicing data, we implement a more parsimonious statistical model featuring gene-level rather than intron-level parameters. Furthermore, we adjust the model-fitting algorithm at the initialization and optimization stages (see Materials and methods). After our modifications, we obtain well-calibrated p-values whereas those from LeafCutter’s original differential splicing model are anti-conservative (Figure 2—figure supplement 2) and perhaps prone to extra false positives if applied directly to scRNA-seq data. We also find improvements in computational cost, both in runtime and memory usage (Figure 2—figure supplement 2).

As described in Materials and methods, we generated synthetic data in order to benchmark scQuint against three other methods that also offer two-sample tests for differential transcript usage proportions: BRIE2 and DTUrtle, both designed for scRNA-seq, and LeafCutter, designed for bulk RNA-seq (Figure 4). While the choice of an appropriate simulation model for scRNA-seq data is very much an open area of debate, particularly at the transcript level, we attempted to recreate a challenging setting for inference by assuming low coverage (1–2X) and high overdispersion (variance-to-mean ratio of 8). We performed three in silico experiments to assess performance under the differing conditions of even transcript coverage, unannotated events, and coverage decay across the transcript. In the case of even coverage, scQuint, LeafCutter, and BRIE2 perform similarly and do a good job of correctly identifying events, while DTUrtle is slightly behind. scQuint does only slightly worse with low cell counts and low coverage, which is probably a trade-off for the robustness that comes from only using reads from junctions sharing 3’ acceptor sites. Next, we recreated the unannotated setting by masking the reference given to methods. Only scQuint and LeafCutter are able to perform differential transcript usage testing in this setting, and, as expected, they performed nearly identically to the annotated setting with even coverage. Lastly, we created a setting where transcript coverage decays with distance from the 3’ in one of the two groups, mirroring a pattern we often saw in the real data analyzed for this paper. Here, scQuint outperforms the other tested methods by a wide margin with performance improving at higher coverages, unlike other methods. These results validate that scQuint is robust to both incomplete annotations and coverage decay while only paying a modest penalty relative to other methods under ideal conditions (even coverage and annotated events).

Figure 4. Evaluation of differential splicing test on simulated data.

Figure 4.

ROC AUC for detecting differential transcript usage between two groups, based on the p-value produced by different methods. Unannotated: the transcript reference given to methods is masked. Coverage decay: coverage decay with distance to the 3’ end of the transcript is induced in one of the two groups.

Augmenting cell atlases with splicing information

We applied scQuint to two of the largest available Smart-seq2 datasets. The first comprehensively surveys the mouse primary motor cortex (BICCN Cortex) (Yao et al., 2021) while the second contains over 100 cell types distributed across 20 mouse organs (Tabula Muris) (Schaum et al., 2018; Table 1). We detect more alternative introns in BICCN Cortex neurons than in the entire broad range of cell types present in Tabula Muris (which includes neurons but in much smaller number). This observation comports with previous findings that the mammalian brain has exceptionally high levels of alternative splicing (Yeo et al., 2004). Booeshaghi et al., 2021 analyzed BICCN Cortex at the transcript level, but focused on changes in absolute transcript expression rather than proportions. While the authors indirectly find some differences in transcript proportions by inspecting genes with no differential expression, this is not a systematic analysis of differential transcript usage. Meanwhile, only microglial cells in Tabula Muris (Nip et al., 2020) have been analyzed at the transcript level. (Tabula Muris also contains 10x Chromium data analyzed at the transcript level [Patrick et al., 2020]).

Table 1. Overview of analyzed datasets.

Table 1—source data 1. Number of cells per cell type and donor in BICCN Cortex.
Table 1—source data 2. Number of cells per tissue and donor in Tabula Muris.
Dataset Cells Tissues Cell types Individuals Genes Alt. introns Unannotated
BICCN Cortex 6220 1 11 45 26,488 39,357 29%
Tabula Muris 44,518 23 117 8 27,348 29,965 25%

As a community resource, we provide complementary ways to interactively explore splicing patterns present in these datasets (Figure 5), available at (https://github.com/songlab-cal/scquint-analysis, Benegas, 2021a) with an accompanying tutorial video. The UCSC Genome Browser (Kent et al., 2002) permits exploration of alternative splicing events within genomic contexts such as amino acid sequence, conservation score, or protein binding sites, while allowing users to select different length scales for examination. We additionally leverage the cell×gene browser (Megill et al., 2021) (designed for gene expression analysis) to visualize alternative intron PSI (percent spliced-in, defined as the proportion of reads supporting an intron relative to the total in the intron group) via cell embeddings. Further, one can generate histograms to compare across different groups defined by cell type, gender, or even manually selected groups of cells. These tools remain under active development by the community, and we hope that both the genome- and cell-centric views will soon be integrated into one browser.

Figure 5. Interactive visualizations of splicing patterns.

Figure 5.

As an example, a skipped exon in Myl6. (a) The UCSC Genome browser visualization of this locus. Bottom: annotated isoforms of Myl6, including a skipped exon. Center: aggregate read coverage in three cell types with varying inclusion levels of the skipped exon. Top: three alternative introns that share a 3’ acceptor site. The identified intron’s proportion corresponds to the skipped exon’s inclusion level. (b) cell×gene browser visualization of the marked intron’s proportions (Myl6_chr10:128491034–128491720). Center: intron proportion for each cell in the UMAP expression embedding. Sides: intron proportion histogram for (left) different cell types and (right) all cells.

Cell-type-specific splicing signal is strong and complementary to gene expression

Primary motor cortex

We first explored the splicing latent space of BICCN Cortex cells by comparing it to the usual expression latent space (Figure 6a). Cells in the splicing latent space strongly cluster by cell type (annotated by Yao et al., 2021 based on gene expression). A similar analysis was recently performed (Feng et al., 2021) on a different cortex subregion in which most, but not all, neuron subclasses could be distinguished based on splicing profiles (e.g., L6 CT and L6b could not be separated). However, the authors only considered annotated skipped exons, a subset of the events we quantify, and used a different dimensionality reduction technique.

Figure 6. Splicing patterns in BICCN Cortex.

(a) Expression and splicing latent spaces, visualized using UMAP. The expression (splicing) latent space is defined by running PCA (VAE) on the gene expression (alternative intron proportion, PSI) matrix. Cell types separate well in both latent spaces. (b) PSI of selected introns (left) and expression (log-transformed normalized counts) of their respective genes (right) averaged across cell types. Top: introns distinguishing Glutamatergic and GABAergic neuron classes. Bottom: introns distinguishing neuron subclasses. (c–e) Sashimi plots (Garrido-Martín et al., 2018) of specific alternative splicing events, displaying overall read coverage with arcs indicating usage of different introns (certain introns are shrunk for better visualization). (c) Novel skipped exon in Pgm2. (d) Novel alternative transcription start site (TSS) in Rbfox1. (e) Annotated skipped exon (SE) in Nrxn1.

Figure 6—source data 1. Intron coordinates for panel (b).

Figure 6.

Figure 6—figure supplement 1. Marker genes for cell types in BICCN Cortex.

Figure 6—figure supplement 1.

Mean (log-transformed) expression for some of the top differentially expressed genes in each cell type.
Figure 6—figure supplement 2. PSI distribution of Pgm2_32951.

Figure 6—figure supplement 2.

Only six individuals with highest number of cells are displayed. Marked N/A are cell types where the individuals have PSI defined in fewer than three cells. Per the experimental design of this dataset, the top 3 individuals have only Glutamatergic cell types sequenced, while the bottom 3 have only GABAergic.
Figure 6—figure supplement 3. PSI distribution of Rbfox1_26172.

Figure 6—figure supplement 3.

Only six individuals with highest number of cells are displayed. Marked N/A are cell types where the individuals have PSI defined in fewer than three cells. Per the experimental design of this dataset, the top 3 individuals have only Glutamatergic cell types sequenced, while the bottom 3 have only GABAergic.
Figure 6—figure supplement 4. PSI distribution of Nrxn1_8067.

Figure 6—figure supplement 4.

Only six individuals with highest number of cells are displayed. Marked N/A are cell types where the individuals have PSI defined in fewer than three cells. Per the experimental design of this dataset, the top 3 individuals have only Glutamatergic cell types sequenced, while the bottom 3 have only GABAergic.

Figure 6b (top left) highlights some differentially spliced genes between Glutamatergic and GABAergic neurons, including the glutamate metabotropic receptor Grm5 as well as Shisa9/Ckamp44, which associates with AMPA ionotropic glutamate receptors (von Engelhardt et al., 2010). The expression pattern of these genes, meanwhile, does not readily distinguish the neuron classes (Figure 6b, top right). In Pgm2, a gene of the glycolysis pathway thought to be regulated in the developing cortex by mTOR (Schüle et al., 2021), we discover a novel exon preferentially included in Glutamatergic neurons (Figure 6c, Figure 6—figure supplement 2).

Our differential splicing test reveals thousands of cell-type-specific splicing events (further discussed below in subsection Comparison of selected tissues), highlighting marker introns that distinguish neuron subclasses, while the expression of their respective genes does not; for example, compare the bottom left and bottom right panels of Figure 6b. Genes that better distinguish cell types at the expression level can be seen in Figure 6—figure supplement 1. As another example of the many novel events we discover, we showcase a novel alternative transcription start site in Rbfox1, a splicing factor known to regulate cell-type-specific alternative splicing in the brain (Wamsley et al., 2018; Figure 6d, Figure 6—figure supplement 3). This novel TSS (exon chr16:5763871–5763913, intron Rbfox1_26172), which lies in a highly-conserved region, is (partially) used by only L6b neurons. We are also able to detect well-known cell-type-specific alternatively spliced genes such as Nrxn1, which encodes a key pre-synaptic molecule (Figure 6e, Figure 6—figure supplement 4; Fuccillo et al., 2015). In this case, we observe an exon (known as splice site 2) exclusively skipped in Vip and Lamp5 neurons.

General patterns in Tabula Muris

We next turned our attention to Tabula Muris, which comprises a wide variety of organs and cell types from across the entire body. As before, we initially compared the expression and splicing latent spaces using UMAP (Figure 7a). This revealed broadly consistent clusters between projections, but a visible shift in the global layout of these clusters. In particular, whereas cell types were better separated in the expression projection, cell classes (e.g., endothelial, epithelial, immune) formed more coherent clusters in the splicing projection.

Figure 7. Global analysis of Tabula Muris.

Figure 7.

(a) UMAP visualization of the expression (left) and splicing (right) latent spaces. Each dot is a cell, colored by organ, and overlays indicate the primary cell type comprising that cluster. (b) Tanglegram comparing dendrograms of major cell types based on distances in the expression (left) and splicing (right) latent spaces, highlighting functional classes with specific colors.

To supplement our qualitative comparison of UMAP projections with a more rigorous approach, we built dendrograms and a tanglegram using the respective distances between cells in each of the expression and splicing latent spaces (Figure 7b). Despite minor shifts, the dendrograms resemble one another, and most subtree structure is preserved. The low value of their entanglement, a quantitative measure of the discrepancy between hierarchical clusterings, at only 6% indicates a high degree of similarity. (For comparison, the entanglement value between the dendrogram for all expressed genes and that for transcript factors is 11% [Schaum et al., 2018]). As in the UMAP visualization, immune cells group together more closely in the splicing dendrogram. However, unlike the UMAP projection, we observe that several types of pancreatic cells cluster together with neurons, a cell type long believed to share an evolutionary origin (Le Roith et al., 1982). Notably, the left dendrogram in Figure 7b shows that hepatocytes are clear outliers in the expression latent space. We suspect this may be due to technical differences from using 96-well plates rather than the 384-well plates used for other cell types.

B cell development in the marrow

We then focused on developing B cells from the bone marrow in Tabula Muris. In the splicing latent space, we found that immature B cells are harder to distinguish from the other B cell subpopulations (Figure 8a), reflecting less refined splicing programs or limitations in transcript capture efficiency. Immature B cells have also fewer differential splicing events when compared to the other stages of B cell development (Figure 8b). The top differential splicing events we identified throughout development displayed splicing trajectories mostly independent from the trajectories of gene expression (Figure 8c). We highlight alternative TSSs (one of them novel) in two transcription factors essential for B cell development: Smarca4, encoding BRG1 (Bossen et al., 2015; Figure 8d, Figure 8—figure supplement 1); and Foxp1 (Hu et al., 2006; Figure 8e, Figure 8—figure supplement 2). While Foxp1 expression peaks in pre-B cells and does not follow a monotonic trend over developmental stages, the alternative TSS is progressively included throughout B cell development. Combining gene-level expression with TSS usage, which can influence translation rate, provides a more nuanced characterization of the expression patterns of these important transcription factors. Some other differentially spliced genes with well-known roles in B cell development are Syk (Cornall et al., 2000), Dock10 (García-Serna et al., 2016), Selplg/Psgl-1 (González-Tajuelo et al., 2020), and Rps6ka1 (Stein et al., 2017).

Figure 8. Splicing in developing marrow B cells from Tabula Muris.

B cell developmental stages include pro-B, pre-B, immature B, and naive B. (a) Expression versus splicing latent space, as defined previously. In the splicing latent space, some cells types (pro-B) are better distinguished than others (immature B). (b) Number of differential splicing events when comparing a B cell stage vs. the rest. (c) PSI of some introns that are differentially spliced throughout development, together with expression of the respective genes (log-transformed normalized counts). Expression and splicing can have very different trajectories. (d) Sashimi plot of novel alternative transcription start site (TSS) in Smarca4. The novel TSS has maximum usage in pre-B cells, and then decays, while the expression peaks at pro-B cells. (e) Sashimi plot of an annotated alternative TSS in Foxp1. The proximal TSS in increasingly used as development progresses, while the expression peaks at pre-B cells.

Figure 8—source data 1. Intron coordinates for panel (c).

Figure 8.

Figure 8—figure supplement 1. PSI distribution of Smarca4_28720.

Figure 8—figure supplement 1.

Figure 8—figure supplement 2. PSI distribution of Foxp1_11076.

Figure 8—figure supplement 2.

Epithelial and endothelial cell types across organs

Having compared different cell types within organs, we analyzed putatively similar cell types which are present in multiple organs to investigate splicing variation associated with tissue environment and function. We find many alternative introns with strong PSI differences across epithelial cell types, including several which are novel (Figure 9a). Conversely, apart from those in the brain, endothelial cell types fail to display such striking differences (Figure 9b). These patterns are consistent with the UMAP projection and dendrogram, both of which suggested less heterogeneity among endothelial than epithelial cells (Figure 7).

Figure 9. Alternative splicing patterns across epithelial and endothelial cell types.

(a–b) PSI of selected introns (left) and expression (log-transformed normalized counts) of the corresponding genes (right) averaged across cell types. Novel intron groups are marked with (*). (a) Introns distinguishing epithelial cell types. (b) Introns distinguishing endothelial cell types. (c) Sashimi plot of an alternative TSS in Itpr1. (d) Sashimi plot of a complex alternative splicing event in Khk.

Figure 9—source data 1. Intron coordinates for panel (a).
Figure 9—source data 2. Intron coordinates for panel (b).

Figure 9.

Figure 9—figure supplement 1. Full-gene view of novel alternative TSS in Itpr1.

Figure 9—figure supplement 1.

Large intestine secretory cells aggregate read coverage visualized in the UCSC Genome Browser.
Figure 9—figure supplement 2. PSI distribution of Itpr1_26257.

Figure 9—figure supplement 2.

Only six individuals with highest number of cells are displayed. Marked N/A are cell types where the individuals have PSI defined in fewer than three cells.
Figure 9—figure supplement 3. PSI distribution of Khk_24896.

Figure 9—figure supplement 3.

Only six individuals with highest number of cells are displayed. Marked N/A are cell types where the individuals have PSI defined in fewer than three cells.

Our analysis revealed a novel alternative TSS in Itpr1 (Figure 9c, Figure 9—figure supplement 2), an intracellular calcium channel in the endoplasmic reticulum, which regulates secretory activity in epithelial cells of the gastrointestinal tract (Lemos et al., 2020). This novel TSS yields a shorter protein isoform (full view in Figure 9—figure supplement 1) which preserves the transmembrane domain, though it is unclear whether this isoform is functional. Notably, it is the predominant isoform in large intestine secretory cells, and these cells express Itpr1 at the highest level among all epithelial cell types in the dataset. All nine novel alternative splicing events in Figure 9a are alternative TSSs, with four affecting the 5’ UTR and five affecting the coding sequence.

Figure 9d (PSI distribution in Figure 9—figure supplement 3) illustrates a complex alternative splicing event in Khk involving the well-studied exons 3a and 3c (Hayward and Bonthron, 1998). Khk catalyzes the conversion of fructose into fructose-1-phosphate, and the two protein isoforms corresponding to either exon 3a or 3c inclusion differ in their thermostability and substrate affinity (Asipu et al., 2003). While the literature describes these exons as mutually exclusive, the transcriptome reference includes transcripts where neither or both may be included. Although we did not find cell types with high inclusion rates for both exons, we did see multiple cell types where both exons are predominantly excluded, for example, epithelial cells from the large intestine. Other differentially spliced genes are involved in cellular junctions, which are particularly important in epithelial tissue. These include Gsn, Eps8, Tln2, Fermt3, and Mapre2.

Comparison of selected tissues

Because of the breadth of the Tabula Muris dataset, we can look for general trends across a diverse array of tissues and cell types. Table 2 summarizes differential expression and splicing for some of the cell types and tissues with the largest sample sizes. First, we note the intersection between the top 100 most differentially expressed and top 100 most differentially spliced genes (ranked by p-value) is consistently low. This means that most differentially spliced genes, which might be of critical importance in a biological system, will go unnoticed if a study only considers differential expression. Second, L5 IT neurons have a larger fraction of genes with differential splicing relative to the number of differentially expressed genes.

Table 2. Summary of differential expression and splicing for select cell types with the largest sample sizes.

The overlap between the top 100 differentially expressed genes and the top 100 differentially spliced genes is low, indicating that splicing provides complementary information. In addition, L5 IT neurons have a higher ratio of differentially spliced genes to differentially expressed genes than the other cell types. Diff. spl. genes: number of differentially spliced genes between the cell type and other cell types in the same tissue. Diff. exp. genes: number of differentially expressed genes between the cell type and other cell types in the same tissue. See Materials and methods for details on the tests for differential splicing and expression.

Tissue Total # cells # cell types Cell type # cells Diff. spl. genes Diff. exp. genes Ratio Top-100 overlap
Brain Non-Myeloid 3049 6 Oligodendrocyte 1390 880 8835 0.10 4
Cortex 6220 10 L5 IT 1571 1447 6402 0.23 2
Heart 4144 6 Endothelial cell of coronary artery 1126 465 7108 0.07 5
Large Intestine 3729 5 Enterocyte of epithelium 1112 586 10,786 0.05 2
Marrow 4783 10 Hematopoietic stem cell 1363 692 9909 0.07 2

We found many more cell-type-specific differential splicing events in the cortex than in the marrow, as expected (Yeo et al., 2004), as well as a higher proportion of events involving novel junctions, which can reach 30% (Figure 10a). Differences in proportion of novel junctions should be interpreted with care, however, since they can be affected by sequencing depth and number of cells, both of which vary between the two tissues. Very similar patterns are seen when grouping differential splicing events that occur in the same gene (Figure 10b). Most differential splicing events that we detected with alternative introns fall in the coding portion of the gene, with high proportions in the 5’ UTR (Figure 10c). This is a property of our quantification approach and does not reflect the total number of alternative splicing events in different gene regions; still, the relative proportion can be compared across tissues. We find an increased proportion of differentially spliced non-coding RNA in the cortex, the majority of which are previously unannotated events. To systematically evaluate how well cell types can be distinguished in the expression and splicing latent spaces, we calculated the ROC AUC score for the one-versus-all classification task for each cell type in each tissue using a binary logistic regression model (Figure 10d). Since cell type labels were defined using gene expression values, near-perfect classification is to be expected using the expression latent space. Classification based only on the splicing latent space is very good in general, suggesting that cell-type-specific differential splicing is rather pervasive. A few cell types were more challenging to classify correctly using splicing patterns alone. One such example is immature B cells, a reflection of the lower degree of separation observed in the embedding of Figure 8a.

Figure 10. Patterns across tissues.

Figure 10.

(a) Number of differential splicing events detected in each cell type. Cortex cell types have more differential splicing events and larger proportions of novel events (those involving an intron absent from the reference). (b) Number of genes with a detected differential splicing event, for different cell types. (c) Number of differential splicing events in different gene regions aggregated over cell types (duplicate events removed). Cortex cell types have higher proportions of events in coding regions and non-coding RNAs. Note: y-axes are not on the same scale. (d) ROC AUC score for classification of each cell type versus the rest based on either the expression or splicing latent space, using logistic regression, training and testing in non-overlapping sets of individuals. The score for splicing-based classification is near-perfect in most cell types with some exceptions such as immature B cells in the marrow.

Finding splicing factors associated with specific alternative splicing events

Several splicing factors have been identified as regulators of specific alternative splicing events, but most regulatory interactions remain unknown (see Vuong et al., 2016 for a review focused on the brain). To complement expensive and laborious knockout experiments, we sought to generate regulatory hypotheses by analyzing the correlation between splicing outcomes and splicing factor variation across cell types. Focusing on a subset of highly expressed genes in BICCN primary motor cortex neurons, we fit a sparse linear model regressing PSI of skipped exons on both expression and splicing patterns of splicing factors (Figure 11a and Figure 11—figure supplement 1). Our model recovers several known regulatory interactions such as Khdrbs3/Slm2/T-Star’s repression of splice site 4 (SS4) in neurexins, modulating their binding with post-synaptic partners (Traunmüller et al., 2016). Additionally, the proportion of a novel alternative TSS (though annotated in the human reference) in Khdrbs3 (Figure 11b, Figure 11—figure supplement 2) is negatively associated with SS4 in Nrxn1 and Nrxn3. This novel isoform lacks the first 30 amino acids of the Qua1 homodimerization domain and could affect dimerization, which modulates RNA affinity (Feracci et al., 2016). The model also recovers the known regulation of a skipped exon in Camta1, a transcription factor required for long-term memory (Bas-Orth et al., 2016), by Rbfox1 (Pedrotti et al., 2015). The skipping of exon 5 (E5) of Grin1, which controls long-term synaptic potentiation and learning (Sengar et al., 2019), is known to be regulated by Mbnl2 and Rbfox1 (Vuong et al., 2016). The model associates Grin1 E5 PSI with the expression of Rbfox1 but not Mbnl2; however, it does suggest an association with the PSI of two skipped exons in Mbnl2 (Figure 11c, Figure 11—figure supplements 3 and 4) and further implicates the inclusion level of the novel alternative TSS in Rbfox1 reported above (Rbfox1_26172, chr16:5763912–6173605, Figure 6d). These results help clarify the disparate impacts of expression and alternative splicing in splicing factors, and encourage the use of regression models to suggest candidate regulators of cell-type-specific alternative splicing. Such computationally generated hypotheses are particularly valuable for splicing events in splicing factors because of the heightened difficulty to experimentally perturb specific exons rather than whole genes.

Figure 11. Associations between splicing factors and alternative splicing.

(a) Regression analysis of exon skipping based on expression and splicing of splicing factors, using the BICCN mouse primary motor cortex dataset. Left panel: mean PSI of skipped exons across cell types. Bottom panel: mean z-scores of selected splicing factor features across cell types, including whole-gene expression (gene name) and PSI of alternative introns (gene name and numerical identifier). Center panel: regression coefficients (log-odds) of each splicing factor feature used to predict skipped exon PSI in our sparse Dirichlet-Multinomial linear model. (b) Novel alternative TSS in Khdrbs3. (c) Annotated skipped exons in Mbnl2.

Figure 11—source data 1. Intron coordinates are available for panel (a).

Figure 11.

Figure 11—figure supplement 1. Full plot of associations between splicing factors and alternative splicing.

Figure 11—figure supplement 1.

Regression analysis of exon skipping based on expression and splicing of splicing factors, using the BICCN mouse primary motor cortex dataset. Left panel: mean PSI of skipped exons across cell types. Bottom panel: mean z-scores of selected splicing factor features across cell types, including whole-gene expression (gene name) and PSI of alternative introns (gene name and numerical identifier). Center panel: regression coefficients (log-odds) of each splicing factor feature used to predict skipped exon PSI in our sparse Dirichlet-Multinomial linear model.
Figure 11—figure supplement 2. PSI distribution of Khdrbs3_25689.

Figure 11—figure supplement 2.

Only six individuals with highest number of cells are displayed. Marked N/A are cell types where the individuals have PSI defined in fewer than three cells. Per the experimental design of this dataset, the top three individuals have only Glutamatergic cell types sequenced, while the bottom three have only GABAergic.
Figure 11—figure supplement 3. PSI distribution of Mbnl2_25376.

Figure 11—figure supplement 3.

Only six individuals with highest number of cells are displayed. Marked N/A are cell types where the individuals have PSI defined in fewer than three cells. Per the experimental design of this dataset, the top three individuals have only Glutamatergic cell types sequenced, while the bottom three have only GABAergic.
Figure 11—figure supplement 4. PSI distribution of Mbnl2_25378.

Figure 11—figure supplement 4.

Only six individuals with highest number of cells are displayed. Marked N/A are cell types where the individuals have PSI defined in fewer than three cells. Per the experimental design of this dataset, the top three individuals have only Glutamatergic cell types sequenced, while the bottom three have only GABAergic.

Discussion

In this study, we introduce scQuint, a toolkit for the quantification, visualization, and statistical inference of alternative splicing in full-length scRNA-seq data without the need for annotations. This allows us to successfully extend the analysis of two single-cell atlases to the level of alternative splicing, overcoming the usual technical challenges as well as coverage artifacts and incomplete annotations. Our results, which we make available for public exploration via interactive browsers, indicate the presence of strong cell-type-specific alternative splicing and previously unannotated splicing events across a broad array of cell types. In most cases, splicing variation is able to differentiate cell types just as well as expression levels. We also note a striking lack of overlap between the most strongly differentially expressed and spliced genes (Table 2), suggesting that expression and splicing are complementary rather than integrated processes. Moreover, this complementarity may also manifest temporally, as we show in developing B cells in the marrow. Another outstanding question is the functional significance of isoforms, and we find that most differential splice sites appear in the coding sequence with a sizeable minority also mapping to 5’ UTRs. The apparent predilection for events to occur in these regions rather than 3’ UTRs poses questions about the role of splicing in protein synthesis from translational regulation to the formation of polypeptide chains. Answering these questions requires a more precise understanding of how variation in UTRs and coding sequences affects final protein output as well as the biophysical characteristics of protein isoforms and their roles in different biological systems. These factors, combined with the large fraction of unannotated events in several cell types, should encourage tissue specialists to more deeply consider the contribution of transcript variation to cell identity and cell and tissue homeostasis.

Despite the clear association between splicing and cell identity, our analyses are yet to produce instances in which clustering in the splicing latent space reveals new cell subpopulations not visible in the expression latent space. This, of course, does not preclude the possibility in other settings where alternative splicing is known to be important, such as in specific developmental transitions or disease conditions. Nevertheless, our current experience leads us to believe that gene expression and splicing proportions provide two different projections of the same underlying cell state. Incidentally, RNA Velocity (La Manno et al., 2018) estimates can be distorted by alternative splicing, and (Bergen et al., 2020) discuss incorporating isoform proportions into the model as a future direction.

To support our understanding of cell-type-specific splicing, we implemented a regularized generalized linear regression model which exploits the natural variation of splicing factors in different cell types. We recovered a number of previously identified (via knockout experiments) regulatory interactions and propose novel regulatory interactions involving genes known to play important regulatory roles. A key component of our analysis is the decision to include both the expression and alternative splicing patterns of splicing factors as features in the model. Consequently, we infer that several alternative splicing events in splicing factors themselves (some previously unannotated) contribute to their regulatory activity. Our model thus provides several opportunities for follow-up and does so with an increased granularity that distinguishes between effects due to expression and splicing differences. To facilitate further exploration of these data, we have uploaded our results to cell and genome browsers (linked at https://github.com/songlab-cal/scquint-analysis, (Benegas, 2021a copy archived at swh:1:rev:97dc31babf2a585666af4a38b1e4aa59a92bbf87)).

Our experience analyzing these large datasets, initially with prior methods and then scQuint, has led to a series of general observations regarding the analysis of splicing in scRNA-seq data. As most analyses use full-length short-read protocols because of the cost of long-read data and the necessary focus on the 3’ end of transcripts in most UMI-based techniques, we restrict our attention to the full-length short-read setting and its incumbent challenges. For example, low transcript capture efficiency introduces additional technical noise into isoform quantification (Arzalluz-Luque and Conesa, 2018; Westoby et al., 2020; Buen Abad Najar et al., 2020), and incomplete transcriptome annotations result in discarded reads and reduced sensitivity to cross-cell differences (Westoby et al., 2020). Nonetheless, we considered several methods (summarized in Appendix 1—table 1) to analyze transcript variation in short-read, full-length scRNA-seq. We found each of the classes of current methods to be problematic in the context of our datasets for varying reasons. Methods which depend on transcript annotations (Bray et al., 2016; Qiu et al., 2017; Huang and Sanguinetti, 2017; Hu et al., 2020; Yan et al., 2015; Wen et al., 2020; Liu et al., 2021; Huang and Sanguinetti, 2021; Tekath and Dugas, 2021) cannot easily identify unannotated alternative splicing events. In large collections of previously unsurveyed cell types, these may comprise a sizable fraction of events. Indeed, we found up to 30% of differential splicing events were unannotated in certain cell types. Annotation-free approaches are also available, but they either do not provide a formal statistical test for differential transcript usage across conditions (Song et al., 2017; Ling et al., 2020; Nip et al., 2020; Welch et al., 2016), or only do so in a specialized manner (Matsumoto et al., 2020), reducing their potential impacts. Finally, methods’ different approaches to quantification are affected by coverage biases to varying degrees. Some methods may thus lead to erroneous inference of cell clusters due to technical rather than biological variation. Until the prevalence and severity of coverage biases are better understood, we advocate quantifying transcript variation in a robust manner.

Recent and future experimental advances will catalyze the study of isoform variation in single cells. For instance, Smart-seq3 (Hagemann-Jensen et al., 2020) allows sequencing of short reads from the entire length of a gene together with unique molecular identifiers, improving mRNA capture and allowing for the filtering of PCR duplicates; however, experiments show that less than 40% of reads can be unambiguously assigned to a single (annotated) isoform. Ultimately, long-read scRNA-seq will provide the definitive picture of isoform variation between cells. Until then, there is much biology to be studied using short-read protocols, and variation at the transcript level should not be disregarded.

Materials and methods

Datasets

Tabula Muris data (Schaum et al., 2018) have accession code GSE109774. Cells were filtered to those from 3-month-old mice present in this collection: https://czb-tabula-muris-senis.s3-us-west-2.amazonaws.com/Data-objects/tabula-muris-senis-facs-processed-official-annotations.h5ad (filtering details in Tabula Muris Consortium, 2020). BICCN Cortex data (Yao et al., 2021) were downloaded from https://assets.nemoarchive.org/dat-ch1nqb7 and filtered as in Booeshaghi et al., 2021.

Simulation

A preliminary set of exon skipping events was obtained by running briekit-event from the BRIE2 software package. For each event, one pair of transcripts was selected if they only differed on the skipped exon, resulting in 561 pairs, each from a different gene. Reads were simulated using Polyester (Frazee et al., 2015), which allows us to control overdispersion and induce different kinds of biases. For roughly half of the genes, differential transcript usage (DTU) was induced by overexpressing one transcript 1.5-fold in one of the two conditions. The number of reads was generated using a highly overdispersed negative binomial distribution with variance equal to eight times the mean. To simulate coverage decay in one of the conditions, the option bias="cdnaf" was added. To ensure coverage decays as a function of absolute distance to the 3’ end of the transcript, reads were generated no farther away from the 3’ than the minimum of the lengths of the two alternative transcripts. The Area Under the Receiver Operating Characteristic Curve (ROC AUC) for classifying genes into DTU vs. non-DTU was computed using the p-values from each method, excluding genes that were not tested by a given method (e.g., because of a minimum reads threshold).

Quantification

The bioinformatic pipeline was implemented using Snakemake (Köster and Rahmann, 2012). Raw reads were trimmed from Smart-seq2 adapters using Cutadapt (Martin, 2011) before mapping to the GRCm38/mm10 genome reference (https://hgdownload.soe.ucsc.edu/goldenPath/mm10/chromosomes/) and the transcriptome reference from Ensembl release 101 (ftp://ftp.ensembl.org/pub/release-101/gtf/mus_musculus/Mus_musculus.GRCm38.101.gtf.gz). Alignment was done using STAR (Dobin et al., 2013) in two-pass mode allowing novel junctions as long as they were supported by reads with at least 20 base pair overhang (30 if they are non-canonical) in at least 30 cells. Also, multimapping and duplicate reads were discarded using the flag --bamRemoveDuplicatesType UniqueIdentical (while this can remove duplicates from the second PCR step of Smart-seq, it will not remove duplicates from the first PCR step). Soft-clipped reads were removed as well. Additionally, reads were discarded if they belonged to the ENCODE region blacklist (Amemiya et al., 2019) (downloaded from https://github.com/Boyle-Lab/Blacklist/raw/master/lists/mm10-blacklist.v2.bed.gz).

Gene expression was quantified using featureCounts (Liao et al., 2014), and total-count normalized such that each cell had 10,000 reads (as in the Scanpy (Wolf et al., 2018) tutorial). Intron usage was quantified using split reads with an overhang of at least six base pairs. Introns were discarded if observed in fewer than 30 cells in BICCN Cortex or 100 cells in Tabula Muris. Introns were grouped into alternative intron groups based on shared 3’ splice acceptor sites. Introns not belonging to any alternative intron group were discarded. Additionally, we decided to subset our analysis to introns with at least one of their donor or acceptor sites annotated, so we could assign a gene to each intron and facilitate interpretation for our specific analyses.

Dimensionality reduction

To run PCA, we worked with alternative intron proportions (PSI, Percent Spliced In) rather than their absolute counts, as the latter would be confounded by gene expression differences. We first introduce some notation:

  • c: cell identifier

  • g: intron group identifier

  • yg(c): vector of counts of introns in intron group g and cell c

  • normalize(x)=xsum(x): function to divide each entry of a vector by the total sum.

Then, PSI can be defined as:

PSIg(c)=normalize(yg(c))

However, given the sparsity of single-cell data, a very high proportion of alternative intron groups will have no reads in a given cell, leaving PSI undefined. More generally, an intron group may contain few reads, resulting in defined but noisy PSI estimates. To navigate this issue, we introduce a form of empirical shrinkage towards a central value. We first define the ‘global PSI’ by aggregating reads from all cells and normalizing. Then, we add this global PSI as a pseudocount vector to each cell before re-normalizing to obtain each cell’s shrunken PSI profile (these are non-uniform pseudocounts adding up to one).

PSIg(global)=normalize(cyg(c))
SMOOTHED_PSIg(c)=normalize(yg(c)+PSIg(global))

We then run standard PCA on the cell-by-intron-smoothed PSI matrix.

The VAE was implemented using PyTorch (Paszke et al., 2019) and scvi-tools (Gayoso et al., 2021). The following is the generative model, repeated for each cell (we drop the superscript indexing the cell in z, p, y and n):

  1. Sample the latent cell state zNormal(0,I)

  2. For each intron group g:

    1. Obtain the underlying intron proportions: pg=softmax(fg(z))

    2. Sample the intron counts conditioning on the total observed ng: yg|ngDirichletMultinomial(ng,αgpg)

Here fg, known as the decoder, can be any differentiable function, including linear mappings and neural networks. αg is a scalar controlling the amount of dispersion. We optimize a variational posterior on cell latent variables q(z|y) (Gaussian with diagonal covariance, given by an encoder neural network) as well as point estimates of global parameters fg, αg. The encoder takes as input the smoothed PSI values, as in PCA, but the likelihood is based on the raw intron counts. The objective to maximize is the evidence lower bound (ELBO), consisting of a reconstruction term and a regularization term:

ELBO(y)=Ezq(z|y)[logp(y|z)]KL(q(z|y)p(z)),

where KL () denotes the Kullback–Leibler divergence. Optimization is performed using Adam (Kingma and Ba, 2015), a stochastic gradient descent method. To avoid overfitting in cases of relatively few cells with respect to the number of features, we considered a linear decoder (Svensson et al., 2020), as well as a Normal(0,σ) prior on the entries of the decoder matrix. Hyperparameters were tuned using reconstruction error on held-out data and are described in Table 3.

Table 3. VAE hyperparameters.

Dataset Decoder Layers σ Latent dimension
BICCN Cortex Linear 1 26.8 18
Tabula Muris Non-linear 2 - 34

Differential splicing test

Our differential splicing test across conditions (such as cell types) is based on a modified version of the Dirichlet-Multinomial Generalized Linear Model proposed in LeafCutter (Li et al., 2018) for bulk RNA-seq. For each intron group g with L alternative introns:

  • yg is a vector of counts for each of the L introns;

  • The independent variable, x, equals 0 in one condition and 1 in the other;

  • ag,bgL-1 are the intercept and coefficients of the linear model;

  • αg is a dispersion parameter shared across conditions; and

  • the function softmax:(z1,,zL-1)(ez11+i=1L-1ezi,,ezL-11+i=1L-1ezi,11+i=1L-1ezi) maps from L-1 to the (L-1)-dimensional probability simplex.

The Dirichlet-Multinomial Generalized Linear Model then proceeds as follows:

  1. Obtain the underlying intron proportions: pg=softmax(ag+bgx)

  2. Sample the intron counts conditioned on the total observed, ng: yg|ngDirichletMultinomial(ng,αgpg)

We implemented this model in PyTorch and optimized it using L-BFGS (Liu and Nocedal, 1989).

To test for differential splicing across the two conditions, we compare the following two hypotheses:

  • Null hypothesis H0:bg=0

  • Alternative hypothesis H1:bg0

We use the likelihood-ratio test, the test statistic for which is asymptotically distributed as a χ2 random variable with L-1 degrees of freedom under H0. Finally, we correct p-values for multiple testing using the Benjamini-Hochberg FDR procedure (Benjamini and Hochberg, 1995).

The differences with LeafCutter are the following:

  • LeafCutter groups introns that share a 5’ donor or 3’ acceptor site while scQuint groups introns that share a 3’ acceptor site.

  • LeafCutter has a vector of concentration parameters, one for each intron, while scQuint uses a single concentration parameter per intron group.

  • The LeafCutter and scQuint optimization procedures were implemented separately and differ in initialization strategies as well as L-BFGS hyperparameters.

Latent space analysis

The expression latent space was obtained by running PCA with 40 components on log-transformed and normalized gene expression values. The splicing latent space was obtained by running the VAE on the alternative intron count matrix (or equivalent features, e.g., Kallisto transcript counts, DEXSeq exon counts). Both latent spaces were visualized using UMAP (McInnes et al., 2018). In the comparison of Figure 1, we used our own implementation of the quantifications proposed by ODEGR-NMF, DEXSeq, and DESJ for ease of application to large single-cell datasets.

Dendrograms were constructed using hierarchical clustering (R function hclust) based on euclidean distance between the median latent space embedding of cells of each type. Tanglegram and entanglement were calculated using the dendextend R package, with the step2side method, as also described in Schaum et al., 2018.

Reported scores for cell type classification within a tissue were obtained by running a binary logistic regression classifier over different splits of cells into train and test sets. To assess generalization across individuals, we ensured the same individual was not present in both train and test sets.

Cell-type-specific differential splicing

For differential splicing testing between a given cell type and the rest of the tissue, we only considered introns expressed in at least 50 cells and intron groups with at least 50 cells from both of the conditions. We called an intron group ‘differentially spliced’ if it was both statistically significant using a 5% FDR and if it contained an intron with a PSI change greater than 0.05. We considered a differentially spliced intron group as unannotated if it contained an unannotated intron with a PSI change greater than 0.05. Differential expression was performed using the Mann-Whitney test. A gene was considered differentially expressed if it was statistically significant using a 5% FDR and if the fold change was at least 1.5.

For selection of marker genes or introns, we proceeded in a semi-automated fashion. For each cell type, we first filtered to keep only significant genes or introns and then ranked them by effect size. We picked a certain number of genes or introns from the top of this list for each cell type, while ensuring there were no repetitions.

Splicing factor regression analysis

We obtained 75 mouse splicing factors using the Gene Ontology term ‘alternative mRNA splicing, via spliceosome’ (http://amigo.geneontology.org/amigo/term/GO:0000380). A skipped exon annotation, processed by BRIE (Huang and Sanguinetti, 2017), was downloaded from https://sourceforge.net/projects/brie-rna/files/annotation/mouse/gencode.vM12/SE.most.gff3/download. Instead of using single cells as replicates, we partitioned the BICCN primary motor cortex dataset into roughly 200 clusters of 30 cells each that were pooled to create pseudobulks, aiming to reduce variance in the expression and splicing of splicing factors used as covariates in the model. We filtered target exon skipping events to those defined in at least 95% of the replicates, and those having a PSI standard deviation of at least 0.2. We used log-transformed normalized expression and PSI of alternative splicing events as input features. We chose to keep the PSI of only one intron per intron group to avoid the presence of highly correlated features and improve clarity, even if some information from non-binary events is lost. Input features were filtered to those having standard deviation of at least 0.05, and then standardized. A lasso Dirichlet-Multinomial GLM was fit to the data (in this instance, the model reduces to a Beta-Binomial because skipped exons are binary events), with the sparsity penalty selected via cross-validation. As a first approach, we fit a regular lasso linear regression model on PSI instead of raw counts, resulting in roughly similar patterns in the coefficients. Figure 11c shows the coefficients of the lasso Dirichlet-Multinomial model for the top 30 targets with the highest variance explained by the regular lasso model, all above 68%.

Code and data availability

scQuint is implemented in Python and is available at https://github.com/songlab-cal/scquint, (Benegas, 2021b copy archived at swh:1:rev:a9db6454e13d42af25f47deee19e201e74d2bdd0). Differential splicing results and access to cell and genome browsers, together with the code to reproduce our results, are available at https://github.com/songlab-cal/scquint-analysis, (Benegas, 2021c copy archived at swh:1:rev:97dc31babf2a585666af4a38b1e4aa59a92bbf87). Processed alternative intron count matrices are provided in the AnnData format (anndata.readthedocs.ioanndata.readthedocs.io) for easy manipulation with Scanpy (Wolf et al., 2018), Seurat (Stuart et al., 2019), and other tools.

Acknowledgements

We would like to thank Angela Oliveira Pisco, Spyros Darmanis, and Kif Liakath-Ali for helpful discussions. We also thank the Chan Zuckerberg Biohub for hosting our cell×gene sessions and Aaron McGeever for assistance. This research is supported in part by grant number R35-GM134922 from NIH and grant number CZF2019-002449 from the Chan Zuckerberg Initiative Foundation. YSS is a Chan Zuckerberg Biohub Investigator.

Appendix 1

Overview of available methods for alternative splicing analysis in full-length scRNA seq data

Due to experimental considerations, the analysis of transcript variation in 10x Chromium data is mostly restricted to the 3’ end of genes; in contrast, Smart-seq2 and other full-length, short-read protocols theoretically enable characterization of transcript variation along the whole gene. Nevertheless, numerous challenges impede such analyses in practice. For example, low transcript capture efficiency introduces additional technical noise into transcript quantification (Arzalluz-Luque and Conesa, 2018; Westoby et al., 2020; Buen Abad Najar et al., 2020), and incomplete transcriptome annotations result in discarded reads and reduced sensitivity to cross-cell differences (Westoby et al., 2020). Some authors have even recommended avoiding the analysis of alternative splicing in single-cell RNA sequencing (scRNA-seq) data until such obstacles can be suitably overcome (Westoby et al., 2020). Despite these difficulties, several methods (summarized in Appendix 1—table 1) have sought to analyze transcript variation in short-read, full-length scRNA-seq. Many methods, including kallisto (Bray et al., 2016), Census (Qiu et al., 2017), BRIE (Huang and Sanguinetti, 2017), SCATS (Hu et al., 2020), Quantas (Yan et al., 2015), VALERIE (meant only for visualization) (Wen et al., 2020), DESJ (Liu et al., 2021), BRIE2 (Huang and Sanguinetti, 2021) and DTUrtle (Tekath and Dugas, 2021), depend on transcript annotations and consequently cannot easily identify unannotated alternative splicing events, which may comprise a sizable fraction of events. Currently available annotation-free methods, such as ODEGR-NMF (Matsumoto et al., 2020), Expedition (Song et al., 2017), ASCOT (Ling et al., 2020), SingleSplice (Welch et al., 2016) and RNA-Bloom (Nip et al., 2020), do not provide a statistical test for differential transcript usage across conditions. Appendix 1—table 1 summarizes this information and makes the comparison of different methods easier.

Appendix 1—table 1. Summary of methods available to analyze transcript variation in short-read full-length scRNA-seq.

Annotation-free: Does quantification require an accurate transcriptome reference? Differential transcript usage: Does the method provide a two-sample test for differences in transcript proportions? Some methods, denoted by (*), provide other statistical tests. Quantas requires cells to be aggregated into known subgroups of each group and therefore does not perform a test at the single-cell level. SingleSplice tests for alternative splicing within a single population. kallisto and ODEGR-NMF test for differential transcript expression, i.e., changes in absolute transcript expression rather than their proportions. Census tests for differential transcript usage along a pseudotime trajectory.

Method Annotation-free Differential transcript usage
Quantas [80] *
SingleSplice [76] *
kallisto [10] *
Census [56] *
BRIE [27]
Expedition [60]
ODEGR-NMF [46] *
SCATS [26]
RNA-Bloom [49]
ASCOT [41]
DESJ [43]
BRIE2 [28]
DTUrtle [65]
scQuint

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Yun S Song, Email: yss@berkeley.edu.

Eduardo Eyras, Australian National University, Australia.

James L Manley, Columbia University, United States.

Funding Information

This paper was supported by the following grants:

  • National Institutes of Health R35-GM134922 to Gonzalo Benegas, Yun S Song.

  • Chan Zuckerberg Initiative CZF2019-002449 to Gonzalo Benegas, Yun S Song.

Additional information

Competing interests

No competing interests declared.

No competing interests declared.

Author contributions

Conceptualization, Investigation, Methodology, Software, Visualization, Writing - original draft.

Conceptualization, Investigation, Supervision, Writing - original draft.

Conceptualization, Funding acquisition, Investigation, Supervision, Writing – review and editing.

Additional files

Transparent reporting form

Data availability

All data analyzed in this study are publicly available and URL links are provided in the Materials and methods section of our manuscript. Our source code as well as all results represented in figures and tables are publicly available on our lab's GitHub repositories: https://github.com/songlab-cal/scquint, (copy archived at swh:1:rev:a9db6454e13d42af25f47deee19e201e74d2bdd0) and https://github.com/songlab-cal/scquint-analysis, (copy archived at swh:1:rev:97dc31babf2a585666af4a38b1e4aa59a92bbf87).

The following previously published datasets were used:

Schaum et al 2018. Tabula Muris. NCBI Gene Expression Omnibus. GSE109774

Yao et al 2021. BRAIN Initiative Cell Census Network Cortex. nemoarchive. dat-ch1nqb7

References

  1. Amemiya HM, Kundaje A, Boyle AP. The ENCODE Blacklist: identification of problematic regions of the genome. Scientific Reports. 2019;9:1–5. doi: 10.1038/s41598-019-45839-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Anders S, Reyes A, Huber W. Detecting differential usage of exons from RNA-seq data. Genome Research. 2012;22:2008–2017. doi: 10.1101/gr.133744.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Arzalluz-Luque Á, Conesa A. Single-cell RNAseq for the study of isoforms-how is that possible? Genome Biology. 2018;19:1496. doi: 10.1186/s13059-018-1496-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Asipu A, Hayward BE, O’Reilly J, Bonthron DT. Properties of normal and mutant recombinant human ketohexokinases and implications for the pathogenesis of essential fructosuria. Diabetes. 2003;52:2426–2432. doi: 10.2337/diabetes.52.9.2426. [DOI] [PubMed] [Google Scholar]
  5. Bas-Orth C, Tan YW, Oliveira AMM, Bengtson CP, Bading H. The calmodulin-binding transcription activator CAMTA1 is required for long-term memory formation in mice. Learning & Memory (Cold Spring Harbor, N.Y.) 2016;23:313–321. doi: 10.1101/lm.041111.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Benegas G. Software Heritage; 2021a. https://archive.softwareheritage.org/swh:1:dir:76b0f21fb892d9f0aee2da2dc71d887513443966;origin=https://github.com/songlab-cal/scquint-analysis;visit=swh:1:snp:c040f4511378d09901aa51c8e0d2bd3a2e5a72f7;anchor=swh:1:rev:97dc31babf2a585666af4a38b1e4aa59a92bbf87 [Google Scholar]
  7. Benegas G. Software Heritage; 2021b. https://archive.softwareheritage.org/swh:1:dir:e94c3f55c512acb3216e2e83dbc35cae75aa8ddb;origin=https://github.com/songlab-cal/scquint;visit=swh:1:snp:4680be565cc000bc06e30fb468919a9f1de9180a;anchor=swh:1:rev:a9db6454e13d42af25f47deee19e201e74d2bdd0 [Google Scholar]
  8. Benegas G. Software Heritage; 2021c. https://archive.softwareheritage.org/swh:1:dir:76b0f21fb892d9f0aee2da2dc71d887513443966;origin=https://github.com/songlab-cal/scquint-analysis;visit=swh:1:snp:c040f4511378d09901aa51c8e0d2bd3a2e5a72f7;anchor=swh:1:rev:97dc31babf2a585666af4a38b1e4aa59a92bbf87 [Google Scholar]
  9. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. 1995;57:289–300. doi: 10.1111/j.2517-6161.1995.tb02031.x. [DOI] [Google Scholar]
  10. Bergen V, Lange M, Peidli S, Wolf FA, Theis FJ. Generalizing RNA velocity to transient cell states through dynamical modeling. Nature Biotechnology. 2020;38:1408–1414. doi: 10.1038/s41587-020-0591-3. [DOI] [PubMed] [Google Scholar]
  11. Booeshaghi AS, Yao Z, van Velthoven C, Smith K, Tasic B, Zeng H, Pachter L. Isoform cell-type specificity in the mouse primary motor cortex. Nature. 2021;598:195–199. doi: 10.1038/s41586-021-03969-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Bossen C, Murre CS, Chang AN, Mansson R, Rodewald HR, Murre C. The chromatin remodeler Brg1 activates enhancer repertoires to establish B cell identity and modulate cell growth. Nature Immunology. 2015;16:775–784. doi: 10.1038/ni.3170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Bray NL, Pimentel H, Melsted P, Pachter L. Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology. 2016;34:525–527. doi: 10.1038/nbt.3519. [DOI] [PubMed] [Google Scholar]
  14. Buen Abad Najar CF, Yosef N, Lareau LF. Coverage-dependent bias creates the appearance of binary splicing in single cells. eLife. 2020;9:e54603. doi: 10.7554/eLife.54603. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Byrne A, Beaudin AE, Olsen HE, Jain M, Cole C, Palmer T, DuBois RM, Forsberg EC, Akeson M, Vollmers C. Nanopore long-read RNAseq reveals widespread transcriptional variation among the surface receptors of individual B cells. Nature Communications. 2017;8:1–11. doi: 10.1038/ncomms16027. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Cornall RJ, Cheng AM, Pawson T, Goodnow CC. Role of Syk in B-cell development and antigen-receptor signaling. PNAS. 2000;97:1713–1718. doi: 10.1073/pnas.97.4.1713. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. Bioinformatics (Oxford, England) 2013;29:15–21. doi: 10.1093/bioinformatics/bts635. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Feng H, Moakley DF, Chen S, McKenzie MG, Menon V, Zhang C. Complexity and graded regulation of neuronal cell-type-specific alternative splicing revealed by single-cell RNA sequencing. PNAS. 2021;118:e2013056118. doi: 10.1073/pnas.2013056118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Feracci M, Foot JN, Grellscheid SN, Danilenko M, Stehle R, Gonchar O, Kang H-S, Dalgliesh C, Meyer NH, Liu Y, Lahat A, Sattler M, Eperon IC, Elliott DJ, Dominguez C. Structural basis of RNA recognition and dimerization by the STAR proteins T-STAR and Sam68. Nature Communications. 2016;7:1–12. doi: 10.1038/ncomms10355. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Frazee AC, Jaffe AE, Langmead B, Leek JT. Polyester: simulating RNA-seq datasets with differential transcript expression. Bioinformatics (Oxford, England) 2015;31:2778–2784. doi: 10.1093/bioinformatics/btv272. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Fuccillo MV, Földy C, Gökce Ö, Rothwell PE, Sun GL, Malenka RC, Südhof TC. Single-Cell mRNA profiling reveals cell-type-specific expression of neurexin isoforms. Neuron. 2015;87:326–340. doi: 10.1016/j.neuron.2015.06.028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. García-Serna AM, Alcaraz-García MJ, Ruiz-Lafuente N, Sebastián-Ruiz S, Martínez CM, Moya-Quiles MR, Minguela A, García-Alonso AM, Martín-Orozco E, Parrado A. Dock10 regulates CD23 expression and sustains B-cell lymphopoiesis in secondary lymphoid tissue. Immunobiology. 2016;221:1343–1350. doi: 10.1016/j.imbio.2016.07.015. [DOI] [PubMed] [Google Scholar]
  23. Garrido-Martín D, Palumbo E, Guigó R, Breschi A. ggsashimi: Sashimi plot revised for browser- and annotation-independent splicing visualization. PLOS Computational Biology. 2018;14:e1006360. doi: 10.1371/journal.pcbi.1006360. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Gayoso A, Lopez R, Xing G, Boyeau P, Wu K, Jayasuriya M, Melhman E, Langevin M, Liu Y, Samaran J, Misrachi G, Nazaret A, Clivio O, Xu C, Ashuach T, Lotfollahi M, Svensson V, Beltrame E, Talavera-López C, Pachter L, Theis FJ, Streets A, Jordan MI, Regier J, Yosef N. Scvi-Tools: A Library for Deep Probabilistic Analysis of Single-Cell Omics Data. bioRxiv. 2021 doi: 10.1101/2021.04.28.441833. [DOI] [PubMed]
  25. González-Tajuelo R, González-Sánchez E, Silván J, Muñoz-Callejas A, Vicente-Rabaneda E, García-Pérez J, Castañeda S, Urzainqui A. Relevance of PSGL-1 expression in B cell development and activation. Frontiers in Immunology. 2020;11:588212. doi: 10.3389/fimmu.2020.588212. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Gupta I, Collier PG, Haase B, Mahfouz A, Joglekar A, Floyd T, Koopmans F, Barres B, Smit AB, Sloan SA, Luo W, Fedrigo O, Ross ME, Tilgner HU. Single-cell isoform RNA sequencing characterizes isoforms in thousands of cerebellar cells. Nature Biotechnology. 2018;36:1197–1202. doi: 10.1038/nbt.4259. [DOI] [PubMed] [Google Scholar]
  27. Hagemann-Jensen M, Ziegenhain C, Chen P, Ramsköld D, Hendriks G-J, Larsson AJM, Faridani OR, Sandberg R. Single-cell RNA counting at allele and isoform resolution using Smart-seq3. Nature Biotechnology. 2020;38:708–714. doi: 10.1038/s41587-020-0497-0. [DOI] [PubMed] [Google Scholar]
  28. Hayward BE, Bonthron DT. Structure and alternative splicing of the ketohexokinase gene. European Journal of Biochemistry. 1998;257:85–91. doi: 10.1046/j.1432-1327.1998.2570085.x. [DOI] [PubMed] [Google Scholar]
  29. Hu H, Wang B, Borde M, Nardone J, Maika S, Allred L, Tucker PW, Rao A. Foxp1 is an essential transcriptional regulator of B cell development. Nature Immunology. 2006;7:819–826. doi: 10.1038/ni1358. [DOI] [PubMed] [Google Scholar]
  30. Hu Y, Wang K, Li M. Detecting differential alternative splicing events in scRNA-seq with or without Unique Molecular Identifiers. PLOS Computational Biology. 2020;16:e1007925. doi: 10.1371/journal.pcbi.1007925. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Huang Y, Sanguinetti G. BRIE: transcriptome-wide splicing quantification in single cells. Genome Biology. 2017;18:123. doi: 10.1186/s13059-017-1248-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Huang Y, Sanguinetti G. BRIE2: computational identification of splicing phenotypes from single-cell transcriptomic experiments. Genome Biology. 2021;22:1–15. doi: 10.1186/s13059-021-02461-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Joglekar A, Prjibelski A, Mahfouz A, Collier P, Lin S, Schlusche AK, Marrocco J, Williams SR, Haase B, Hayes A, Chew JG, Weisenfeld NI, Wong MY, Stein AN, Hardwick SA, Hunt T, Wang Q, Dieterich C, Bent Z, Fedrigo O, Sloan SA, Risso D, Jarvis ED, Flicek P, Luo W, Pitt GS, Frankish A, Smit AB, Ross ME, Tilgner HU. A spatially resolved brain region- and cell type-specific isoform atlas of the postnatal mouse brain. Nature Communications. 2021;12:1–16. doi: 10.1038/s41467-020-20343-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D. The human genome browser at UCSC. Genome Research. 2002;12:996–1006. doi: 10.1101/gr.229102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Kingma DP, Welling M. Adam: A method for stochastic optimization. In: Proceedings of the 2nd International Conference on Learning Representations (ICLR).2014. [Google Scholar]
  36. Kingma DP, Ba J. Auto-encoding variational Bayes. In: Proceedings of the 2nd International Conference onLearning Representations (ICLR).2015. [Google Scholar]
  37. Köster J, Rahmann S. Snakemake--a scalable bioinformatics workflow engine. Bioinformatics (Oxford, England) 2012;28:2520–2522. doi: 10.1093/bioinformatics/bts480. [DOI] [PubMed] [Google Scholar]
  38. La Manno G, Soldatov R, Zeisel A, Braun E, Hochgerner H, Petukhov V, Lidschreiber K, Kastriti ME, Lönnerberg P, Furlan A, Fan J, Borm LE, Liu Z, van Bruggen D, Guo J, He X, Barker R, Sundström E, Castelo-Branco G, Cramer P, Adameyko I, Linnarsson S, Kharchenko PV. RNA velocity of single cells. Nature. 2018;560:494–498. doi: 10.1038/s41586-018-0414-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Le Roith D, Shiloach J, Roth J. Is there an earlier phylogenetic precursor that is common to both the nervous and endocrine systems? Peptides. 1982;3:211–215. doi: 10.1016/0196-9781(82)90080-8. [DOI] [PubMed] [Google Scholar]
  40. Lebrigand K, Magnone V, Barbry P, Waldmann R. High throughput error corrected Nanopore single cell transcriptome sequencing. Nature Communications. 2020;11:1–8. doi: 10.1038/s41467-020-17800-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Lemos FO, Guerra MT, Leite M. Inositol 1,4,5 trisphosphate receptors in secretory epithelial cells of the gastrointestinal tract. Current Opinion in Physiology. 2020;17:169–174. doi: 10.1016/j.cophys.2020.08.003. [DOI] [Google Scholar]
  42. Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011;12:1–16. doi: 10.1186/1471-2105-12-323. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Li YI, Knowles DA, Humphrey J, Barbeira AN, Dickinson SP, Im HK, Pritchard JK. Annotation-free quantification of RNA splicing using LeafCutter. Nature Genetics. 2018;50:151–158. doi: 10.1038/s41588-017-0004-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics (Oxford, England) 2014;30:923–930. doi: 10.1093/bioinformatics/btt656. [DOI] [PubMed] [Google Scholar]
  45. Ling JP, Wilks C, Charles R, Leavey PJ, Ghosh D, Jiang L, Santiago CP, Pang B, Venkataraman A, Clark BS, Nellore A, Langmead B, Blackshaw S. ASCOT identifies key regulators of neuronal subtype-specific splicing. Nature Communications. 2020;11:1–12. doi: 10.1038/s41467-019-14020-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Liu DC, Nocedal J. On the limited memory BFGS method for large scale optimization. Mathematical Programming. 1989;45:503–528. doi: 10.1007/BF01589116. [DOI] [Google Scholar]
  47. Liu S, Zhou B, Wu L, Sun Y, Chen J, Liu S. Single-cell differential splicing analysis reveals high heterogeneity of liver tumor-infiltrating T cells. Scientific Reports. 2021;11:1–12. doi: 10.1038/s41598-021-84693-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Lopez R, Gayoso A, Yosef N. Enhancing scientific discoveries in molecular biology with deep generative models. Molecular Systems Biology. 2020;16:20199198. doi: 10.15252/msb.20199198. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.Journal. 2011;17:10. doi: 10.14806/ej.17.1.200. [DOI] [Google Scholar]
  50. Matsumoto H, Hayashi T, Ozaki H, Tsuyuzaki K, Umeda M, Iida T, Nakamura M, Okano H, Nikaido I. An NMF-based approach to discover overlooked differentially expressed gene regions from single-cell RNA-seq data. NAR Genomics and Bioinformatics. 2020;2:lqz020. doi: 10.1093/nargab/lqz020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. McInnes L, Healy J, Saul N, Großberger L. UMAP: Uniform Manifold Approximation and Projection. Journal of Open Source Software. 2018;3:861. doi: 10.21105/joss.00861. [DOI] [Google Scholar]
  52. Megill C, Martin B, Weaver C, Bell S, Prins L, Badajoz S, McCandless B, Pisco AO, Kinsella M, Griffin F, Kiggins J, Haliburton G, Mani A, Weiden M, Dunitz M, Lombardo M, Huang T, Smith T, Chambers S, Freeman J, Cool J, Carr A. Cellxgene: A Performant, Scalable Exploration Platform for High Dimensional Sparse Matrices. bioRxiv. 2021 doi: 10.1101/2021.04.05.438318. [DOI]
  53. Nip KM, Chiu R, Yang C, Chu J, Mohamadi H, Warren RL, Birol I. RNA-Bloom enables reference-free and reference-guided sequence assembly for single-cell transcriptomes. Genome Research. 2020;30:1191–1200. doi: 10.1101/gr.260174.119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Ntranos V, Yi L, Melsted P, Pachter L. A discriminative learning approach to differential expression analysis for single-cell RNA-seq. Nature Methods. 2019;16:163–166. doi: 10.1038/s41592-018-0303-9. [DOI] [PubMed] [Google Scholar]
  55. Olivieri JE, Dehghannasiri R, Salzman J. The SpliZ Generalizes “Percent Spliced In” to Reveal Regulated Splicing at Single-Cell Resolution. bioRxiv. 2020 doi: 10.1101/2020.11.10.377572. [DOI] [PMC free article] [PubMed]
  56. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L. PyTorch: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems; 2019. pp. 8026–8037. [Google Scholar]
  57. Patrick R, Humphreys DT, Janbandhu V, Oshlack A, Ho JWK, Harvey RP, Lo KK. Sierra: discovery of differential transcript usage from polyA-captured single-cell RNA-seq data. Genome Biology. 2020;21:1–27. doi: 10.1186/s13059-020-02071-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Pedrotti S, Giudice J, Dagnino-Acosta A, Knoblauch M, Singh RK, Hanna A, Mo Q, Hicks J, Hamilton S, Cooper TA. The RNA-binding protein Rbfox1 regulates splicing required for skeletal muscle structure and function. Human Molecular Genetics. 2015;24:2360–2374. doi: 10.1093/hmg/ddv003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Picelli S, Faridani OR, Björklund AK, Winberg G, Sagasser S, Sandberg R. Full-length RNA-seq from single cells using Smart-seq2. Nature Protocols. 2014;9:171–181. doi: 10.1038/nprot.2014.006. [DOI] [PubMed] [Google Scholar]
  60. Qiu X, Hill A, Packer J, Lin D, Ma YA, Trapnell C. Single-cell mRNA quantification and differential analysis with Census. Nature Methods. 2017;14:309–315. doi: 10.1038/nmeth.4150. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Schaum NM, Karkanias J, Neff NF, May AP, Quake SR, Wyss-Coray T, Darmanis S, Batson J, Botvinnik O, Chen MB. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature. 2018;562:367–372. doi: 10.1038/s41586-018-0590-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Schüle M, Butto T, Dewi S, Schlichtholz L, Strand S, Gerber S, Endres K, Schweiger S, Winter J. mTOR Driven Gene Transcription Is Required for Cholesterol Production in Neurons of the Developing Cerebral Cortex. International Journal of Molecular Sciences. 2021;22:6034. doi: 10.3390/ijms22116034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Sengar AS, Li H, Zhang W, Leung C, Ramani AK, Saw NM, Wang Y, Tu Y, Ross PJ, Scherer SW, Ellis J, Brudno M, Jia Z, Salter MW. Control of Long-Term Synaptic Potentiation and Learning by Alternative Splicing of the NMDA Receptor Subunit GluN1. Cell Reports. 2019;29:4285–4294. doi: 10.1016/j.celrep.2019.11.087. [DOI] [PubMed] [Google Scholar]
  64. Song Y, Botvinnik OB, Lovci MT, Kakaradov B, Liu P, Xu JL, Yeo GW. Single-Cell Alternative Splicing Analysis with Expedition Reveals Splicing Dynamics during Neuron Differentiation. Molecular Cell. 2017;67:148–161. doi: 10.1016/j.molcel.2017.06.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Stein M, Dütting S, Mougiakakos D, Bösl M, Fritsch K, Reimer D, Urbanczyk S, Steinmetz T, Schuh W, Bozec A, Winkler TH, Jäck HM, Mielenz D. A defined metabolic state in pre B cells governs B-cell development and is counterbalanced by Swiprosin-2/EFhd1. Cell Death and Differentiation. 2017;24:1239–1252. doi: 10.1038/cdd.2017.52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM, Hao Y, Stoeckius M, Smibert P, Satija R. Comprehensive Integration of Single-Cell Data. Cell. 2019;177:1888–1902. doi: 10.1016/j.cell.2019.05.031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Svensson V, Gayoso A, Yosef N, Pachter L. Interpretable factor models of single-cell RNA-seq via variational autoencoders. Bioinformatics (Oxford, England) 2020;36:3418–3421. doi: 10.1093/bioinformatics/btaa169. [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. Tabula Muris Consortium A single-cell transcriptomic atlas characterizes ageing tissues in the mouse. Nature. 2020;583:590–595. doi: 10.1038/s41586-020-2496-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Tekath T, Dugas M. Differential transcript usage analysis of bulk and single-cell RNA-seq data with DTUrtle. Bioinformatics (Oxford, England) 2021;37:3781–3787. doi: 10.1093/bioinformatics/btab629. [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. Traunmüller L, Gomez AM, Nguyen TM, Scheiffele P. Control of neuronal synapse specification by a highly dedicated alternative splicing program. Science (New York, N.Y.) 2016;352:982–986. doi: 10.1126/science.aaf2397. [DOI] [PubMed] [Google Scholar]
  71. Vaquero-Garcia J, Barrera A, Gazzara MR, González-Vallinas J, Lahens NF, Hogenesch JB, Lynch KW, Barash Y. A new view of transcriptome complexity and regulation through the lens of local splicing variations. eLife. 2016;5:e11752. doi: 10.7554/eLife.11752. [DOI] [PMC free article] [PubMed] [Google Scholar]
  72. Volden R, Vollmers C. Highly Multiplexed Single-Cell Full-Length CDNA Sequencing of Human Immune Cells with 10X Genomics and R2C2. bioRxiv. 2020 doi: 10.1101/2020.01.10.902361. [DOI]
  73. von Engelhardt J, Mack V, Sprengel R, Kavenstock N, Li KW, Stern-Bach Y, Smit AB, Seeburg PH, Monyer H. CKAMP44: a brain-specific protein attenuating short-term synaptic plasticity in the dentate gyrus. Science (New York, N.Y.) 2010;327:1518–1522. doi: 10.1126/science.1184178. [DOI] [PubMed] [Google Scholar]
  74. Vuong CK, Black DL, Zheng S. The neurogenetics of alternative splicing. Nature Reviews. Neuroscience. 2016;17:265–281. doi: 10.1038/nrn.2016.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Wamsley B, Jaglin XH, Favuzzi E, Quattrocolo G, Nigro MJ, Yusuf N, Khodadadi-Jamayran A, Rudy B, Fishell G. Rbfox1 Mediates Cell-type-Specific Splicing in Cortical Interneurons. Neuron. 2018;100:846–859. doi: 10.1016/j.neuron.2018.09.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
  76. Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB. Alternative isoform regulation in human tissue transcriptomes. Nature. 2008;456:470–476. doi: 10.1038/nature07509. [DOI] [PMC free article] [PubMed] [Google Scholar]
  77. Wang Q, Rio DC. JUM is a computational method for comprehensive annotation-free analysis of alternative pre-mRNA splicing patterns. PNAS. 2018;115:E8181–E8190. doi: 10.1073/pnas.1806018115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  78. Wang X, He Y, Zhang Q, Ren X, Zhang Z. Direct Comparative Analyses of 10X Genomics Chromium and Smart-seq2. Genomics, Proteomics & Bioinformatics. 2021;19:253–266. doi: 10.1016/j.gpb.2020.02.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  79. Welch JD, Hu Y, Prins JF. Robust detection of alternative splicing in a population of single cells. Nucleic Acids Research. 2016;44:e73. doi: 10.1093/nar/gkv1525. [DOI] [PMC free article] [PubMed] [Google Scholar]
  80. Wen WX, Mead AJ, Thongjuea S. VALERIE: Visual-based inspection of alternative splicing events at single-cell resolution. PLOS Computational Biology. 2020;16:e1008195. doi: 10.1371/journal.pcbi.1008195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  81. Westoby J, Artemov P, Hemberg M, Ferguson-Smith A. Obstacles to detecting isoforms using full-length scRNA-seq data. Genome Biology. 2020;21:1–19. doi: 10.1186/s13059-020-01981-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  82. Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biology. 2018;19:1–5. doi: 10.1186/s13059-017-1382-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  83. Yan Q, Weyn-Vanhentenryck SM, Wu J, Sloan SA, Zhang Y, Chen K, Wu JQ, Barres BA, Zhang C. Systematic discovery of regulated and conserved alternative exons in the mammalian brain reveals NMD modulating chromatin regulators. PNAS. 2015;112:3445–3450. doi: 10.1073/pnas.1502849112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  84. Yao Z, Liu H, Xie F, Fischer S, Adkins RS, Aldridge AI, Ament SA, Bartlett A, Behrens MM, Van den Berge K, Bertagnolli D, de Bézieux HR, Biancalani T, Booeshaghi AS, Bravo HC, Casper T, Colantuoni C, Crabtree J, Creasy H, Crichton K, Crow M, Dee N, Dougherty EL, Doyle WI, Dudoit S, Fang R, Felix V, Fong O, Giglio M, Goldy J, Hawrylycz M, Herb BR, Hertzano R, Hou X, Hu Q, Kancherla J, Kroll M, Lathia K, Li YE, Lucero JD, Luo C, Mahurkar A, McMillen D, Nadaf NM, Nery JR, Nguyen TN, Niu SY, Ntranos V, Orvis J, Osteen JK, Pham T, Pinto-Duarte A, Poirion O, Preissl S, Purdom E, Rimorin C, Risso D, Rivkin AC, Smith K, Street K, Sulc J, Svensson V, Tieu M, Torkelson A, Tung H, Vaishnav ED, Vanderburg CR, van Velthoven C, Wang X, White OR, Huang ZJ, Kharchenko PV, Pachter L, Ngai J, Regev A, Tasic B, Welch JD, Gillis J, Macosko EZ, Ren B, Ecker JR, Zeng H, Mukamel EA. A transcriptomic and epigenomic cell atlas of the mouse primary motor cortex. Nature. 2021;598:103–110. doi: 10.1038/s41586-021-03500-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  85. Yeo G, Holste D, Kreiman G, Burge CB. Variation in alternative splicing across human tissues. Genome Biology. 2004;5:1–15. doi: 10.1186/gb-2004-5-10-r74. [DOI] [PMC free article] [PubMed] [Google Scholar]

Editor's evaluation

Eduardo Eyras 1

This paper presents a new method to study known and novel alternative splicing events at the single-cell level and perform differential analysis across cell types. The method addresses current challenges in the analysis of splicing in single cells related to technical variation and experimental biases. Performing one of the most comprehensive studies to date with data from different mice, this work expands the body of splicing events that potentially define individual cell types.

Decision letter

Editor: Eduardo Eyras1

Our editorial process produces two outputs: i) public reviews designed to be posted alongside the preprint for the benefit of readers; ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Robust and annotation-free analysis of alternative splicing across diverse cell types in mice" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Senior Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed their reviews with one another, and it was agreed that a resubmission that fully addresses all of the concerns raised would be suitable for further consideration for publication in eLife. The Reviewing Editor has drafted the following to help you prepare a revised submission.

Essential revisions:

The reviewers have indicated that although the work might be of interest to researchers working in alternative splicing, the method requires significant additional testing and benchmarking, and the novelty of the findings must be made more clear. The reviewers have provided multiple suggestions to improve this and other aspects of the manuscript.

Reviewer #1 (Recommendations for the authors):

Authors have applied their method to two big scRNA-Seq datasets and have reported multiple biological discoveries from their computational analysis. However, the presentation and validation of the results should be improved. I elaborate on my comments below:

One of my major concerns is about the evaluation and benchmarking analysis of scQuint. Authors have particularly reviewed some of the existing methods in the appendix, but they provided no comparison between the performance of scQuint and those methods. Particularly authors have mentioned on page 18 that they have previously analyzed these datasets with prior methods, but they did not provide any comparison between their findings and those by other methods. In its current form, it is extremely difficult to judge the sensitivity and specificity of scQuint or whether it is a new contribution to the field. Thus, the paper's contribution is to run a standard, published analysis on a single cell dataset. There is no functional or experimental validation to support or refute the findings. Further, there is no computational validation, in terms of testing whether the predictions in this dataset hold in other data.

One limitation of the current method is more statistical tests which could lower the statistical power due to multiple hypothesis testing issue, as it needs to perform a separate test for each pair of gene/cell type compared to a test for each gene that some other methods need for finding "genes" with cell-type-specific splicing.

Authors have used the same model as in leafcutter for their analysis. However, they claim that they are getting better p-value and clustering results compared to leafcutter. It is not clear why their method should perform better than leafcutter.

One of the major advantages of the tabula muris dataset is that it contains data from multiple mouse individuals (i.e., biological replicates), which can be leveraged to show the reproducibility of the biological findings across biological replicates. However, authors did not take advantage of this in presenting their results. I highly recommend that authors show that their results can be replicated across mouse individuals, by visualizing their results as stratified by donor ID. Reproducibility is important for distinguishing between a real reproducible biological signal and a biological/technical noise particularly for the unannotated splicing events as they might be a product of splicing noise.

Authors have applied their method to only SS2 and not to any 10x data. I believe that the tabula muris dataset contains 10x data as well. While I agree that 10x is more challenging than SS2 for splicing analysis, it is still a valuable resource for splicing analysis as it has higher throughput compared to SS2 and can better capture rare cell types. I recommend that authors comment on the applicability of their method to 10x in the paper and, if their method is applicable, show how their current results compare to the results based on 10x data.

Since the paper is about analyzing splicing in single cells, I think it is extremely valuable to show the variation at the "single-cell level" (rather than pseudobulked cell-type level) via box or violin plots. This is extremely important as it is not clear from the current plots (e.g., figure 5 c,d,e or figure 8c,d) that the splicing event was observed in how many cells in each cell type and what is the range of read counts per single cell in each cell type. As I mentioned earlier it is extremely difficult to judge the reproducibility and single-cell variation of the visualized splicing events in figures as the data is aggregated across all cells within the cell type from separate donors.

Authors mention that there is little overlap between differentially expressed and differentially spliced genes but on the other hand they say that the clusters based on splicing and expression latent space are highly consistent with each other. I think they should comment on why this is possible, is it because the same cluster has different markers in each space. If so, is it possible to highlight a few clusters and show their marker genes based on splicing and expression changes?

On page 9, authors say that they detected thousands of cell-type-specific events; however, they do not provide more specifics about these events? How many events exactly? Across how many distinct genes (also what fraction of genes, and is this fraction with previous studies?)? And distinct cell types? Also, it is not clear how the examples in figure 5 were chosen? Are they among the top genes? What are the top genes? Are they genes known to have cell-type-specific splicing?

The paper lacks any experimental validation on the discovered splicing events. It is extremely important to show through experimental/FISH validations that these events are not computational artifacts and can be detected in the cell types.

It is not clear how (and how many?) splicing events in B cell trajectory were identified. Do you report any event that is differential in any of the B cell states as a cell with alternative splicing in B cell trajectory? And again, how these examples were chosen are they among the top genes in B cell trajectory?

Authors say that they detected many more events in cortex and also higher fraction of unannotated events in cortex, is this because cortex has been more deeply sampled compared to other tissues (Table 2)? Authors should account for sampling depth differences between cell types to see which one is really more enriched in alternative splicing events.

For Figure 9C, what is the AUC if the model is trained on one individual and used for prediction on another mouse?

Is not the higher fraction of events in 5' UTRs vs 3' UTR a result of the bias in your method? As you only consider events with shared 3' SS and not events with shared 5' sites in your analysis?

How did authors account for the coverage-dependent bias (as reported in https://elifesciences.org/articles/54603) which could cause spurious splicing bimodality in scRNA-Seq?

Reviewer #2 (Recommendations for the authors):

To demonstrate the significance of the approach a more completed performance evaluation, for example, using synthetic data, is recommended, as well as a comparison to alternative methods regarding biological significance.

eLife. 2022 Mar 1;11:e73520. doi: 10.7554/eLife.73520.sa2

Author response


Reviewer #1 (Recommendations for the authors):

Authors have applied their method to two big scRNA-Seq datasets and have reported multiple biological discoveries from their computational analysis. However, the presentation and validation of the results should be improved. I elaborate on my comments below:

One of my major concerns is about the evaluation and benchmarking analysis of scQuint. Authors have particularly reviewed some of the existing methods in the appendix, but they provided no comparison between the performance of scQuint and those methods. Particularly authors have mentioned on page 18 that they have previously analyzed these datasets with prior methods, but they did not provide any comparison between their findings and those by other methods. In its current form, it is extremely difficult to judge the sensitivity and specificity of scQuint or whether it is a new contribution to the field. Thus, the paper's contribution is to run a standard, published analysis on a single cell dataset. There is no functional or experimental validation to support or refute the findings. Further, there is no computational validation, in terms of testing whether the predictions in this dataset hold in other data.

Thank you for raising these concerns. We agree that a more rigorous evaluation of scQuint will improve the manuscript. Unfortunately, it is difficult to formally compare all of the methods we discuss because of differences in their requirements and output. For instance, nearly all methods either require annotations or don’t have two-sample tests for differential transcript usage proportions (DTU); neither of these conditions apply to scQuint. We hence reported our qualitative finding that embeddings based on the quantifications of various methods were hampered by technical biases on these datasets (Figure 1 and its two associated supplementary figures), which led us to have doubts about their performance.

We now introduce an in silico benchmarking procedure to demonstrate scQuint’s performance compared with other methods where such a comparison is feasible (Figure 4 in the section Differential splicing hypothesis testing with Generalized Linear Model on page 6). We had intended to include SCATS and DESJ, but were forced to exclude them due to difficulties getting their software to run (the same issues have been noted by other users on Github). To summarize, we found scQuint was comparable to the top methods in DTU testing when coverage artifacts were absent and vastly outperformed them when artifacts were present.

Per your later suggestion, we now look across different mice to help support the reliability of our findings (discussed in response to comments 4 and 6). Experimental validation, while ideal, has generally not been undertaken by relevant methods (with a couple of exceptions) and would require a time horizon and scope far beyond our intent for this work. We have begun one such collaboration, however. (See also the response to comment 9)

One limitation of the current method is more statistical tests which could lower the statistical power due to multiple hypothesis testing issue, as it needs to perform a separate test for each pair of gene/cell type compared to a test for each gene that some other methods need for finding "genes" with cell-type-specific splicing.

We understand such a test can be advantageous in certain situations. Nevertheless, we decided to test each cell type against the rest, as is the usual approach in the field (like, for example, in the tutorials for the popular single-cell analysis frameworks Seurat and Scanpy).

Authors have used the same model as in leafcutter for their analysis. However, they claim that they are getting better p-value and clustering results compared to leafcutter. It is not clear why their method should perform better than leafcutter.

Thank you for pointing out this ambiguity. We adapted several aspects of LeafCutter’s approach to make it more suitable for the scRNA-seq context, and this is the source of differences in accuracy and scalability. We now include a short description of these modifications in the Methods section and refer to these changes in the appropriate locations of the Introduction and Results. Specifically, we use a modified intron quantification (LeafCutter groups introns that share 5’ donor or 3’ acceptor sites while scQuint groups those sharing 3’ acceptor sites). Moreover, LeafCutter employs a vector of concentration parameters where entries correspond to introns whereas scQuint utilizes a single concentration parameter for each intron group. Finally, the optimization procedure of scQuint was implemented separately and makes a number of different choices, notably adopting a different initialization strategy and using different hyperparameters. The aggregate effect of these changes improves the robustness, estimability, memory requirement, and speed of the quantification and testing procedure when applied to scRNA-seq reads. See Figure 2—figure supplement 2.

One of the major advantages of the tabula muris dataset is that it contains data from multiple mouse individuals (i.e., biological replicates), which can be leveraged to show the reproducibility of the biological findings across biological replicates. However, authors did not take advantage of this in presenting their results. I highly recommend that authors show that their results can be replicated across mouse individuals, by visualizing their results as stratified by donor ID. Reproducibility is important for distinguishing between a real reproducible biological signal and a biological/technical noise particularly for the unannotated splicing events as they might be a product of splicing noise.

Absolutely, thank you for this great suggestion. We went through many genes, visualizing the results across donors and found that the splicing patterns were well-conserved. We now include supplementary figures demonstrating this for several of the events we highlighted in the manuscript (Figure 6—figure supplements 2,3,4; Figure 8—figure supplements 1,2; Figure 9—figure supplements 2,3; Figure 11—figure supplements 2,3,4). We further built a cell type classifier with no overlap in donors between the training and test sets to see if cell-type-specific patterns were preserved across mice, finding that the classifier performed just as well as when all mice were used in both the training and test sets. The updated panels Figure 10d/e now show these results.

Authors have applied their method to only SS2 and not to any 10x data. I believe that the tabula muris dataset contains 10x data as well. While I agree that 10x is more challenging than SS2 for splicing analysis, it is still a valuable resource for splicing analysis as it has higher throughput compared to SS2 and can better capture rare cell types. I recommend that authors comment on the applicability of their method to 10x in the paper and, if their method is applicable, show how their current results compare to the results based on 10x data.

This is indeed an important topic to discuss. We have added the text below to the manuscript to address your comment. Moreover, while it is time consuming to process and analyze all the 10x data from Tabula Muris, we did try a pilot analysis on a smaller, related 10x dataset, finding rather few alternative introns. For now, we think this is the safest advice.

“We do not recommend using scQuint to analyze alternative splicing in 10x Genomics Chromium data given its strong 3' transcript bias and evidence suggesting these data can detect about half the number of junctions as Smart-seq2 (Wang et al. 2021). While this imposes a fundamental limit on the number of isoforms that can be distinguished, several approaches for differential transcript usage in 10x data have been developed: Sierra (Patrick et al. 2020), SpliZ (Olivieri et al. 2021), and a kallisto-based approach which could be adapted for this task (Ntranos et al. 2019). While a systematic benchmark is missing, we expect alternative intron quantification to be sub-optimal in this setting.”

Since the paper is about analyzing splicing in single cells, I think it is extremely valuable to show the variation at the "single-cell level" (rather than pseudobulked cell-type level) via box or violin plots. This is extremely important as it is not clear from the current plots (e.g., figure 5 c,d,e or figure 8c,d) that the splicing event was observed in how many cells in each cell type and what is the range of read counts per single cell in each cell type. As I mentioned earlier it is extremely difficult to judge the reproducibility and single-cell variation of the visualized splicing events in figures as the data is aggregated across all cells within the cell type from separate donors.

We want to emphasize that all analyses (with the exception of the splice factor regression model in Figure 11) use the single cell data directly and are not pseudobulked. We now make this more explicit in the manuscript. The new supplementary figures (Figure 6—figure supplements 2,3,4; Figure 8—figure supplements 1,2; Figure 9—figure supplements 2,3; Figure 11—figure supplements 2,3,4) to which we alluded in comment 4 response show the distribution of introns across donors for these particular events and are broadly representative of the general trend of consistency we see across genes. These display the strong cell-type specificity and consistency across individuals. We did try box and violin plots, but given the relatively binary nature of the data within cell types, they didn’t yield particularly informative visualizations.

Authors mention that there is little overlap between differentially expressed and differentially spliced genes but on the other hand they say that the clusters based on splicing and expression latent space are highly consistent with each other. I think they should comment on why this is possible, is it because the same cluster has different markers in each space. If so, is it possible to highlight a few clusters and show their marker genes based on splicing and expression changes?

Thank you for raising this question, as it is an interesting and important one which gets at the biological phenomena at play. Indeed, we found that the markers are generally distinct in the expression and splicing spaces despite both data modalities yielding highly consistent clusterings. Figures 6b and 9a-b show that the pattern of alternative splicing across cell types generally do not correlate with the expression of corresponding genes across the same cell types. Table 2 also shows the extremely narrow overlap of the most prominent markers in these two spaces for 5 different tissues. We have also added Figure 6—figure supplement 1 to display the expression marker genes for the cortex cell types. Inspection reveals very little overlap with the splicing markers.

On page 9, authors say that they detected thousands of cell-type-specific events; however, they do not provide more specifics about these events? How many events exactly? Across how many distinct genes (also what fraction of genes, and is this fraction with previous studies?)? And distinct cell types? Also, it is not clear how the examples in figure 5 were chosen? Are they among the top genes? What are the top genes? Are they genes known to have cell-type-specific splicing?

Figure 10 contains information about the number of differential splicing events and differentially spliced genes across different cell types in both the Cortex and Tabula Muris datasets. We have inserted a parenthetical remark in the sentence you mentioned to direct readers to the latter section that contains this figure and a discussion of the information it contains.

The splicing events in Figure 6 (the former figure 5) panel b were chosen after ranking by inferred effect size with some manual curation to ensure cell-type specificity. Genes for panels c-e (and for subsequent Sashimi plots) were selected from our pool of cell-type-specific novel events based on the clarity of alternative splicing/transcription, and genes were prioritized if there was prior knowledge of roles in relevant biological processes.

The paper lacks any experimental validation on the discovered splicing events. It is extremely important to show through experimental/FISH validations that these events are not computational artifacts and can be detected in the cell types.

We agree that experimental validation is a vital avenue to pursue, and we have begun a collaboration to investigate some of the computationally inferred splicing differences we found in neuronal cells. However, for reasons of the necessary time and our intended scope, we do not feel that it is appropriate to include that as part of this manuscript. We also note that the large majority of computational methods papers for splicing analysis do not perform any experimental validation (we found only 2 out of 13 did so). Our hope with this work is to help generate reasonable hypotheses for follow-up experimentation given the expense of such an endeavor.

It is not clear how (and how many?) splicing events in B cell trajectory were identified. Do you report any event that is differential in any of the B cell states as a cell with alternative splicing in B cell trajectory? And again, how these examples were chosen are they among the top genes in B cell trajectory?

We have inserted panel b of Figure 8 (formerly Figure 7) to explicitly give the number of identified splicing events for each B cell stage. Examples were chosen by selecting genes we identified as containing differential splicing/transcription start events across B cell stages which also have previously identified roles (based on expression) in B cell development. This allowed a comparison of the progression of splicing and expression in genes where both processes seem to play a role in development.

Authors say that they detected many more events in cortex and also higher fraction of unannotated events in cortex, is this because cortex has been more deeply sampled compared to other tissues (Table 2)? Authors should account for sampling depth differences between cell types to see which one is really more enriched in alternative splicing events.

We have updated the relevant section so that it provides readers better context for our results. In particular, we note that the higher rate of unannotated events in the cortex is consistent with previous observations. Due to the one-vs-all testing procedure, the results are affected by the sequencing depths and number of cell types across many tissues and cell types. This makes a comprehensive evaluation difficult, but we now mention that these may complicate a direct interpretation of the number of identified events. The newly added items are underlined.

“We found many more cell-type-specific differential splicing events in the cortex than in the marrow, as expected (Yeo et al. 2004.), as well as a higher proportion of events involving novel junctions, which can reach 30% (Figure 10a). Differences in proportion of novel junctions should be interpreted with care, however, since they can be affected by sequencing depth and number of cells, both of which vary between the two tissues. Very similar patterns are seen when grouping differential splicing events that occur in the same gene (Figure 10b).”

For Figure 9C, what is the AUC if the model is trained on one individual and used for prediction on another mouse?

Figure 10d now shows the AUC when the classifier is trained and tested on different sets of mice. The values are extremely similar to when there was no partition of individuals between the training and testing data.

Is not the higher fraction of events in 5' UTRs vs 3' UTR a result of the bias in your method? As you only consider events with shared 3' SS and not events with shared 5' sites in your analysis?

We now include a note that our method of intron quantification influences the 5’ vs 3’ ratio we observe:

"Most differential splicing events that we detected with alternative introns fall in the coding portion of the gene, with high proportions in the 5' UTR (Figure 10c). This is a property of our quantification approach and does not reflect the total number of alternative splicing events in different gene regions; still, the relative proportion can be compared across tissues.”

How did authors account for the coverage-dependent bias (as reported in https://elifesciences.org/articles/54603) which could cause spurious splicing bimodality in scRNA-Seq?

We also encountered this issue while working with these datasets, and it was an important consideration to select a model robust to such technical noise. The Dirichlet-Multinomial model we chose is able to handle this bimodality by fitting a concentration parameter close to zero. To clarify this for readers, we have inserted the following sentence:

“For example, the often encountered “binary'' splicing (Najar et al. 2020) can be modeled by fitting a concentration parameter close to zero.”

Reviewer #2 (Recommendations for the authors):

To demonstrate the significance of the approach a more completed performance evaluation, for example, using synthetic data, is recommended, as well as a comparison to alternative methods regarding biological significance.

We thank the reviewer for their careful reading of our manuscript and for noting its strengths and weaknesses. We have revised the paper with a new computational benchmarking section and a better discussion of the biological novelty and insights yielded by our method in this analysis.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Citations

    1. Schaum et al 2018. Tabula Muris. NCBI Gene Expression Omnibus. GSE109774
    2. Yao et al 2021. BRAIN Initiative Cell Census Network Cortex. nemoarchive. dat-ch1nqb7

    Supplementary Materials

    Table 1—source data 1. Number of cells per cell type and donor in BICCN Cortex.
    Table 1—source data 2. Number of cells per tissue and donor in Tabula Muris.
    Figure 6—source data 1. Intron coordinates for panel (b).
    Figure 8—source data 1. Intron coordinates for panel (c).
    Figure 9—source data 1. Intron coordinates for panel (a).
    Figure 9—source data 2. Intron coordinates for panel (b).
    Figure 11—source data 1. Intron coordinates are available for panel (a).
    Transparent reporting form

    Data Availability Statement

    scQuint is implemented in Python and is available at https://github.com/songlab-cal/scquint, (Benegas, 2021b copy archived at swh:1:rev:a9db6454e13d42af25f47deee19e201e74d2bdd0). Differential splicing results and access to cell and genome browsers, together with the code to reproduce our results, are available at https://github.com/songlab-cal/scquint-analysis, (Benegas, 2021c copy archived at swh:1:rev:97dc31babf2a585666af4a38b1e4aa59a92bbf87). Processed alternative intron count matrices are provided in the AnnData format (anndata.readthedocs.ioanndata.readthedocs.io) for easy manipulation with Scanpy (Wolf et al., 2018), Seurat (Stuart et al., 2019), and other tools.

    All data analyzed in this study are publicly available and URL links are provided in the Materials and methods section of our manuscript. Our source code as well as all results represented in figures and tables are publicly available on our lab's GitHub repositories: https://github.com/songlab-cal/scquint, (copy archived at swh:1:rev:a9db6454e13d42af25f47deee19e201e74d2bdd0) and https://github.com/songlab-cal/scquint-analysis, (copy archived at swh:1:rev:97dc31babf2a585666af4a38b1e4aa59a92bbf87).

    The following previously published datasets were used:

    Schaum et al 2018. Tabula Muris. NCBI Gene Expression Omnibus. GSE109774

    Yao et al 2021. BRAIN Initiative Cell Census Network Cortex. nemoarchive. dat-ch1nqb7


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES