Abstract
Motivation
Existing nanopore single-cell data analysis tools showed severe limitations in handling current data sizes.
Results
We introduce scywalker, an innovative and scalable package developed to comprehensively analyze long-read sequencing data of full-length single-cell or single-nuclei cDNA. We developed novel scalable methods for cell barcode demultiplexing and single-cell isoform calling and quantification and incorporated these in an easily deployable package. Scywalker streamlines the entire analysis process, from sequenced fragments in FASTQ format to demultiplexed pseudobulk isoform counts, into a single command suitable for execution on either server or cluster. Scywalker includes data quality control, cell type identification, and an interactive report. Assessment of datasets from the human brain, Arabidopsis leaves, and previously benchmarked data from mixed cell lines demonstrate excellent correlation with short-read analyses at both the cell-barcoding and gene quantification levels. At the isoform level, we show that scywalker facilitates the direct identification of cell-type-specific expression of novel isoforms.
Availability and implementation
Scywalker is available on github.com/derijkp/scywalker under the GNU General Public License (GPL) and at https://zenodo.org/records/13359438/files/scywalker-0.108.0-Linux-x86_64.tar.gz.
1 Introduction
Single-cell and single-nuclei (collectively called single-cell hereafter) transcriptome sequencing has revolutionized our understanding of processes in health and disease, especially in heterogeneous tissue like the human brain (Piwecka et al. 2023). Alternative isoforms, with variation in start sites, splicing of exons, or transcript ends, are highly prevalent in the transcriptome of complex eukaryotes (Wang et al. 2008, Park et al. 2018). However, the most commonly used single-cell short-read sequencing methods only result in data from the 5ʹ or 3ʹ ends of the transcripts. Even with short-read sequencing covering the entire gene length, there is no direct observation of full-length transcripts and only a partial reconstruction of the splice diversity.
In contrast, long-read sequencing methods from Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) enable the sequencing of full-length transcripts and the identification and quantification of all isoform variations. Bulk long-read transcriptome sequencing consistently leads to the discovery of novel isoforms (Clark et al. 2020, Glinos et al. 2022, Heberle et al. 2023) but lacks the cell-type resolution and may miss isoforms from lowly abundant cell types. Therefore, there is great value in optimizing long-read single-cell transcriptomic methods. Initial strategies to deal with the higher error rate of long-read sequencing included combining short-read sequencing data from the same library (Gupta et al. 2018, Lebrigand et al. 2020), rolling circle amplification (Volden et al. 2018), or dimer nucleotide blocks for barcodes and UMIs (Philpott et al. 2021). Methods not relying on short-read sequencing data arose with decreasing nucleotide-level error rates (Tian et al. 2021, You et al. 2023, Oxford Nanopore Technologies 2024). These methods were tested on relatively small datasets and, in our experience, did not scale to the large datasets (>10 000 cells per sample) currently produced. We developed scywalker to address this issue, creating a more scalable package for adequate analysis of this rich data source using novel methods and extensive parallelization while simultaneously improving the accuracy of the results and utility. In contrast to existing tools, scywalker can, in one command, provide analysis of multiple samples, generating ready-to-use per-cell expression counts and multi-sample per-cell-type summed pseudobulk counts, allowing easy comparison of gene and isoform expression over multiple samples and cell types.
2 Materials and methods
2.1 Single-nuclei and single-cell transcriptome sequencing
Four fresh frozen human brain samples (BA10) were provided by the NeuroBiobank of the Born-Bunge Institute (IBB-Neurobiobank), Wilrijk (Antwerp), Belgium; ID: BB190113. The study was approved by the Ethics Committee of the University Hospital Antwerp and the University Antwerp (20/10/107). Nuclei isolation from human brain samples was performed using an adapted density gradient protocol (Habib et al. 2016). A different protease inhibitor (1× cOmplete EDTA-free protease inhibitor) was used, and the lysis time was reduced to 2 min.
Arabidopsis (Arabidopsis thaliana L. Heynh.) cv. Columbia-0 (Col-0) was grown in a controlled-environment growth chamber (Weiss Technik). For isolation of leaf protoplasts (∼5 excised leaves from four-week-old seedlings pooled per sample), a “tape-sandwich” method was used, where the abaxial epidermis of leaves was peeled using adhesive tape (Scotch® Magic™, 3M) and leaves were immediately immersed in a cell-wall degrading enzymatic buffer in protoplasting buffer (0.6 M mannitol, 20 mM KCl, 10 mM CaCl2, 20 mM MES, 0.1% BSA, 1.0% cellulase R10, 0.3% macerozyme) at pH 5.7. The enzymatic reaction was performed for 1 h at room temperature with gentle rotation (30 rpm) in the dark. The solution was filtered through pre-wet 70 μm nylon mesh, collected by centrifugation at 200×g for 6 min, and resuspended in wash buffer (0.6 M mannitol, 20 mM KCl, 10 mM CaCl2, 20 mM MES).
The remainder of the protocol is the same for leaf and brain samples. Beads in emulsion (GEM) generation and droplet barcoding were performed according to the 10x Genomics protocols with Chromium Next GEM Single Cell 3ʹKit v3.1 and Chromium Next GEM Chip G Single Cell Kit, aiming for 10 000 cells per sample. Unfragmentated cDNA was prepared for nanopore sequencing on the ONT PromethION (P24) with a combination of the primers from SQK-PCS111, rapid adapters from EXP-RAA114, and auxiliary vials for loading on R10.4.1 flow cells from EXP-AUX003. Data was basecalled on the ONT PromethION (P24) using the Guppy basecaller (v7.0.9) with the SUP model. In parallel, the droplets were further processed for short-read sequencing using the 10× Chromium Next GEM Single Cell 3ʹ Reagent Kits v3.1 (Dual Index) per the manufacturer’s protocol (10× user guide CG000315). The libraries were sequenced on the Illumina NovaSeq 6000 v1.5 sequencing kit using S4 flow cells, targeting 40 000 reads per cell. The sequencing data was processed with the Cell Ranger pipeline. For the analysis of the human samples, the GRCh38 reference was used (ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz). The GENCODE V42 annotations as downloaded from the UCSC Genome Browser (genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&g=wgEncodeGencodeCompV42) were used as reference transcripts. For the Arabidopsis samples, TAIR10 (ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-56/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa.gz) was used as the reference genome, and the version 56 ensembl annotations (ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-56/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.56.gtf.gz) were used as transcript references.
2.2 Implementation of the scywalker pipeline
The scywalker pipeline (Supplementary Fig. S1) is implemented within the GenomeComb framework (github.com/derijkp/genomecomb/), a toolset for efficiently analyzing various types of sequencing data.
2.2.1 Scywalker droplet barcoding
All barcode files are merged, summarized, and sorted by read count. The top barcode (largest read count) is taken as a correct droplet barcode, and the remaining barcodes are searched for barcodes with one difference using cost-only dynamic programming with a cost cut-off of 1. This search is precomputed in parallel in 50 batches. All hits will be assigned to the initial droplet barcode and removed from the list for further processing. This procedure is then repeated with the following (free) barcode in the list until it has fewer than (by default) 20 reads, or until (by default) 100 000 droplets have already been identified. If a whitelist of all potential droplet barcodes is given, only barcodes in this list will be processed. The resulting mapping of read barcodes to droplet/cell barcodes will be combined with the FASTQs and their barcoding files to generate FASTQs where the corrected (droplet) barcode and UMI are prepended before the read name and added as FASTQ comments.
2.2.2 Scywalker gene and isoform calling
The barcoded FASTQ files are aligned to the reference genome with minimap2 (Li 2018) using the splice preset in separate jobs for each file and coordinate sorted using a modified version of gnu-sort. The presorted alignments are merged using the gnu-sort mergesort, resulting in one sorted alignment file in bam format.
Genes and isoforms are then identified and counted based on the postprocessing of IsoQuant result, run highly parallelized (Prjibelski et al. 2023). Scywalker splits the alignments into smaller regions (default target 5Mbase, but splits can only happen in 250kbase regions without known genes) that are processed separately to improve efficiency and reduce peak memory use. These are run in separate jobs instead of threads (to allow distribution over a cluster). IsoQuant is run on each region alignment, and reads are analyzed as bulk data (command included in the Supplementary code). Scywalker then adapts the IsoQuant output: IsoQuant produces separate results for transcripts in the given reference set (known) and for predicted novel models not in the reference set. These are merged into one. In this process, the identifiers given to novel isoforms and genes by Isoquant are changed to a unique identifier based on the genomic location to avoid collisions when separately processed regions are later merged. The IsoQuant read_assignment file (for the reference calls) is also adapted by merging in the transcript_model_reads data (changing the assignment of reads supporting novel transcripts), putting information about detected polyA and classifications in separate columns, and adding information about the completeness of the match. The read_assignment files can be used later to calculate single cell counts because the cell barcodes are incorporated through the readnames.
Scywalker uses a custom algorithm to analyze organelle genomes. Only gene locations are loaded into memory to minimize memory use, while the numerous read alignments are loaded and analyzed individually. For each read alignment, the CIGAR string is parsed to define different alignment regions of the read, and the percent overlap with genes is calculated. If multiple overlaps exist, the read is assigned to the nonribosomal RNA gene with the highest pct overlap. A file similar to the IsoQuant read_assignment file is produced. In the assignment_events field of the file, one of the following is recorded to indicate how the match/read relates to a reference transcript: mono_exon_match if the alignments start and end coordinates differ a maximum of 3 bases from the known gene location, mono_exon_enclosed for alignments entirely within the gene (but not a match) and mono_exon_overlap for alignments overlapping at least 5% of the gene. This information is later considered when calculating final read counts: reads with mono_exon_overlap are included in the total weighted count, while they are left out of unique counts.
After the regions are processed separately, the read assignment info is concatenated, sorted on the read name (which starts with barcode and UMI info), and processed per block of assignments concerning the same read (has the same barcode and UMI, or the same read name if those were not found). Reads uniquely assigned to one isoform are copied directly into the final read assignment file. Multiple assignments in a block are first filtered: assignments less informative than others (significantly shorter, not in a gene, while others match a gene/isoform) in the block are removed. Additional information that will serve as correction factors in the counting step is added to the read assignments: umicount (the number of reads having this same UMI), ambiguity (the number of isoforms supported by the read), gambiguity (the number of genes/locations supported), while information such as covered_pct (how much of the isoform is covered by the read) and inconsistency (level of inconsistency the alignment shows with the isoform structure) will also be used in providing different types of counts. Finally, the read assignments file is (re)sorted according to genomic location.
Scywalker returns different types of counts as separate fields in the results. These are calculated simultaneously by parsing the read assignments file and keeping a tally for each type of count and cell. A weight is added to the appropriate tallies for each read assignment. For the “weighted” isoform count, a weight of 1/(ambiguity*umicount) is added, while the “unique” count only takes reads into account that uniquely support one transcript, and “strict” only counts unique reads that cover ≥90% of the assigned transcript. Similar results are provided in the “aweighted,” “aunique,” and “astrict” columns, but limited to reads with a detected polyA tail. For the default gene “count” a weight of 1/(gambiguity*umicount) is added, including for reads mapping to introns of the genes (which is also the default for Cell Ranger), while “nicount” only counts reads matching an isoform.
2.2.3 Scywalker cell filtering and typing
Droplets that contain nuclei are determined based on the EmptyDrops algorithm (Lun et al. 2019). To improve cell clustering, novel genes detected by scywalker are excluded. Cells are clustered at a resolution of 0.5 using Seurat’s K-nearest-neighbor graph-based approach based on the gene expression levels. The algorithm groups cells into communities based on similarity in their feature profiles. A high resolution would identify relatively large amounts of smaller clusters, while low values would result in relatively fewer clusters that group together more cells. In our case, the resolution is set slightly lower than the default in Seurat (0.8) to avoid over-clustering (identifying a large number of small communities) in this initial clustering step and focus only on capturing the large, overarching cell populations in the data. This will make their automatic annotation in the following step more robust.
Each cell is assigned to one of the cell types in a user-provided marker file using ScType (Ianevski et al. 2022) and scSorter (Guo and Li 2021). Users can instead opt to give only the tissue from which the sample originates, in this case, only ScType will be run using its general (human) marker database. The results are tab-separated group files (one for each typer) that link cell to cell type. These are then used together with the per-cell gene and isoform files to generate pseudobulk files giving gene and isoform counts for each cell type (in separate columns). When running on a project with multiple samples, Scywalker will make multi-sample, multi-cell-type count tables. Novel isoforms with the same junctions but differences at the ends from different samples will be merged into one line. The resulting pseudobulk files can be used directly for downstream analysis. Users can also inspect (recommended) and adapt the automatically assigned cell annotations, e.g. a more in-depth annotation with careful expert considerations. After recording adapted cell type annotations in a group file, a user can easily re-run scywalker’s pseudobulk module to obtain pseudobulk matrices for the newly assigned populations.
2.2.4 Scywalker reporting
Alignment metrics from the minimap2 bam file are analyzed using cramino in spliced mode (De Coster and Rademakers 2023), additionally complemented with plots and metrics from identified nuclei and transcript assignment and summarized in an HTML report. The report provides key metrics of the sequencing and cell and gene identification, several interactive plots, such as a knee plot and quality control filters, and information on the read length, including a histogram and a comparison of read lengths against the number of exons detected per read. Furthermore, a histogram shows the number of genes identified per cell and the overlap between reads and known or novel genes, antisense, intergenic, or intronic intervals. Finally, the report includes the UMAP for cell type identification.
2.3 Comparison to existing methods
We compared the scywalker pipeline against the following existing workflows: BLAZE-FLAMES (You et al. 2023), respectively versions v1.1.0 and v0.1, and the ONT wf-single-cell v0.2.8 (Oxford Nanopore Technologies 2024). Gene level UMI counts for BLAZE-FLAMES were obtained by summing transcript counts per gene. While scywalker outputs a UMI-corrected total cell read count table (independent of the transcript or gene level counts) as well, for accurate comparison of UMI counts per cell between workflows, we considered the cell level UMI counts that were obtained by summing the gene level UMI counts available in all compared workflows. Performance across different workflows was evaluated based on compute time and the correlation of the UMI counts per cell and per gene compared against the short-read quantification. For these correlation comparisons, we only considered the UMI counts per cell and per gene if they were higher than 1 in either of the compared workflows. Moreover, we only included the genes commonly available in the gene annotation references (i.e. GTF/GFF3 files) of the compared workflows, excluding rRNA genes from the comparisons for the plant samples. All software was run using standard parameters (code in Supplementary code), except a maximum edit distance (MAX_DIST) of 1 was used instead of the advised MAX_DIST = 2 setting in match_cell_barcode step of the FLAMES workflow for the brain sample because of the higher accuracy of sequencing. We also tested with MAX_DIST = 2, which resulted in a substantially lower correlation both for the cell UMI counts comparison (R = 0.287 versus R = 0.888, Supplementary Fig. S6) and for the gene/cell UMI counts comparison (R = 0.644 versus R = 0.868, Supplementary Fig. S6). The correlation plots were generated in R (v4.2.0) using the following packages: ggplot2 (Wickham 2016), ggpairs [as implemented in GGally (Schloerke et al. 2024)], and ggpubr (Kassambara 2023).
2.4 Differential transcript usage between brain cell types
2.4.1 Annotation of brain cell types
Scywalker’s single-cell module creates a Seurat object using the retained cells after EmptyDrops filtering. It then follows the standard Seurat processing workflow with default parameters, except for a slightly lower resolution of 0.5 during clustering. Clusters were then annotated using scSorter and ScType, with the human brain marker genes for the major cell populations (excitatory neurons, inhibitory neurons, astrocytes, microglia, oligodendrocytes, OPCs, endothelial cells, and pericytes, Supplementary Table S2). These markers were selected from a list of literature-based genes based on their segregating capacity, which was determined using Garnett (Pliner et al. 2019). Markers were only used for annotation if they had an ambiguity of <3% in the human brain dataset. Counts were summed per cell population based on these annotations.
2.4.2 Differential transcript usage between neuronal and glial cells
We used the resulting pseudobulk counts (weighted) as a metric to assess scywalker’s performance at the transcript level. Only transcripts detected in at least half of the samples within a given cell type were retained for subsequent analysis. Transcript counts of the four studied samples were summed for each cell type. The plots to visualize the performance were generated using R (v4.3.2) and the following packages: ggplot2 (Wickham 2016), ggpubr (Kassambara 2023), and ggupset. DTU analysis was conducted using the DRIMSeq R package (Nowicka and Robinson 2016), implementing a Dirichlet-multinomial model. We compared neuronal and glial cell types, with the analysis adjusted for individual variations and conditions (Alzheimer’s Disease or neurologically normal). A two-stage statistical procedure with stageR (Van den Berge et al. 2017) accounted for multiple testing. In brief, the DRIMSeq gene P-values were analyzed in a screening stage to determine which genes show signs of DTU (gDTU). In gDTUs, the DRIMSeq transcript-level P-values were individually tested for DTU in a confirmation stage, resulting in a corrected P-value for each gene and each transcript of a significant gene. The differential transcript usage plots were made using the viz_transcripts tool included with scywalker, which is based on ggtranscript (Gustavsson et al. 2022) and ggplot2 (Wickham 2016).
3 Results
3.1 Overview of the scywalker workflow
Scywalker is an integrated workflow for analyzing long-read single-cell sequencing data, currently tailored to the 10x Genomics microfluidics platform. Scywalker orchestrates a complete workflow from FASTQ to cell-type demultiplexed gene and isoform discovery and quantification. It consists of three main modules (Supplementary Fig. S1).
In the first step, 10x droplet barcodes are detected and assigned to all reads using a novel method: analogous to BLAZE and wf-single-cell, scywalker starts by sorting barcodes according to the number of reads they were found in. Unlike BLAZE and wf-single-cell, which select cell barcodes using a cut-off (based on the expected number of cells), scywalker detects droplet-barcodes (i.e. putative cell barcodes as well as empty droplets) using a stepwise procedure, to be used for the statistical identification of valid cell barcodes (Lun et al. 2019) later after gene quantification. The top barcode is selected as an accurate droplet barcode, and all barcodes with a single nucleotide difference are searched and added to this top barcode, eliminating them from further selection. This process is repeated for the remaining barcodes until the highest-ranked remaining barcode has 20 reads or fewer. As such, the software detects droplet barcodes well into the “empty drops” levels needed by the sophisticated EmptyDrops algorithm (Lun et al. 2019) to determine which barcodes represent real cells at a later stage. As identifying mismatched barcodes is integrated into the barcode detection, the module can immediately create barcoded FASTQ files where each read has its corrected droplet barcode and UMI sequence added.
In the second module, the droplet barcoded reads are aligned to the reference genome, and isoforms (and genes) are first detected and quantified using a method based on IsoQuant bulk analysis (Prjibelski et al. 2023) for the nonorganelle chromosomes. Gene counting for organelles is done separately using a specific method (see Section 2) because organelles often pose problems to standard isoform callers due to the different transcription structure (e.g. polycistronic) and extreme read counts. IsoQuant and organelle transcript identification produce files specifying the assignment of reads (and thus cell barcodes) to isoforms. This data is used together with the isoform output to produce the quantification of isoforms and genes per droplet barcode in the final step of this module. Based on extra information in the read assignment data (including detection of the polyA tail, completeness of the match, support for more than one isoform), different types of count are included in the result: weighted (1/n for reads supporting n isoforms), unique (counting only reads uniquely supporting the isoform), strict (only counting unique reads that are 90% complete), and analogous using only reads for which a polyA tail was detected.
Using various approaches initially developed for short-read single-cell analysis, the third module starts by filtering out low-quality and empty droplets from the output of the previous step (Lun et al. 2019), and the filtered results are processed using Seurat (Butler et al. 2018) for normalization, scaling, and clustering. The cell barcodes are automatically assigned to cell types using ScType and scSorter. Finally, pseudobulk counts per cell type are generated based on these assignments. Scywalker can also be used to generate pseudobulk results for user-supplied groupings. When analyzing multiple samples together, scywalker will also make multi-sample, multi-cell type count tables, matching novel isoforms with the same junctions that may have differences at the ends between different samples.
Scywalker supports scalable parallelization. In order to reduce memory use and improve performance, most steps are subdivided into smaller jobs (indicated by overlapping boxes in Supplementary Fig. S1), which are efficiently distributed over different processing cores, either on the same computer or over different computers in a cluster. Scywalker is distributed as a portable application directory that contains all dependencies and runs on any Linux system without installing or setting up environments. This facilitates workflow installation and execution, especially on clusters where root access is not always possible, the external network is not necessarily available, and systems may be heterogeneous.
3.2 Scywalker accurately quantifies cells and genes
We isolated single nuclei from four adult human brains and performed droplet barcoding and cDNA generation using 10x Chromium, aiming to get ∼10 000 nuclei per sample (hereafter mentioned as samples brain 1–4). The resulting libraries were sequenced on ONT PromethION (P24) and Illumina NovaSeq 6000 (run details see Supplementary Table S1). Short-read sequencing data is not needed for the scywalker analysis, but it provides a reference for evaluating the long-read analysis results on the cell and gene count level. We compared the UMI-corrected read counts per cell barcode found by scywalker in the long reads and Cell Ranger for the short read data. The correlation between the two (Supplementary Fig. S2) is very high (R = 0.984, 0.993, 0.995, 0.994, respectively for brain1, brain2, brain3, and brain4), showing that scywalker accurately finds cell barcodes and assigns reads to them. Furthermore, scywalker also produces accurate results at the gene count per cell level when compared to the short-read data (Fig. 1). While some genes are only found in either short or long-read data here, these are, with a few exceptions, mostly confined to lower-expressed ones, and the overall correlation between short and long read data is high (R = 0.934, 0.932, 0.931, 0.914, respectively for brain1, brain2, brain3, and brain4).
Figure 1.
Scywalker UMI counts per gene and cell compared to their respective short-read Cell Ranger results for the four human brain samples. Sample-specific Pearson correlation coefficients (R) are shown on the upper left corners of each panel. y-axis, scywalker UMI counts per gene and cell from long-read sequencing data; x-axis, Cell Ranger UMI counts per gene and cell from short-read sequencing data. SRS, short-read sequencing; UMI, unique molecular identifier.
3.3 Comparison to other software
While we had successfully tested the wf-single-cell pipeline provided by ONT [formerly known as Sockeye (Oxford Nanopore Technologies 2024)] on smaller datasets, it could not handle the size of the single-nuclei brain dataset (>10 000 nuclei, >100M reads per sample). As an alternative, we tried the combination of BLAZE (You et al. 2023) for barcode discovery and FLAMES (Tian et al. 2021) for isoform analysis, which ran successfully but took over a week to complete. To properly compare scywalker to both wf-single-cell and BLAZE-FLAMES and show that scywalker also works on data from less accurate data from earlier iterations of the sequencing chemistry and basecaller versions (Guppy 3.1.5), we downloaded the scmixology2 (GSE154870) dataset, containing single-cell data of a mixture of equal proportions of five human lung adenocarcinoma cell lines sequenced with short and long reads (Tian et al. 2021, You et al. 2021). We used this smaller dataset (target 183 cells, 22M reads) to benchmark the different packages.
Scywalker shows a higher correlation (R = 0.942) with the short read data than wf-single-cell (R = 0.734) and BLAZE-FLAMES (R = 0.65) at the cell identification level (Supplementary Fig. S3). The largest differences for scywalker were cells detected only in the ONT data, while the other two packages mainly missed cells found in the short read data. Moreover, at the level of gene counts per cell, scywalker also shows a higher correlation (R = 0.89) than wf-single-cell (R = 0.765) and BLAZE-FLAMES (R = 0.563) (Supplementary Fig. S4). Scywalker includes downstream analysis, including cell type identification. Scywalker found the expected five clusters and could assign them to the five cell lines using a custom marker set (Supplementary Fig. S5). Using the brain1 dataset for comparison also confirmed that the correlation of the cell (Supplementary Fig. S6) and gene quantification (Supplementary Fig. S7) compared with the short-read results was also higher for scywalker (R = 0.984 and R = 0.934) than for BLAZE-FLAMES (R = 0.888 and R = 0.868). A comparison of the run times for this smaller scmixology2 dataset, tested on a system with a 24-core EPYC 7443P and 512G memory, shows comparable results for the three tools, with BLAZE-FLAMES being the fastest (3h19), scywalker second (3h37) and wf-single-cell the slowest 7(5h41). The analysis of the larger brain1 sample using scywalker took 14h33, or around four times as long, on the same benchmark system. However, the analysis of this larger dataset using BLAZE-FLAMES took 172h06, or over 50 times as long as the smaller scmixology2, demonstrating the improved scalability of scywalker.
3.4 Differential usage of transcripts between brain cell types
Scywalker successfully assigned the different brain nuclei cell types (Fig. 2, Supplementary Fig. S8) for the four brain samples, generated pseudobulk files for both gene and isoform counts, and combined these in multi-sample, multi-celltype gene and isoform count files.
Figure 2.
UMAP plot generated by scywalker based on gene counts, showing cell-type assignments by ScType in different colors for the brain1 sample.
To evaluate the results of scywalker at the isoform level, we first compared the number of identified isoforms across cell types using the pseudobulk counts from the four brain samples. Excitatory neurons exhibited the highest number of identified isoforms, of which 11.3% were novel (i.e. not in the gencode v42 reference set), followed by inhibitory neurons and oligodendrocytes (Supplementary Fig. S9A). Endothelial cells, the cell type with the lowest cell count, also exhibited the fewest identified isoforms (Supplementary Fig. S9B). Additionally, we observed a correlation between the number of isoforms and the count of novel isoforms (R = 0.99, P-value = 1.9×10 − 5) (Supplementary Fig. S9B). Scywalker was also able to identify multiple transcripts per gene, as shown in Supplementary Fig. S9C. Among the identified isoforms, considered if present in a minimum of 2 samples in at least one cell type (188 806 isoforms), ∼40% of them were present in at least five distinct cell types, and 26% were neuron-specific (Supplementary Fig. S9D).
Next, we performed a differential transcript usage (DTU) analysis to assess the capability of scywalker to detect isoform expression variation. We compared the proportions of isoforms within each gene between neuronal cell types (excitatory and inhibitory neurons) and glial cell types (oligodendrocytes, astrocytes, OPCs, and microglia). We found 301 genes with DTU (gDTU) (FDR < 0.05), allowing the identification of neuron-specific isoforms of specific genes such as SEPTIN8 (gFDR = 6.16×10 − 46, ENST00000378719.7 tFDR = 1.09×10 − 57) (Fig. 3A). Septins are known to undergo extensive alternative splicing. In the case of SEPTIN8, it was previously suggested that some of its transcripts are ubiquitously expressed across tissues, while others show most expression in the central nervous system (Hall et al. 2005). Additionally, in 24 other genes, novel isoforms were found to be differentially used between neurons and glia, e.g. for PNKD (gFDR = 2.21×10 − 16, novel transcript tFDR = 1.38×10 − 9, Fig. 3B).
Figure 3.
Transcript proportion and exon representation of two gDTUs as generated by scywalker, (A) for SEPTIN8 and (B) for PNKD. In the left panel, all the analyzed isoforms of the gene are shown, with gray rectangles representing the exons of each transcript and red rectangles indicating all potential exon positions for reference. The right panel illustrates the observed percentage abundance of each transcript within groups (glia or neurons) for each brain sample.
3.5 Scywalker extends beyond human nanopore sequencing
To showcase its broad applicability in the analysis of nonhuman species, we also tested scywalker on two large plant single-cell datasets. For this, protoplasts were prepared from mature A.thaliana leaves, followed by the generation of cell-barcoded cDNA using the 10x Chromium, obtaining ∼10 000 cells per sample. The cDNA was subsequently sequenced on both NovaSeq 6000 and ONT PromethION P24. The short read data (377M reads and 467M reads) was analyzed using Cell Ranger, while the long read data (220M and 230M reads) was analyzed with scywalker. As for the human brain samples, the short read (Cell Ranger) and long read (scywalker) UMI corrected read counts per cell showed excellent correlation (R = 0.973 and R = 0.988 respectively, for plant1 and plant2) (Supplementary Fig. S10).
Similarly, for the plant leaf samples, the UMI gene counts were compared for all genes (except rRNA genes) in cells that had more than a single count in at least one of the two datasets. Here, the correlation between scywalker and Cell Ranger counts is also high (R = 0.89 and R = 0.898 respectively for plant1 and plant2) (Supplementary Fig. S11), thus demonstrating that scywalker provides reliable single-cell expression metrics in human and plant long-read scRNA-seq samples.
After data preprocessing and cell gene count estimation, scywalker performed follow-up single-cell analysis, including clustering and assignment of cell types. After providing a custom list of cell-type marker genes (Supplementary Table S3), scywalker successfully assigned major leaf cell types (Fig. 4, Supplementary Fig. S12), including the epidermis (Fig. 4B), mesophyll (Fig. 4C), bundle sheath (Fig. 4D), and other leaf cell types.
Figure 4.
(A) UMAP plot generated by scywalker based on gene counts, showing cell-type assignments by ScType in colors for plant sample 1. (B–D) The UMAP plots show the expression of leaf tissue markers for the epidermis [LTPG1 (Clark and Bohnert 1999)], mesophyll [ESM1 (Zhang et al. 2021)], and the bundle sheath [THA2 (Procko et al. 2022)].
At the transcript level, longer reads allow a more accurate assignment of reads to isoforms and the identification of novel isoforms. Ideally, reads would cover an entire isoform. However, due to various experimental factors (fragmentation, early template switching), this is not the case for most reads (Supplementary Fig. S13). Comparison of plant2 (average read length 967.72 bp) to plant1 (average read length 771.63 bp) clearly illustrates that extracting longer fragments to sequence improves isoform coverage. For plant1, 9.23% of informative reads cover at least 80% of their isoform, rising to 18.63% for plant2.
While originally developed for nanopore data, the workflow extends to PacBio HiFi data, including those prepared using the MAS-seq method after read segmentation (Al’Khafaji et al. 2024). Using a publicly available HG002 10X Genomics lymphoblastoid cell line dataset from PacBio, we demonstrate that the correlation between PacBio and short-read is high at both the cell level (R = 0.986, Supplementary Fig. S14) and per cell gene-count level (R = 0.886, Supplementary Fig. S15). We also show cell typing results using the publicly available PacBio PBMC dataset (Peripheral Blood Mononuclear Cells, Supplementary Fig. S16).
4 Discussion
Long-read single-cell sequencing is essential for a complete transcriptome reconstruction and cell-type-specific alternative isoforms. Compared to short-read sequencing, long reads provide a better resolution at the gene level in organisms with incomplete or inaccurate annotation toward the transcript ends. At the transcript level, more reads are unambiguously assigned to a transcript, offering far better chances of detecting and reconstructing novel isoforms.
As a pilot experiment, we sequenced single-nuclei transcriptomes from four human brain samples on the PromethION P24 platform (ONT). In order to obtain a sufficient number of cells from cell types present at lower abundance and to reduce potential sampling errors at that level, we aimed to obtain >10 000 nuclei per sample. The runs generated, on average, >100 million reads per sample. Available analytical workflows did not scale well to this number of cell barcodes and reads. We developed the scywalker workflow, making use of novel algorithms and extensive parallelization, which could handle these data readily, analyzing one sample in <15 h on a single 24-core server, and is further scalable to high throughput by running on a cluster. Scalability is important as the field evolves toward datasets with more cells at higher depths as the technology matures. The initial implementation of the pipeline was tailored to nanopore sequencing of single-cell libraries from the 10× Genomics microfluidics system but also supports PacBio sequencing data, and is sufficiently flexible to be adapted to accommodate technologies that result in a different read structure, if the need arises.
Although long-read sequencing provides more extensive results at the transcript level, short-read sequencing is the current standard for single-cell gene quantification due to its lower cost, higher per-nucleotide accuracy, and well-established analysis tools. Therefore, we also sequenced the same libraries on an Illumina short-read platform to serve as a reference for benchmarking cell detection and gene quantification in the long-read data. We used the correlation between long-read and short-read data as a measure of the accuracy (even though it is probable that the long-read results are more accurate due to less multi-mapping for some genes). Using this approach, we showed that, combined with the latest improvements in nanopore sequencing accuracy, scywalker’s algorithms could generate single-cell gene counts that are very highly correlated (R > 0.9) with results from short reads and that these can also be successfully used for downstream analyses such as cell typing, thus obviating the need for dual sequencing. Even for the older and less accurate iterations of nanopore sequencing data, the correlation for the scywalker results was high (R = 0.89), and the results were usable for downstream analysis. Regarding cell and gene level accuracy, Scywalker also outperformed the other tools (in the instances where they could be tested). In order to ensure the general applicability of scywalker outside of human (or even animal) data, we also generated two plant single-cell datasets (A.thaliana), which were analyzed successfully by scywalker, also showing very high correlation with their respective short-read results.
Isoform discovery in scywalker is based on IsoQuant, a proven tool showing good results in bulk long-read RNA-seq analysis (Prjibelski et al. 2023). In scywalker, the initial discovery is performed without taking cell barcodes into account, i.e. on bulk cDNA. This way, isoforms that have low expression in individual cells but are generally present are not discarded due to too low read counts. Based on the read assignments (to isoforms) and their barcodes, gene and isoform counts per cell are generated afterward. Due to experimental issues such as fragmentation, degradation, and premature incorporation of the template-switching oligonucleotide, even in long-read sequencing, most reads do not represent complete isoforms (e.g. Supplementary Fig. S13) and thus cannot always be assigned unambiguously to specific transcripts. Scywalker incorporates several ways of dealing with this ambiguity by calculating counts by (i) using reduced weight for ambiguous read assignments, (ii) counting only uniquely assigned reads, (iii) counting only (nearly) complete reads, the latter being, e.g. useful for assessing actual support for novel isoforms with different ends. One of the main goals of long-read sequencing is comparisons at the transcript level, and scywalker provides considerable utility toward that goal. The workflow includes determining cell types, generating pseudobulk data (per cell type), and creating a multi-sample pseudobulk count matrix. As illustrated in the results, this can be used for downstream analysis, such as differential isoform usage.
High-throughput nanopore single-cell sequencing has great potential for uncovering detailed isoform usage across cell types and samples. Scywalker unlocks this potential by providing a highly scalable pipeline capable of handling nanopore samples comprising more than 10 000 cells each and over 100 million reads per sample at volume while simultaneously obtaining excellent accuracy in cell identification and gene quantification. Furthermore, scywalker incorporates automated cell-type assignment, pseudobulk isoform count generation, and sample comparison in this single-command end-to-end workflow. We also demonstrated that the results can be directly used to identify differential transcript usage, including the detection of novel transcripts.
Supplementary Material
Acknowledgements
The authors acknowledge Robin Pottie and Jolien De Block for excellent technical assistance and Kai Xun Chan for discussions. The authors acknowledge Hayden Christensen, Carrie Fisher, and Mark Hamill for inspiring the project. The authors acknowledge the NeuroBiobank of the Born-Bunge Institute (IBB-Neurobiobank, Antwerp), Belgium; ID: BB190113 for providing the brain samples used in this work. We thank the VIB Single Cell Core and VIB Nucleomics core for support and access to the instrument park (vib.be/technologies/).
Contributor Information
Peter De Rijk, Neuromics Support Facility, VIB Center for Molecular Neurology, VIB, Universiteitsplein 1, Antwerp, 2610, Belgium; Department of Biomedical Sciences, University of Antwerp, Universiteitsplein 1, Antwerp, 2610, Belgium.
Tijs Watzeels, Department of Biomedical Sciences, University of Antwerp, Universiteitsplein 1, Antwerp, 2610, Belgium; Complex Genetics of Alzheimer’s Disease Group, VIB Center for Molecular Neurology, Universiteitsplein 1, 2610, Antwerp, Belgium.
Fahri Küçükali, Department of Biomedical Sciences, University of Antwerp, Universiteitsplein 1, Antwerp, 2610, Belgium; Complex Genetics of Alzheimer’s Disease Group, VIB Center for Molecular Neurology, Universiteitsplein 1, 2610, Antwerp, Belgium.
Jasper Van Dongen, Department of Biomedical Sciences, University of Antwerp, Universiteitsplein 1, Antwerp, 2610, Belgium; Complex Genetics of Alzheimer’s Disease Group, VIB Center for Molecular Neurology, Universiteitsplein 1, 2610, Antwerp, Belgium.
Júlia Faura, Department of Biomedical Sciences, University of Antwerp, Universiteitsplein 1, Antwerp, 2610, Belgium; Applied and Translational Neurogenomics Group, VIB Center for Molecular Neurology, Universiteitsplein 1, Antwerp, 2610, Belgium.
Patrick Willems, Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 71, Zwijnaarde, 9052, Belgium; VIB Center for Plant Systems Biology, VIB, Technologiepark 71, Zwijnaarde, 9052, Belgium; Department of Biomolecular Medicine, Ghent University, Corneel Heymanslaan 10, Ghent, 9000, Belgium; VIB Center for Medical Biotechnology, VIB, Technologiepark-Zwijnaarde 75, Ghent, 9052, Belgium.
Lara De Deyn, Department of Biomedical Sciences, University of Antwerp, Universiteitsplein 1, Antwerp, 2610, Belgium; Complex Genetics of Alzheimer’s Disease Group, VIB Center for Molecular Neurology, Universiteitsplein 1, 2610, Antwerp, Belgium.
Lena Duchateau, Department of Biomedical Sciences, University of Antwerp, Universiteitsplein 1, Antwerp, 2610, Belgium; Complex Genetics of Alzheimer’s Disease Group, VIB Center for Molecular Neurology, Universiteitsplein 1, 2610, Antwerp, Belgium.
Carolin Grones, Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 71, Zwijnaarde, 9052, Belgium; VIB Center for Plant Systems Biology, VIB, Technologiepark 71, Zwijnaarde, 9052, Belgium.
Thomas Eekhout, Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 71, Zwijnaarde, 9052, Belgium; VIB Center for Plant Systems Biology, VIB, Technologiepark 71, Zwijnaarde, 9052, Belgium; VIB Single Cell Core, VIB, Technologiepark-Zwijnaarde 71, Ghent, 9052, Belgium.
Tim De Pooter, Neuromics Support Facility, VIB Center for Molecular Neurology, VIB, Universiteitsplein 1, Antwerp, 2610, Belgium; Department of Biomedical Sciences, University of Antwerp, Universiteitsplein 1, Antwerp, 2610, Belgium.
Geert Joris, Neuromics Support Facility, VIB Center for Molecular Neurology, VIB, Universiteitsplein 1, Antwerp, 2610, Belgium; Department of Biomedical Sciences, University of Antwerp, Universiteitsplein 1, Antwerp, 2610, Belgium.
Stephane Rombauts, Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 71, Zwijnaarde, 9052, Belgium; VIB Center for Plant Systems Biology, VIB, Technologiepark 71, Zwijnaarde, 9052, Belgium.
Bert De Rybel, Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 71, Zwijnaarde, 9052, Belgium; VIB Center for Plant Systems Biology, VIB, Technologiepark 71, Zwijnaarde, 9052, Belgium.
Rosa Rademakers, Department of Biomedical Sciences, University of Antwerp, Universiteitsplein 1, Antwerp, 2610, Belgium; Applied and Translational Neurogenomics Group, VIB Center for Molecular Neurology, Universiteitsplein 1, Antwerp, 2610, Belgium.
Frank Van Breusegem, Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark 71, Zwijnaarde, 9052, Belgium; VIB Center for Plant Systems Biology, VIB, Technologiepark 71, Zwijnaarde, 9052, Belgium.
Mojca Strazisar, Neuromics Support Facility, VIB Center for Molecular Neurology, VIB, Universiteitsplein 1, Antwerp, 2610, Belgium; Department of Biomedical Sciences, University of Antwerp, Universiteitsplein 1, Antwerp, 2610, Belgium.
Kristel Sleegers, Department of Biomedical Sciences, University of Antwerp, Universiteitsplein 1, Antwerp, 2610, Belgium; Complex Genetics of Alzheimer’s Disease Group, VIB Center for Molecular Neurology, Universiteitsplein 1, 2610, Antwerp, Belgium.
Wouter De Coster, Department of Biomedical Sciences, University of Antwerp, Universiteitsplein 1, Antwerp, 2610, Belgium; Applied and Translational Neurogenomics Group, VIB Center for Molecular Neurology, Universiteitsplein 1, Antwerp, 2610, Belgium.
Author contributions
P.D.R., M.S., W.D.C., R.R., and K.S. conceived the project. P.D.R. devised the algorithms and designed most of the software. T.W. implemented cell filtering and typing, and W.D.C. implemented the report. P.D.R., F.K., and J.V.D. performed benchmarking. P.D.R., T.W., and J.V.D. did testing. F.K. performed comparisons of the different workflows. P.W. analyzed the plant results. T.W. and J.F. analyzed differential transcript usage. J.V.D., L.D.D., and L.D. generated the brain single-nuclei libraries, while C.G., T.E., and B.D.R. generated the plant data, with S.R. and F.V.B. coordinating plant data generation and analysis. T.D.P. and G.J. performed all nanopore sequencing, coordinated by M.S. M.S. also coordinated software development. K.S. coordinated the brain data generation and analysis, acquired funding, and administered the project. W.D.C. coordinated data analysis, writing of the manuscript and administered the project. P.D.R., W.D.C., T.W., F.K., J.F., and P.W. drafted the manuscript with input from M.S. and L.D.D. All authors reviewed and provided feedback on the manuscript.
Supplementary data
Supplementary data are available at Bioinformatics online.
Conflict of interest
W.D.C. and M.S. have received free consumables and travel reimbursements from Oxford Nanopore Technologies. The other authors report no conflict of interest. Oxford Nanopore Technologies supported this work by providing free PromethION flow cells for sequencing the single-cell plant transcriptomes.
Funding
This work was partly funded by the VIB (Flanders Institute for Biotechnology, Belgium), the University of Antwerp, the Alzheimer’s Association [AARG-20-683760], the Alzheimer Research Foundation Flanders [SAO-FRA 2021/0032], the Fund for Scientific Research Flanders [FWO G030718N], the Centres of Excellence in Neurodegeneration [CoEN6005], and the legacy Maesschalk. P.W. and W.D.C. are recipients of a postdoctoral fellowship from Fonds Wetenschappelijk Onderzoek (FWO) [12T1722N and 12ASR24N, respectively]. L.D.D. receives a PhD fellowship [BOF DOCPRO 44697], and F.K. a postdoctoral fellowship [BOF 49758] from the University of Antwerp Research Fund. J.F. receives a Holloway Postdoctoral Fellowship [2022-001] from the Association for Frontotemporal Degeneration (AFTD). This work was supported by the European Research Council [ERC StG TORPEDO; 714055 and ERC CoG PIPELINES; 101043257 to B.D.R. and T.E.], and by an FWO project grant [FWO G007723N to F.V.B.].
Data availability
The human brain dataset is available in the European Genome-Phenome Archive (EGA) under accession EGAS50000000537, the VIB Data Access Committee will control access. The Arabidopsis data is available at EBI ArrayExpress under the accession number E-MTAB-13866. The scmixology2 (GSE154870) dataset was obtained from the Sequence Read Archive under accession identifier SRR12282457 for the short read data and SRR12282458 for the long read data. The public PacBio data from HG002 LCL was obtained from downloads.pacbcloud.com/public/dataset/MAS-Seq/DATA-Revio-Kinnex-HG002-10x5p/1-Sreads/segmented.bam, the corresponding short-read data from downloads.pacbcloud.com/public/dataset/MAS-Seq/DATA-Revio-Kinnex-HG002-10x5p/Illumina, and the public PacBio data from PBMCs was downloaded from downloads.pacbcloud.com/public/dataset/Kinnex-single-cell-RNA/DATA-MAS-SQ2-PBMC_10kcells/1-Sreads/segmented.bam.
References
- Al’Khafaji AM, Smith JT, Garimella KV. et al. High-throughput RNA isoform sequencing using programmed cDNA concatenation. Nat Biotechnol 2024;42:582–6. [DOI] [PubMed] [Google Scholar]
- Butler A, Hoffman P, Smibert P. et al. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol 2018;36:411–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Clark AM, Bohnert HJ.. Cell-specific expression of genes of the lipid transfer protein family from Arabidopsis thaliana. Plant Cell Physiol 1999;40:69–76. [DOI] [PubMed] [Google Scholar]
- Clark MB, Wrzesinski T, Garcia AB. et al. Long-read sequencing reveals the complex splicing profile of the psychiatric risk gene CACNA1C in human brain. Mol Psychiatry 2020;25:37–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- De Coster W, Rademakers R.. NanoPack2: population-scale evaluation of long-read sequencing data. Bioinformatics 2023;39:btad311. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Glinos DA, Garborcauskas G, Hoffman P. et al. Transcriptome variation in human tissues revealed by long-read sequencing. Nature 2022;608:353–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guo H, Li J.. scSorter: sssigning cells to known cell types according to marker genes. Genome Biol 2021;22:69. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gupta I, Collier PG, Haase B. et al. Single-cell isoform RNA sequencing characterizes isoforms in thousands of cerebellar cells. Nat Biotechnol 2018;36:1197–202. [DOI] [PubMed] [Google Scholar]
- Gustavsson EK, Zhang D, Reynolds RH. et al. ggtranscript: an R package for the visualization and interpretation of transcript isoforms using ggplot2. Bioinformatics 2022;38:3844–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Habib N, Li Y, Heidenreich M. et al. Div-Seq: single-nucleus RNA-Seq reveals dynamics of rare adult newborn neurons. Science 2016;353:925–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hall PA, Jung K, Hillan KJ. et al. Expression profiling the human septin gene family. J Pathol 2005;206:269–78. [DOI] [PubMed] [Google Scholar]
- Heberle BA, Brandon JA, Page ML. et al. Mapping medically relevant RNA isoform diversity in the aged human frontal cortex with deep long-read RNA-seq. Nat Biotech 2024. [DOI] [PMC free article] [PubMed]
- Ianevski A, Giri AK, Aittokallio T. et al. Fully-automated and ultra-fast cell-type identification using specific marker combinations from single-cell transcriptomic data. Nat Commun 2022;13:1246. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kassambara A. ggpubr: ‘ggplot2’ Based Publication Ready Plots. 2023. https://rpkgs.datanovia.com/ggpubr/authors.html#citation
- Lebrigand K, Magnone V, Barbry P. et al. High throughput error corrected nanopore single cell transcriptome sequencing. Nat Commun 2020;11:4025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 2018;34:3094–100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lun ATL, Riesenfeld S, ANDrews T. et al. ; Participants in the 1st Human Cell Atlas Jamboree. EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol 2019;20:63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nowicka M, Robinson MD.. DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics. F1000Res 2016;5:1356. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oxford Nanopore Technologies. WF-Single-Cell. 2024. https://github.com/epi2me-labs/wf-single-cell
- Park E, Pan Z, Zhang Z. et al. The expanding landscape of alternative splicing variation in human populations. Am J Hum Genet 2018;102:11–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Philpott M, Watson J, Thakurta A. et al. Nanopore sequencing of single-cell transcriptomes with scCOLOR-seq. Nat Biotechnol 2021;39:1517–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Piwecka M, Rajewsky N, Rybak-Wolf A. et al. Single-cell and spatial transcriptomics: deciphering brain complexity in health and disease. Nat Rev Neurol 2023;19:346–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pliner HA, Shendure J, Trapnell C. et al. Supervised classification enables rapid annotation of cell atlases. Nat Methods 2019;16:983–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Prjibelski AD, Mikheenko A, Joglekar A. et al. Accurate isoform discovery with IsoQuant using long reads. Nat Biotechnol 2023;41:915–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Procko C, Lee T, Borsuk A. et al. Leaf cell-specific and single-cell transcriptional profiling reveals a role for the palisade layer in UV light protection. Plant Cell 2022;34:3261–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schloerke B, Cook D, Larmarange J et al. GGally: Extension to ‘ggplot2’. 2024. https://ggobi.github.io/ggally/authors.html#citation
- Tian L, Jabbari JS, Thijssen R. et al. Comprehensive characterization of single-cell full-length isoforms in human and mouse with long-read sequencing. Genome Biol 2021;22:310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Van den Berge K, Soneson C, Robinson MD. et al. stageR: a general stage-wise method for controlling the gene-level false discovery rate in differential expression and differential transcript usage. Genome Biol 2017;18:151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Volden R, Palmer T, Byrne A. et al. Improving nanopore read accuracy with the R2C2 method enables the sequencing of highly multiplexed full-length single-cell cDNA. Proc Natl Acad Sci USA 2018;115:9726–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang ET, Sandberg R, Luo S. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 2008;456:470–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wickham H. ggplot2: Elegant Graphics for Data Analysis. New York: Springer, 2016. [Google Scholar]
- You Y, Tian L, Su S. et al. Benchmarking UMI-based single-cell RNA-seq preprocessing workflows. Genome Biol 2021;22:339. [DOI] [PMC free article] [PubMed] [Google Scholar]
- You Y, Prawer YDJ, De Paoli-Iseppi R. et al. Identification of cell barcodes from long-read single-cell RNA-seq with BLAZE. Genome Biol 2023;24:66. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang T-Q, Chen Y, Wang J-W. et al. A single-cell analysis of the Arabidopsis vegetative shoot apex. Dev Cell 2021;56:1056–74.e8. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The human brain dataset is available in the European Genome-Phenome Archive (EGA) under accession EGAS50000000537, the VIB Data Access Committee will control access. The Arabidopsis data is available at EBI ArrayExpress under the accession number E-MTAB-13866. The scmixology2 (GSE154870) dataset was obtained from the Sequence Read Archive under accession identifier SRR12282457 for the short read data and SRR12282458 for the long read data. The public PacBio data from HG002 LCL was obtained from downloads.pacbcloud.com/public/dataset/MAS-Seq/DATA-Revio-Kinnex-HG002-10x5p/1-Sreads/segmented.bam, the corresponding short-read data from downloads.pacbcloud.com/public/dataset/MAS-Seq/DATA-Revio-Kinnex-HG002-10x5p/Illumina, and the public PacBio data from PBMCs was downloaded from downloads.pacbcloud.com/public/dataset/Kinnex-single-cell-RNA/DATA-MAS-SQ2-PBMC_10kcells/1-Sreads/segmented.bam.