Significance
Alternative polyadenylation (APA) is an RNA-processing mechanism that produces transcripts with distinct 3′-end for the same gene. We developed single-cell polyadenylation sequencing (scPolyA-seq) to characterize APA at single-cell resolution. Interestingly, cell cycle genes were significantly enriched in genes with high variation in polyA site usage, as well as genes with the polyA site usage switch after cell synchronization. Particularly, by systemically analyzing APA dynamics along cell cycle, we found multiple gene clusters with different dynamic patterns of polyA site usage, indicating that APA is an important layer for cell cycle regulation. Deletion of one polyA site in several genes such as MSL1 and SCCPDH slowed or accelerated cell cycle progression, further supporting APA is an important cell cycle regulator.
Keywords: single-cell RNA-sequencing, single-cell polyadenylation sequencing, alternative polyadenylation, cell cycle, cell cycle synchronization
Abstract
Alternative polyadenylation (APA) plays an important role in posttranscriptional gene regulation such as transcript stability and translation efficiency. However, our knowledge about APA dynamics at the single-cell level is largely unexplored. Here, we developed single-cell polyadenylation sequencing, a strand-specific approach for sequencing the 3′ end of transcripts, to investigate the landscape of APA at the single-cell level. By analyzing several cell lines, we found many genes using multiple polyA sites in bulk data are prone to use only one polyA site in each single cell. Interestingly, cell cycle genes were significantly enriched in genes with high variation in polyA site usages. Furthermore, the 414 genes showing a polyA site usage switch after cell synchronization enriched cell cycle genes, while the differentially expressed genes after cell synchronization did not enrich cell cycle genes. We further identified 812 genes showing polyA site usage changes between neighboring cell cycles, which were grouped into six clusters, with cell phase-specific functional categories enriched in each cluster. Deletion of one polyA site in MSL1 and SCCPDH results in slower and faster cell cycle progression, respectively, supporting polyA site usage switch played an important role in cell cycle. These results indicate that APA is an important layer for cell cycle regulation.
Polyadenylation is essential for eukaryotic mRNA maturation, while most genes contain multiple polyA sites (1, 2). The process of a gene using different polyA sites for generating different transcripts was called alternative polyadenylation (APA). APA resulted in many transcripts with varying 3′ends, which expanded the diversity of gene products derived from a single gene. The advances of next generation sequencing resulted in the genome-wide mapping of polyA sites, which further revealed the features of polyA sites and the regulation of APA. APA involves in many posttranscriptional regulations, such as mRNA maturation, mRNA stability, cellular RNA decay, and RNA’s cellular localization (1–5). APA plays an important role in cell proliferation (6), development (5, 7), neural function (8), immune response (9), and aging (10). Furthermore, recent studies found tumor cells switched polyA site usage comparing with the normal cell in various cancers (11–15), particularly cancer cells prefer to use the proximal polyA site to avoid microRNA-mediated repression (11).
On the other hand, cell-to-cell variations are crucial to various biological processes including embryonic development, cell homeostasis, and tissue functions (16–20). It is reported that even seemingly homogeneous cell population displays cell-to-cell heterogeneity in response to environmental stimulation (21). Single-cell RNA-sequencing (scRNA-seq), providing an unprecedented resolution to explore cell-to-cell variation in gene expression, has become an ideal approach for identifying cellular heterogeneity, searching novel cell types, exploring developmental processes, inferring regulators underlying various processes, and investigating tissue microenvironment (16, 20, 22–26). While scRNA-seq focuses on quantification of RNA expression at the single-cell level (16, 22, 27, 28), transcript isoforms including APA have been ignored in most studies. Although several studies investigated polyA sites at the single-cell level (29–32), they either only analyzed the cell type-specific APA (31) or characterized polyA site usage (32). Our knowledge about polyA site usage and dynamics at the single-cell level is largely unexplored partially because of the limitation of data quality.
To enhance our understanding of APA at the single-cell level, we developed single-cell polyadenylation sequencing (scPolyA-seq). Combined with the Fluidigm C1 HT integrated fluidics circuit (IFC) system, we conducted scPolyA-seq on cells from three cell lines, namely MDA-MB-468, HeLa, and mouse embryonic fibroblasts (MEFs). We explored the features of polyA site usage at the single-cell level and APA dynamics during cell cycle.
Results
scPolyA-seq Generates High-Quality Data.
To develop scPolyA-seq, we modified the template-switching reverse transcription method (Smart-seq2) (33) to capture and sequence the 3′ end of transcripts (Fig. 1A). In brief, primers containing polyT and cell barcode for labeling the 3′ end of transcripts were added after cell lysis. Full-length cDNA was synthesized by template-switching reverse transcription, amplification, and tagmentation with Tn5 transposases. Afterward, DNA fragments at the 3′ end of the cDNA were captured by targeted PCR, and sequencing indices were added for amplification. The libraries were sequenced, and then polyA supporting reads were mapped to the genome for polyA site identification.
We designed to analyze the APA of three cell lines: triple-negative breast cancer cell line (MDA-MB-468), cervix adenocarcinoma cell line (HeLa), and MEFs. Fluidigm C1™ Single-Cell Autoprep System (Fluidigm, South San Francisco, CA, USA) was used for scPolyA-seq. Cell suspensions of MDA-MB-468 (containing ~10% HeLa cells) and MEFs (containing ~10% synchronized HeLa cells) were loaded into two independent inlets of a C1 HT IFC, thus up to 400 cells could be captured for each cell suspension (Fig. 1B). Each well was carefully examined under a white field microscope (SI Appendix, Fig. S1A). The scPolyA-seq libraries were prepared and sequenced. Reads were demultiplexed to different single cells using cell barcode, and mapped to reference genome (hg19 or mm10). After filtering and qualification control, 557 single cells (with 222 human cells and 335 MEFs) (Methods and SI Appendix, Fig. S1 B and D) were left for further analysis. Two bulk libraries using a small number of MDA-MB-468 were manually constructed following the scPolyA-seq protocol.
The read counts between pooled scPolyA-seq data and bulk data generated are highly correlated (r = 0.957), suggesting the high reproducibility of scPolyA-seq (Fig. 1C). On average, scPolyA-seq generated 1.46 million reads per cell, which is 96-fold more than that of 10× Genomics (Student’s t test, P = 1.68 × 10−77) (Fig. 1D). Furthermore, scPolyA-seq detected 6,379 expressed genes per cell on average, which is 1.87-fold more than that of 10× Genomics (Student’s t test, P = 4.33 × 10−92) (Fig. 1D). As expected, scPolyA-seq generated much less reads and detected much less genes per cell than Smart-seq2 (SI Appendix, Fig. S1F). Compared with 10x Genomics and Smart-seq2, the reads from scPolyA-seq were more concentrated at the 3′ end of genes (Fig. 1E). MDA-MB-468, unsynchronized HeLa, and synchronized HeLa showed up on uniform manifold approximation and projection (UMAP) (Fig. 1F and SI Appendix, Fig. S2A). Each cell subpopulation exhibited subpopulation-specific expressed genes (Fig. 1G and SI Appendix, Fig. S2B).
Identification and Annotation of PolyA Sites in Single Cells.
We developed scPolyA-pipe for de novo identification of polyA sites using scPolyA-seq data. In brief, scPolyA-pipe performs peak calling on the pooled reads from all cells for identifying the potential polyA sites. The reads on each potential polyA site in each cell were counted for further analyses (SI Appendix, Fig. S3 A and C). Using scPolyA-pipe, we identified 20,222 polyA sites after merging adjacent polyA sites. Motif enrichment analyses of sequences around polyA sites showed hexamer motif A[U/A]UAAA and its variants significantly overrepresented at 20 nt upstream of polyA sites (Fig. 2A and SI Appendix, Fig. S3 D and E), consistent with previous studies (34, 35).
Most de novo polyA sites identified by scPolyA-seq resided in known polyA sites (46.96%), followed by 3′ UTRs (28.63%) and exon (12.18%) (Fig. 2B). More than half of these polyA sites could be found within 12nt of nearest sites in PolyA_DB2 (35) (SI Appendix, Fig. S3 F and G). These polyA sites were annotated into 9,454 genes, of which 43.4% genes contain at least two polyA sites (Fig. 2C).
Genes Using Multi-PolyA Sites in Bulk Data Are Prone to Use Only One PolyA Site in Each Cell.
About 43.4% genes used ≥2 polyA sites in our pooled scPolyA-seq (Fig. 2C). It is interesting to explore whether the observed multi-polyA site usage is constitutive in single cells, or simply caused by different cells using different polyA sites. We analyzed the simplest case, in which each gene has a distal polyA site and a proximal polyA site. We measured polyA site usage of each gene across these cells using the distal polyA site usage index (DPAU), similar to PDUI in Xia et al. (11). DPAU ranges from 0 to 1, with higher DPAU representing higher distal polyA site usage. The density of DPAU showed a prominent bimodal curve (Fig. 2D and SI Appendix, Fig. S4A), implying a gene tends to use only one polyA site in each cell, even these genes used multiple polyA sites in the pooled scPolyA-seq data. The genes with average DPAU near 0 or 1 showed low cell-to-cell variation in polyA site usage, while the genes with average DPAU near 0.5 showed high cellular heterogeneity in polyA site usage (Fig. 2E). We defined the phenomenon that reads on one polyA site account for >95% of the total reads on this gene as mono-polyA site usage in a cell. We found majority of genes showing a high fraction of a mono-polyA site usage ratio (SI Appendix, Fig. S4B). The most significantly enriched GO terms in genes using both polyA sites in one single cell were protein localization to endoplasmic reticulum (P = 7.41 × 10−4) and endomembrane system organization (P = 2.57 × 10−3) (SI Appendix, Fig. S4 C and D), potentially indicating these genes using APA to determine different locations of transcripts.
Variations in APA and Expression Level Are Negatively Correlated.
We further found the variations in DPAU are negatively correlated with gene expression levels (Fig. 2F). The genes with the high variations in DPAU and the low expression are enriched in cell cycle (P = 1.74 × 10−18), proteolysis involved in cellular protein catabolic process (P = 5.75 × 10−18), membrane trafficking (P = 2.34 × 10−16), and DNA repair (P = 3.89 × 10−16) (Fig. 2G). The genes with the low variations in DPAU and the high expression are enriched in housekeeping gene-related GO terms, such as oxidative phosphorylation (P = 1.66 × 10−33), translation (P = 8.13 × 10−31), and metabolism of RNA (P = 1.17 × 10−29) (Fig. 2H). We conducted GO analyses on the two cell cycle gene sets in Fig. 2 G and H. “Cell cycle” genes in Fig. 2G significantly enriched in cell cycle process and regulation-associated GO terms (SI Appendix, Fig. S11A). “Cell cycle” genes in Fig. 2H significantly enriched in DNA repair-associated GO terms (SI Appendix, Fig. S11B). This shows that cell cycle-associated genes have different APA and expression behaviors. These results indicated the housekeeping genes had high expression and low APA dynamics, while the cell cycle-associated genes had low expression and high APA dynamics.
Relationships between APA and Expression Level.
To evaluate the relationship between expression level and polyA site usage, we indexed the polyA site usage preference of each gene by calculating generalized distal polyA site usage index (gDPAU), and conducted correlation tests between gDPAU and expression level for each gene. We identified a total of 817 genes showing significant correlations between gDPAU and expression level, among which 222 genes showed positive correlations, and 595 genes showed negative correlations (Fig. 3A and Dataset S1). The genes showing positive correlations include NUTF2 and CDKN2B, while the genes showing negative correlations include BLOC1S5 and KRAS (Fig. 3B). There are significantly more genes showing negative correlations between gDPAUs and expression levels than genes showing positive correlations (binomial test, P<10−16). Therefore, genes prefer to use the proximal polyA site when its expression increases, potentially because usage of proximal polyA sites is more efficient than usage of distal polyA sites, and/or switching to the proximal polyA site may escape miRNA-induced expression inhibition (11). Indeed, the expression levels of genes showing negative correlations with gDPAU are significantly higher than the genes showing positive correlations with gDPAU (Fig. 3C).
The genes showing negative correlations between expression levels and gDPAU were enriched in metabolism of RNA (P = 6.92 × 10−27), cell cycle (P = 3.02 × 10−23), and DNA repair (P = 4.68 × 10−18) (Fig. 3D), potentially indicating APA played an important role in these biological processes. The genes showing no correlation between gDPAU and expression levels were enriched in mRNA processing (P = 6.17 × 10−9), peptide biosynthetic process (P = 6.61 × 10−9), translation (P = 3.72 × 10−8), and RNA catabolic process (P = 3.89 × 10−8) (SI Appendix, Fig. S5 C and D), most of which are housekeeping gene-related GO terms. This further indicates housekeeping genes are unlikely affected by APA. The genes showing positive correlations between gDPAU and expression were enriched in the metabolism of RNA (P = 2.33 × 10−3), Netrin-1 signaling (P = 1.53 × 10−2), RNA splicing (P = 1.53 × 10−2), and cellular responses to stress (P = 3.10× 10−2) (Fig. 3E). Metabolism of RNA is the most significantly enriched GO term in both genes showing negative correlations and genes showing positive correlations between gDPAU and expression, indicating the metabolism of RNA-associated genes has the strongest interdependence of polyA site usage, either positive correlation or negative correlation.
Changes in PolyA Site Usage and Gene Expression after Cell Synchronization.
We identified 1,019 differentially expressed genes (DEGs) between unsynchronized HeLa cells and synchronized HeLa cells (Fig. 4A and Dataset S2), which were enriched in cellular response to growth factor stimulus (P = 1.05 × 10−11), response to wounding (P = 2.00 × 10−9), response to acid chemical (P = 6.61 × 10−9), cellular response to external stimulation (P = 1.05 × 10−7), and response to mechanical stimulus (P = 9.77 × 10−8) (Fig. 4B). We further identified 414 genes showing a significant polyA site usage switch between unsynchronized HeLa cells and synchronized HeLa cells (Fig. 4C and Dataset S3), which were enriched in cell cycle-associated GO terms including cell cycle mitotic (P = 2.88 × 10−7) and cell cycle G2/M phase transition (P = 1.41 × 10−5) (Fig. 4D). For example, IFT20 (Fig. 4E) and UBFD1 (Fig. 4F) are more likely to use the proximal polyA site after cell synchronization, while GRB10 (Fig. 4G) tend to use the distal polyA site after cell synchronization.
There is very little overlap between DEGs and polyA site usage-switched genes (SI Appendix, Fig. S5 A and B), potentially indicating the regulation of expression level and the regulation of APA are different mechanisms. We checked the DEGs and polyA site usage-switched genes after cell synchronization in one plot to better understand their relationship (Fig. 4H). We found the up-regulated genes were more likely to use the proximal polyA sites (P = 6.5 × 10−3; binomial test), while the down-regulated genes were more likely to use the distal polyA sites (P = 0.04; binomial test). (Fig. 4I), consistent with recent studies (11, 36–39).
Dynamics of polyA Site Usage Switching along Cell Cycle.
Each cell was assigned a cell cycle phase based on its cell cycle score (SI Appendix, Fig. S6), similar to Macosko et al. (22). We identified the genes showing polyA site usage switches between neighboring cell cycle phases in MDA-MB-468 to explore the dynamics of polyA site usage during cell cycle (Dataset S5). Venn diagrams show the majority of these genes showing polyA site usage switching were observed in one or two comparisons (Fig. 5A), which potentially indicate polyA site usage is very dynamic during cell cycle. These genes were enriched in cell cycle (P = 2.89 × 10−12), regulation of cellular protein localization (P = 1.05 × 10−8), and transcriptional regulation by TP53 (P = 1.78 × 10−7) (Fig. 5B). We further clustered these genes using correlation of average gDPAU of each gene across five cell phases, resulting in six gene clusters (Fig. 5C). The genes in the six clusters are high APA variation genes (SI Appendix, Fig. S11 D–F). We noticed that each of the six gene clusters enriched specific functional GO terms (Fig. 5D). The polyA site usage dynamics of the six gene clusters during cell cycle were distinct from each other (Fig. 5E and SI Appendix, Fig. S7A). E.g., cluster1 enriched in transcription-associated genes, which prefer distal polyA sites in the S phase. Cluster2 enriched in the regulation of dephosphorylation, which prefer distal polyA sites in the S phase and the M phase. Cluster3 enriched in DNA repair and DNA replication, which prefer distal polyA sites in the S phase and the G2M phase. Cluster4 enriched in cell division, which prefer proximal polyA sites in the S phase and the M phase. Cluster5 enriched in the regulation of cellular protein localization, which prefer distal polyA sites in M phase. Cluster6 enriched in nucleus organization, which prefer proximal polyA sites in S phase. PolyA site usage switch belongs to the posttranscriptional regulation that could quickly respond to environmental stimulation, which may explain why the polyA site usage switch plays an important role in cell cycle (Fig. 5 D and E and Dataset S6). The genes showing the polyA site usage switch during cell cycle in HeLa were also grouped into six clusters, and the gDPAU of six clusters showed dynamic patterns similar to MDA-MB-468 (SI Appendix, Fig. S7B), which indicate that polyA site usage switches during cell cycle are conserved between different cell lines/types.
Interestingly, we found the gDPAU of cluster1 and cluster6 were negatively correlated (SI Appendix, Fig. S7C), indicating that transcription-associated genes and nucleus organization-associated genes showed the opposite pattern of polyA site usages. In particular, cluster1 and cluster6 reached maximum gDPAU and minimum gDPAU in S phase, respectively (Fig. 5E). Furthermore, negative correlations of gDPAUs between cluster3 and cluster5 indicated that DNA repair/replication-associated genes and cellular protein localization-associated genes showed an opposite pattern of polyA site usages; Negative correlations of gDPAUs between cluster2 and cluster4 indicated that cell division and regulation of dephosphorylation showed an opposite pattern of polyA site usages (SI Appendix, Fig. S7C). These results indicated that different gene categories have different preferences on the polyA site usage during cell cycle.
Genes with PolyA Site Usage Change during Cell Cycle is Much More than That with Expression Change.
We further analyzed the RNA-seq data of synchronized MCF-7 at G1, S, and M phases (40). We identified 1,235 genes showing a significant polyA site usage switch between G1 and S, which is 4.9-fold higher than the number of DEGs between G1 and S (251 genes). In fact, genes showing the polyA site usage switch are much more than the number of DEGs between any two cell phase comparison based on the logarithmic value (P = 0.004, log Paired t test) (Fig. 6A). PolyA site usage-switched gene sets between cell phases are significantly enriched in cell cycle-related terms, such as cell cycle, mitotic cell cycle, chromatin organization, metabolism of RNA, and DNA metabolic process (Fig. 6B), while DEGs are not significantly enriched in cell cycle-related terms. These results support that polyA site usages are strongly associated with cell cycle, while expression levels are not, which is consistent with aforementioned results (Fig. 4 A–D)
Deletion of PolyA Sites Affects Cell Cycle Progression.
We hypothesize that the polyA site usage switch plays an important role in cell cycle progression, since we found the strong association between polyA site usage and cell cycle. We deleted distal polyA sites of MSL1, SCCPDH, ZC3HAV1, and NUP160 in MDA-MB-468 cells by CRISPR/Cas9 to analyze their effect on cell cycle. Wildtype and polyA site-deleted cells were arrested at G1/S transition by double thymidine block; thus they are synchronized. The arrested cells were released to analyze cell cycle by flow cytometry after propidium iodide (PI) staining. E.g., MSL1 only uses the distal polyA site at the S phase (Fig. 6 C and E), deletion of the distal polyA site of MSL1 could block the polyA site usage switch from the proximal polyA site to the distal polyA site at phase S, thus potentially slows the cell cycle progression. Indeed, we found slowed cell cycle progression after deletion of the distal polyA site on MSL1 (Fig. 6F). Similarly, the deletion of distal polyA sites of ZC3HAV1 and NUP160 also slowed cell cycle (SI Appendix, Fig. S8 A and B). On the other hand, SCCPDH simultaneously uses the distal polyA site and the proximal polyA site at most phases while only uses the proximal site at the M phase (Fig. 6 D and E). We found accelerated cell cycle progression after the deletion of the distal polyA site of SCCPDH (Fig. 6F), potentially indicating solely the use of the proximal polyA site accelerates cell cycleprogression.
Discussion
Although polyA tail has been targeted for mRNA capture, scRNA-seq only focused on quantifying gene expression while ignored polyA site usages. Particularly, almost all 3′ end scRNA-seq sequences the alternative end of the targeted fragment instead of polyA end to avoid the potential sequencing error caused by polyA. Therefore, it is very difficult to directly identify the polyA site for APA analyses using conventional scRNA-seq. Different from conventional scRNA-seq, scPolyA-seq sequences are much close to the 3′ end of transcripts and generate a high fraction of polyA-supporting reads (SI Appendix, Fig. S3H). The high fraction of polyA-supporting reads greatly facilitates identification of the exact location of polyA sites in the genome, thus making scPolyA-seq more suitable for APA analyses. Furthermore, scPolyA-seq generated about 96-fold more reads per cell than that of 10× Genomics on average, which makes scPolyA-seq having enough statistical power to detect polyA sites in single cells. The experiment protocol of scPolyA-seq was similar to that of SCRB-seq (41), although scPolyA-seq and SCRB-seq were designed for mapping the polyA site and quantifying the gene expression level, respectively. Compared with SCRB-seq, scPolyA-seq has a much higher fraction of the polyA site supporting (PASS) reads (SI Appendix, Fig. S3H), which greatly facilitate the APA analyses at the single-cell level. Overall, scPolyA-seq provides a nice approach to analyze the cellular heterogeneity of APA and provides an opportunity for exploring the relationship between gene expression and polyA site usage.
There is no systematic study on APA dynamics during cell cycle so far, potentially due to technology limitation for detecting polyA site dynamics at single-cell resolution. After we found cell cycle is the most significantly enriched GO term in polyA site usage changed genes between unsynchronized HeLa and synchronized HeLa, we realized that polyA site usage might be highly dynamic during cell cycle. We analyzed the genes showing the polyA site usage switch between neighboring cell cycle phases and identified six gene clusters showing distinct APA dynamic patterns during cell cycle. We further found there are much more genes showing a polyA site usage switch than those showing the expression-level change during cell cycle, indicating the polyA site usage switch are strongly associated with cell cycle. Changes in cell cycle progression caused by the deletion of the distal polyA site of MSL1, SCCPDH, ZC3HAV1, and NUP160 further support that the polyA site usage switch plays an important role in cell cycle progression. The polyA site usage switch belongs to the posttranscriptional regulation that could quickly respond to environment stimulation, which may explain why the polyA site usage switch plays an important role in cell cycle. Identification of the six gene clusters with a specific APA dynamic pattern significantly improved our understanding of APA dynamics. Further analysis showed that the results based on genes with two polyA sites (SI Appendix, Figs. S9 and S10) were essentially consistent with those based on genes with multi-polyA sites, indicating genes with two or more polyA sites follow the similar polyA site usage rules. These findings suggest that polyA site usage should be a good target for cell cycle intervention, which may facilitate therapy for cancers and other diseases.
Materials and Methods
Cell Culture and Cell Synchronization by Double Thymidine Block.
MEF was prepared following our recent study (42). Cell culture and cell synchronization of MDA-MB-468, HEK293T, and HeLa cell lines were provided in SI Appendix, Materials and Methods.
scPolyA-seq.
We used Fluidigm C1™ Single-Cell Auto Prep System (Fluidigm, South San Francisco, CA, USA) for scPolyA-seq library preparation. Library construction followed the Fluidigm manual and protocol (43), except the steps adapted to scPolyA-seq. In brief, MDA-MB-468, MEF, and HeLa cells were collected and resuspended. Cell suspensions of MEF (mixed with 10% synchronized HeLa cells) and MDA-MB-468 (mixed with 10% HeLa cells) were loaded into two independent inlets of a Fluidigm high-throughput integrated fluidic circuit (HT IFC) to capture single cells, thus up to 400 cells could be captured for each cell suspension.
The C1 Single-Cell mRNA Seq HT Reagent Kit was used for cDNA synthesis. After cell lysis, primers with polyT and cell barcode were added to get full-length cDNA by template-switching reverse transcription, followed by amplification and tagmentation with Tn5 transposases. The Nextera XT DNA Library Preparation Kit (Ilumina) was applied to construct scPolyA-seq library. DNA fragments at the 3′ end of the cDNA were enriched by targeted PCR. Libraries of the same columns were pooled together and sequenced on Illumina HiSeq2500 to obtain 150-bp paired-end reads.
scPolyA-seq Data Preprocessing and Quality Control.
The scPolyA-seq data were demultiplexed to single cells using a script provided by Fluidigm (mRNASeqHT_demultiplex.pl). The data quality was checked by FastQC v0.11.7 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc). The number of reads for each cell was essentially consistent with the cell status in each well based on microscopy check of IFC (SI Appendix, Fig. S1 A and B). Cells that have <3.2 × 105 reads were filtered out (SI Appendix, Fig. S1C). Reads were filtered and trimmed using Trim Galore version 0.6.1 (https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/). Reads from each cell were mapped to human reference genome (hg19) by STAR version 2.5.2b (44). The gene expression level of the coding genes from GENCODE v30 (45) was quantified by htseq-count (46).
The scPolyA-seq library was strand-specific (SI Appendix, Fig. S1E), which may avoid some ambiguity in overlapping regions on the genome when identifying polyA sites. Potential outliers were removed if read counts or the expressed gene number was outside 3 × median absolute deviation (MAD) of each median. Cells were removed if a proportion of their mitochondrial RNA was an outlier. Counts per million (CPM) was calculated to quantify the gene expression level, since scPolyA-seq only sequenced the 3′ end of each transcript (Fig. 1E).
Identification of Cell Types and Cell Populations.
We generated scPolyA-seq data for four cell populations, namely MDA-MB-468, MEF, synchronized HeLa, and unsynchronized Hela. MEF and synchronized HeLa on the same inlet of IFC were separated by mapping to hg19_mm10 mega reference genome using STAR (22) and plotting uniquely mapped reads (SI Appendix, Fig. S1D). Synchronized HeLa cells combined with MDA-MB-468 and unsynchronized Hela cells from the other inlet of IFC were projected on UMAP by Seurat 3.1.5 (47, 48) (SI Appendix, Fig. S2).
Identification of PolyA Sites.
We developed a de-novo polyA site identification pipeline with Snakemake (49), named scPolyA-pipe, to identify a polyA site from sequence data (https://github.com/WangJL2021/scPolyA-seq). To accurately locate a polyA site, we only selected PASS reads, which either contain ≥10 consecutive ‘A’s or with ≥6 consecutive ‘A’s at the 3′-end, for further analyses (SI Appendix, Fig. S3 A and B). The upstream of PASS read immediately before consecutive ‘A’s is the potential sequence before the polyA site and was cleaved for mapping to human reference genome (hg19). The uniquely mapped reads from each cell were saved in a bam file. All the bam files were merged into one file, in which the 3′-end of each read was considered as the polyA cleavage position. These polyA cleavage positions were used for inferring the polyA site, similar to Hoque et al. (50) (SI Appendix, Fig. S3C). In brief, these polyA cleavage positions were merged into one polyA site cluster, if they were on the same strand and within 24 nt with each other. If a cluster extends >24 nt, the site with the highest number of cleavage positions was identified, and other sites located further than 12 nt from this site were reclustered. This process was repeated until all polyA site clusters expand <24 nt. We got a total of 740,973 polyA sites/clusters without considering the mitochondrial DNA. To eliminate the internal priming effect, we removed polyA sites with six consecutive As or more than 15 As in the 20 nt downstream region of the cleavage site on the genome. We only kept the polyA sites supported by at least 10% of cells. Finally, 20,222 polyA sites passed the quality control and were used for further analyses.
Features of Identified PolyA Sites.
Sequences of 100 nt upstream and downstream of polyA sites were extracted, and the base frequency was plotted (SI Appendix, Fig. S3E). In order to identify the enriched motifs, the 60 nt upstream sequences of each polyA site extracted by bedtools getfasta (51) were submitted to MEME (52) to discover motifs (SI Appendix, Fig. S3D).
PolyA Site Annotation.
PolyA sites were annotated according to hg19 GENCODE v30. We annotated polyA sites located in ±10bp of the 3′ end of transcripts as known polyA. PolyA sites located in exon, intron, 3′ UTR, and downstream 1kb of 3′ UTR were annotated as exon, intron, UTR3, and Extended3UTR, respectively. For polyA sites falling into multiple categories, we set priority as Known pA > 3′UTR > Extended3UTR > Exon > Intron >Intergenic.
DPAU Index and gDPAU.
For genes with two polyA sites, we used DPAU to quantify the percentage of DPAU for each gene in each cell. E.g., in cell i, the DPAU of gene g can be calculated as:
DPAU ranges from 0 to 1. In each cell, if a gene’s total supporting reads at both sites are less than 5, this gene’s DPAU in this cell will be regarded as noise and noted as NA. Otherwise, this cell will be considered as a valid observation.
For genes with more than two polyA sites, gDPAU is used to quantify the trend of distal polyA site usage for each gene. It is a location index weighted sum of read count percentage of gene’s each polyA site. E.g., for a gene with n (n≥2) polyA sites, p1, p2, …, pn represent the percentages of its polyA site usage at each site from 5′-end to 3′-end.
When n = 2, gDPAU = DPAU. In each cell, if a gene’s total supporting reads at all sites are less than 5, this gene’s gDPAU in this cell will be regarded as noise and noted as NA. Otherwise, this cell will be considered as a valid observation.
Identifying Changes in APA between Groups.
Chi-squared test is used to compare the significance of gDPAU changes between groups by stacking counts at each polyA site within one group. Benjamini–Hochberg was used for multiple comparison corrections. If a gene’s absolute mean difference of gDPAU is >0.3 and adjusted P value is < 0.01 between two groups, the gene’s APA change between two groups will be deemed as significant.
Visualization and Enrichment Analyses.
Reads, polyA sites, and genes across cells are visualized by IGV 2.8.2 (53). GO representative analysis is done by Metascape (54). False discovery rate (FDR) correction is used for multiple comparison correction.
Statistical Analysis.
The statistical tests and plots were conducted using R version 3.6.0 (2019-04-26).
Code Availability.
The source code and scripts for processing scPolyA-seq are maintained in the GitHub code repository: https://github.com/WangJL2021/scPolyA-seq.
Supplementary Material
Acknowledgments
This study was supported by National Key Research and Development Program of China (2021YFA0909300, 2021YFF1200900), National Natural Science Foundation of China (32170646), Shenzhen Science and Technology Program (KQTD20180411143432337), Shenzhen Innovation Committee of Science and Technology (JCYJ20180504170158430, JCYJ20220818100401003, ZDSYS20200811144002008). We thank the Center for Computational Science and Engineering in Southern University of Science and Technology for computational support.
Author contributions
W.J. and T.N. designed research; J.W., W.C., W.Y., and F.R. performed research; J.W., W.C., and H.Z. contributed new reagents/analytic tools; J.W. and W.H. analyzed data; T.N., Y.Q., N.H., and W.J. supervised this study; and J.W. and W.J. wrote the paper.
Competing interests
The authors declare no competing interest.
Footnotes
This article is a PNAS Direct Submission.
Contributor Information
Yuanming Qi, Email: qym@zzu.edu.cn.
Ni Hong, Email: hongn@mail.sustech.edu.cn.
Ting Ni, Email: tingni@fudan.edu.cn.
Wenfei Jin, Email: jinwf@sustech.edu.cn.
Data, Materials, and Software Availability
All raw and processed sequencing data have been submitted to the NCBI Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/) under accession number GSE178264. Single-cell RNA-seq data by Smart-seq2 were from GEO under accession no. GSE134311. The RNA-seq datasets of synchronized MCF-7 in G1, S, and M phases were from GEO under accession no. GSE94479.
Supporting Information
References
- 1.Derti A., et al. , A quantitative atlas of polyadenylation in five mammals. Genome Res. 22, 1173–1183 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Wang R., Zheng D., Yehia G., Tian B., A compendium of conserved cleavage and polyadenylation events in mammalian genes. Genome Res. 28, 1427–1441 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Gruber A. J., Zavolan M., Alternative cleavage and polyadenylation in health and disease. Nat. Rev. Genet. 20, 599–614 (2019). [DOI] [PubMed] [Google Scholar]
- 4.Gruber A. J., et al. , A comprehensive analysis of 3′ end sequencing data sets reveals novel polyadenylation signals and the repressive role of heterogeneous ribonucleoprotein C on cleavage and polyadenylation. Genome Res. 26, 1145–1159 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Lianoglou S., Garg V., Yang J. L., Leslie C. S., Mayr C., Ubiquitously transcribed genes use alternative polyadenylation to achieve tissue-specific expression. Genes Dev. 27, 2380–2396 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Sandberg R., Neilson J. R., Sarma A., Sharp P. A., Burge C. B., Proliferating cells express mRNAs with shortened 3′ untranslated regions and fewer microRNA target sites. Science 320, 1643–1647 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Ji Z., Lee J. Y., Pan Z., Jiang B., Tian B., Progressive lengthening of 3′ untranslated regions of mRNAs by alternative polyadenylation during mouse embryonic development. Proc. Natl. Acad. Sci. U.S.A. 106, 7028–7033 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Miura P., Shenker S., Andreu-Agullo C., Westholm J. O., Lai E. C., Widespread and extensive lengthening of 3′ UTRs in the mammalian brain. Genome Res. 23, 812–825 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Jia X., et al. , The role of alternative polyadenylation in the antiviral innate immune response. Nat. Commun. 8, 14605 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Shen T., et al. , Alternative polyadenylation dependent function of splicing factor SRSF3 contributes to cellular senescence. Aging (Albany NY) 11, 1356–1388 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Xia Z., et al. , Dynamic analyses of alternative polyadenylation from RNA-seq reveal a 3′-UTR landscape across seven tumour types. Nat. Commun. 5, 5274 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Wang L., et al. , Dissecting the heterogeneity of the alternative polyadenylation profiles in triple-negative breast cancers. Theranostics 10, 10531–10547 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Ye C., Zhou Q., Hong Y., Li Q. Q., Role of alternative polyadenylation dynamics in acute myeloid leukaemia at single-cell resolution. RNA Biol. 16, 785–797 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Mayr C., Bartel D. P., Widespread shortening of 3′UTRs by alternative cleavage and polyadenylation activates oncogenes in cancer cells. Cell 138, 673–684 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Park H. J., et al. , 3′ UTR shortening represses tumor-suppressor genes in trans by disrupting ceRNA crosstalk. Nat. Genet. 50, 783–789 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Qin P., et al. , Integrated decoding hematopoiesis and leukemogenesis using single-cell sequencing and its medical implication. Cell Discov. 7, 2 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Jin W., et al. , Genome-wide detection of DNase I hypersensitive sites in single cells and FFPE tissue samples. Nature 528, 142–146 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Ren G., et al. , CTCF-mediated enhancer-promoter interaction is a critical regulator of cell-to-cell variation of gene expression. Mol. Cell 67, 1049–1058.e1046 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Lai B., et al. , Principles of nucleosome organization revealed by single-cell micrococcal nuclease sequencing. Nature 562, 281–285 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Wang W., Ren G., Hong N., Jin W., Exploring the changing landscape of cell-to-cell variation after CTCF knockdown via single cell RNA-seq. BMC Genomics 20, 1015 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Shalek A. K., et al. , Single-cell RNA-seq reveals dynamic paracrine control of cellular variation. Nature 510, 363–369 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Macosko E. Z., et al. , Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Paul F., et al. , Transcriptional Heterogeneity and lineage commitment in myeloid progenitors. Cell 163, 1663–1677 (2015). [DOI] [PubMed] [Google Scholar]
- 24.Tirosh I., et al. , Single-cell RNA-seq supports a developmental hierarchy in human oligodendroglioma. Nature 539, 309–313 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Yu Y., et al. , Single-cell RNA-seq identifies a PD-1(hi) ILC progenitor and defines its development pathway. Nature 539, 102–106 (2016). [DOI] [PubMed] [Google Scholar]
- 26.Sebe-Pedros A., et al. , Cnidarian cell type diversity and regulation revealed by whole-organism single-cell RNA-Seq. Cell 173, 1520–1534.e1520 (2018). [DOI] [PubMed] [Google Scholar]
- 27.Jaitin D. A., et al. , Massively parallel single-cell RNA-seq for marker-free decomposition of tissues into cell types. Science 343, 776–779 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Klein A. M., et al. , Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Velten L., et al. , Single-cell polyadenylation site mapping reveals 3′ isoform choice variability. Mol. Syst. Biol. 11, 812 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Ye C., Lin J., Li Q. Q., Discovery of alternative polyadenylation dynamics from single cell types. Comput. Struct. Biotechnol. J. 18, 1012–1019 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Shulman E. D., Elkon R., Cell-type-specific analysis of alternative polyadenylation using single-cell transcriptomics data. Nucleic Acids Res. 47, 10027–10039 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Hu Y., et al. , Single-cell RNA cap and tail sequencing (scRCAT-seq) reveals subtype-specific isoforms differing in transcript demarcation. Nat. Commun. 11, 5148 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Picelli S., et al. , Full-length RNA-seq from single cells using Smart-seq2. Nat. Protoc. 9, 171–181 (2014). [DOI] [PubMed] [Google Scholar]
- 34.Wang R., Nambiar R., Zheng D., Tian B., PolyA_DB 3 catalogs cleavage and polyadenylation sites identified by deep sequencing in multiple genomes. Nucleic Acids Res. 46, D315–D319 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Lee J. Y., Yeh I., Park J. Y., Tian B., PolyA_DB 2: mRNA polyadenylation sites in vertebrate genes. Nucleic Acids Res. 35, D165–D168 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Ha K. C. H., Blencowe B. J., Morris Q., QAPA: A new method for the systematic analysis of alternative polyadenylation from RNA-seq data. Genome Biol. 19, 45 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Rosonina E., Bakowski M. A., McCracken S., Blencowe B. J., Transcriptional activators control splicing and 3′-end cleavage levels. J. Biol. Chem. 278, 43034–43040 (2003). [DOI] [PubMed] [Google Scholar]
- 38.Ji Z., et al. , Transcriptional activity regulates alternative cleavage and polyadenylation. Mol. Syst. Biol. 7, 534 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Nagaike T., et al. , Transcriptional activators enhance polyadenylation of mRNA precursors. Mol. Cell 41, 409–418 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Liu Y., et al. , Transcriptional landscape of the human cell cycle. Proc. Natl. Acad. Sci. U.S.A. 114, 3473–3478 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Soumillon M., Cacchiarelli D., Semrau S., van Oudenaarden A., Mikkelsen T. S., Characterization of directed differentiation by highthroughput single-cell RNA-seq. bioXriv [Preprint] (2014). 10.1101/003236. Accessed March 05, 2014. [DOI]
- 42.Chen W., et al. , Single-cell transcriptome analysis reveals six subpopulations reflecting distinct cellular fates in senescent mouse embryonic fibroblasts. Front. Genet. 11, 867 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.DeLaughter D. M., The use of the fluidigm C1 for RNA expression analyses of single cells. Curr. Protoc. Mol. Biol. 122, e55 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Dobin A., et al. , STAR: Ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Frankish A., et al. , GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Anders S., Pyl P. T., Huber W., HTSeq–a python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Stuart T., et al. , Comprehensive integration of single-cell data. Cell 177, 1888–1902.e1821 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Zhou B., Jin W., Visualization of single cell RNA-seq data using t-SNE in R. Methods Mol. Biol. 2117, 159–167 (2020). [DOI] [PubMed] [Google Scholar]
- 49.Koster J., Rahmann S., Snakemake-a scalable bioinformatics workflow engine. Bioinformatics 34, 3600 (2018). [DOI] [PubMed] [Google Scholar]
- 50.Hoque M., et al. , Analysis of alternative cleavage and polyadenylation by 3′ region extraction and deep sequencing. Nat. Methods 10, 133–139 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Quinlan A. R., BEDTools: The swiss-army tool for genome feature analysis. Curr. Protoc. Bioinformatics 47, 11–34 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Bailey T. L., Johnson J., Grant C. E., Noble W. S., The MEME suite. Nucleic Acids Res. 43, W39–W49 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Thorvaldsdottir H., Robinson J. T., Mesirov J. P., Integrative genomics viewer (IGV): High-performance genomics data visualization and exploration. Brief Bioinform. 14, 178–192 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Zhou Y., et al. , Metascape provides a biologist-oriented resource for the analysis of systems-level datasets. Nat. Commun. 10, 1523 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All raw and processed sequencing data have been submitted to the NCBI Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/) under accession number GSE178264. Single-cell RNA-seq data by Smart-seq2 were from GEO under accession no. GSE134311. The RNA-seq datasets of synchronized MCF-7 in G1, S, and M phases were from GEO under accession no. GSE94479.