Abstract
Mutations in GBA1 cause Gaucher disease and are the most important genetic risk factor for Parkinson’s disease. However, analysis of transcription at this locus is complicated by its highly homologous pseudogene, GBAP1. We show that >50% of short RNA-sequencing reads mapping to GBA1 also map to GBAP1. Thus, we used long-read RNA sequencing in the human brain, which allowed us to accurately quantify expression from both GBA1 and GBAP1. We discovered significant differences in expression compared to short-read data and identify currently unannotated transcripts of both GBA1 and GBAP1. These included protein-coding transcripts from both genes that were translated in human brain, but without the known lysosomal function—yet accounting for almost a third of transcription. Analyzing brain-specific cell types using long-read and single-nucleus RNA sequencing revealed region-specific variations in transcript expression. Overall, these findings suggest nonlysosomal roles for GBA1 and GBAP1 with implications for our understanding of the role of GBA1 in health and disease.
Long-read RNA sequencing uncovers unexpected protein-coding roles for GBA1 and GBAP1, exhibiting tissue and cell type selectivity.
INTRODUCTION
The human genome contains regions that evade comprehensive analysis through short-read sequencing technologies and thus remain poorly studied. While these difficulties can be attributed to challenges with sequencing (e.g., high GC content), they are most commonly the result of duplicated genomic regions (1). This leads to sequencing reads aligning to multiple genomic locations due to a high degree of sequence similarity, a phenomenon known as multimapping. Given that defective gene copies with high sequence similarity to their parent genes, termed pseudogenes, are frequently found in the human genome, this is a common problem (2).
While the impact of multimapping has been investigated in the context of pathogenic variant detection and can cause variants to be “missed” using conventional analyses (3), the effect of multimapping on transcriptomic analyses has received less attention despite the problem being similar in nature (4). This is surprising given the considerable number of genes affected, many of which are implicated in human disease. Short-read RNA sequencing (RNA-seq) has been crucial to our understanding of transcript annotation, gene expression, and its tissue and cell type–specific regulation. However, a major challenge in analyzing these datasets is the difficulty of annotating parent-pseudogene pairs due to reads that cannot unambiguously map to either the parent gene or pseudogene, and so accurately quantifying gene expression.
Here, we focused on the disease-relevant example of GBA1 and its expressed pseudogene GBAP1. GBA1 encodes glucocerebrosidase (GCase), a lysosomal hydrolase (5) that degrades the glycosphingolipid, glucosylceramide (6). Biallelic mutations in GBA1 result in decreased GCase activity causing Gaucher disease (GD) with glycosphingolipid excess in the brain and soma (7–11). Notably, family members of adults with GD face an increased risk of developing Parkinson’s disease (PD) (12). Furthermore, heterozygous mutations in GBA1 are among the most important genetic risk factors for PD (13–16), contributing to a more rapid progression of motor and nonmotor symptoms (17–21), and they also appear to be important predictors for nonmotor symptom progression after deep brain stimulation surgery in patients with PD (18, 22). Adding to this intricate landscape, it is noteworthy that “mild” and “severe” heterozygous GBA1 mutations exhibit differential impacts on the risk and age at onset (AAO) of PD (23).
To address the limitations of short sequencing reads, which seldom span multiple splice junctions (24), we used long-read RNA-seq to examine human brain regions and induced pluripotent stem cell (iPSC)-derived brain cells in depth. Our focus was on GBA1 and GBAP1, and we discovered significant differences in gene expression compared to short-read RNA-seq. Moreover, we identified a large number of novel transcripts from both genes, comprising novel protein-coding transcripts. We supported these findings by integrating short-read RNA-seq data, biochemistry, and proteomic data, which validated the novel protein-coding transcripts and confirmed that GBAP1 is translated in cells and human brain. Furthermore, we used both long-read sequencing and annotation-agnostic short-read sequencing data and found that inaccuracies in annotation are common among parent genes. Figure 1 summarizes our analyses.
RESULTS
Pseudogenes are commonly expressed and alternatively spliced across human tissues
We started by quantifying pseudogenes from GENCODE (v38) annotation to investigate their impact on transcriptomic analyses. We identified a total of 14,709 pseudogenes in the human genome (2, 25), which can be divided into processed pseudogenes (n = 10,666) and unprocessed pseudogenes (n = 3565), derived from retrotransposition of processed mRNAs and segmental duplications, respectively (Fig. 2A). To date, 10,370 pseudogenes have been confidently assigned to 3665 unique parent genes (table S1) (26). We found that 734 (20.0%; Fig. 2B) parent genes were linked to 1015 Online Mendelian Inheritance in Man (OMIM) phenotypes, accounting for 17.0% of all OMIM disease genes (https://omim.org/) (27).
To examine pseudogene expression across tissues, we used uniquely mapped short-read RNA-seq data generated by the Genotype-Tissue Expression (GTEx) (28, 29) Consortium (v8, accessed 10 November 2021). We found that 64.7% of pseudogenes are expressed in ≥1 tissue (Fig. 2C) and that, on average, 25.7 ± 2.5% of pseudogenes are expressed per tissue (n = 41; fig. S1). We then assessed the percentage of expressed pseudogenes that are alternatively spliced (>1 transcript expressed) across human brain, heart, and lung samples using publicly available long-read RNA-seq data. On average, we found that 54.8 ± 2.6% of unprocessed pseudogenes and 13.5 ± 3.4% of processed pseudogenes are alternatively spliced (Fig. 2D). Together, this is consistent with the observation that a proportion of pseudogenes are of functional importance (30).
Multimapping results in significant underestimation of GBA1 expression in human brain
We next examined the sequence similarity between pseudogenes and their parent genes as a way to investigate the potential functionality and complicating effects of the widespread expression and alternative splicing of pseudogenes. Our findings revealed that pseudogenes share an average of 80.0 ± 13.4% sequence similarity to the coding sequence (CDS) with their parent genes (Fig. 3A). As a result, genomic regions containing pseudogenes have the potential to confound transcriptomic analyses in all human tissues for a considerable proportion of protein-coding genes, including many that are causally linked to disease.
To explore this hypothesis in detail, we focused on the parent-pseudogene pair, GBA1-GBAP1 (31). This choice was driven by the following: (i) the high sequence similarity of GBA1-GBAP1 of 96%, which we reasoned would make both genes prone to inaccuracies in gene expression measures and transcript annotation (Fig. 3A); (ii) GBAP1’s broad tissue expression (determined using RNA-seq data provided by GTEx), which means that simply masking its specific genomic region during mapping would be incorrect (fig. S2); and (iii) GBA1 has been extensively studied due to its widely known role in disease, and its pseudogene is well recognized.
We began by studying GBA1 and GBAP1 expression using gene-level measures from human tissues (n = 41) available through GTEx. Counter to previous reverse transcription polymerase chain reaction (RT-PCR)–based quantifications showing that GBA1 is expressed at significantly higher levels than GBAP1 (32), we found GBA1 and GBAP1 expression to be equivalent in many tissues (fig. S3), including the human brain (log2 fold change = 0.9 ± 0.5) (Fig. 3B). We questioned whether this observation could be explained by multimapping reads, which are often discarded in standard processing and so do not contribute to gene-level quantification of expression in many publicly available datasets [e.g., GTEx (28), PsychENCODE (33), and recount3 (34)]. To explore this question, we reanalyzed publicly available short-read RNA-seq of human anterior cingulate cortex samples derived from 18 individuals (n, control = 5, PD, with or without dementia = 13) (35). Using this high-depth dataset [100–base pair (bp) paired-end reads, with a mean depth of 182.9 ± 14.9 million read pairs per sample], we assessed the proportion of reads that uniquely mapped to GBA1. We found that only 41.7 ± 11.2% of all reads mapped to GBA1 were uniquely mapped (fig. S4A), with 96.0 ± 2.0% of multimapped reads also aligning to GBAP1 (fig. S4B). Considering that most reads mapped to GBA1 and GBAP1 are not used for quantification, we concluded that long-read RNA-seq would be required to assess their relative expression. Therefore, we applied direct cDNA Oxford Nanopore sequencing [Oxford Nanopore Technologies (ONT)] to pooled human frontal lobe (n individuals = 26) and hippocampus samples (n individuals = 27) (total library size: 42.7 million and 48.0 million reads, respectively) and found higher expression of GBA1 (numerator) compared to GBAP1 (denominator) (frontal lobe, log2 fold change = 2.3; hippocampus, log2 fold change = 3.1). That is, quantification with short-read RNA-seq wrongly estimated the relative expression of this parent-pseudogene pair by a 2- to 3-log2 fold difference (frontal cortex, Grubbs’ test statistic = 3.58, P = 0.03; hippocampus, Grubbs’ test statistic = 4.27, P < 0.01, Grubbs test for one outlier) (Fig. 3C).
Long-read RNA-seq reveals unannotated transcripts for GBA1 and GBAP1 with no dominant transcript in the human brain
The inaccuracies in quantification suggested that high dependence on short-read RNA-seq technologies may have also led to inaccuracies in GBA1 and GBAP1 transcript structures. To address this, we performed targeted Pacific Biosciences (PacBio) isoform sequencing (Iso-Seq) (fig. S5A) on 12 human brain regions. Brain tissue was used because of GBA1’s importance in neurological disease (13–16, 36, 37) and previous evidence suggesting that transcriptome annotation is most incomplete in human brain (38). We used PacBio Iso-Seq, which has >99% base pair accuracy enabled by circular consensus sequencing (CCS), which in turn, allows accurate mapping. To ensure that full-length reads were generated from mature mRNA alone, we used high-quality polyadenylated RNA (RNA integrity number > 8) pooled from multiple individuals per tissue (table S2). GBA1 and GBAP1 cDNAs were enriched using biotinylated hybridization probes designed against exonic and intronic genic regions (fig. S6) to ensure that few assumptions were made regarding transcript structure. Collapsing mapped reads resulted in 2368 GBA1 and 3083 GBAP1 unique transcripts, each supported by ≥2 full-length HiFi reads across all samples (fig. S7, A and B). After QC (quality control) and filtering for a minimum of 0.3% transcript usage per sample (equating to a mean of 43.4 to 11,127.2 and 15.4 to 1161.3 full-length HiFi reads for GBA1 and GBAP1, respectively), we identified 32 GBA1 and 48 GBAP1 transcripts (Fig. 4), thus providing the most reliable annotation of GBA1 and GBAP1 transcription to date.
Next, we examined the identified transcripts for coding potential, nonsense-mediated decay (NMD) and similarity with the existing annotation from GENCODE to categorize transcripts into the following five categories: (i) coding known (alternate 3′/5′ end), (ii) coding novel, (iii) NMD novel, (iv) noncoding known, and (v) noncoding novel (Materials and Methods and fig. S5B). We noted that 24 of the 32 identified GBA1 transcripts and all 48 identified GBAP1 transcripts were absent from GENCODE (Fig. 4, A and D).
Contrary to the expectation that most protein-coding genes express one dominant transcript (39–41), we did not find a dominant GBA1 or GBAP1 transcript across any of the 12 brain regions sequenced. The most highly expressed GBA1 transcript (PB.845.2786; a full splice match to ENST00000368373) only corresponded to a mean of 38.4 ± 7.6% of total transcription at the locus (Fig. 4B). Although less surprising for a pseudogene, the most highly expressed transcript of GBAP1 (noncoding novel) only corresponded to a mean of 14.0 ± 5.0% of total transcription at the locus (Fig. 4E).
Collectively 25 novel protein-coding transcripts of GBA1 and its pseudogene GBAP1 are identified
We found that of all the coding transcripts detected, 18 GBA1 transcripts had a novel open reading frame (ORF) and 7 GBAP1 transcripts were predicted to encode a protein, despite GBAP1 being classified as a pseudogene (Fig. 4, A and D). Since usage of unannotated 5′ transcription start sites (TSSs) was a common feature of GBA1 and GBAP1 transcripts with novel ORFs (fig. S8), we focused on validating these sites using cap analysis gene expression (CAGE) peaks [defined by FANTOM5 (42, 43)]. We found that, despite the fact that CAGE sequencing only captures the first 20 to 30 nucleotides from the 5′-end (unique mapping only), 57% (n = 4) and 50% (n = 9) of novel GBA1 and GBAP1 5′ TSSs, respectively, were located within 50 bp of CAGE peaks, providing additional confidence in calling of these transcripts. Moreover, we validated all novel ORFs through additional targeted Iso-Seq of GBA1 and GBAP1 in iPSC-derived cortical neurons (n = 6), astrocytes (n = 3), and microglia (n = 3). In summary, we were able to detect GBA1 and GBAP1 transcripts with novel ORFs using a different RNA-seq technology and validate them in an independent dataset.
To explore the coding potential of GBA1 and GBAP1 transcripts with novel ORFs, we used a sequence-based approach along with AlphaFold2 (44) (which accurately predicts GBA1 structure; fig. S9). We focused on the most highly expressed GBA1 (n = 3) and GBAP1 (n = 2) ORFs (Fig. 5, A and B). Although protein isoforms of both genes were predicted to have highly similar tertiary structures at the C terminus, we predicted that all protein products would be unlikely to have GCase activity due to the partial/full loss of key enzymatic sites or the absence of the lysosomal targeting sequence (LIMP-2 interface region; Fig. 5, C to H, and fig. S10) (45, 46). To assess the coding potential of these novel GBA1 and GBAP1 transcripts, we amplified the ORFs and cloned them into a vector with a C-terminal FLAG-tag. We transfected these vectors into H4 cells with homozygous knockout of GBA1 and found translation of all transcripts as detected with both an anti-FLAG antibody and an antibody directed to the conserved C terminus (Fig. 6A and fig. S11). However, none of these transcripts encoded protein isoforms with GCase activity, including those transcribed from GBAP1 (Fig. 6B). We also found no evidence to suggest that these protein isoforms inhibited constitutive GCase activity in H4 parental cells expressing GBA1 (Fig. 6C). Nevertheless, this will require further corroboration by the use of an artificial GCase substrate compatible with live imaging to directly determine GCase activity in the lysosomal compartment [e.g., (47, 48)] and/or the quantification of GCase substrate levels by liquid chromatography–mass spectrometry (LC-MS).
Moreover, immunohistochemical analysis conducted on H4 GBA1 knockout and the H4 parental line, expressing endogenous GBA1, revealed the absence of lysosomal localization for PB.845.525 (GBAP1), PB.845.2627, and PB.845.2629 (both GBA1 isoforms affecting the Glyco_hydro_30 domain + signal peptide). Conversely, some degree of lysosomal localization was observed for PB.845.1693 (GBAP1) and potentially PB.845.2954 (GBA1 isoform affecting signal peptide) (Fig. 6D and fig. S12, A and B).
Noteworthy, only about 20 to 30% of expressed GCase construct distributes to the lysosome, as opposed to approximately 50 to 60% of endogenous GCase (fig. S12C), which may be attributed to artifactual protein trafficking due to tagging and overexpression. To address this caveat and to fully understand distribution of the different forms, one will require approaches based on the use of specific antibodies for probing the putative endogenous proteins and/or LC-MS analysis of organelle-specific proteomes (e.g., LysoIP) to identify specific peptides. With no current availability of these antibodies and no access to this type of organelle-specific datasets, we aimed at interrogating public bulk mass spectrometry dataset of human prefrontal cortex (49) and human embryonic stem cell-derived microglia-like cell lines (hMGLs) (50). Since novel GBA1 isoforms have no unique sequences that differentiate them, we focused on GBAP1 isoforms. We found proteomic support for GBAP1 (PB.845.1693) within the datasets with a protein Q value of <0.01. In particular, we identified the unique amino acid sequence QWALDGAEYR, which is unique to GBAP1 and was not identified when searched within the UniProt human protein reviewed dataset. This shows translation of GBAP1 within the human prefrontal cortex and in hMGLs.
To explore the impact this has on variant interpretation, we conducted an analysis of genetic variants spanning the entire GBA1 gene, encompassing all variants cataloged in ClinVar and the GBA1-PD browser. We discovered that most pathogenic variants are not present in the first two exons of the MANE select transcript (ENST00000368373), despite these data primarily originating from whole-genome sequencing. However, when they are present in these exons, they lead to a more severe phenotype.
These initial exons encode the signal peptide, which plays a critical role in transporting the protein across the membrane of the rough endoplasmic reticulum. Consequently, when not transcribed, it results in a protein without GCase activity. Consistent with our data, the absence of transcription in these exons is associated with a protein lacking GCase activity. Therefore, variants in these exons appear to be linked to a more pronounced clinical outcome, while those situated in later exons exhibit a broader spectrum of phenotypes, ranging from severe GD to PD risk.
GBA1 and GBAP1 transcripts show cell type selectivity in human brain
We found that novel protein-coding transcripts of GBA1 without predicted GCase activity were common, collectively accounting for between 15.8% (cerebellum) and 31.7% (caudate nucleus) of transcription from the GBA1 locus. Notably, we found that only 48% of transcription in the caudate nucleus was predicted to encode a protein isoform with GCase activity. This high variability in the usage of GBA1 transcripts with novel ORFs across the human brain led us to hypothesize that these transcripts may have high cell type specificity. To test this, we used both 5′ single-nucleus RNA-seq (snRNA-seq) of human dorsolateral prefrontal cortex (DLPFC) and targeted PacBio Iso-Seq of human iPSC-derived brain-relevant cell types. Our analysis revealed cell type–selective differences in the expression of GBA1 and GBAP1 (Fig. 7).
Specifically, we used 5′ snRNA-seq of DLPFC to assess the expression of GBA1 and GBAP1 in various cell types, including astrocytes, excitatory neurons, inhibitory neurons, microglia, oligodendrocytes, and oligodendrocyte precursor cells (OPCs) (Fig. 7A). Our analysis showed an absence of signal at the first exon of PB.845.2888 (GBA1) in microglia, along with an overall lower expression of novel GBA1 transcripts in microglia and OPCs (Fig. 7B). We found that microglia showed significantly lower relative expression of shorter GBA1 ORFs lacking GCase activity (PB.275.2954 and PB.845.2888) compared to neurons or astrocytes, using PacBio Iso-Seq of human iPSC-derived neurons, astrocytes, and microglia (Fig. 7D).
Likewise, our analysis revealed that excitatory neurons had higher expression of GBAP1 ORF transcripts as compared to microglia, using 5′ snRNA-seq of DLPFC (Fig. 7C). Further, using PacBio Iso-Seq of human iPSC-derived neurons, astrocytes, and microglia, we found significant cell type–specific differences in GBAP1 ORF usage, with lower utilization of all GBAP1 ORFs in microglia compared to excitatory neurons and astrocytes (Fig. 7E). Additionally, our profiling of H3K4me3 mark in neurons using CUT&RUN (51) supported transcriptional activity at the 5′ TSSs of GBAP1 ORF transcripts (fig. S13).
Inaccurate annotation is frequent among parent genes across human tissues
We have shown substantial inaccuracies in annotation of the parent gene GBA1. However, we wanted to explore the scope of this problem. To do so, we compared inaccuracies in annotation of all 3665 parent genes compared with other protein-coding genes (including paralogs). Initially, we used public long-read RNA-seq data from 29 samples (n, brain = 9, heart = 16, and lung = 6; table S3) to assess the proportion of transcripts per gene, with at least one novel splice site in the CDS that would result in a novel ORF. Despite a low sequencing depth (mean, 2.2 ± 0.9 million full-length reads per sample), we found a significant increase in such events among parent genes compared to other protein-coding genes (parent genes = 23.9 ± 11.5%; protein-coding genes = 22.7 ± 11.4%; two-sided Wilcoxon rank sum test P < 0.01; Fig. 8A). We extended this analysis to a greater number of samples (n = 7595) and human tissues (n = 41, GTEx) using annotation-agnostic short-read RNA-seq analyses to quantify the proportion of parent genes with evidence of novel splicing (Materials and Methods). On the basis of the identification of novel expressed genomic regions (38) and novel splice site usage, we found that the proportion of genes with incomplete annotation was significantly higher among parent genes compared to other protein-coding genes (novel expression regions: parent genes = 13.9 ± 1.4%; protein-coding genes = 10.8 ± 1.3%; two-sided Wilcoxon rank sum test P < 0.01; Fig. 8B; splice site usage: parent genes = 66.5 ± 3.5%; protein-coding genes = 54.8 ± 4.3; two-sided Wilcoxon rank sum test P < 0.01; Fig. 8C). This observation was consistent across all tissues analyzed (fig. S14).
DISCUSSION
Here, we show that widespread expression and alternative splicing of pseudogenes in human tissues has limited our understanding of both pseudogene and parent gene transcription with a substantial impact on our appreciation of gene function. Our long-read RNA-seq analysis of the parent gene GBA1 and its pseudogene GBAP1 demonstrated notable diversity in transcription and showed that, contrary to expectation (40, 41), no single transcript dominated expression of either gene in human brain. This analysis involved sequencing of polyA-selected RNA and subsequent QC to mitigate the possibility of nascent RNA inclusion. A substantial portion of transcription from both loci was novel, leading to the identification of novel protein-coding transcripts with tissue- and cell type–specific biases in usage. Together, these findings have a substantial impact on our understanding of the potential mechanisms through which genetic variation at the GBA1-GBAP1 locus could explain phenotypic diversity in GD and modulate disease risk and expressivity in PD.
Although current annotation is known to be incomplete, especially in the brain (38), the extent of transcriptional variety and novelty at parent gene loci was surprising, and particularly so at GBA1. After all, GCase dysfunction has been implicated in human disease since 1965 (6) and mutations in GBA1 have been described since 1987 (11), making GBA1 one of the most studied genes in the genome. Nonetheless, we found that as much as 31.7% of GBA1 transcription in the caudate nucleus may be translated into novel protein isoforms that do not localize in lysosomes and, consequently, lack GCase activity. PD is primarily characterized by the degeneration of dopaminergic neurons in the substantia nigra pars compacta (SNpc). SNpc projects dopamine to the striatum, encompassing the caudate nucleus, which, together with the putamen, constitutes a pivotal part of the basal ganglia. Dysfunction in basal ganglia circuitry is a notable feature of PD, and the connection between reduced GCase activity and PD (13–16) makes these findings even more noteworthy. Moreover, this has implications for variant interpretation. We found that most pathogenic, and risk, variability at GBA1 would also be translated on novel protein isoforms that do not localize in lysosomes and, thus, lack GCase activity. However, those variants that affect the signal peptide, needed for lysosomal localization and GCase function, seemingly cause a more severe disease. Thus, understanding the specific transcript usage, within disease-relevant tissues and how that relate to GCase activity, might also help in understanding divergent phenotype-genotype relationships for both GD and PD.
While most analyses have focused on GBA1-GBAP1, we also demonstrate that inaccuracies in annotation were significantly more common across parent genes as compared to other protein-coding genes (Fig. 8) and were not restricted to example. High sequence similarity within the genome and subsequent multimapping of short RNA-seq reads have affected our understanding of many genes, including those already causally linked to disease. Such loci are predictable using sequence similarity analyses, the technology to resolve these “problem” loci is available, and the impact on our understanding of disease is likely to be significant. As exemplified by GBA1-GBAP1, our limited understanding of transcription from this locus results in errors in quantification of gene expression and all dependent analysis from differential gene expression in disease to quantitative trait loci detection. Beyond a research setting, inaccuracies in annotation will affect variant interpretation and consequently diagnostic yield for some disease-associated genes. Finally, and perhaps most importantly, inaccuracies in transcript annotation impact on our understanding of gene function. Directed by our long-read RNA-seq results, we have found that some GBAP1 transcripts are more highly expressed in neurons and astrocytes, share a similar predicted three-dimensional (3D) protein structure to GBA1, have protein products that do not localize to lysosomes, and lack GCase activity. Yet, we find robust evidence of translation of such GBAP1 transcripts in human brain using high-throughput mass spectrometry data (49). Extrapolating these findings to GBA1, where mass spectrometry data were uninformative, would suggest a nonlysosomal function for both GBA1 and GBAP1 in brain and particularly in neurons.
We propose that improving our understanding of the molecular functions of parent-pseudogene pairs will become increasingly important to the development and success of RNA-targeting therapies. Accurate annotation is required at the tissue and cell level to design effective antisense oligonucleotides (ASOs) or gene therapies. Furthermore, some pseudogenes may represent particularly high-value therapeutic targets due to their potential to operate as genetic modifiers of Mendelian disorders. Nusinersen, which targets the splicing of former pseudogene SMN2, is a highly successful treatment for spinal muscular atrophy (52). Thus, a deeper understanding of pseudogene function could lead to innovative therapeutic strategies.
Our results suggest that novel GBA1 isoforms, particularly those lacking GCase activity, may contribute to phenotypic diversity in GD and PD. Further experimentation using top-down proteomics to accurately detect and quantify these novel isoforms and molecular biology techniques to investigate their subcellular localization and functional properties would be necessary to validate this hypothesis. Additionally, our results raise the possibility that novel GBA1 transcripts, particularly those lacking GCase activity, may have alternative functions. Further experimentation would be required to definitively establish the functional roles of these novel GBA1 transcripts, but these findings suggest that GBA1 may have a broader range of functions than previously appreciated.
Together, our findings from the GBA1-GBAP1 study demonstrate the need for thorough reexamination of transcription in duplicated genomic regions, such as parent-pseudogene pairs. By using accurate full-length transcript sequencing, we are able to resolve these complex loci with unprecedented detail, leading to novel transcript discovery and, as a result, new insights into the functionality of human diseases.
MATERIALS AND METHODS
Pseudogenes and parental genes
Pseudogene and parent gene annotations
Pseudogene annotations were obtained from GENCODE v38 (https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/) (25). We included all HAVANA annotated pseudogenes excluding polymorphic pseudogenes. Biotypes were clustered using the “gene_type” column so that “IG_V_pseudogene,” “IG_C_pseudogene,” “IG_J_pseudogene,” “IG_pseudogene,” TR,” “TR_J_pseudogene,” “TR_V_pseudogene,” “transcribed_unitary_pseudogene,” “unitary_pseudogene” = “Unitary”; “rRNA_pseudogene,” “pseudogene” = “Other”; “transcribed_unprocessed_pseudogene,” “unprocessed_pseudogene,” “translated_unprocessed_pseudogene” = “Unprocessed”; “processed_pseudogene,” “transcribed_processed_pseudo-gene,” “translated_processed_pseudogene” = “Processed.” Parent genes have previously been inferred (26) and were obtained from psiCube (http://pseudogene.org/psicube/index.html).
Expression analysis from GTEx
Pseudogene and parent gene expression was assessed using median transcript per million (TPM) expression per tissue generated by the GTEx Consortium (v8, accessed on 10 November 2021). As GTEx only uses uniquely mapped reads for expression and multimapping was a concern, expression was assessed as a binary variable. That is, a gene with a median TPM > 0 was considered to be expressed.
For quantitative expression of GBA1 and GBAP1, we used RNA-seq data for 17,510 human samples originating from 54 different human tissues (GTEx, v8) that were downloaded using the R package recount (v1.4.6) (53). Cell lines, sex-specific tissues, and tissues with 10 samples or below were removed. Samples with large chromosomal deletions and duplications or large copy number variation previously associated with disease were filtered out (smafrze ! = “EXCLUDE”). For any log2 fold change calculations, GBA1 is the numerator and GBAP1 is the denominator.
Alternative splicing analysis using long-read RNA-seq
To identify alternative splicing of pseudogenes, we used publicly available long-read RNA-seq data from ENCODE (https://www.encodeproject.org/rna-seq/long-read-rna-seq/) (54). We included 29 samples from brain (n = 9), heart (n = 16), and lung (n = 6). A description of the samples included can be found in table S2. All samples were sequenced on the PacBio Sequel II platform and processed with the ENCODE DCC deployment of the TALON pipeline (v2.0.0; https://github.com/ENCODE-DCC/long-read-rna-pipeline) (55).
OMIM data
Phenotype relationships and clinical synopses of all OMIM genes were downloaded using application programming interface through https://omim.org/api (accessed 14 April 2022) (27). Parent genes were annotated genes as OMIM morbid if they were listed as causing a Mendelian phenotype.
Sequence similarity
Sequence similarity of parent genes and pseudogenes has previously been calculated by Pei et al. (2) and is available through The Pseudogene Decoration Resource (psiDr; http://www.pseudogene.org/psidr/similarity.dat; accessed 14 April 2022). We compared the sequence similarity of parent and pseudogenes considering the CDS of parent genes.
Multimapping from short-read RNA-seq
Multimapping rates of parent genes, including GBA1 and GBAP1, were investigated in human anterior cingulate cortex samples previously reported by Feleke et al. (35). Here, we used control individuals (n = 5) and individuals with PD with or without dementia (n = 13). Adapter trimming and read quality filtering was performed with default options using Fastp (v0.23.2; RRID:SCR_016962) (56), with QC metrics generated using both Fastp and FastQC (v0.11.9; RRID:SCR_014583). Alignment to the GRCh38 genome using GENCODE v38 was performed using STAR (v2.7.10; RRID:SCR_004463) (57). ENCODE standard options for long RNA-seq were used with STAR, except for alignSJDBoverhangMin, outSAMmultNmax, and outFilterMultimapNmax. outFilterMultimapNmax sets the rate of multimapping permitted; as a conservative estimate, we set this to 10, half the ENCODE standard. outSAMmultNmax was set to −1, which allowed multimapped reads to be kept in the same output SAM/BAM file. The QC and alignment processes were performed using a nextflow (58) pipeline. BAM files were sorted and indexed using Samtools (v1.14; RRID:SCR_002105) (59) and filtered in R (v4.0.5; RRID:SCR_001905) for reads overlapping the GBA1 or GBAP1 locus, using GenomicRanges (v1.42.0; RRID:SCR_000025) (60) and Rsamtools (version 2.6.0). Only paired first mate reads on the correct strand (minus for both GBA1 and GBAP1) were selected. The “NH” tag, which provides the number of alignments for a read, was also extracted from the SAM header. The CIGAR string of the read was used to provide a width of the reads relative to the reference by adding operations that consume the reference together. Reads were then filtered, using dplyr (v1.0.9; RRID:SCR_016708)(61) and tibble (v3.1.6) (61), with this new width to leave reads that aligned completely within the GBA1 and GBAP1 loci. Reads were then split between unique alignment and multimapping alignments based on the NH tag. The percentage of reads [uniquely mapped/(uniquely mapped + multimapped)] that mapped uniquely to either the GBA1 or GBAP1 locus was then calculated. Additionally, for reads that multimapped to the GBA1 or GBAP1 locus, the read name was extracted and searched for within the reads that multimapped to the alternate locus (i.e., reads names from reads that multimapped to the GBA1 locus were searched against read names for reads that multimapped to the GBAP1 locus). This provided a percentage of reads that aligned to GBA1 that also aligned elsewhere and the percentage of reads aligning to GBAP1. Code and commentary can be found here: https://github.com/Jbrenton191/GBA_multimapping_2022.
Oxford Nanopore direct cDNA sequencing
Samples
Human poly A+ RNA of healthy individuals that passed away from sudden death/trauma derived from frontal lobe and hippocampus was commercially purchased through Clontech (table S2).
Direct cDNA sequencing
A total of 100 ng of poly A+ RNA per sample was used for initial cDNA synthesis and subsequent library preparation according to the direct cDNA sequencing (SQK-DCS109) protocol described in detail at protocols.io (dx.doi.org/10.17504/protocols.io.yxmvmkpxng3p/v1). Sequencing was performed on the PromethION using one R9.4.1 flow cell per sample and base-called using Guppy (v4.0.11; ONT, Oxford, UK). Resulting fastq files were processed through a Snakemake pipeline “pipeline-isoforms-ONT-stringtie” [https://github.com/egustavsson/pipeline-ref-isoforms-ONT.git (DOI: https://doi.org/10.5281/zenodo.11091676)]. Gene abundances were calculated implementing the -A parameter in StringTie (v2.2.2 RRID:SCR_016323) (62). Data are available and deposited in the Gene Expression Omnibus under accession GSE215459.
Comparing short-read quantification versus long-read quantification
For each sample in GTEx, a log2 fold change was calculated with GBA1 as the numerator and GBAP1 as the denominator across frontal lobe and hippocampus. Shapiro-Wilk normality test in each tissue was used to confirm a normal distribution. To compare against ONT long-read quantification, we used Grubbs’ test (maximum normalized residual test) for a single outlier.
PACBIO targeted Iso-Seq
Samples
Human brain samples: Human poly A+ RNA of healthy individuals that passed away from sudden death/trauma derived from caudate nucleus, cerebellum, cerebral cortex, corpus callosum, dorsal root ganglion, frontal lobe, hippocampus, medulla oblongata, pons, spinal cord, temporal lobe, and thalamus was commercially purchased through Clontech (table S2).
iPSC, neuroepithelial, neural progenitor, cortical neuron, astrocyte, and microglia cells: Control iPSCs consisted of the previously characterized lines Ctrl1 (63), ND41866 (Coriel), RBi001 (EBiSC/Sigma-Aldrich), and SIGi1001 (EBiSC/Sigma-Aldrich) as well as the isogenic line previously generated (64). Reagents were purchased from Thermo Fisher Scientific unless otherwise stated. iPSC lines were grown in Essential 8 medium on Geltrex substrate and passaged using 0.5 M EDTA. Cortical neurons were differentiated using dual SMAD inhibition for 10 days (10 μM SB431542 and 1 μM dorsomorphin, Tocris) in N2B27 medium before maturation in N2B27 alone (65). Day 100 ± 5 days was taken as the final time point. Astrocytes were generated following a similar neural induction protocol until day 80 before repeatedly passaging cortical neuronal inductions in fibroblast growth factor 2 (FGF2) (10 ng/ml, PeproTech) to enrich for astrocyte precursors. At day 150, to generate mature astrocytes, a 2-week maturation consisted of bone morphogenetic protein 4 (BMP4) (10 ng/ml, Thermo Fisher Scientific) and leukemia inhibitory factor (LIF) (10 ng/ml, Sigma-Aldrich) (66). To induce inflammatory conditions, astrocytes were stimulated with tumor necrosis factor–α (TNF-α) (30 ng/ml, PeproTech), interleukin-1α (IL-1α) (3 ng/ml, PeproTech), and C1q (400 ng/ml, Merck) (67). iPSC-microglia were differentiated following the protocol of Xiang et al. (68). Embryoid bodies were generated using 10,000 iPSCs, and myeloid differentiation was initiated in Lonza X-VIVO 15 medium, IL-3 (25 ng/ml, PeproTech), and macrophage colony-stimulating factor (MCSF) (100 ng/ml, PeproTech). Microglia released from embryoid bodies were harvested weekly from 4 weeks and matured in Dulbecco’s modified Eagle’s medium (DMEM)–F12 supplemented with 2% insulin/transferrin/selenium, 1% N2 supplement, 1× GlutaMAX, 1× nonessential amino acids, and insulin (5 ng/ml) supplemented with IL-34 (100 ng/ml, PeproTech), MCSF (25 ng/ml, PeproTech), and transforming growth factor–β1 (TGF-β1) (5 ng/ml, PeproTech). A final 2-day maturation consisted of CXC3L1 (100 ng/ml, PeproTech) and CD200 (100 ng/ml, 2B Scientific). Inflammation was stimulated with lipopolysaccharide (10 ng/ml, Sigma-Aldrich). Total RNA was extracted using the Qiagen RNeasy kit according to the manufacturer’s protocol, with β-mercaptoethanol added to buffer RLT and with a deoxyribonuclease (DNase) digestion step included.
cDNA synthesis
A total of 250 ng of RNA was used per sample for reverse transcription. Two different cDNA synthesis approaches were used: (i) Human brain cDNA was generated by SMARTer PCR cDNA synthesis (Takara) and (ii) iPSC-derived cell lines were generated using NEBNext Single Cell/Low Input cDNA Synthesis & Amplification Module (New England Biolabs). For both reactions, sample-specific barcoded oligo dT (12 μM) with PacBio 16-mer barcode sequences was added (table S3).
SMARTer PCR cDNA synthesis: First-strand synthesis was performed as per manufacturer instructions, using sample-specific barcoded primers instead of the 3′ SMART CDS Primer II A. We used a 90-min incubation to generate full-length cDNAs. cDNA amplification was performed using a single primer (5′ PCR Primer II A from the SMARTer kit, 5′-AAGCAGTGGTATCAACGCAGAGTAC-3′) and was used for all PCR reactions after reverse transcription. We followed the manufacturer’s protocol with our determined optimal number of 18 cycles for amplification; this was used for all samples. We used a 6-min extension time to capture longer cDNA transcripts. PCR products were purified separately with 1× ProNex Beads.
NEBNext single-cell/low-input cDNA synthesis and amplification module: A reaction mix of 5.4 μl of total RNA (250 ng in total), 2 μl of barcoded primer, 1.6 μl of deoxynucleotide triphosphate (25 mM) held at 70°C for 5 min. This reaction mix was then combined with 5 μl of NEBNext Single Cell RT Buffer, 3 μl of nuclease-free H2O, and 2 μl of NEBNext Single Cell RT Enzyme Mix. The reverse transcription mix was then placed in a thermocycler at 42°C with the lid at 52°C for 75 min and then held at 4°C. On ice, we added 1 μl of Iso-Seq Express Template Switching Oligo and then placed the reaction mix in a thermocycler at 42°C with the lid at 52°C for 15 min. We then added 30 μl of elution buffer (EB) to the 20-μl Reverse Transcription and Template Switching reaction (for a total of 50 μl), which was then purified with 1× ProNex Beads and eluted in 46 μl of EB. cDNA amplification was performed by combining the eluted Reverse Transcription and Template Switching reaction with 50 μl of NEBNext Single Cell cDNA PCR Master Mix, 2 μl of NEBNext Single Cell cDNA PCR Primer, 2 μl of Iso-Seq Express cDNA PCR Primer, and 0.5 μl of NEBNext Cell Lysis Buffer.
cDNA capture using IDT xGen Lockdown probes
We used the xGen Hyb Panel Design Tool (https://eu.idtdna.com/site/order/designtool/index/XGENDESIGN) to design nonoverlapping 120-mer hybridization probes against GBA1 and GBAP1. We removed any overlapping probes with repetitive sequences (repeatmasker) and to reduce the density of probes mapping to intronic regions 0.2, which means 1 probe per 1.2 kb. In the end, our probe pool consisted of 119 probes, of which 54 were targeted toward GBA1 and 65 were targeted toward GBAP1.
We pooled an equal mass of barcoded cDNA for a total of 500 ng per capture reaction. Pooled cDNA was combined with 7.5 μl of Cot DNA in a 1.5-ml LoBind tube. We then added 1.8× of ProNex beads to the cDNA pool with Cot DNA, gently mixed the reaction mix 10 times (using a pipette), and incubated for 10 min at room temperature. After two washes with 200 μl of freshly prepared 80% ethanol, we removed any residual ethanol and immediately added 19 μl of hybridization mix consisting of 9.5 μl of 2× Hybridization Buffer, 3 μl of Hybridization Buffer Enhancer, 1 μl of xGen Asym TSO block (25 nmol), 1 μl of polyT block (25 nmol), and 4.5 μl of 1× xGen Lockdown Probe pool. The PacBio targeted Iso-Seq protocol is described in detail at protocols.io (dx.doi.org/10.17504/protocols.io.n92ld9wy9g5b/v1).
Automated analysis of Iso-Seq data using Snakemake
For the analysis of targeted PacBio Iso-Seq data, we created two Snakemake (69) (v5.32.2; RRID:SCR_003475) pipelines to analyze targeted long-read RNA-seq robustly and systematically:
APTARS (Analysis of PacBio TARgeted Sequencing; https://github.com/sid-sethi/APTARS): For each SMRT cell, two files were required for processing: (i) a subreads.bam and (ii) a FASTA file with primer sequences, including barcode sequences.
Each sequencing run was processed by ccs (v5.0.0; RRID:SCR_021174; https://ccs.how/), which combines multiple subreads of the same SMRTbell molecule and to produce one highly accurate consensus sequence, also called a HiFi read (≥Q20). We used the following parameters: --minLength 10–maxLength 50000–minPasses 3–minSnr 2.5–maxPoaCoverage 0–minPredictedAccuracy 0.99.
Identification of barcodes, demultiplexing, and removal of primers were then performed using lima (v2.0.0; https://lima.how/) invoking–isoseq–peek-guess.
Isoseq3 (v3.4.0; https://github.com/PacificBiosciences/IsoSeq) was then used to (i) remove polyA tails and (ii) identify and remove concatemers/chimeric reads, with the following parameters refine–require-polya, --log-level DEBUG. This was followed by clustering and polishing with the following parameters using cluster flnc.fofn clustered.bam–verbose–use-qvs.
Reads with predicted accuracy ≥0.99 were aligned to the GRCh38 reference genome using minimap2 (70) (v2.17; RRID:SCR_018550) using -ax splice:hq -uf–secondary = no. Samtools (59) (RRID:SCR_002105; http://www.htslib.org/) was then used to sort and filter the output SAM for the locus of gene of interest, as defined in config.yml.
We used cDNA_Cupcake (v22.0.0; https://github.com/Magdoll/cDNA_Cupcake) to (i) collapse redundant transcripts, using collapse_isoforms_by_sam.py (--dun-merge-5-shorter) and (ii) obtain read counts per sample, using get_abundance_post_collapse.py followed by demux_isoseq_with_genome.py.
Isoforms detected were characterized and classified using SQANTI3 (71) (v4.2; https://github.com/ConesaLab/SQANTI3) in combination with GENCODE (v38) comprehensive gene annotation. An isoform was classified as full splice match (FSM) if it aligned with reference genome with the same splice junctions and contained the same number of exons, incomplete splice match (ISM) if it contained fewer 5′ exons than reference genome, novel in catalog (NIC) if it is a novel isoform containing a combination of known donor or acceptor sites, or novel not in catalog (NNC) if it is a novel isoform with at least one novel donor or acceptor site.
PSQAN (Post Sqanti QC Analysis; https://github.com/sid-sethi/PSQAN): Following transcript characterization from SQANTI3, we applied a set of filtering criteria to remove potential genomic contamination and rare PCR artifacts. We removed an isoform if (i) the percent of genomic “A’s” in the downstream 20-bp window was more than 80% (“perc_A_downstream_TTS” > 80), (ii) one of the junctions was predicted to be template switching artifact (“RTS_stage” = TRUE), or (iii) it was not associated with the gene of interest. Using SQANTI’s output of ORF prediction, NMD prediction, and structural categorization based on comparison with the reference annotation (GENCODE), we grouped the identified isoforms into the following categories: (i) noncoding novel—if predicted to be noncoding and not a full-splice match with the reference; (ii) noncoding known—if predicted to be noncoding and a full-splice match with the reference; (iii) NMD novel—if predicted to be coding and NMD, and not a full-splice match with the reference; (iv) NMD known—if predicted to be coding and NMD, and a full-splice match with the reference; (v) coding novel—if predicted to be coding and not NMD, and not a full-splice match with the reference; (vi) coding known (complete match)—if predicted to be coding and not NMD, and a full-splice and untranslated region match with the reference; and (vii) coding known (alternate 3′/5′ end)—if predicted to be coding and not NMD, and a full-splice match with the reference but with an alternate 3′ end, 5′ end, or both 3′ and 5′ end.
Given a transcript T in sample i with FLR as the number of full-length reads mapped to the transcript T, we calculated the normalized full-length reads (NFLRTi) as the percentage of total transcription in the sample
where NFLRTi represents the normalized full-length read count of transcript T in sample i, FLRTi is the full-length read count of transcript T in sample i, and M is the total number of transcripts identified to be associated with the gene after filtering. Finally, to summarize the expression of a transcript associated with a gene, we calculated the mean of normalized full-length reads (NFLRTi) across all the samples
where NFLRT represents the mean expression of transcript T across all samples and N is the total number of samples. To remove low-confidence isoforms arising from artifacts, we only selected isoforms fulfilling the following three criteria: (i) expression of minimum 0.1% of total transcription per sample, i.e., NFLRTi ≥ 0.1; (ii) a minimum of 80% of total samples passing the NFLRTi threshold; and (3) expression of minimum 0.3% of total transcription across samples, i.e., NFLRT ≥ 0.3.
Visualizations of transcripts
For any visualization of transcript structures, we have recently developed ggtranscript (72) (v0.99.03; https://github.com/dzhang32/ggtranscript), an R package that extends the incredibly popular tool ggplot2 (61) (v3.3.5 RRID; SCR_014601) for visualizing transcript structure and annotation.
CAGE-seq analysis
To assess whether predicted 5′ TSSs of novel transcript were in proximity of CAGE peaks, we used data from the FANTOM5 dataset (42, 43). CAGE is based on “cap trapping”: capturing capped full-length RNAs and sequencing only the first 20 to 30 nucleotides from the 5′-end. CAGE peaks were downloaded from the FANTOM5 project (https://fantom.gsc.riken.jp/5/datafiles/reprocessed/hg38_latest/extra/CAGE_peaks/hg38_liftover+new_CAGE_peaks_phase1and2.bed.gz; accessed 20 May 2022).
Single-nucleus RNA-seq
Nuclei extraction of cortical postmortem tissue
Postmortem brain tissue from control individuals with no known history of neurological or neuropsychiatric symptoms was acquired from the Cambridge Brain Bank (ethical approval from the London-Bloomsbury Research Ethics Committee, REC reference: 16/LO/0508). Brains were bisected in the sagittal plane with one-half flash-frozen and stored at −80°C and the other half fixed in 10% neutral buffered formalin for 2 to 3 weeks. From the flash-frozen blocks, 50 to 100 mg were sampled from the DLPFC (Brodmann area 46) and stored at −80°C until use.
Nuclei were isolated as previously described (73), with minor modifications. Approximately 20 μg of −80°C-conserved tissue was thawed and dissociated in ice-cold lysis buffer [0.32 M sucrose, 5 mM CaCl2, 3 mM MgAc, 0.1 mM Na2EDTA, 10 mM tris-HCl (pH 8.0), 1 mM dithiothreitol (DTT)] using a 1-ml glass dounce tissue grinder (Wheaton). The homogenate was slowly and carefully layered on top of a sucrose layer [1.8 M sucrose, 3 mM MgAc, 10 mM tris-HCl (pH 8.0), 1 mM DTT] in centrifuge tubes to create a gradient and then centrifuged at 15,500 rpm for 2 hours and 15 min. After centrifugation, the supernatant was removed and the pellet was softened for 10 min in 100 μl of nuclear storage buffer [15% sucrose, 10 mM tris-HCl (pH 7.2), 70 mM KCl, 2 mM MgCl2] before resuspension in 300 μl of dilution buffer [10 mM tris-HCl (pH 7.2), 70 mM KCl, 2 mM MgCl2, Draq7 1:1000]. The suspension was then filtered (70-μm cell strainer) and sorted via fluorescence-activated cell sorting (FACS) (FACSAria III, BD Biosciences) at 4°C at a low flow rate, using a 100-μm nozzle [pipette tips and Eppendorf tubes for transferring nuclei were precoated with 1% bovine serum albumin (BSA)]. Nuclei (8500) were sorted for snRNA-seq and then loaded onto the Chromium Next GEM Single Cell 5′ Kit (10x Genomics, PN-1000263). Sequencing libraries were generated with unique dual indices (TT set A) and pooled for sequencing on NovaSeq 6000 (Illumina) using a 100-cycle kit and 28-10-10-90 reads.
snRNA-seq analysis
Raw base calls were demultiplexed to obtain sample-specific FASTQ files using Cell Ranger mkfastq and default parameters (v6; 10x Genomics; RRID:SCR_017344). Reads were aligned to the GRCh38 genome assembly using the Cell Ranger count (v6; 10x Genomics; RRID:SCR_017344) with default parameters (--include-introns were used for nuclei mapping) (74). Nuclei were filtered based on the number of genes detected—nuclei with less of the mean minus an SD or more than the mean plus two SDs were discarded to exclude low-quality nuclei or possible doublets. The data were normalized to center log ratio (CLR) to reduce sequencing depth variability. Clusters were defined with Seurat function FindClusters (v4.1.1; RRID:SCR_007322) using resolution of 0.5. Obtained clusters were manually annotated using canonical marker gene expression (table S5).
Signal of GBA1/GBAP1 per cell type
Barcodes (grouped by sample and cell type) were used to create Cluster objects from the python package trusTEr (v0.1.1; https://github.com/raquelgarza/truster) and processed with the following functions:
1) tsv_to_bam()—extracts the given barcodes from a sample’s BAM file (outs/possorted_genome_bam.bam output from Cell Ranger count) using the subset-bam software from 10x Genomics (v1.0). Outputs one BAM file for each cell type per sample, which contains all alignments.
2) filter_UMIs()—filters BAM files to only keep unique combinations of cell barcodes, unique molecular identifier (UMI), and sequences.
3) bam_to_fastq()—uses bamtofastq from 10x Genomics (version 1.2.0) to output the filtered BAM files as fastQ files.
4) concatenate_lanes()—concatenates the different lanes (as output from bamtofastq) from one library and generates one FASTQ file per cluster.
5) merge_clusters()—concatenates the resulting FASTQ files (one for each cell type and sample) in defined groups of samples. Here, groups were set to PD or control depending on the diagnosis of the individual from which the sample was derived. Output is a FASTQ file per cell type per condition.
6) map_clusters()—the resulting FASTQ files were then mapped using STAR (v2.7.8a). Multimapping reads were allowed to map up to 100 loci (outFilterMultimapNmax 100, winAnchorMultimapNmax 200); the rest of the parameters were used as default.
The resulting BAM files were converted to bigwig files using bamCoverage and normalized by the number of nuclei per group (expression was multiplied by a scale factor of 1 × 107 and divided by the number of nuclei in a particular cell type) (deeptools v2.5.4; RRID:SCR_016366).
For more details, please refer to the scripts process_celltypes_control_PFCTX.py, celltypes_characterization_PFCTX_Ctl.Rmd, and Snakefile_celltypes_control_PFCTX at GitHub (https://github.com/raquelgarza/GBA_snRNAseq_cutnrun_Gustavsson2022.git).
CUT&RUN
Postmortem brain tissue from control individuals with no known history of neurological or neuropsychiatric symptoms was acquired from the Skåne University Hospital Tissue Bank (ethical approvement Ethical Committee in Lund, 06582-2019 and 00080-2019). From the flash-frozen tissue, 50 to 100 mg were sampled from the DLPFC and stored at −80°C until use.
CUT&RUN was performed as previously described (75), with minor modifications. Concanavalin A (ConA)-coated magnetic beads (Epicypher) were activated by washing twice in bead binding buffer [20 mM Hepes (pH 7.5), 10 mM KCl, 1 mM CaCl, 1 mM MnCl2] and placed on ice until use. For adult neuronal samples, nuclei were isolated from frozen tissue as described above (see the “Nuclei extraction of cortical postmortem tissue” section). Before FACS, nuclei were incubated with Recombinant Alexa Fluor 488 Anti-NeuN antibody [EPR12763] - Neuronal Marker (ab190195) at a concentration of 1:500 for 30 min on ice. The nuclei were run through the FACS at 4°C at a low flow rate using a 100-μm nozzle. Alexa Fluor 488–positive nuclei (300,000) were sorted. The sorted nuclei were pelleted at 1300g for 15 min and resuspended in 1 ml of ice-cold nuclear wash buffer (20 mM Hepes, 150 mM NaCl, 0.5 mM spermidine, 1× cOmplete protease inhibitors, 0.1% BSA). Thirty microliters (10 μl per antibody treatment) of ConA-coated magnetic beads (Epicypher) were added during gentle vortexing (pipette tips for transferring nuclei were precoated with 1% BSA). Binding of nuclei to beads proceeded for 10 min at room temperature with gentle rotation, and then bead-bound nuclei were split into equal volumes [corresponding to immunoglobulin G (IgG) control and H3K4me3 treatments]. After removal of the wash buffer, nuclei were then resuspended in 100 μl of cold nuclear antibody buffer [20 mM Hepes (pH 7.5), 0.15 M NaCl, 0.5 mM Spermidine, 1× Roche complete protease inhibitors, 0.02% (w/v) digitonin, 0.1% BSA, 2 mM EDTA] containing primary antibody (rabbit anti-H3K4me3 Active Motif 39159, RRID:AB_2615077, or goat anti-rabbit IgG, Abcam ab97047, RRID:AB_10681025) at 1:50 dilution and incubated at 4°C overnight with gentle shaking. Nuclei were washed thoroughly with nuclear digitonin wash buffer [20 mM Hepes (pH 7.5), 150 mM NaCl, 0.5 mM spermidine, 1× Roche cOmplete protease inhibitors, 0.02% digitonin, 0.1% BSA] on the magnetic stand. After the final wash, Protein A and Micrococcal Nuclease (pA-MNase) (a gift from S. Henikoff) was added in nuclear digitonin wash buffer and incubated with the nuclei at 4°C for 1 hour. Nuclei were washed twice, resuspended in 100 μl of digitonin buffer, and chilled to 0° to 2°C in a metal block sitting in wet ice. Genome cleavage was stimulated by addition of 2 mM CaCl2 at 0°C for 30 min. The reaction was quenched by addition of 100 μl of 2× stop buffer [0.35 M NaCl, 20 mM EDTA, 4 mM EGTA, 0.02% digitonin, glycogen (50 ng/μl), ribonuclease A (50 ng/μl), yeast spike-in DNA (10 fg/μl) (a gift from S. Henikoff)] and vortexing. After 30-min incubation at 37°C to release genomic fragments, bead-bound nuclei were placed on the magnet stand and fragments from the supernatant were purified by a NucleoSpin clean-up kit (Macherey-Bagel). Illumina sequencing libraries were prepared using the Hyperprep kit (KAPA) with unique dual-indexed adapters (KAPA), pooled, and sequenced on a NextSeq 500 instrument (Illumina).
CUT&RUN analysis
Paired-end reads (2 × 150 bp) were aligned to the hg38 genome using bowtie2 (76) (v2.3.4.2; RRID:SCR_016368) (--local–very-sensitive-local–no-mixed–no-discordant–phred33 -I 10 -X 700), converted to bam files with samtools (59) (v1.4; RRID:SCR_002105), and indexed with samtools (59) (v1.9; RRID:SCR_002105). Normalized bigwig coverage tracks were made with bamCoverage (deepTools (77) v2.5.4; RRID:SCR_016366), with reads per kilobase of exon per million reads mapped normalization. For more details, please refer to the pipeline Snakefile_Neun_cutnrun in GitHub (https://github.com/raquelgarza/GBA_snRNAseq_cutnrun_Gustavsson2022.git).
Translation of novel transcripts
Structure predictions
Protein sequences of the different isoforms were aligned pairwise to MANE select with BioPython using a BLOSUM62 scoring matrix with gap open penalty of −3 and gap extend penalty of −0.1. pLDDT scores for residues from AlphaFold2 models were extracted and mapped onto the sequence of MANE select according to the alignment. While the structure of the predictions of newly detected isoforms follows mostly the known GBA1 structure, a noteworthy breakdown of the confidence score in regions with deletions is visible. This might indicate a conflict between coevolution information and structural templates from dominant isoforms versus the learned physicochemical properties of protein structures, which might be unfavorable in those regions.
Cell culture
H4 cells (American Type Culture Collection HTB-148148) with homozygous knockout of GBA1 (ENSG00000177628) were generated using indel-based CRISPR/Cas9 technology [gRNA 5′-TCCATTGGTCTTGAGCCAAG-3′ (reverse orientation) targeting exon 7] via Horizon Discovery Ltd. Cells were cultured in DMEM supplemented with 10% fetal bovine serum at 37°C, 5% CO2. Cells were subcultured every 3 to 4 days at a split ratio of 1:6.
Cell transfection
Cells were transfected using Lipofectamine 3000 reagent (Invitrogen L3000008) according to the manufacturer’s instructions. GBA1 or GBAP1 transcripts subcloned in the pcDNA3.1(+)-C-DYK vector were designed using the GenSmart design tool and acquired from GenScript.
Western blot
Protein was extracted from whole cells using MSD lysis buffer (MSD R60TX-3) containing 1× cOmplete Mini Protease Inhibitor Cocktail (Roche 11836153001) and 1× PhosSTOP Phosphatase Inhibitor Cocktail (Roche 4906845001). Protein concentration was determined by bicinchoninic acid (BCA) assay according to the manufacturer’s instructions (Pierce 23225). Protein (10 to 20 μg) was diluted in NuPAGE LDS Sample Buffer (Invitrogen NP0007) and 200 mM DTT was loaded on NuPAGE 4 to 12% bis-tris mini protein gels. Gels were run in NuPAGE MES SDS Running Buffer (Invitrogen NP0002) at 150 V and transferred to 0.2-μm nitrocellulose membranes in tris-glycine transfer buffer containing 20% MeOH at 30 V for 1.5 hours. Subsequently, membranes were blocked in Intercept Blocking Buffer (LI-COR 927-60001) and incubated with primary antibodies overnight at 4°C and then IRdye-conjugated secondary antibodies before imaging on the LI-COR Biosciences Odyssey CLx imaging system. Primary antibodies used include mouse anti-FLAG (Sigma-Aldrich F3165), rabbit anti-GBA1 (C-terminal; Sigma-Aldrich G4171), and rabbit anti–glyceraldehyde-3-phosphate dehydrogenase (GAPDH) (Abcam ab9485).
GCase activity assay
Cells cultured on a 96-well plate were washed with phosphate-buffered saline (PBS) (no Ca2+, no Mg2+) and harvested in activity assay buffer containing 50 mM citric acid/potassium phosphate (pH 5.0 to 5.4), 0.25% (v/v) Triton X-100, 1% (w/v) sodium taurocholate, and 1 mM EDTA. After a cycle of freeze/thaw and 30-min incubation on ice, samples were centrifuged at 3500 rpm for 5 min in 4°C. Supernatant was collected and incubated in 1% BSA and 2 mM 4-methylumbelliferyl-β-d-galactopyranoside (4-MUG; Sigma-Aldrich M3633) for 90 min at 37°C. The reaction was stopped by addition of 1 M glycine (pH 12.5), and fluorescence (excitation, 365 nm; emission, 445 nm) was measured using a SpectraMax M2 microplate reader (Molecular Devices). Enzyme activity was normalized to untransfected controls.
Immunofluorescence
Cells cultured on a 96-well plate were fixed in 4% paraformaldehyde for 10 min and methanol for 10 min and permeabilized in 0.3% Triton X-100 for 10 min at room temperature. Cells were then blocked in BlockAce blocking reagent (Bio-Rad BUF029) for 60 min and then incubated with primary antibodies at 4°C overnight. Following washing with PBS with 0.1% Tween 20, cells were incubated with Alexa Fluor secondary antibodies and Hoechst nucleic acid stain. Imaging was performed on the Thunder imager (Leica) and Opera Phenix High-content Screening System (PerkinElmer). The proportion of FLAG-tag staining (representing overexpressed GBA1) that localized to lysosomes was quantified using Harmony High-Content Imaging and Analysis Software (PerkinElmer). For each condition, >100 cells were assessed across two individual wells with nine fields of images taken per well. Primary antibodies used include mouse anti-FLAG (Sigma-Aldrich F3165), mouse anti-GBA1 (Abcam ab55080), and rabbit anti–cathepsin D (Abcam ab75852).
Variant interpretation
We retrieved all genetic variants overlapping the GBA1 locus from ClinVar, using this script https://github.com/egustavsson/long-read_scripts/blob/main/scripts/getClinVarForLoci.sh and subsequently filtered for only pathogenic variants. Since GBA1 variants associated with risk of PD are not necessarily classified as pathogenic, we also included data from the GBA1-PD browser (https://pdgenetics.shinyapps.io/gba1browser/) (78), a manual curation of PD risk variants in GBA1.
Mass spectrometric analysis of prefrontal cortex proteomes
Public mass spectrometry dataset was retrieved from ProteomeXchange (PXD026370) and from MassIVE (MSV000085698). PXD026370 consisting of human brain tissue was collected postmortem from patients diagnosed with multiple system atrophy (n = 45) and from controls (n = 30) to perform a comparative quantitative proteome profiling of tissue from the prefrontal cortex (Broadman area 9) (49). MSV000085698 consists of label-free mass spectrometry analysis of hMGLs (50) [NO_PRINTED_FORM].
The data analysis was performed using MetaMorpheus (v0.0.320; https://github.com/smith-chem-wisc/MetaMorpheus) (79). The search was conducted for two GBAP1 isoforms (PB.845.1693 and PB.845.525), and a list of 267 frequent protein contaminants was found within mass spectrometry data as provided by MetaMorpheus. An FDR (false discovery rate) of 1% was applied for presentation of PSMs (peptide spectrum matches), peptides, and proteins following review of decoy target sequences.
The following search settings were used: protease = trypsin; maximum missed cleavages = 2; minimum peptide length = 7; maximum peptide length = unspecified; initiator methionine behavior = Variable; fixed modifications = Carbamidomethyl on C, Carbamidomethyl on U; variable modifications = Oxidation on M; max mods per peptide = 2; max modification isoforms = 1024; precursor mass tolerance = ±5.0000 parts per million (PPM); product mass tolerance = ±20.0000 PPM; report PSM ambiguity = True.
Annotation of parent genes and protein-coding genes
To explore inaccuracies in annotation of parent genes and protein-coding genes, we applied three independent approaches.
Long-read RNA-seq
To identify full-length transcripts with at least one novel splice junction, we used the same long-read RNA-seq samples available from ENCODE (54) as previously described. Transcripts with novel splice junction resulting in novel ORF were those transcripts that had a predicted ORF that was not present in GENCODE v38 annotation.
Novel expressed regions
Novel unannotated expression (38) was downloaded from Visualisation of Expressed Regions (vizER; https://rytenlab.com/browser/app/vizER). The data originate from RNA-seq data in base-level coverage format for 7595 samples originating from 41 different GTEx tissues. Cell lines, sex-specific tissues, and tissues with 10 samples or below were removed. Samples with large chromosomal deletions and duplications or large copy number variation previously associated with disease were filtered out (smafrze = “USE ME”). Coverage for all remaining samples was normalized to a target library size of 40 million 100-bp reads using the area under coverage value provided by recount2 (53). For each tissue, base-level coverage was averaged across all samples to calculate the mean base-level coverage. GTEx junction reads, defined as reads with a noncontiguous gapped alignment to the genome, were downloaded using the recount2 resource and filtered to include only junction reads detected in at least 5% of samples for a given tissue and those that had available donor and acceptor splice sequences.
Splice junctions
To identify novel junctions with potential evidence of incomplete annotation, we used data provided by IntroVerse (80).
IntroVerse is a relational database that comprises exon-exon split-read data on the splicing of human introns (Ensembl v105) across 17,510 human control RNA samples and 54 tissues originally made available by GTEx and processed by the recount3 project (34). RNA-seq reads provided by the GTEx v8 project were sequenced using the Illumina TruSeq library construction protocol (nonstranded 76-bp-long reads, polyA+ selection). Samples from GTEx v8 were processed by recount3 through Monorail [STAR (57)] to detect and summarize splice junctions and Megadepth (81) to analyze the bam files produced by STAR. Additional QC criteria applied by IntroVerse included (i) exclusively analyzing samples passing the GTEx v8 minimum standards (smafrze ! = “EXCLUDE”), (ii) discarding any split reads overlapping any of the sequences included in the ENCODE Blacklist (82), or (iii) split reads that presented an implied intron length shorter than 25 bp.
Second, we extracted all novel donor and acceptor junctions that had evidence of use in ≥5% of the samples of each tissue and grouped them by gene. We then classify those genes either as “parent” or “protein-coding.” Finally, we calculated the proportion that each category of genes presented within each tissue. Focusing on the parent genes category, this can be described as follows
Let j denote the total number of parent genes containing at least one novel junction shared by ≥5% of the samples of the current tissue. Let x denote the total number of parent genes available for study. Let T denote the current tissue.
We mirrored the formula above to calculate the proportion of protein-coding genes per tissue.
Acknowledgments
We would like to thank all funding agencies and the UCL Long Read Sequencing Service for assistance. We are grateful to all members of the Ryten laboratory. For the purpose of open access, the author has applied a CC BY public copyright license to all Author Accepted Manuscripts arising from this submission.
Funding: This research was funded in whole or in part by Aligning Science Across Parkinson’s [grant numbers: ASAP-000478 (E.K.G., R.H.R., F.P., M.R., and J.H.), ASAP-000509 (E.K.G., J.W.B., M.G.-P., S.G., N.W.W., and M.R.), and ASAP-000520 (R.G., A.Q., R.A.B., and J.J.)] through the Michael J. Fox Foundation for Parkinson’s Research (MJFF), BrightFocus Foundation A2021009F (E.K.G.), Swedish Society for Medical Research Starting Grant SSMF S19-0100 (C.H.D.), Tenure Track Clinician Scientist Fellowship N008324/1 (M.R.), Alzheimer’s Society AS-JF-18-008 (C.A.), Alzheimer’s Research UK Senior Research Fellowship ARUK-SRF2016B-2 (S.W.), and the NIHR UCL Hospitals Biomedical Research Centre (S.W., C.A., and J.H.).
Author contributions: Conceptualization: E.K.G. and M.R. Methodology: E.K.G., S.S., D.Z., and J.W.B. Investigation: E.K.G., S.S., Y.G., J.W.B., S.G.-R., D.Z., R.G., R.H.R., J.R.E., Z.C., M.G.-P., H.M., K.M., R.D., A.I.W., C.A., S.W., S.G., J.E., C.B., C.H.D., A.A., D.A.M.A., A.K., A.Q., R.A.B., E.E., F.P., J.J., H.S., and C.F.B. Visualization: E.K.G., S.S., Y.G., J.W.B., S.G.-R., D.Z., and R.G. Funding acquisition: S.G., J.J., N.W.W., H.H., J.H., and M.R. Project administration: E.K.G. and M.R. Supervision: E.K.G., S.G., H.S., C.F.B., J.H., and M.R. Writing—original draft: E.K.G. and M.R. Writing—review and editing: E.K.G., S.S., Y.G., J.W.B., S.G.-R., D.Z., R.G., R.H.R., J.R.E., Z.C., M.G.-P., H.M., K.M., R.D., A.I.W., C.A., S.W., S.G., J.E., C.B., C.H.D., A.A., D.A.M.A., A.K., A.Q., R.A.B., E.E., F.P., J.J., N.W.W., H.H., H.S., C.F.B., J.H., and M.R.
Competing interests: S.S., Y.G., J.E., H.S., and C.F.B. are employed by Astex Pharmaceuticals. The other authors declare no competing interests.
Data and materials availability: All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials. Raw long-read RNA-seq data generated and used in this manuscript are publicly available in Synapse (Project SynID: syn53642785): https://www.synapse.org/#!Synapse:syn53642785/wiki/626685. Other data, code, and materials used in the analysis are available and described throughout the manuscript. The code for analysis and figure generation in this manuscript can be accessed through https://github.com/egustavsson/GBA_GBAP1_manuscript.git (DOI: https://doi.org/10.5281/zenodo.10514842).
Supplementary Materials
This PDF file includes:
Other Supplementary Material for this manuscript includes the following:
REFERENCES AND NOTES
- 1.Ebbert M. T. W., Jensen T. D., Jansen-West K., Sens J. P., Reddy J. S., Ridge P. G., Kauwe J. S. K., Belzil V., Pregent L., Carrasquillo M. M., Keene D., Larson E., Crane P., Asmann Y. W., Ertekin-Taner N., Younkin S. G., Ross O. A., Rademakers R., Petrucelli L., Fryer J. D., Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight. Genome Biol. 20, 1–23 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Pei B., Sisu C., Frankish A., Howald C., Habegger L., Mu X. J., Harte R., Balasubramanian S., Tanzer A., Diekhans M., Reymond A., Hubbard T. J., Harrow J., Gerstein M. B., The GENCODE pseudogene resource. Genome Biol. 13, 1–26 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Toffoli M., Chen X., Sedlazeck F. J., Lee C. Y., Mullin S., Higgins A., Koletsi S., Garcia-Segura M. E., Sammler E., Scholz S. W., Schapira A. H. V., Eberle M. A., Proukakis C., Comprehensive short and long read sequencing analysis for the Gaucher and Parkinson’s disease-associated GBA gene. Commun. Biol. 5, 1–10 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Deschamps-Francoeur G., Simoneau J., Scott M. S., Handling multi-mapped reads in RNA-seq. Comput. Struct. Biotechnol. J. 18, 1569–1576 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Weinreb N. J., Brady R. O., Tappel A. L., The lysosomal localization of sphingolipid hydrolases. Biochim. Biophys. Acta 159, 141–146 (1968). [DOI] [PubMed] [Google Scholar]
- 6.Brady R. O., Kanfer J., Shapiro D., The metabolism of glucocerebrosides: I. purification and properties of a glucocerebroside-cleaving enzyme from spleen tissue. J. Biol. Chem. 240, 39–43 (1965). [PubMed] [Google Scholar]
- 7.Hruska K. S., LaMarca M. E., Scott C. R., Sidransky E., Gaucher disease: Mutation and polymorphism spectrum in the glucocerebrosidase gene (GBA). Hum. Mutat. 29, 567–583 (2008). [DOI] [PubMed] [Google Scholar]
- 8.Koprivica V., Stone D. L., Park J. K., Callahan M., Frisch A., Cohen I. J., Tayebi N., Sidransky E., Analysis and classification of 304 mutant alleles in patients with type 1 and type 3 Gaucher disease. Am. J. Human Genet. 66, 1777–1786 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Wigderson M., Firon N., Horowitz Z., Wilder S., Frishberg Y., Reiner O., Horowitz M., Characterization of mutations in Gaucher patients by cDNA cloning. Am. J. Hum. Genet. 44, 365 (1989). [PMC free article] [PubMed] [Google Scholar]
- 10.Latham T., Grabowski G. A., Theophilus B. D. M., Smith F. I., Complex alleles of the acid beta-glucosidase gene in Gaucher disease. Am. J. Hum. Genet. 47, 79 (1990). [PMC free article] [PubMed] [Google Scholar]
- 11.Tsuji S., Choudary P. V., Martin B. M., Stubblefield B. K., Mayor J. A., Barranger J. A., Ginns E. I., A mutation in the human glucocerebrosidase gene in neuronopathic Gaucher’s disease. N. Engl. J. Med. 316, 570–575 (1987). [DOI] [PubMed] [Google Scholar]
- 12.Aflaki E., Westbroek W., Sidransky E., The complicated relationship between Gaucher disease and parkinsonism: Insights from a rare disease. Neuron 93, 737–746 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Sidransky E., Lopez G., The link between the GBA gene and parkinsonism. Lancet Neurol. 11, 986–998 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Sidransky E., Nalls M. A., Aasly J. O., Aharon-Peretz J., Annesi G., Barbosa E. R., Bar-Shira A., Berg D., Bras J., Brice A., Chen C.-M., Clark L. N., Condroyer C., de Marco E. V., Dürr A., Eblan M. J., Fahn S., Farrer M. J., Fung H.-C., Gan-Or Z., Gasser T., Gershoni-Baruch R., Giladi N., Griffith A., Gurevich T., Januario C., Kropp P., Lang A. E., Lee-Chen G.-J., Lesage S., Marder K., Mata I. F., Mirelman A., Mitsui J., Mizuta I., Nicoletti G., Oliveira C., Ottman R., Orr-Urtreger A., Pereira L. V., Quattrone A., Rogaeva E., Rolfs A., Rosenbaum H., Rozenberg R., Samii A., Samaddar T., Schulte C., Sharma M., Singleton A., Spitz M., Tan E.-K., Tayebi N., Toda T., Troiano A. R., Tsuji S., Wittstock M., Wolfsberg T. G., Wu Y.-R., Zabetian C. P., Zhao Y., Ziegler S. G., Multicenter analysis of glucocerebrosidase mutations in Parkinson’s disease. N. Engl. J. Med. 361, 1651–1661 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Lwin A., Orvisky E., Goker-Alpan O., LaMarca M. E., Sidransky E., Glucocerebrosidase mutations in subjects with parkinsonism. Mol. Genet. Metab. 81, 70–73 (2004). [DOI] [PubMed] [Google Scholar]
- 16.Aharon-Peretz J., Rosenbaum H., Gershoni-Baruch R., Mutations in the glucocerebrosidase gene and Parkinson’s disease in Ashkenazi Jews. N. Engl. J. Med. 351, 1972–1977 (2004). [DOI] [PubMed] [Google Scholar]
- 17.Winder-Rhodes S. E., Evans J. R., Ban M., Mason S. L., Williams-Gray C. H., Foltynie T., Duran R., Mencacci N. E., Sawcer S. J., Barker R. A., Glucocerebrosidase mutations influence the natural history of Parkinson’s disease in a community-based incident cohort. Brain 136, 392–399 (2013). [DOI] [PubMed] [Google Scholar]
- 18.Lythe V., Athauda D., Foley J., Mencacci N. E., Jahanshahi M., Cipolotti L., Hyam J., Zrinzo L., Hariz M., Hardy J., Limousin P., Foltynie T., GBA-associated Parkinson’s disease: Progression in a deep brain stimulation cohort. J. Parkinsons Dis. 7, 635–644 (2017). [DOI] [PubMed] [Google Scholar]
- 19.Davis M. Y., Johnson C. O., Leverenz J. B., Weintraub D., Trojanowski J. Q., Chen-Plotkin A., van Deerlin V. M., Quinn J. F., Chung K. A., Peterson-Hiller A. L., Rosenthal L. S., Dawson T. M., Albert M. S., Goldman J. G., Stebbins G. T., Bernard B., Wszolek Z. K., Ross O. A., Dickson D. W., Eidelberg D., Mattis P. J., Niethammer M., Yearout D., Hu S. C., Cholerton B. A., Smith M., Mata I. F., Montine T. J., Edwards K. L., Zabetian C. P., Association of GBA mutations and the E326K polymorphism with motor and cognitive progression in Parkinson disease. JAMA Neurol. 73, 1217–1224 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Brockmann K., Srulijes K., Pflederer S., Hauser A. K., Schulte C., Maetzler W., Gasser T., Berg D., GBA-associated Parkinson’s disease: Reduced survival and more rapid progression in a prospective longitudinal study. Mov. Disord. 30, 407–411 (2015). [DOI] [PubMed] [Google Scholar]
- 21.Iwaki H., Blauwendraat C., Leonard H. L., Liu G., Maple-Grødem J., Corvol J. C., Pihlstrøm L., van Nimwegen M., Hutten S. J., Nguyen K. D. H., Rick J., Eberly S., Faghri F., Auinger P., Scott K. M., Wijeyekoon R., van Deerlin V. M., Hernandez D. G., Day-Williams A. G., Brice A., Alves G., Noyce A. J., Tysnes O. B., Evans J. R., Breen D. P., Estrada K., Wegel C. E., Danjou F., Simon D. K., Ravina B., Toft M., Heutink P., Bloem B. R., Weintraub D., Barker R. A., Williams-Gray C. H., van de Warrenburg B. P., van Hilten J. J., Scherzer C. R., Singleton A. B., Nalls M. A., Genetic risk of Parkinson disease and progression. Neurol. Genet. 5, e348 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Gustavsson E. K., Trinh J., McKenzie M., Bortnick S., Petersen M. S., Farrer M. J., Aasly J. O., Genetic identification in early onset parkinsonism among Norwegian patients. Mov. Disord. Clin. Pract. 4, 499–508 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Gan-Or Z., Amshalom I., Kilarski L. L., Bar-Shira A., Gana-Weisz M., Mirelman A., Marder K., Bressman S., Giladi N., Orr-Urtreger A., Differential effects of severe vs mild GBA mutations on Parkinson disease. Neurology 84, 880–887 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Conesa A., Madrigal P., Tarazona S., Gomez-Cabrero D., Cervera A., McPherson A., Szcześniak M. W., Gaffney D. J., Elo L. L., Zhang X., Mortazavi A., A survey of best practices for RNA-seq data analysis. Genome Biol. 17, 1–19 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Harrow J., Frankish A., Gonzalez J. M., Tapanari E., Diekhans M., Kokocinski F., Aken B. L., Barrell D., Zadissa A., Searle S., Barnes I., Bignell A., Boychenko V., Hunt T., Kay M., Mukherjee G., Rajan J., Despacio-Reyes G., Saunders G., Steward C., Harte R., Lin M., Howald C., Tanzer A., Derrien T., Chrast J., Walters N., Balasubramanian S., Pei B., Tress M., Rodriguez J. M., Ezkurdia I., van Baren J., Brent M., Haussler D., Kellis M., Valencia A., Reymond A., Gerstein M., Guigó R., Hubbard T. J., GENCODE: The reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Sisu C., Pei B., Leng J., Frankish A., Zhang Y., Balasubramanian S., Harte R., Wang D., Rutenberg-Schoenberg M., Clark W., Diekhans M., Rozowsky J., Hubbard T., Harrow J., Gerstein M. B., Comparative analysis of pseudogenes across three phyla. Proc. Natl. Acad. Sci. U.S.A. 111, 13361–13366 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Amberger J. S., Bocchini C. A., Schiettecatte F., Scott A. F., Hamosh A., OMIM.org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders. Nucleic Acids Res. 43, D789–D798 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Lonsdale J., Thomas J., Salvatore M., Phillips R., Lo E., Shad S., Hasz R., Walters G., Garcia F., Young N., Foster B., Moser M., Karasik E., Gillard B., Ramsey K., Sullivan S., Bridge J., Magazine H., Syron J., Fleming J., Siminoff L., Traino H., Mosavel M., Barker L., Jewell S., Rohrer D., Maxim D., Filkins D., Harbach P., Cortadillo E., Berghuis B., Turner L., Hudson E., Feenstra K., Sobin L., Robb J., Branton P., Korzeniewski G., Shive C., Tabor D., Qi L., Groch K., Nampally S., Buia S., Zimmerman A., Smith A., Burges R., Robinson K., Valentino K., Bradbury D., Cosentino M., Diaz-Mayoral N., Kennedy M., Engel T., Williams P., Erickson K., Ardlie K., Winckler W., Getz G., DeLuca D., MacArthur D., Kellis M., Thomson A., Young T., Gelfand E., Donovan M., Meng Y., Grant G., Mash D., Marcus Y., Basile M., Liu J., Zhu J., Tu Z., Cox N. J., Nicolae D. L., Gamazon E. R., Im H. K., Konkashbaev A., Pritchard J., Stevens M., Flutre T., Wen X., Dermitzakis E. T., Lappalainen T., Guigo R., Monlong J., Sammeth M., Koller D., Battle A., Mostafavi S., McCarthy M., Rivas M., Maller J., Rusyn I., Nobel A., Wright F., Shabalin A., Feolo M., Sharopova N., Sturcke A., Paschal J., Anderson J. M., Wilder E. L., Derr L. K., Green E. D., Struewing J. P., Temple G., Volpi S., Boyer J. T., Thomson E. J., Guyer M. S., Ng C., Abdallah A., Colantuoni D., Insel T. R., Koester S. E., Little A. R., Bender P. K., Lehner T., Yao Y., Compton C. C., Vaught J. B., Sawyer S., Lockhart N. C., Demchok J., Moore H. F., The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Aguet F., Barbeira A. N., Bonazzola R., Brown A., Castel S. E., Jo B., Kasela S., Kim-Hellmuth S., Liang Y., Oliva M., Flynn E. D., Parsana P., Fresard L., Gamazon E. R., Hamel A. R., He Y., Hormozdiari F., Mohammadi P., Muñoz-Aguirre M., Park Y. S., Saha A., Segrè A. V., Strober B. J., Wen X., Wucher V., Ardlie K. G., Battle A., Brown C. D., Cox N., Das S., Dermitzakis E. T., Engelhardt B. E., Garrido-Martín D., Gay N. R., Getz G. A., Guigó R., Handsaker R. E., Hoffman P. J., Im H. K., Kashin S., Kwong A., Lappalainen T., Li X., MacArthur D. G., Montgomery S. B., Rouhana J. M., Stephens M., Stranger B. E., Todres E., Viñuela A., Wang G., Zou Y., Anand S., Gabriel S., Graubert A., Hadley K., Huang K. H., Meier S. R., Nedzel J. L., Nguyen D. T., Balliu B., Conrad D. F., Cotter D. J., de Goede O. M., Einson J., Eskin E., Eulalio T. Y., Ferraro N. M., Gloudemans M. J., Hou L., Kellis M., Li X., Mangul S., Nachun D. C., Nobel A. B., Park Y., Rao A. S., Reverter F., Sabatti C., Skol A. D., Teran N. A., Wright F., Ferreira P. G., Li G., Melé M., Yeger-Lotem E., Barcus M. E., Bradbury D., Krubit T., McLean J. A., Qi L., Robinson K., Roche N. V., Smith A. M., Sobin L., Tabor D. E., Undale A., Bridge J., Brigham L. E., Foster B. A., Gillard B. M., Hasz R., Hunter M., Johns C., Johnson M., Karasik E., Kopen G., Leinweber W. F., McDonald A., Moser M. T., Myer K., Ramsey K. D., Roe B., Shad S., Thomas J. A., Walters G., Washington M., Wheeler J., Jewell S. D., Rohrer D. C., Valley D. R., Davis D. A., Mash D. C., Branton P. A., Sobin L., Barker L. K., Gardiner H. M., Mosavel M., Siminoff L. A., Flicek P., Haeussler M., Juettemann T., Kent W. J., Lee C. M., Powell C. C., Rosenbloom K. R., Ruffier M., Sheppard D., Taylor K., Trevanion S. J., Zerbino D. R., Abell N. S., Akey J., Chen L., Demanelis K., Doherty J. A., Feinberg A. P., Hansen K. D., Hickey P. F., Hou L., Jasmine F., Jiang L., Kaul R., Kellis M., Kibriya M. G., Li J. B., Li Q., Lin S., Linder S. E., Pierce B. L., Rizzardi L. F., Smith K. S., Snyder M., Stamatoyannopoulos J., Tang H., Wang M., Branton P. A., Carithers L. J., Guan P., Koester S. E., Little A. R., Moore H. M., Nierras C. R., Rao A. K., Vaught J. B., Volpi S., The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Troskie R. L., Jafrani Y., Mercer T. R., Ewing A. D., Faulkner G. J., Cheetham S. W., Long-read cDNA sequencing identifies functional pseudogenes in the human transcriptome. Genome Biol. 22, 1–15 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Horowitz M., Wilder S., Horowitz Z., Reiner O., Gelbart T., Beutler E., The human glucocerebrosidase gene and pseudogene: Structure and evolution. Genomics 4, 87–96 (1989). [DOI] [PubMed] [Google Scholar]
- 32.Straniero L., Rimoldi V., Samarani M., Goldwurm S., di Fonzo A., Krüger R., Deleidi M., Aureli M., Soldà G., Duga S., Asselta R., The GBAP1 pseudogene acts as a ceRNA for the glucocerebrosidase gene GBA by sponging miR-22-3p. Sci. Rep. 7, 1–13 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Akbarian S., Liu C., Knowles J. A., Vaccarino F. M., Farnham P. J., Crawford G. E., Jaffe A. E., Pinto D., Dracheva S., Geschwind D. H., Mill J., Nairn A. C., Abyzov A., Pochareddy S., Prabhakar S., Weissman S., Sullivan P. F., State M. W., Weng Z., Peters M. A., White K. P., Gerstein M. B., Amiri A., Armoskus C., Ashley-Koch A. E., Bae T., Beckel-Mitchener A., Berman B. P., Coetzee G. A., Coppola G., Francoeur N., Fromer M., Gao R., Grennan K., Herstein J., Kavanagh D. H., Ivanov N. A., Jiang Y., Kitchen R. R., Kozlenkov A., Kundakovic M., Li M., Li Z., Liu S., Mangravite L. M., Mattei E., Markenscoff-Papadimitriou E., Navarro F. C. P., North N., Omberg L., Panchision D., Parikshak N., Poschmann J., Price A. J., Purcaro M., Reddy T. E., Roussos P., Schreiner S., Scuderi S., Sebra R., Shibata M., Shieh A. W., Skarica M., Sun W., Swarup V., Thomas A., Tsuji J., van Bakel H., Wang D., Wang Y., Wang K., Werling D. M., Willsey A. J., Witt H., Won H., Wong C. C. Y., Wray G. A., Wu E. Y., Xu X., Yao L., Senthil G., Lehner T., Sklar P., Sestan N., The PsychENCODE project. Nat. Neurosci. 18, 1707–1712 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Wilks C., Zheng S. C., Chen F. Y., Charles R., Solomon B., Ling J. P., Imada E. L., Zhang D., Joseph L., Leek J. T., Jaffe A. E., Nellore A., Collado-Torres L., Hansen K. D., Langmead B., Recount3: Summaries and queries for large-scale RNA-seq expression and splicing. Genome Biol. 22, 1–40 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Feleke R., Reynolds R. H., Smith A. M., Tilley B., Taliun S. A. G., Hardy J., Matthews P. M., Gentleman S., Owen D. R., Johnson M. R., Srivastava P. K., Ryten M., Cross-platform transcriptional profiling identifies common and distinct molecular pathologies in Lewy body diseases. Acta Neuropathol. 142, 449–474 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Nalls M. A., Duran R., Lopez G., Kurzawa-Akanbi M., McKeith I. G., Chinnery P. F., Morris C. M., Theuns J., Crosiers D., Cras P., Engelborghs S., de Deyn P. P., van Broeckhoven C., Mann D. M. A., Snowden J., Pickering-Brown S., Halliwell N., Davidson Y., Gibbons L., Harris J., Sheerin U. M., Bras J., Hardy J., Clark L., Marder K., Honig L. S., Berg D., Maetzler W., Brockmann K., Gasser T., Novellino F., Quattrone A., Annesi G., de Marco E. V., Rogaeva E., Masellis M., Black S. E., Bilbao J. M., Foroud T., Ghetti B., Nichols W. C., Pankratz N., Halliday G., Lesage S., Klebe S., Durr A., Duyckaerts C., Brice A., Giasson B. I., Trojanowski J. Q., Hurtig H. I., Tayebi N., Landazabal C., Knight M. A., Keller M., Singleton A. B., Wolfsberg T. G., Sidransky E., A multicenter study of glucocerebrosidase mutations in dementia with Lewy bodies. JAMA Neurol. 70, 727–735 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Chia R., Sabir M. S., Bandres-Ciga S., Saez-Atienzar S., Reynolds R. H., Gustavsson E., Walton R. L., Ahmed S., Viollet C., Ding J., Makarious M. B., Diez-Fairen M., Portley M. K., Shah Z., Abramzon Y., Hernandez D. G., Blauwendraat C., Stone D. J., Eicher J., Parkkinen L., Ansorge O., Clark L., Honig L. S., Marder K., Lemstra A., St George-Hyslop P., Londos E., Morgan K., Lashley T., Warner T. T., Jaunmuktane Z., Galasko D., Santana I., Tienari P. J., Myllykangas L., Oinas M., Cairns N. J., Morris J. C., Halliday G. M., van Deerlin V. M., Trojanowski J. Q., Grassano M., Calvo A., Mora G., Canosa A., Floris G., Bohannan R. C., Brett F., Gan-Or Z., Geiger J. T., Moore A., May P., Krüger R., Goldstein D. S., Lopez G., Tayebi N., Sidransky E., Sotis A. R., Sukumar G., Alba C., Lott N., Martinez E. M. G., Tuck M., Singh J., Bacikova D., Zhang X., Hupalo D. N., Adeleye A., Wilkerson M. D., Pollard H. B., Norcliffe-Kaufmann L., Palma J. A., Kaufmann H., Shakkottai V. G., Perkins M., Newell K. L., Gasser T., Schulte C., Landi F., Salvi E., Cusi D., Masliah E., Kim R. C., Caraway C. A., Monuki E. S., Brunetti M., Dawson T. M., Rosenthal L. S., Albert M. S., Pletnikova O., Troncoso J. C., Flanagan M. E., Mao Q., Bigio E. H., Rodríguez-Rodríguez E., Infante J., Lage C., González-Aramburu I., Sanchez-Juan P., Ghetti B., Keith J., Black S. E., Masellis M., Rogaeva E., Duyckaerts C., Brice A., Lesage S., Xiromerisiou G., Barrett M. J., Tilley B. S., Gentleman S., Logroscino G., Serrano G. E., Beach T. G., McKeith I. G., Thomas A. J., Attems J., Morris C. M., Palmer L., Love S., Troakes C., Al-Sarraj S., Hodges A. K., Aarsland D., Klein G., Kaiser S. M., Woltjer R., Pastor P., Bekris L. M., Leverenz J. B., Besser L. M., Kuzma A., Renton A. E., Goate A., Bennett D. A., Scherzer C. R., Morris H. R., Ferrari R., Albani D., Pickering-Brown S., Faber K., Kukull W. A., Morenas-Rodriguez E., Lleó A., Fortea J., Alcolea D., Clarimon J., Nalls M. A., Ferrucci L., Resnick S. M., Tanaka T., Foroud T. M., Graff-Radford N. R., Wszolek Z. K., Ferman T., Boeve B. F., Hardy J. A., Topol E. J., Torkamani A., Singleton A. B., Ryten M., Dickson D. W., Chiò A., Ross O. A., Gibbs J. R., Dalgard C. L., Traynor B. J., Scholz S. W., Genome sequencing analysis identifies new loci associated with Lewy body dementia and provides insights into its genetic architecture. Nat. Genet. 53, 294–303 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Zhang D., Guelfi S., Garcia-Ruiz S., Costa B., Reynolds R. H., D’Sa K., Liu W., Courtin T., Peterson A., Jaffe A. E., Hardy J., Botía J. A., Collado-Torres L., Ryten M., Incomplete annotation has a disproportionate impact on our understanding of Mendelian and complex neurogenetic disorders. Sci. Adv. 6, 8299–8309 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Tian L., Jabbari J. S., Thijssen R., Gouil Q., Amarasinghe S. L., Voogd O., Kariyawasam H., Du M. R. M., Schuster J., Wang C., Su S., Dong X., Law C. W., Lucattini A., Prawer Y. D. J., Collar-Fernández C., Chung J. D., Naim T., Chan A., Ly C. H., Lynch G. S., Ryall J. G., Anttila C. J. A., Peng H., Anderson M. A., Flensburg C., Majewski I., Roberts A. W., Huang D. C. S., Clark M. B., Ritchie M. E., Comprehensive characterization of single-cell full-length isoforms in human and mouse with long-read sequencing. Genome Biol. 22, 1–24 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Gonzàlez-Porta M., Frankish A., Rung J., Harrow J., Brazma A., Transcriptome analysis of human tissues and cell lines reveals one dominant transcript per gene. Genome Biol. 14, 1–11 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Djebali S., Davis C. A., Merkel A., Dobin A., Lassmann T., Mortazavi A., Tanzer A., Lagarde J., Lin W., Schlesinger F., Xue C., Marinov G. K., Khatun J., Williams B. A., Zaleski C., Rozowsky J., Röder M., Kokocinski F., Abdelhamid R. F., Alioto T., Antoshechkin I., Baer M. T., Bar N. S., Batut P., Bell K., Bell I., Chakrabortty S., Chen X., Chrast J., Curado J., Derrien T., Drenkow J., Dumais E., Dumais J., Duttagupta R., Falconnet E., Fastuca M., Fejes-Toth K., Ferreira P., Foissac S., Fullwood M. J., Gao H., Gonzalez D., Gordon A., Gunawardena H., Howald C., Jha S., Johnson R., Kapranov P., King B., Kingswood C., Luo O. J., Park E., Persaud K., Preall J. B., Ribeca P., Risk B., Robyr D., Sammeth M., Schaffer L., See L. H., Shahab A., Skancke J., Suzuki A. M., Takahashi H., Tilgner H., Trout D., Walters N., Wang H., Wrobel J., Yu Y., Ruan X., Hayashizaki Y., Harrow J., Gerstein M., Hubbard T., Reymond A., Antonarakis S. E., Hannon G., Giddings M. C., Ruan Y., Wold B., Carninci P., Guig R., Gingeras T. R., Landscape of transcription in human cells. Nature 489, 101–108 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Forrest A. R. R., Kawaji H., Rehli M., Baillie J. K., de Hoon M. J. L., Haberle V., Lassmann T., Kulakovskiy I. V., Lizio M., Itoh M., Andersson R., Mungall C. J., Meehan T. F., Schmeier S., Bertin N., Jørgensen M., Dimont E., Arner E., Schmidl C., Schaefer U., Medvedeva Y. A., Plessy C., Vitezic M., Severin J., Semple C. A., Ishizu Y., Young R. S., Francescatto M., Altschuler I. A., Albanese D., Altschule G. M., Arakawa T., Archer J. A. C., Arner P., Babina M., Rennie S., Balwierz P. J., Beckhouse A. G., Pradhan-Bhatt S., Blake J. A., Blumenthal A., Bodega B., Bonetti A., Briggs J., Brombacher F., Burroughs A. M., Califano A., Cannistraci C. V., Carbajo D., Chen Y., Chierici M., Ciani Y., Clevers H. C., Dalla E., Davis C. A., Detmar M., Diehl A. D., Dohi T., Drabløs F., Edge A. S. B., Edinger M., Ekwall K., Endoh M., Enomoto H., Fagiolini M., Fairbairn L., Fang H., Farach-Carson M. C., Faulkner G. J., Favorov A., Fisher M. E., Frith M. C., Fujita R., Fukuda S., Furlanello C., Furuno M., Furusawa J. I., Geijtenbeek T. B., Gibson A. P., Gingeras T., Goldowitz D., Gough J., Guhl S., Guler R., Gustincich S., Ha T. J., Hamaguchi M., Hara M., Harbers M., Harshbarger J., Hasegawa A., Hasegawa Y., Hashimoto T., Herlyn M., Hitchens K. J., Sui S. J. H., Hofmann O. M., Hoof I., Hori F., Huminiecki L., Iida K., Ikawa T., Jankovic B. R., Jia H., Joshi A., Jurman G., Kaczkowski B., Kai C., Kaida K., Kaiho A., Kajiyama K., Kanamori-Katayama M., Kasianov A. S., Kasukawa T., Katayama S., Kato S., Kawaguchi S., Kawamoto H., Kawamura Y. I., Kawashima T., Kempfle J. S., Kenna T. J., Kere J., Khachigian L. M., Kitamura T., Klinken S. P., Knox A. J., Kojima M., Kojima S., Kondo N., Koseki H., Koyasu S., Krampitz S., Kubosaki A., Kwon A. T., Laros J. F. J., Lee W., Lennartsson A., Li K., Lilje B., Lipovich L., Mackay-sim A., Manabe R. I., Mar J. C., Marchand B., Mathelier A., Mejhert N., Meynert A., Mizuno Y., de Morais D. A. L., Morikawa H., Morimoto M., Moro K., Motakis E., Motohashi H., Mummery C. L., Murata M., Nagao-Sato S., Nakachi Y., Nakahara F., Nakamura T., Nakamura Y., Nakazato K., van Nimwegen E., Ninomiya N., Nishiyori H., Noma S., Nozaki T., Ogishima S., Ohkura N., Ohmiya H., Ohno H., Ohshima M., Okada-Hatakeyama M., Okazaki Y., Orlando V., Ovchinnikov D. A., Pain A., Passier R., Patrikakis M., Persson H., Piazza S., Prendergast J. G. D., Rackham O. J. L., Ramilowski J. A., Rashid M., Ravasi T., Rizzu P., Roncador M., Roy S., Rye M. B., Saijyo E., Sajantila A., Saka A., Sakaguchi S., Sakai M., Sato H., Satoh H., Savvi S., Saxena A., Schneider C., Schultes E. A., Schulze-Tanzil G. G., Schwegmann A., Sengstag T., Sheng G., Shimoji H., Shimoni Y., Shin J. W., Simon C., Sugiyama D., Sugiyama T., Suzuki M., Suzuki N., Swoboda R. K., Hoen P. A. C. T., Tagami M., Tagami N. T., Takai J., Tanaka H., Tatsukawa H., Tatum Z., Thompson M., Toyoda H., Toyoda T., Valen E., van de Wetering M., van den Berg L. M., Verardo R., Vijayan D., Vorontsov I. E., Wasserman W. W., Watanabe S., Wells C. A., Winteringham L. N., Wolvetang E., Wood E. J., Yamaguchi Y., Yamamoto M., Yoneda M., Yonekura Y., Yoshida S., Zabierowski S. E., Zhang P. G., Zhao X., Zucchelli S., Summers K. M., Suzuki H., Daub C. O., Kawai J., Heutink P., Hide W., Freeman T. C., Lenhard B., Bajic L. V. B., Taylor M. S., Makeev V. J., Sandelin A., Hume D. A., Carninci P., Hayashizaki Y., A promoter-level mammalian expression atlas. Nature 507, 462–470 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Imada E. L., Sanchez D. F., Collado-Torres L., Wilks C., Matam T., Dinalankara W., Stupnikov A., Lobo-Pereira F., Yip C. W., Yasuzawa K., Kondo N., Itoh M., Suzuki H., Kasukawa T., Hon C. C., de Hoon M. J. L., Shin J. W., Carninci P., Jaffe A. E., Leek J. T., Favorov A., Franco G. R., Langmead B., Marchionni L., Recounting the FANTOM CAGE-associated transcriptome. Genome Res. 30, 1073–1081 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Jumper J., Evans R., Pritzel A., Green T., Figurnov M., Ronneberger O., Tunyasuvunakool K., Bates R., Žídek A., Potapenko A., Bridgland A., Meyer C., Kohl S. A. A., Ballard A. J., Cowie A., Romera-Paredes B., Nikolov S., Jain R., Adler J., Back T., Petersen S., Reiman D., Clancy E., Zielinski M., Steinegger M., Pacholska M., Berghammer T., Bodenstein S., Silver D., Vinyals O., Senior A. W., Kavukcuoglu K., Kohli P., Hassabis D., Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Zunke F., Andresen L., Wesseler S., Groth J., Arnold P., Rothaug M., Mazzulli J. R., Krainc D., Blanz J., Saftig P., Schwake M., Characterization of the complex formed by β-glucocerebrosidase and the lysosomal integral membrane protein type-2. Proc. Natl. Acad. Sci. U.S.A. 113, 3791–3796 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Liou B., Haffey W. D., Greis K. D., Grabowski G. A., The LIMP-2/SCARB2 binding motif on acid β-glucosidase. J. Biol. Chem. 289, 30063–30074 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Deen M. C., Gilormini P. A., Vocadlo D. J., Strategies for quantifying the enzymatic activities of glycoside hydrolases within cells and in vivo. Curr. Opin. Chem. Biol. 77, 102403 (2023). [DOI] [PubMed] [Google Scholar]
- 48.Zhu S., Deen M. C., Zhu Y., Gilormini P. A., Chen X., Davis O. B., Chin M. Y., Henry A. G., Vocadlo D. J., A fixable fluorescence-quenched substrate for quantitation of lysosomal glucocerebrosidase activity in both live and fixed cells. Angew. Chem. Int. Ed. 62, e202309306 (2023). [DOI] [PubMed] [Google Scholar]
- 49.Rydbirk R., Østergaard O., Folke J., Hempel C., DellaValle B., Andresen T. L., Løkkegaard A., Hejl A. M., Bode M., Blaabjerg M., Møller M., Danielsen E. H., Salvesen L., Starhof C. C., Bech S., Winge K., Rungby J., Pakkenberg B., Brudek T., Olsen J. V., Aznar S., Brain proteome profiling implicates the complement and coagulation cascade in multiple system atrophy brain pathology. Cell. Mol. Life Sci. 79, 1–22 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Liu T., Zhu B., Liu Y., Zhang X., Yin J., Li X., Jiang L. L., Hodges A. P., Rosenthal S. B., Zhou L., Yancey J., McQuade A., Blurton-Jones M., Tanzi R. E., Huang T. Y., Xu H., Multi-omic comparison of Alzheimer’s variants in human ESC-derived microglia reveals convergence at APOE. J. Exp. Med. 217, e20200474 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Skene P. J., Henikoff S., An efficient targeted nuclease strategy for high-resolution mapping of DNA binding sites. eLife 6, e21856 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Corey D. R., Nusinersen, an antisense oligonucleotide drug for spinal muscular atrophy. Nat. Neurosci. 20, 497–499 (2017). [DOI] [PubMed] [Google Scholar]
- 53.Collado-Torres L., Nellore A., Kammers K., Ellis S. E., Taub M. A., Hansen K. D., Jaffe A. E., Langmead B., Leek J. T., Reproducible RNA-seq analysis using recount2. Nat. Biotechnol. 35, 319–321 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Dunham I., Kundaje A., Aldred S. F., Collins P. J., Davis C. A., Doyle F., Epstein C. B., Frietze S., Harrow J., Kaul R., Khatun J., Lajoie B. R., Landt S. G., Lee B. K., Pauli F., Rosenbloom K. R., Sabo P., Safi A., Sanyal A., Shoresh N., Simon J. M., Song L., Trinklein N. D., Altshuler R. C., Birney E., Brown J. B., Cheng C., Djebali S., Dong X., Ernst J., Furey T. S., Gerstein M., Giardine B., Greven M., Hardison R. C., Harris R. S., Herrero J., Hoffman M. M., Iyer S., Kellis M., Kheradpour P., Lassmann T., Li Q., Lin X., Marinov G. K., Merkel A., Mortazavi A., Parker S. C. J., Reddy T. E., Rozowsky J., Schlesinger F., Thurman R. E., Wang J., Ward L. D., Whitfield T. W., Wilder S. P., Wu W., Xi H. S., Yip K. Y., Zhuang J., Bernstein B. E., Green E. D., Gunter C., Snyder M., Pazin M. J., Lowdon R. F., Dillon L. A. L., Adams L. B., Kelly C. J., Zhang J., Wexler J. R., Good P. J., Feingold E. A., Crawford G. E., Dekker J., Elnitski L., Farnham P. J., Giddings M. C., Gingeras T. R., Guigó R., Hubbard T. J., Kent W. J., Lieb J. D., Margulies E. H., Myers R. M., Stamatoyannopoulos J. A., Tenenbaum S. A., Weng Z., White K. P., Wold B., Yu Y., Wrobel J., Risk B. A., Gunawardena H. P., Kuiper H. C., Maier C. W., Xie L., Chen X., Mikkelsen T. S., Gillespie S., Goren A., Ram O., Zhang X., Wang L., Issner R., Coyne M. J., Durham T., Ku M., Truong T., Eaton M. L., Dobin A., Tanzer A., Lagarde J., Lin W., Xue C., Williams B. A., Zaleski C., Röder M., Kokocinski F., Abdelhamid R. F., Alioto T., Antoshechkin I., Baer M. T., Batut P., Bell I., Bell K., Chakrabortty S., Chrast J., Curado J., Derrien T., Drenkow J., Dumais E., Dumais J., Duttagupta R., Fastuca M., Fejes-Toth K., Ferreira P., Foissac S., Fullwood M. J., Gao H., Gonzalez D., Gordon A., Howald C., Jha S., Johnson R., Kapranov P., King B., Kingswood C., Li G., Luo O. J., Park E., Preall J. B., Presaud K., Ribeca P., Robyr D., Ruan X., Sammeth M., Sandhu K. S., Schaeffer L., See L. H., Shahab A., Skancke J., Suzuki A. M., Takahashi H., Tilgner H., Trout D., Walters N., Wang H., Hayashizaki Y., Reymond A., Antonarakis S. E., Hannon G. J., Ruan Y., Carninci P., Sloan C. A., Learned K., Malladi V. S., Wong M. C., Barber G. P., Cline M. S., Dreszer T. R., Heitner S. G., Karolchik D., Kirkup V. M., Meyer L. R., Long J. C., Maddren M., Raney B. J., Grasfeder L. L., Giresi P. G., Battenhouse A., Sheffield N. C., Showers K. A., London D., Bhinge A. A., Shestak C., Schaner M. R., Kim S. K., Zhang Z. Z., Mieczkowski P. A., Mieczkowska J. O., Liu Z., McDaniell R. M., Ni Y., Rashid N. U., Kim M. J., Adar S., Zhang Z., Wang T., Winter D., Keefe D., Iyer V. R., Zheng M., Wang P., Gertz J., Vielmetter J., Partridge E. C., Varley K. E., Gasper C., Bansal A., Pepke S., Jain P., Amrhein H., Bowling K. M., Anaya M., Cross M. K., Muratet M. A., Newberry K. M., McCue K., Nesmith A. S., Fisher-Aylor K. I., Pusey B., DeSalvo G., Parker S. L., Balasubramanian S., Davis N. S., Meadows S. K., Eggleston T., Newberry J. S., Levy S. E., Absher D. M., Wong W. H., Blow M. J., Visel A., Pennachio L. A., Petrykowska H. M., Abyzov A., Aken B., Barrell D., Barson G., Berry A., Bignell A., Boychenko V., Bussotti G., Davidson C., Despacio-Reyes G., Diekhans M., Ezkurdia I., Frankish A., Gilbert J., Gonzalez J. M., Griffiths E., Harte R., Hendrix D. A., Hunt T., Jungreis I., Kay M., Khurana E., Leng J., Lin M. F., Loveland J., Lu Z., Manthravadi D., Mariotti M., Mudge J., Mukherjee G., Notredame C., Pei B., Rodriguez J. M., Saunders G., Sboner A., Searle S., Sisu C., Snow C., Steward C., Tapanari E., Tress M. L., van Baren M. J., Washietl S., Wilming L., Zadissa A., Zhang Z., Brent M., Haussler D., Valencia A., Addleman N., Alexander R. P., Auerbach R. K., Balasubramanian S., Bettinger K., Bhardwaj N., Boyle A. P., Cao A. R., Cayting P., Charos A., Cheng Y., Eastman C., Euskirchen G., Fleming J. D., Grubert F., Habegger L., Hariharan M., Harmanci A., Iyengar S., Jin V. X., Karczewski K. J., Kasowski M., Lacroute P., Lam H., Lamarre-Vincent N., Lian J., Lindahl-Allen M., Min R., Miotto B., Monahan H., Moqtaderi Z., Mu X. J., O’Geen H., Ouyang Z., Patacsil D., Raha D., Ramirez L., Reed B., Shi M., Slifer T., Witt H., Wu L., Xu X., Yan K. K., Yang X., Struhl K., Weissman S. M., Penalva L. O., Karmakar S., Bhanvadia R. R., Choudhury A., Domanus M., Ma L., Moran J., Victorsen A., Auer T., Centanin L., Eichenlaub M., Gruhl F., Heermann S., Hoeckendorf B., Inoue D., Kellner T., Kirchmaier S., Mueller C., Reinhardt R., Schertel L., Schneider S., Sinn R., Wittbrodt B., Wittbrodt J., Jain G., Balasundaram G., Bates D. L., Byron R., Canfield T. K., Diegel M. J., Dunn D., Ebersol A. K., Frum T., Garg K., Gist E., Hansen R. S., Boatman L., Haugen E., Humbert R., Johnson A. K., Johnson E. M., Kutyavin T. V., Lee K., Lotakis D., Maurano M. T., Neph S. J., Neri F. V., Nguyen E. D., Qu H., Reynolds A. P., Roach V., Rynes E., Sanchez M. E., Sandstrom R. S., Shafer A. O., Stergachis A. B., Thomas S., Vernot B., Vierstra J., Vong S., Wang H., Weaver M. A., Yan Y., Zhang M., Akey J. M., Bender M., Dorschner M. O., Groudine M., MacCoss M. J., Navas P., Stamatoyannopoulos G., Beal K., Brazma A., Flicek P., Johnson N., Lukk M., Luscombe N. M., Sobral D., Vaquerizas J. M., Batzoglou S., Sidow A., Hussami N., Kyriazopoulou-Panagiotopoulou S., Libbrecht M. W., Schaub M. A., Miller W., Bickel P. J., Banfai B., Boley N. P., Huang H., Li J. J., Noble W. S., Bilmes J. A., Buske O. J., Sahu A. D., Kharchenko P. V., Park P. J., Baker D., Taylor J., Lochovsky L., An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.D. Wyman, G. Balderrama-Gutierrez, F. Reese, S. Jiang, S. Rahmanian, S. Forner, D. Matheos, W. Zeng, B. Williams, D. Trout, W. England, S.-H. Chu, R. C. Spitale, A. J. Tenner, B. J. Wold, A. Mortazavi, A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification. bioRxiv 672931 [Preprint] (2020). 10.1101/672931. [DOI]
- 56.Chen S., Zhou Y., Chen Y., Gu J., fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Dobin A., Davis C. A., Schlesinger F., Drenkow J., Zaleski C., Jha S., Batut P., Chaisson M., Gingeras T. R., STAR: Ultrafast universal RNA-seq aligner. Bioinformatics 29, 15 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.di Tommaso P., Chatzou M., Floden E. W., Barja P. P., Palumbo E., Notredame C., Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, 316–319 (2017). [DOI] [PubMed] [Google Scholar]
- 59.Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Lawrence M., Huber W., Pagès H., Aboyoun P., Carlson M., Gentleman R., Morgan M. T., Carey V. J., Software for computing and annotating genomic ranges. PLOS Comput. Biol. 9, e1003118 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Wickham H., Averick M., Bryan J., Chang W., Mcgowan L. D. A., François R., Grolemund G., Hayes A., Henry L., Hester J., Kuhn M., Pedersen T. L., Miller E., Bache S. M., Müller K., Ooms J., Robinson D., Seidel D. P., Spinu V., Takahashi K., Vaughan D., Wilke C., Woo K., Yutani H., Welcome to the Tidyverse. J. Open Source Softw. 4, 1686 (2019). [Google Scholar]
- 62.Pertea M., Pertea G. M., Antonescu C. M., Chang T. C., Mendell J. T., Salzberg S. L., StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Sposito T., Preza E., Mahoney C. J., Setó-Salvia N., Ryan N. S., Morris H. R., Arber C., Devine M. J., Houlden H., Warner T. T., Bushell T. J., Zagnoni M., Kunath T., Livesey F. J., Fox N. C., Rossor M. N., Hardy J., Wray S., Developmental regulation of tau splicing is disrupted in stem cell-derived neurons from frontotemporal dementia patients with the 10 + 16 splice-site mutation in MAPT. Hum. Mol. Genet. 24, 5260–5269 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Arber C., Toombs J., Lovejoy C., Ryan N. S., Paterson R. W., Willumsen N., Gkanatsiou E., Portelius E., Blennow K., Heslegrave A., Schott J. M., Hardy J., Lashley T., Fox N. C., Zetterberg H., Wray S., Familial Alzheimer’s disease patient-derived neurons reveal distinct mutation-specific effects on amyloid beta. Mol. Psychiatry 25, 2919–2931 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Shi Y., Kirwan P., Smith J., Robinson H. P. C., Livesey F. J., Human cerebral cortex development from pluripotent stem cells to functional excitatory synapses. Nat. Neurosci. 15, 477–486 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Hall C. E., Yao Z., Choi M., Tyzack G. E., Serio A., Luisier R., Harley J., Preza E., Arber C., Crisp S. J., Watson P. M. D., Kullmann D. M., Abramov A. Y., Wray S., Burley R., Loh S. H. Y., Martins L. M., Stevens M. M., Luscombe N. M., Sibley C. R., Lakatos A., Ule J., Gandhi S., Patani R., Progressive motor neuron pathology and the role of astrocytes in a human stem cell model of VCP-related ALS. Cell Rep. 19, 1739–1749 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Liddelow S. A., Guttenplan K. A., Clarke L. E., Bennett F. C., Bohlen C. J., Schirmer L., Bennett M. L., Münch A. E., Chung W. S., Peterson T. C., Wilton D. K., Frouin A., Napier B. A., Panicker N., Kumar M., Buckwalter M. S., Rowitch D. H., Dawson V. L., Dawson T. M., Stevens B., Barres B. A., Neurotoxic reactive astrocytes are induced by activated microglia. Nature 541, 481–487 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Xiang X., Piers T. M., Wefers B., Zhu K., Mallach A., Brunner B., Kleinberger G., Song W., Colonna M., Herms J., Wurst W., Pocock J. M., Haass C., The Trem2 R47H Alzheimer’s risk variant impairs splicing and reduces Trem2 mRNA and protein in mice but not in humans. Mol. Neurodegener. 13, 1–14 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Köster J., Mölder F., Jablonski K. P., Letcher B., Hall M. B., Tomkins-Tinch C. H., Sochat V., Forster J., Lee S., Twardziok S. O., Kanitz A., Wilm A., Holtgrewe M., Rahmann S., Nahnsen S., Sustainable data analysis with Snakemake. F1000Research 10, 33 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Li H., Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Tardaguila M., de La Fuente L., Marti C., Pereira C., Pardo-Palacios F. J., del Risco H., Ferrell M., Mellado M., Macchietto M., Verheggen K., Edelmann M., Ezkurdia I., Vazquez J., Tress M., Mortazavi A., Martens L., Rodriguez-Navarro S., Moreno-Manzano V., Conesa A., SQANTI: Extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification. Genome Res. 28, 396–411 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Gustavsson E. K., Zhang D., Reynolds R. H., Garcia-Ruiz S., Ryten M., ggtranscript: An R package for the visualization and interpretation of transcript isoforms using ggplot2. Bioinformatics 38, 3844–3846 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Södersten E., Toskas K., Rraklli V., Tiklova K., Björklund Å. K., Ringnér M., Perlmann T., Holmberg J., A comprehensive map coupling histone modifications with gene regulation in adult dopaminergic and serotonergic neurons. Nat. Commun. 9, 1–16 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Zheng G. X. Y., Terry J. M., Belgrader P., Ryvkin P., Bent Z. W., Wilson R., Ziraldo S. B., Wheeler T. D., McDermott G. P., Zhu J., Gregory M. T., Shuga J., Montesclaros L., Underwood J. G., Masquelier D. A., Nishimura S. Y., Schnall-Levin M., Wyatt P. W., Hindson C. M., Bharadwaj R., Wong A., Ness K. D., Beppu L. W., Deeg H. J., McFarland C., Loeb K. R., Valente W. J., Ericson N. G., Stevens E. A., Radich J. P., Mikkelsen T. S., Hindson B. J., Bielas J. H., Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Skene P. J., Henikoff J. G., Henikoff S., Targeted in situ genome-wide profiling with high efficiency for low cell numbers. Nat. Protoc. 13, 1006–1019 (2018). [DOI] [PubMed] [Google Scholar]
- 76.Langmead B., Salzberg S. L., Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Ramírez F., Dündar F., Diehl S., Grüning B. A., Manke T., deepTools: A flexible platform for exploring deep-sequencing data. Nucleic Acids Res. 42, W187–W191 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Parlar S. C., Grenn F. P., Kim J. J., Baluwendraat C., Gan-Or Z., Classification of GBA1 variants in Parkinson’s disease: The GBA1-PD browser. Mov. Disord. 38, 489–495 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Solntsev S. K., Shortreed M. R., Frey B. L., Smith L. M., Enhanced global post-translational modification discovery with MetaMorpheus. J. Proteome Res. 17, 1844–1851 (2018). [DOI] [PubMed] [Google Scholar]
- 80.García-Ruiz S., Gustavsson E. K., Zhang D., Reynolds R. H., Chen Z., Fairbrother-Browne A., Gil-Martínez A. L., Botia J. A., Collado-Torres L., Ryten M., IntroVerse: A comprehensive database of introns across human tissues. Nucleic Acids Res. 51, D167–D178 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Wilks C., Ahmed O., Baker D. N., Zhang D., Collado-Torres L., Langmead B., Megadepth: Efficient coverage quantification for BigWigs and BAMs. Bioinformatics 37, 3014–3016 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Amemiya H. M., Kundaje A., Boyle A. P., The ENCODE blacklist: Identification of problematic regions of the genome. Sci. Rep. 9, 1–5 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.