Abstract
The majority of mammalian genes encode multiple transcript isoforms that result from differential promoter use, changes in exonic splicing, and alternative 3’ end choice. Detecting and quantifying transcript isoforms across tissues, cell types, and species has been extremely challenging because transcripts are much longer than the short reads normally used for RNA-seq. By contrast, long-read RNA-seq (LR-RNA-seq) gives the complete structure of most transcripts. We sequenced 264 LR-RNA-seq PacBio libraries totaling over 1 billion circular consensus reads (CCS) for 81 unique human and mouse samples. We detect at least one full-length transcript from 87.7% of annotated human protein coding genes and a total of 200,000 full-length transcripts, 40% of which have novel exon junction chains.
To capture and compute on the three sources of transcript structure diversity, we introduce a gene and transcript annotation framework that uses triplets representing the transcript start site, exon junction chain, and transcript end site of each transcript. Using triplets in a simplex representation demonstrates how promoter selection, splice pattern, and 3’ processing are deployed across human tissues, with nearly half of multi-transcript protein coding genes showing a clear bias toward one of the three diversity mechanisms. Evaluated across samples, the predominantly expressed transcript changes for 74% of protein coding genes. In evolution, the human and mouse transcriptomes are globally similar in types of transcript structure diversity, yet among individual orthologous gene pairs, more than half (57.8%) show substantial differences in mechanism of diversification in matching tissues. This initial large-scale survey of human and mouse long-read transcriptomes provides a foundation for further analyses of alternative transcript usage, and is complemented by short-read and microRNA data on the same samples and by epigenome data elsewhere in the ENCODE4 collection.
INTRODUCTION
Most mammalian genes produce multiple distinct transcript isoforms1. This transcript structure diversity is governed by promoter selection, splicing, and polyA site selection, which respectively dictate the transcript start site (TSS), exon junction chain (the unique series of exon-exon junctions used in a transcript), and transcript end site (TES, and the resulting 3’ UTR) used in the final transcript. Each of these processes is highly regulated and is subject to a different set of evolutionary pressures2–5. In protein coding genes, missplicing can lead to nonfunctional transcripts by disrupting canonical reading frames or introducing premature stop codons that predispose the transcript to nonsense mediated decay (NMD). Conversely, the cellular machinery involved in promoter or polyA site selection for protein coding genes is only constrained by the need to include start and stop codons for the correct open reading frame (ORF) in the final mRNA product.
Transcript structure diversity poses challenges for both basic and preclinical biology. As computational gene prediction and manual curation efforts have identified ever more transcripts for many genes6,7, a common assumption in genomics and medical genetics is that we only need to consider one or at most a handful of representative transcripts per gene such as those from the MANE (Matched Annotation from NCBI and EMBL-EBI) project8. MANE transcripts are chosen with respect to their expression levels in biologically-relevant samples and sequence conservation of the coding regions, and are perfectly matched between NCBI and ENSEMBL with explicit attention to the 5’ and 3’ ends. This decision to focus on one transcript per gene was driven in part by the difficulties in transcript assembly using ESTs and short-read RNA-seq, which is the assay used for most bulk and single-cell RNA-seq experiments9,10. The advent of long-read platforms heralded the promise of full-length transcript sequencing to identify expressed transcript isoforms, thus potentially bypassing the error-prone transcript assembly step11,12. However, as long-read RNA-seq (LR-RNA-seq) produces more novel candidate transcripts, there is a need to find organizational principles that will allow us to cope with the diversity of transcripts observed at some gene loci in catalogs such as GENCODE7, while at the same time distinguishing the genes that do not seem to undergo any alternative splicing.
Short-read RNA-seq has been the core assay for measuring gene expression in the second and third phases of the ENCODE project for all RNA biotypes, regardless of their lengths, in both human and mouse samples13–15. Short-read RNA-seq has also been used by many groups to comprehensively characterize TSS usage16, splicing17, and TES usage18, but the challenges of transcript assembly given the combinatorial nature of the problem have precluded a definitive assessment of the transcripts present. In addition to continuing Illumina-based short-read sequencing of mRNA and microRNA, the fourth phase of ENCODE (ENCODE4) adds matching LR-RNA-seq using the Pacific Biosciences Sequel 1 and 2 platforms in a set of human and mouse primary tissues and cell lines in order to identify and quantify known and novel transcript isoforms expressed across a diverse set of samples. We report the resulting ENCODE4 human and mouse transcriptome datasets. We implement a novel triplet scheme that captures essential differences in 5’ end choice, splicing, and 3’ usage, which allows us to categorize genes based on features driving their transcript structure diversity using a new software package called Cerberus. We introduce the gene structure simplex as an intuitive coordinate system for comparing transcript usage between genes and across samples. We then compare transcript usage between orthologous genes in human and mouse and identify substantial differences in transcript diversity for over half the genes.
Results
The ENCODE4 RNA dataset.
This LR-RNA-seq study profiled 81 tissues or cell lines by using the PacBio sequencing platform on 264 human and mouse libraries that include replicate samples and multiple human tissue donors (Tables S1–2). Without consideration for the seven postnatal timepoints in mouse, they represent 49 unique tissues or cell types across human and mouse (Fig. 1a, Fig. S1). In addition, we sequenced matching human short-read RNA-seq (Fig. S1c) and microRNA-seq (Fig. S2, Supplementary results) for most samples as well as for an additional 37 that were sequenced with short-read RNA-seq only. We detect the vast majority of polyA genes (those with biotype protein coding, pseudogene, or lncRNA) whether we restrict the analysis to short-read samples that have matching data in the LR-RNA-seq dataset (93.9% of GENCODE v40 polyA genes and 90.6% of protein coding genes) or if we use all of the short-read samples (Fig. 1b, Fig. S3a). 31.1% of all expressed genes are detected in most (>90%) of the samples, and 34.0% are detected more specifically (<10% of samples) (Fig. 1c, Fig. S3b).
Figure 1. Overview of the ENCODE4 RNA datasets.
a, Overview of the sampled tissues and number of libraries from each tissue in the ENCODE human LR-RNA-seq dataset. b, Percentage of GENCODE v40 polyA genes by gene biotype detected in at least one ENCODE short-read RNA-seq library from samples that match the LR-RNA-seq at > 0 TPM, >=1 TPM, and >= 100 TPM. c, Number of samples in which each GENCODE v40 gene is detected >= 1 TPM in the ENCODE short-read RNA-seq dataset from samples that match the LR-RNA-seq. d, Data processing pipeline for the LR-RNA-seq data. e, Percentage of GENCODE v40 polyA genes by gene biotype detected in at least one ENCODE human LR-RNA-seq library at > 0 TPM, >= 1 TPM, and >= 100 TPM. f, Number of samples in which each GENCODE v40 gene is detected >= 1 TPM in the ENCODE human LR-RNA-seq dataset. g, Boxplot of TPM of polyA genes at the indicated rank in each human LR-RNA-seq library. Not significant (no stars) P > 0.05; *P <= 0.05, **P <= 0.01, ***P <= 0.001, ****P <= 0.0001; Wilcoxon rank-sum test.
For each LR-RNA-seq dataset, we first mapped the reads using Minimap219 and corrected non-canonical splice junctions and small indels using TranscriptClean20, after which we ran TALON21 and LAPA22 to identify each transcript by its exon junction chain and assign each transcript a supported 5’ and 3’ end. Finally, to catalog transcript features and summarize transcript structure diversity in our datasets, we ran Cerberus, which is described below. It is important to emphasize that this pipeline (Fig. 1d) does not attempt to assemble reads, so that every reported known transcript is observed from 5’ to 3’ end in at least one read. We further required support from multiple reads for defining valid ends. Overall this is a conservative pipeline that was designed to detect and quantify robust novel and known transcripts (Materials and Methods).
Our LR-RNA-seq reads are oligo-dT primed and we therefore expect to see high detection of transcripts from polyA genes, which we define as belonging to the protein coding, lncRNA, or pseudogene GENCODE-annotated biotypes, across our datasets. Consistent with this expectation, we detect 75.9% of annotated GENCODE v40 polyA genes and 93.7% of protein coding genes at >= 1 TPM in at least one library in our human dataset (Fig. 1e). The overwhelming majority of undetected polyA genes are pseudogenes and lncRNAs, which are likely to be either lowly expressed or completely unexpressed in the tissues assayed. As expected, GO analysis of the undetected protein coding genes yielded biological processes such as smell and taste-related sensory processes that represent genes specifically expressed in tissues that we did not assay (Fig. S3c). We find that many genes are either expressed in a sample-specific manner (27.8% in <10% of samples) or are ubiquitously expressed across many samples (28.2% in >90% of samples), consistent with the short-read samples (Fig. 1f).
Transcriptionally active regions that are absent from GENCODE are candidates for novel genes. Applying conservative thresholds that included a requirement for one or more reproducible splice junctions (Supplementary methods), we found 214 novel candidate genes with at least one spliced transcript isoform expressed >= 1 TPM in human and 96 in mouse at our existing sequencing depth. Applying the same criteria to annotated polyA genes, we find 20,716 and 18,971 genes for human and mouse respectively, meaning that plausible novel genes constitute less than 1.0% in human and 0.5% in mouse. We subsequently focus analysis on transcripts from known polyA genes.
We then examined the distribution of gene expression values across our human LR-RNA-seq dataset to characterize the abundance of genes and to assess whether we would be able to measure differences in transcript abundance at our current sequencing depths. For each library, we ranked each gene by TPM and found that the most highly expressed genes have higher TPMs in primary tissue-derived libraries than cell line-derived libraries (Fig. 1g, Supplementary methods). In particular, the tissue-derived liver libraries have the most highly expressed genes at ranks 1, 5, and 10, which include ALB and FTL, as expected. We also observed that the top 1,000 genes expressed in all but one liver library are expressed >= 100 TPM and that the top 5,000 are expressed >= 10 TPM. We can therefore confidently measure major transcript expression usage with a conservative threshold of 10 TPM for at least a third of expressed genes in each sample.
From genes expressed >= 10 TPM, we are able to capture over half (54.0%) of MANE transcripts that are 9–12 kb long (Fig. S3d). Coupled with our read length profiles, we estimate that we can reliably sequence the 99.7% of annotated GENCODE v40 polyA transcripts that are <12 kb long from end-to-end if they are highly expressed (Fig. S3d–g). In mouse, we observe similar read length profiles, sample separation, and gene detection patterns (Fig. S3h–j), including detection of 84.9% of annotated GENCODE vM25 protein coding genes at >= 1 TPM (Fig. S3h). In summary, we are able to detect most of human and mouse protein coding genes in our ENCODE LR-RNA-seq datasets at similar rates to short-read RNA-seq, and our long reads are long enough to capture the vast majority of annotated polyA transcripts.
Different sources of transcript structure diversity.
We compared the transcript start sites (TSSs), exon junction chains (ECs), and transcript end sites (TESs) observed in the human LR-RNA-seq data with other prior assays of these features and with established catalogs of these features. Cerberus is designed to identify unique TSSs, ECs, and TESs from a wide variety of inputs that include LR-RNA-seq data, reference atlases, and external transcriptional assays such as CAGE, PAS-seq, and the GTEx LR-RNA-seq dataset18,23 (Supplementary methods). Cerberus numbers each TSS, EC, and TES (triplet features) based on the annotation status of the transcript that it came from (e.g. the most confidently annotated will be numbered first) as well as the order in which each source was provided (Supplementary methods). Cerberus outputs genomic regions for each unique TSS and TES and a list of coordinates for each unique EC. In all cases, the gene of origin is also annotated (Fig. S4a). Using the integrated series of cataloged triplet features, Cerberus assigns a TSS, EC, and TES to each unique transcript model to create a transcript identifier of the form Gene[X,Y,Z], which we call the transcript triplet (Fig. S4b, Fig. 2a). This strategy distinguishes the structure of two different transcripts from the same gene solely on the basis of their transcript triplets. Additionally, it gives us the ability to sum up the expression of TSSs, ECs, and TESs across the transcripts they come from to enable quantification of promoter usage, EC usage, and polyA site usage respectively.
Figure 2. Triplet annotation of transcript structure maps diversity within and across samples.
a, Representation of structure and transcript triplet naming convention for 3 different transcripts from the same gene based on the transcript start site (TSS), exon junction chain (EC), and transcript end site (TES) used. b-d, Triplet features detected >= 1 TPM in human ENCODE LR-RNA-seq from GENCODE v40 polyA genes broken out by novelty and support. Known features are annotated in GENCODE v29 or v40. Novel supported features are supported by b, CAGE or RAMPAGE c, GTEx, d, PAS-seq or the PolyA Atlas. e-g, Triplet features detected >= 1 TPM in human ENCODE LR-RNA-seq per GENCODE v40 polyA gene split by gene biotype for e, TSSs, f, ECs, g, TESs. h, Number of transcripts from GENCODE v40 polyA genes detected >= 1 TPM from human ENCODE LR-RNA-seq that have a known EC split by gene biotype. i, Novelty characterization of triplet features in each transcript detected >= 1 TPM in the human ENCODE LR-RNA-seq. j, Number of transcripts detected >= 1 TPM in human ENCODE LR-RNA-seq per GENCODE v40 polyA gene split by gene biotype. k, COL1A1 (gene expressed at 548 TPM) transcripts expressed >= 1 TPM in the ovary sample from human ENCODE LR-RNA-seq. l, PKM (gene expressed at 506 TPM) transcripts expressed >= 1 TPM in the ovary sample from human ENCODE LR-RNA-seq colored by expression level (TPM). m, Expression level of gene (TPM) versus the percent isoform (pi) value of the predominant transcript for each gene expressed >= 1 TPM from human ENCODE LR-RNA-seq in the ovary sample. Points are colored by whether or not pi = 100. n, Number of unique predominant transcripts detected >= 1 TPM across samples per gene.
We applied Cerberus to the ENCODE human LR-RNA-seq data and to annotations from GENCODE v40 and v29 to obtain transcript triplets for the transcripts present in each transcriptome. Cerberus labels each triplet feature as known if it is detected in a reference set (here defined as transcripts derived from GENCODE) of transcripts or novel if not. Additionally, using the information from the GENCODE reference transcriptomes, we assign the triplet [1,1,1] to the MANE transcript isoform for the gene, if it has one.
Altogether, we detected 206,806 transcripts expressed >= 1 TPM from polyA genes, 76,469 of which have exon junction chains unannotated in GENCODE v29 or v40. From these transcripts, we first sought to characterize the observed triplet features (expressed >= 1 TPM in at least one library from polyA genes) in our dataset (Fig. 2b–d, Fig. S5a–f). We found that 18.0% of TSSs, 37.3% of ECs, and 22.1% of TESs are novel compared to both GENCODE v29 and v40 (Fig. 2b–d). We furthermore determined whether any novel triplet features were supported by sources outside of the GENCODE reference. We used CAGE and RAMPAGE data to support TSSs, GTEx transcripts to support ECs, and PAS-seq and the PolyA Atlas regions to support the TESs (Supplementary methods). Of the novel triplet features, 42.8% of TSSs, 17.9% of ECs, and 79.0% of TESs were supported by at least one external dataset (Fig. 2b–d, Fig. S5a–c). While the intermediate transcriptome (in general transfer format; GTF) for our LR-RNA-seq dataset from LAPA has single-base transcript ends, the majority of our Cerberus TSSs and TESs derived from the LR-RNA-seq data are 101 bp in length and 99.8% are shorter than 500 bp, which is consistent with how Cerberus extends TSSs and TESs derived from GTFs by n bp (here n=50) on either side (Fig. S4a, Fig. S5d–e, Supplementary methods).
We further annotated the novelty of ECs compared to GENCODE using the nomenclature from SQANTI24. For detected ECs from polyA genes, we find that the majority (62.7%) of ECs were already annotated in either GENCODE v29 or v40. Novel ECs are primarily annotated as either NIC (16.1%; novel in catalog, defined as having a novel combination of known splice sites) or NNC (11.6%; novel not in catalog, defined as having at least one novel splice site) (Fig. S5f). Given the high external support for our triplet features, we were also able to predict CAGE and RAMPAGE support for our long-read derived TSSs both in and across cell types using logistic regression (Fig. S6, Supplementary results). In aggregate, a majority of our triplet features observed in our LR-RNA-seq data show were in prior annotations or have external support from additional assays.
We then examined the number of observed triplet features per gene. We find that most protein coding genes (89.8%) express more than one triplet feature across our dataset (Fig. 2e–g). By contrast, only 33.7% of lncRNAs and 14.4% of pseudogenes express more than 1 transcript and therefore triplet feature per gene. These biotypes exhibit far less transcript structure diversity as compared to protein coding genes. Overall, we find that our observed triplet features are individually well-supported by external annotations and assays. We also show that protein coding genes are far more likely to have more than one triplet feature than lncRNAs and pseudogenes.
The ENCODE4 LR-RNA-seq transcriptome.
Following our characterization of individual triplet features, we moved on to examining our full-length transcripts. We first note that most observed transcripts with known ECs belong to the protein coding biotype (Fig. 2h). In contrast to the gene-level analysis, transcripts with known ECs are expressed in a more sample-specific manner, with 49.0% expressed in <10% of samples and 4.4% expressed in >90% of samples (Fig. S7a, Fig. 1f). Of the remaining protein coding transcripts with novel ECs, 53.0% are predicted to have complete ORFs which are not subject to nonsense mediated decay (Supplementary methods). Examining detected transcripts in our dataset based on the novelty of each constituent triplet feature, we find that 52.0% of observed transcripts have each of their triplet features annotated, and that more transcripts contain novel TSSs than novel TESs (Fig. 2i). Consistent with our observation that protein coding genes generally have more than one triplet feature per gene compared to the other polyA biotypes, we find that most protein coding genes also have more than one transcript per gene (Fig. 2j, e–g). We investigated the extent that lower expression levels of lncRNAs contribute to their overall lower transcript diversity compared to protein coding genes. We found that the 137 lncRNAs expressed >100 TPM in one or more samples have the same median number of expressed transcripts per gene as the 8,436 protein coding genes at the same expression level (median = 7) (Fig. S7b, Supplementary methods). Therefore the lower reported overall diversity of lncRNAs is due to a combination of their lower expression levels and our sequencing depth.
We compared the number of TSSs and TESs that are detected per EC across the observed transcripts from GENCODE v40 versus in our observed transcripts. We found that in GENCODE v40, each multiexonic EC has a maximum of 3 TSSs or TESs across the polyA transcripts and that the overwhelming majority of ECs are only annotated with 1 TSS and TES (99.7% and 99.4% respectively) (Fig. S7c–d). In contrast, our strategy of transcriptome annotation yields a substantial increase in the number of distinct TSSs and TESs observed per EC, which more accurately reflects the biology of the coordination of promoter choice, polyA site selection, and splicing across the diverse samples in our dataset (Fig. S7e–f). The effect of this increase in annotated TSSs and TESs is also apparent when analyzing our transcripts using traditional alternative splicing event detection methods, which are not written to consider the more subtle differences in transcript structure at the 5’ and 3’ end (Fig. S8, Supplementary results).
Predominant transcript structure differs across tissues and cell types.
Different multiexonic genes with similar expression levels within the same sample can exhibit vastly different levels of transcript structure diversity. For instance, the genes COL1A1 and PKM have a high number of exons (60 and 47 exons, respectively across our entire human dataset) and are highly expressed in ovary (548 and 506 TPM respectively). Yet, we detect only one 6.9 kb long transcript for COL1A1 (Fig. 2k) whereas we detect 18 transcript isoforms that vary on the basis of their TSSs, ECs, and TESs for PKM (Fig. 2l).
We then asked what fraction of overall gene expression is accounted for by the predominant transcript, which is the most highly expressed transcript for a gene in a given sample. Comparing the TPM of genes expressed in ovary to the the percentage of reads from a gene that come from that transcript (pi - percent isoform)25 of the predominant transcript, we find that 19.5% of protein coding genes expressed >100 TPM have a predominant transcript that accounts for less than 50% of the reads, and therefore are highly expressed with high transcript structure diversity. Conversely, 26.8% of protein coding genes are expressed >100 TPM and have a predominant transcript that accounts for more than 90% of the expression of the gene (Fig. 2m). Globally, we generated a catalog of predominant transcripts for each sample. The median number of predominant transcripts per protein coding gene across samples was 2, and that 73.0% of protein coding genes have more than one predominant transcript across the samples surveyed (Fig. 2n). Thus, the majority of human protein coding genes use a different predominant transcript in at least one condition represented in our sample collection.
Quantifying transcript structure diversity across samples using gene triplets and the gene structure simplex.
We developed a framework to systematically characterize and quantify the diversity between the detected transcripts from each gene by computing a summary gene triplet, which is related to but distinct from transcript triplets. For each set of transcripts from a given gene, we count the number of unique TSSs, ECs, and TESs (Fig. 3a, Fig. S9). As the number of exon junction chains is naturally linked to the number of alternative TSSs or TESs (for instance, a new TSS with a different splice donor will lead to a novel EC regardless of similarities in downstream splicing), we calculate the splicing ratio as to more fairly assess the contribution of ECs to transcript diversity in each gene (Fig. 3a, Fig. S9). We then compute the proportion of transcript diversity that arises from each source of variation: alternative TSS usage, alternative TES usage, or internal splicing (Fig. 3a). Representing these numbers as proportions allows us to plot them as coordinates in a two-dimensional gene structure simplex (Fig. 3b, Fig. S9). This enables us to visualize how transcripts from a gene typically differ from one another and categorize genes based on their primary driver of transcript structure diversity. Genes with a high proportion of transcripts characterized by alternative TSS usage (>0.5) will fall into the TSS-high sector of the simplex, those with a high proportion of transcripts characterized by alternative TES usage (>0.5) will fall into the TES-high sector of the simplex, and those with a high proportion of transcripts characterized by internal splicing (>0.5) will fall into the splicing-high portion of the simplex. Genes with more than one transcript that do not display a strong preference for one mode over the other lie in the mixed sector, and genes with just one transcript are in the center of the simplex, henceforth the simple sector (Fig. 3a–b, Fig. S9, Supplementary methods).
Figure 3. The gene structure simplex represents distinct modes of transcript structure diversity across genes and samples.
a, Transcripts for 5 model genes; 1 of each sector (TSS-high, splicing-high, TES-high, mixed, and simple). Table shows the gene triplet, splicing ratio gene triplet, and simplex coordinates that correspond to each toy gene. b, Layout of the gene structure simplex with the genes from a, plotted based on their simplex coordinates. Proportion of TSS usage is the blue axis (left), proportion of TES is the orange axis (bottom), and proportion of splicing ratio is the pink axis (right). Regions of the simplex are colored and labeled based on their sector category (TSS-high, splicing-high, TES-high). Gene triplets that land in each sector are assigned the concordant sector category. c-e, Gene structure simplices for the transcripts from protein coding genes that are c, annotated in GENCODE v40 where the parent gene is also detected in our human LR-RNA-seq dataset, d, the observed set of transcripts, those detected >= 1 TPM in the human ENCODE LR-RNA-seq dataset, e, the observed major set of transcripts, the union of major transcripts from each sample detected >= 1 TPM in the human ENCODE LR-RNA-seq dataset. f-j, Proportion of genes from the GENCODE v40, observed, and observed major sets that fall into the f, TSS-high sector, g, splicing-high sector, h, TES-high sector, i, mixed sector, j, simple sector. k, Gene structure simplex for AKAP8L. Gene triplets with splicing ratio for H9 and H9-derived pancreatic progenitors labeled. Simplex coordinates for the GENCODE v40, observed set, and centroid of the samples also shown for AKAP8L. l-m, Transcripts of AKAP8L expressed >= 1 TPM in l, H9 m, H9-derived pancreatic progenitors colored by expression level in TPM. Alternative exons that differ between transcripts are colored pink.
We first used the gene structure simplex to compare different transcriptomes. We computed gene triplets for protein coding genes for the following transcriptomes: GENCODE v40 transcripts from genes we detect in our LR-RNA-seq dataset; observed transcripts in our LR-RNA-seq dataset (observed); and the union of detected major transcripts (observed major), which we define as the set of most highly expressed transcripts per gene in a sample that are cumulatively responsible for over 90% of that gene’s expression in any of our LR-RNA-seq samples (Fig. 3c–j, Supplementary methods). The observed and observed major gene triplets describe the diversity of transcription in each gene across all samples in the dataset. Unsurprisingly, GENCODE genes show less density in the TSS or TES sectors of the simplex, largely because the main focus of GENCODE is to annotate unique ECs rather than 5’ or 3’ ends. This causes a concomitant drop in diversity in these sectors in GENCODE compared to the observed and observed major transcripts (Fig. 3f, h). Interestingly, there is also a distinct enrichment of genes that occupy the splicing-high portion of the simplex in our observed set compared to GENCODE (Fig. 3g). When considering the observed major transcripts, we see an increase in the percentage of genes in the TSS and splicing-high sectors over the set of all transcripts detected in our entire LR-RNA-seq dataset, but a decrease in the TES-high sector (Fig. 3f–h). Overall, we compared gene triplets for transcripts as annotated by GENCODE and observed in our LR-RNA-seq dataset and found higher proportions of genes with high TSS and splicing diversity as compared to GENCODE.
Calculating sample-level gene triplets identifies genes that show distinct transcript structure diversity across samples.
The observed gene triplets represent the aggregate repertoire of triplet features for each gene globally across our entire LR-RNA-seq dataset. However, the overall transcript structure diversity of a gene does not necessarily reflect the transcript structure diversity of a gene within a given sample. Therefore, we computed gene triplets for each sample in our dataset using all detected transcripts in each sample (sample-level gene triplets) or just the major transcripts in each sample (sample-level major gene triplets). These gene triplets can also be visualized on the gene structure simplex where each point represents the gene triplet associated with a different sample (Fig. 3k).
In order to find genes that display heterogeneous transcript structure diversity across unique biological contexts, we computed the average coordinate (centroid) for each gene from the sample-level gene triplets and calculated the distance between it and each sample-level gene triplet (Supplementary methods). 2,892 unique genes had a distance z-score >3 in at least one sample and therefore demonstrate dissimilar transcript structure diversity from the average. One such example is AKAP8L in the H9-derived pancreatic progenitors (z-score: 5.23). AKAP8L can bind both DNA and RNA in the nucleus and has been shown to have functional differences on the protein level resulting from alternate transcript choice26,27. In our data, transcripts of this gene generally differ in terms of the EC or TES choice, but this behavior differs from sample to sample (Fig. 3k). For example, transcripts of AKAP8L differ only in their ECs in H9 embryonic stem cells, whereas transcripts differ in their ECs and TESs in the H9-derived pancreatic progenitors (Fig. 3l–m).
We also compared our sample-level gene triplets to the observed gene triplets to understand how transcript structure diversity differs globally versus within samples (Fig. 3d, Fig. 2j, Fig. S10a–h). First, we simply counted the number of triplet features or transcripts per gene and found that while most genes have more than one triplet feature or transcript globally (Fig. 2e–g, j), on the sample level, most genes have far fewer triplet features and transcripts; with a particularly pronounced difference for the TSS (Fig. S10a–d). We found that the distributions of triplet features overall and in each sample were significantly different from one another (two-sided KS test) (Fig. S10e–l, Supplementary methods).
To determine how transcript structure diversity for each gene changes from the global to sample level using the gene structure framework, we computed distances between the global observed gene triplets for non-simple genes to sample-level gene triplet centroids (Supplementary methods). We find that 3.2% of tested genes have a distance z-score >2 between their observed and sample-level centroid gene triplets. In support of our analysis on the individual triplet feature level, we find that 94.8% of genes from the TSS-high sector in the observed set do not share this sector with their sample-level centroid, indicating that genes with a large number of promoters typically use them in a sample-specific manner. ACTA1, a gene that encodes for an actin protein28, is the gene with the highest distance between observed and sample-level centroid. Its observed gene triplet is (1,18,1) and therefore splicing-high. However, in most samples where ACTA1 is expressed, it has only one transcript isoform (Fig. S10m). This drives the sample-level centroid behavior into the mixed sector (Fig. S10n). In contrast, in heart and muscle ACTA1 expresses 18 and 15 transcripts respectively, which all differ on the basis of their ECs (Fig. S10m–o). This illustrates how the gene structure framework can be used to highlight differences between sample-specific and global transcript structure diversity, and also shows that individual genes are substantially different.
Sample-specific and global changes in predominant and major transcript isoform usage.
Nevertheless, the transcript structure diversity pattern for the majority of genes is consistent across samples where they are expressed at substantial levels. Elastin (ELN), which is an important component of the extracellular matrix29, is the gene with the greatest number of detected transcripts in our dataset (283 in total). We find that in most samples, distinct transcripts of ELN are characterized by different ECs (Fig. 4a). For example, in lung, ELN has 32 major transcripts with 21 different ECs, but in 31 of its major transcripts, uses only one TSS and two TESs (Fig. 4b). By contrast, the four transcripts from the transcription factor CTCF expressed in lung use three TSSs but only one TES (Fig. 4c–d).
Figure 4. Sample-specific and global changes in predominant and major transcript isoform usage.
a, Gene structure simplex for major transcripts of ELN. Gene triplets with splicing ratio for lung and H9-derived chondrocytes labeled. Simplex coordinates for the GENCODE v40 and observed major set are labeled. b, Major transcripts of ELN expressed >= 1 TPM in lung colored by expression level in TPM. Alternative exons that differ between transcripts are colored pink. c, Gene structure simplex for major transcripts of CTCF. Gene triplets with splicing ratio for lung labeled. Simplex coordinates for the GENCODE v40 and observed major set are labeled. d, From top to bottom: Major transcripts of CTCF expressed >= 1 TPM in lung, TSSs of CTCF major transcripts expressed >= 1 TPM in lung, ENCODE cCREs colored by type. e, Gene structure simplex for E4F1. Gene triplet with splicing ratio for observed E4F1 transcripts labeled. Simplex coordinates for the GENCODE v40 and observed set also shown for E4F1. f, Gene structure simplex for major transcripts of E4F1. Gene triplet with splicing ratio for observed major E4F1 transcripts labeled. Simplex coordinates for the GENCODE v40 and observed major set also shown for E4F1. g, Sector assignment change and conservation for protein coding genes in the human ENCODE LR-RNA-seq dataset between the observed set of gene triplets (left) and the observed major set of gene triplets (right). Percent of genes with the same sector between both sets labeled in the middle. h-k, Percentage of libraries where a gene with an annotated MANE transcript is expressed and the MANE h, transcript i, TSS j, EC k, TES is the predominant transcript or triplet feature.
While the observed gene triplets for a gene represent the overall transcript structure diversity, the observed major gene triplets capture diversity of the most highly expressed transcripts in each sample. We computed the distances between the observed and observed major simplex coordinates for protein coding genes. The transcription factor E4F130 has a high distance between the observed and observed major gene triplets, which corresponds to a change from the mixed to splicing-high sector (Fig. 4e–f). This sector change is driven by the use of fewer TSSs and TESs in major transcripts. Overall, 83.7% of protein coding genes retain their sectors between our observed and observed major triplets, while 4.8% genes in the mixed sector move to one of the three corners of the simplex (TSS, splicing, or TES-high) (Fig. 4g). Thus, the differences between the observed and observed major gene triplets in a subset of genes can be substantial.
One criterion for the identification of MANE transcripts is how highly expressed the transcript is compared to others8. Therefore, we assessed how frequently the MANE transcript was the predominant one in each of our LR-RNA-seq libraries. Limiting ourselves to only the genes that have annotated MANE transcripts in GENCODE v40, we found that 64.1% of genes have a non-MANE predominant transcript in at least 80% of the libraries where the gene is expressed (Fig. 4h). At the individual triplet feature level, 30.8% of TSSs, 40.9% of ECs, and 45.2% of TESs have a non-MANE predominant feature in at least 80% of libraries (Fig. 4i–k). Therefore, though the MANE transcript typically is the most highly expressed transcript in a library, most genes with MANE transcripts have some libraries where this is not the case. For non-MANE predominant transcripts, only 17.0% were predicted to have the same ORF as the MANE transcript. Furthermore, 62.1% of non-MANE predominant transcripts are predicted to encode for a full ORF that does not undergo NMD. These results indicate that in many cases, the alternative predominant transcript in a sample likely encodes for a distinct, functional protein. The genes where the MANE transcript and triplet features are frequently not the predominant one represent loci that would suffer more from restricting analyses to only a single transcript isoform.
For a subset of gene / library combinations where the MANE transcript or feature was not the predominant one, the MANE transcript or feature was still expressed, albeit at a lower level. For these gene / library combinations, we compared the expression of the predominant transcript to the MANE one (Fig. S11a–d). We found that for predominant transcripts or triplet features expressed <30 TPM, the MANE counterpart was expressed at a comparable level. By contrast, for the opposite situation, where the MANE transcript or triplet feature was the predominant one, we found that the secondary transcript was not expressed at a similar level (Fig. S11e–h). Overall, for most gene / library combinations, the MANE transcript or triplet feature is the predominant one (Fig. S11i–l).
Comparing transcript structure diversity between species.
We ran Cerberus on the ENCODE4 mouse LR-RNA-seq dataset to calculate transcript and gene triplets to enable comparison of transcript structure diversity between the two species. Compared to human GENCODE v40, GENCODE vM25 genes are less enriched in the TSS, splicing, and TES-high sectors (Fig. 3f–j, Fig. 5a–e, Fig. S12a–c). For the mouse observed and observed major gene triplets, we see relatively similar percentages of genes in each sector (Fig. 5a–e, Fig. S12a–c). We found fewer predominant transcripts across samples per protein coding gene in mouse than in human, which is expected due to the overall lower number of tissues in our mouse data, with a median of 2 predominant transcripts per gene and 57.7% of protein coding genes with more than one (Fig. S12d). Furthermore, we observe that 54.5% of protein coding transcripts with novel ECs are predicted to encode for full ORFs without NMD. Thus, the two transcriptomes have similar distributions of genes in our gene structure simplex.
Figure 5. Conservation of gene triplets from human and mouse.
a-e, Proportion of genes from the GENCODE vM25, observed, and observed major sets that fall into the a, TSS-high sector, b, splicing-high sector, c, TES-high sector, d, mixed sector, e, simple sector. f, Gene structure simplex for ARF4 in human. Gene triplet with splicing ratio for ARF4 transcripts in H1 labeled. Simplex coordinates for the GENCODE v40, sample-level centroid, and observed set also shown for ARF4. g, Gene structure simplex for Arf4 in mouse. Gene triplet with splicing ratio for Arf4 transcripts in F121–9 labeled. Simplex coordinates for the GENCODE v40, sample-level centroid, and observed set also shown for Arf4. h, Transcripts of ARF4 expressed >= 1 TPM in human H1 sample colored by expression level in TPM. i, Transcripts of Arf4 expressed >= 1 TPM in mouse F121–9 sample colored by expression level in TPM. j, Sector assignment change and conservation for orthologous protein coding genes between the observed major human set of gene triplets (left) and the observed major mouse set of gene triplets (right). Percent of genes with the same sector between both sets labeled in the middle. k, Sector assignment change and conservation for orthologous protein coding genes between the sample-level H1 major human set of gene triplets (left) and the sample-level F121–9 major mouse set of gene triplets (right). Percent of genes with the same sector between both sets labeled in the middle.
In order to make gene-level comparison for orthologous genes in both species, we subset the human samples on those that are the most similar to the mouse samples and computed “mouse matched” observed, observed major, and sample-level gene triplets (Supplementary methods, Table S1–2). We computed the sample-level centroids for each gene in both species and computed the distance between each pair of 1:1 orthologs. Of the 13,536 orthologous genes, 4.3% have a distance z-score >2 between the species and therefore exhibit substantial changes in transcript structure diversity between the species. One of these is ADP-Ribosylation Factor 4 (ARF4), which is the most divergent member of the ARF4 family31. Human ARF4 sample-level gene triplets are nearly always splicing-high whereas mouse Arf4 sample-level gene triplets are mainly TES-high (Fig. 5f–g). We examined the ARF4 / Arf4 transcripts expressed in matching embryonic stem cell samples (H1 in human and F121–9 in mouse) and found that, despite the homologous samples, all 3 of the expressed human transcripts use the same TSS and TES but differ in the ECs whereas all 3 expressed mouse transcripts use the same TSS and EC but differ at the TES (Fig. 5h–i). We find globally that when comparing the observed major gene triplets between human and mouse, only 42.2% of genes have the same sector in human and mouse (Fig. 5j). This result holds even when restricting ourselves to comparing human tissues with adult mouse samples or just a comparison between human and mouse embryonic stem cells (Fig. 5k). Thus, we find substantial differences in splicing diversity for orthologous genes between human and mouse.
Discussion
The ENCODE4 LR-RNA-seq dataset is the first large-scale, cross-species survey of transcript structure diversity using full-length cDNA sequencing on long-read platforms. We identify and quantify known and novel transcripts in a broad and diverse set of samples with uniformly processed data and annotations available at the ENCODE portal. A new framework was introduced for categorizing transcript structure diversity based on their exon junction chains and ends using gene and transcript triplets, which allowed us to use the gene structure simplex to visualize and compare gene triplets between samples and across species. The results showed a full range of transcript structure diversity across the transcriptome, based on promoter, internal splicing, and polyA site choice. As expected, the existing gene annotation catalogs such as GENCODE have successfully captured individual features such as TSSs and exons. However, GENCODE annotated full-length transcripts only represent a subset of the TSS, EC, and TES combinations that we observe using our conservative pipeline that requires full end-to-end support in a single read and support from multiple reads for defining ends. From the human LR-RNA-seq quantification, we found more than one predominant transcript across samples for 73.0% of genes, which is in contrast with prior reports32,33. We also found that for a substantial number of genes, transcript structure diversity and major transcript usage for the same gene differs between tissues, samples, and developmental timepoints. The majority of genes had at least one library where the MANE transcript is not the predominant transcript. This could confound analyses such as variant effect prediction in which it is common practice to consider only one transcript per gene. Finally, we found that transcript structure diversity behavior differs quite strikingly between human and mouse on a gene-by-gene basis. In matching samples, the dominant source of transcript structure diversity differed for more than half of orthologous protein coding genes.
Our data and framework provide a foundation for further analyses such as the functional impacts of alternative 5’ and 3’ ends, RNA modifications, RNA binding protein function, allele-specific expression, and transcript half-life. Together with the accompanying tissue and cell type annotations, this constitutes a transcript-level reference atlas that is structured appropriately for integration of future single-cell long-read analysis. Exploration using the gene structure simplex analysis will yield additional genes showing sample specific variance compared to their average behavior when extended to new tissues, differentiation time courses, or disease samples. The triplet annotation scheme for transcripts, based on mechanistically distinct transcript features, organizes and simplifies high-level analysis of transcripts from the same gene. We find it to be a useful and commonsense improvement over arbitrary transcript IDs and we expect it to be widely applicable to transcriptomes of any organism that uses regulated alternative splicing, promoter choice, or 3’ end selection.
Our annotations are consistent and extensive, yet they have several limitations. With the current sample preparation protocol and depth of sequencing, we reliably detect transcripts that are expressed above a minimum expression level of 1 TPM and are less than 10 kb long. While 99.3% of GENCODE v40 polyA transcripts are less than 10 kb long, we are undoubtedly underrepresenting transcripts that are at the long end of the distribution, especially when they are expressed at low levels or in rare cell types within a tissue. RNA integrity differences between the human cell lines and mouse tissues, both of which produce very high quality RNA compared to human postmortem tissues, are expected to affect our results, because we have focused on full-length transcript sequencing rather than read assembly. Our imposition of minimum expression and inherent length limitations could also lead to lower sensitivity of splicing diversity in lncRNAs, which have thus far generated staggering transcript structure complexity when sequenced after enrichment capture34. We also had lower detection of pseudogenes, and we hypothesize that the PacBio platform’s accuracy and read length reduce the multimapping errors typical of short reads, especially for pseudogenes of highly expressed genes.
Within these boundaries outlined above, we were able to assess the sources and specifics of transcript structure diversity for major transcripts of most protein coding genes. Nearly all studies that have examined alternative splicing have emphasized transcript isoform multiplicity per gene. Also as expected, studies that have applied more permissive processing pipelines, used transcript assembly, or focused on nuclear RNA typically find evidence for far more RNA transcripts, especially at lower expression ranges24,35–38. Assigning biological functions to new transcripts from our collection or any other contemporary study is a major challenge for the field. Unlike DNA replication’s elaborate mechanisms to ensure fidelity, the three major processes of RNA biogenesis mapped here are understood to operate with less stringent fidelity, and though it has long been debated, we consider evidence for the existence of a new transcript isoform simply makes it a candidate of interest for a protein coding, precursor, or regulatory function.
The range of regulation used by different genes was illuminating. COL1A1, a complex gene in terms of number of exons, exhibited minimal transcript structure diversity in spite of high expression versus other genes of similar expression levels, such as PKM with its many transcripts resulting from all three mechanisms. This implies that transcript structure diversity is a property of the gene that has been optimized in evolution. This has major implications for evaluating the functions of regulatory factors such as PAX6, which has 81 transcripts in GENCODE v40, and 33 transcripts in our dataset. Conventional gene-level short-read RNA-seq profiling is likely obscuring important distinctions in transcript usage. While not every one of these transcripts leads to a difference in the protein product, changes at the 5’ and 3’ end are likely to alter the regulation of those transcripts. The incorporation of transcript usage as well as its regulation within the framework of gene regulatory networks, where appropriate, is a major challenge going forward.
Considering transcript structure diversity as a fundamental, tunable property of gene function, the mouse-human comparative results were the most surprising to us. In genomics and in the wider biology community we often use orthology of mouse and human genes to predict and interpret gene function in vivo, including many uses of mice as mammalian models for both basic and preclinical purposes. The differences in transcript structure diversity that surfaced when we compared matching tissues from human and mouse suggests that this diversity is rapidly evolving on a per gene basis, even between primates and rodents. This is, however, consistent with prior observations of a large population of rapidly evolving candidate cis-regulatory elements39. The results presented here provide a roadmap for evaluating the evolution of transcript structure diversity across species and impetus to focus on it, especially for genes with substantial differences that would affect interpretation of existing animal models and expectations for humanized gene-locus mouse models. It is hard to underestimate the need for better methods to test the functional significance of different transcript isoforms.
Supplementary Material
ACKNOWLEDGEMENTS
We thank the UCI GRTH for sequencing PacBio LR-RNA-seq libraries and the Caltech Jacobs Genetics and Genomics Laboratory for sequencing the Illumina mRNA RNA-seq libraries. We thank Gloria Sheynkman for guidance on protein prediction analyses.
A.M., B.J.W., L.R., D.B. were supported by UM1HG009443. B.J.W. was also supported by the Caltech Beckman Institute BIFGRC. D.B. was also supported by P30AG10161, P30AG72975, R01AG15819, R01AG17917, U01AG46152, U01AG61356. Z.W., M.B.G., R.G., and the members of the ENCODE DAC were supported by U24HG009446. M.L. was supported by UM1HG009382. M.B. was supported by R01HG012367 and U01HG009380. B.H. and the members of the ENCODE DCC were supported by U24HG009397.
Footnotes
CODE AVAILABILITY
• Data processing and figure generation code
• Cerberus
DATA AVAILABILITY
• Human LR-RNA-seq processed data / processing pipeline
• Human LR-RNA-seq datasets
• Mouse LR-RNA-seq processed data / processing pipeline
• Mouse LR-RNA-seq datasets
• Human short-read RNA-seq datasets
• Human microRNA-seq datasets
Bibliography
- 1.Park Eddie, Pan Zhicheng, Zhang Zijun, Lin Lan, and Xing Yi. The Expanding Landscape of Alternative Splicing Variation in Human Populations. The American Journal of Human Genetics, 102(1):11–26, 2018. ISSN 0002–9297. doi: 10.1016/j.ajhg.2017.11.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Di Giammartino Dafne Campigli, Nishida Kensei, and Manley James L.. Mechanisms and Consequences of Alternative Polyadenylation. Molecular Cell, 43(6):853–866, 2011. ISSN 1097–2765. doi: 10.1016/j.molcel.2011.08.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Ara Takeshi, Lopez Fabrice, Ritchie William, Benech Philippe, and Gautheret Daniel. Conservation of alternative polyadenylation patterns in mammalian genes. BMC Genomics, 7 (1):189, 2006. doi: 10.1186/1471-2164-7-189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Xing Yi and Lee Christopher. Alternative splicing and RNA selection pressure — evolutionary consequences for eukaryotic genomes. Nature Reviews Genetics, 7(7):499–509, 2006. ISSN 1471–0056. doi: 10.1038/nrg1896. [DOI] [PubMed] [Google Scholar]
- 5.Nagasaki Hideki, Arita Masanori, Nishizawa Tatsuya, Suwa Makiko, and Gotoh Osamu. Species-specific variation of alternative splicing and transcriptional initiation in six eukaryotes. Gene, 364:53–62, 2005. ISSN 0378–1119. doi: 10.1016/j.gene.2005.07.027. [DOI] [PubMed] [Google Scholar]
- 6.O’Leary Nuala A., Wright Mathew W., Brister J. Rodney, Ciufo Stacy, Haddad Diana, McVeigh Rich, Rajput Bhanu, Robbertse Barbara, Smith-White Brian, Ako-Adjei Danso, Astashyn Alexander, Badretdin Azat, Bao Yiming, Blinkova Olga, Brover Vyacheslav, Chetvernin Vyacheslav, Choi Jinna, Cox Eric, Ermolaeva Olga, Farrell Catherine M., Goldfarb Tamara, Gupta Tripti, Haft Daniel, Hatcher Eneida, Hlavina Wratko, Joardar Vinita S., Kodali Vamsi K., Li Wenjun, Maglott Donna, Masterson Patrick, McGarvey Kelly M., Murphy Michael R., O’Neill Kathleen, Pujar Shashikant, Rangwala Sanjida H., Rausch Daniel, Riddick Lillian D., Schoch Conrad, Shkeda Andrei, Storz Susan S., Sun Hanzhen, Thibaud-Nissen Francoise, Tolstoy Igor, Tully Raymond E., Vatsan Anjana R., Wallin Craig, Webb David, Wu Wendy, Landrum Melissa J., Kimchi Avi, Tatusova Tatiana, DiCuccio Michael, Kitts Paul, Murphy Terence D., and Pruitt Kim D.. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Research, 44(D1):D733–D745, 2016. ISSN 0305–1048. doi: 10.1093/nar/gkv1189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Frankish Adam, Diekhans Mark, Jungreis Irwin, Lagarde Julien, Loveland Jane E, Mudge Jonathan M, Sisu Cristina, Wright James C, Armstrong Joel, Barnes If, Berry Andrew, Bignell Alexandra, Boix Carles, Sala Silvia Carbonell, Cunningham Fiona, Di Domenico Tomás, Donaldson Sarah, Fiddes Ian T, Girón Carlos García, Gonzalez Jose Manuel, Grego Tiago, Hardy Matthew, Hourlier Thibaut, Howe Kevin L, Hunt Toby, Izuogu Osagie G, Johnson Rory, Martin Fergal J, Martínez Laura, Mohanan Shamika, Muir Paul, Navarro Fabio C P, Parker Anne, Pei Baikang, Pozo Fernando, Riera Ferriol Calvet, Ruffier Magali, Schmitt Bianca M, Stapleton Eloise, Suner Marie-Marthe, Sycheva Irina, Uszczynska-Ratajczak Barbara, Wolf Maxim Y, Xu Jinuri, Yang Yucheng T, Yates Andrew, Zerbino Daniel, Zhang Yan, Choudhary Jyoti S, Gerstein Mark, Guigó Roderic, Hubbard Tim J P, Kellis Manolis, Paten Benedict, Tress Michael L, and Flicek Paul. GENCODE 2021. Nucleic Acids Research, 49(D1):gkaa1087–, 2020. ISSN 0305–1048. doi: 10.1093/nar/gkaa1087. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Morales Joannella, Pujar Shashikant, Loveland Jane E., Astashyn Alex, Bennett Ruth, Berry Andrew, Cox Eric, Davidson Claire, Ermolaeva Olga, Farrell Catherine M., Fatima Reham, Gil Laurent, Goldfarb Tamara, Gonzalez Jose M., Haddad Diana, Hardy Matthew, Hunt Toby, Jackson John, Joardar Vinita S., Kay Michael, Kodali Vamsi K., McGarvey Kelly M., McMahon Aoife, Mudge Jonathan M., Murphy Daniel N., Murphy Michael R., Rajput Bhanu, Rangwala Sanjida H., Riddick Lillian D., Thibaud-Nissen Françoise, Threadgold Glen, Vatsan Anjana R., Wallin Craig, Webb David, Flicek Paul, Birney Ewan, Pruitt Kim D., Frankish Adam, Cunningham Fiona, and Murphy Terence D.. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature, 604(7905):310–315, 2022. ISSN 0028–0836. doi: 10.1038/s41586-022-04558-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Modrek Barmak and Lee Christopher. A genomic view of alternative splicing. Nature Genetics, 30(1):13–19, 2002. ISSN 1061–4036. doi: 10.1038/ng0102-13. [DOI] [PubMed] [Google Scholar]
- 10.Mortazavi Ali, Williams Brian A, McCue Kenneth, Schaeffer Lorian, and Wold Barbara. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods, 5(7):621–628, 2008. ISSN 1548–7091. doi: 10.1038/nmeth.1226. [DOI] [PubMed] [Google Scholar]
- 11.Rhoads Anthony and Au Kin Fai. PacBio Sequencing and Its Applications. Genomics, Proteomics & Bioinformatics, 13(5):278–289, 2015. I Crawford 2015.08.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Garalde Daniel R, Snell Elizabeth A, Jachimowicz Daniel, Sipos Botond, Lloyd Joseph H, Bruce Mark, Pantic Nadia, Admassu Tigist, James Phillip, Warland Anthony, Jordan Michael, Ciccone Jonah, Serra Sabrina, Keenan Jemma, Martin Samuel, McNeill Luke, Wallace E Jayne, Jayasinghe Lakmal, Wright Chris, Blasco Javier, Young Stephen, Brocklebank Denise, Juul Sissel, Clarke James, Heron Andrew J, and Turner Daniel J. Highly parallel direct RNA sequencing on an array of nanopores. Nature Methods, 15(3):201–206, 2018. ISSN 1548–7091. doi: 10.1038/nmeth.4577. [DOI] [PubMed] [Google Scholar]
- 13.Dunham Ian, Kundaje Anshul, Aldred Shelley F., Collins Patrick J., Davis Carrie A., Doyle Francis, Epstein Charles B., Frietze Seth, Harrow Jennifer, Kaul Rajinder, Khatun Jainab, Lajoie Bryan R., Landt Stephen G., Lee Bum-Kyu, Pauli Florencia, Rosenbloom Kate R., Sabo Peter, Safi Alexias, Sanyal Amartya, Shoresh Noam, Simon Jeremy M., Song Lingyun, Trinklein Nathan D., Altshuler Robert C., Birney Ewan, Brown James B., Cheng Chao, Djebali Sarah, Dong Xianjun, Dunham Ian, Ernst Jason, Furey Terrence S., Gerstein Mark, Giardine Belinda, Greven Melissa, Hardison Ross C., Harris Robert S., Herrero Javier, Hoffman Michael M., Iyer Sowmya, Kellis Manolis, Khatun Jainab, Kheradpour Pouya, Kundaje Anshul, Lassmann Timo, Li Qunhua, Lin Xinying, Marinov Georgi K., Merkel Angelika, Mortazavi Ali, Parker Stephen C. J., Reddy Timothy E., Rozowsky Joel, Schlesinger Felix, Thurman Robert E., Wang Jie, Ward Lucas D., Whitfield Troy W., Wilder Steven P., Wu Weisheng, Xi Hualin S., Yip Kevin Y., Zhuang Jiali, Bernstein Bradley E., Birney Ewan, Dunham Ian, Green Eric D., Gunter Chris, Snyder Michael, Pazin Michael J., Lowdon Rebecca F., Dillon Laura A. L., Adams Leslie B., Kelly Caroline J., Zhang Julia, Wexler Judith R., Green Eric D., Good Peter J., Feingold Elise A., Bernstein Bradley E., Birney Ewan, Crawford Gregory E., Dekker Job, Elnitski Laura, Farnham Peggy J., Gerstein Mark, Giddings Morgan C., Gingeras Thomas R., Green Eric D., Guigó Roderic, Hardison Ross C., Hubbard Timothy J., Kellis Manolis, Kent W. James, Lieb Jason D., Margulies Elliott H., Myers Richard M., Snyder Michael, Stamatoyannopoulos John A., Tenenbaum Scott A., Weng Zhiping, White Kevin P., Wold Barbara, Khatun Jainab, Yu Yanbao, Wrobel John, Risk Brian A., Gunawardena Harsha P., Kuiper Heather C., Maier Christopher W., Xie Ling, Chen Xian, Giddings Morgan C., Bernstein Bradley E., Epstein Charles B., Shoresh Noam, Ernst Jason, Kheradpour Pouya, Mikkelsen Tarjei S., Gillespie Shawn, Goren Alon, Ram Oren, Zhang Xiaolan, Wang Li, Issner Robbyn, Coyne Michael J., Durham Timothy, Ku Manching, Truong Thanh, Ward Lucas D., Altshuler Robert C., Eaton Matthew L., Kellis Manolis, Djebali Sarah, Davis Carrie A., Merkel Angelika, Dobin Alex, Lassmann Timo, Mortazavi Ali, Tanzer Andrea, Lagarde Julien, Lin Wei, Schlesinger Felix, Xue Chenghai, Marinov Georgi K., Khatun Jainab, Williams Brian A., Zaleski Chris, Rozowsky Joel, Maik Röder Felix Kokocinski, Abdelhamid Rehab F., Alioto Tyler, Antoshechkin Igor, Baer Michael T., Batut Philippe, Bell Ian, Bell Kimberly, Chakrabortty Sudipto, Chen Xian, Chrast Jacqueline, Curado Joao, Derrien Thomas, Drenkow Jorg, Dumais Erica, Dumais Jackie, Duttagupta Radha, Fastuca Megan, Fejes-Toth Kata, Ferreira Pedro, Foissac Sylvain, Fullwood Melissa J., Gao Hui, Gonzalez David, Gordon Assaf, Gunawardena Harsha P., Howald Cédric, Jha Sonali, Johnson Rory, Kapranov Philipp, King Brandon, Kingswood Colin, Li Guoliang, Luo Oscar J., Park Eddie, Preall Jonathan B., Presaud Kimberly, Ribeca Paolo, Risk Brian A., Robyr Daniel, Ruan Xiaoan, Sammeth Michael, Sandhu Kuljeet Singh, Schaeffer Lorain, See Lei-Hoon, Shahab Atif, Skancke Jorgen, Suzuki Ana Maria, Takahashi Hazuki, Tilgner Hagen, Trout Diane, Walters Nathalie, Wang Huaien, Wrobel John, Yu Yanbao, Hayashizaki Yoshihide, Harrow Jennifer, Gerstein Mark, Hubbard Timothy J., Reymond Alexandre, Antonarakis Stylianos E., Hannon Gregory J., Giddings Morgan C., Ruan Yijun, Wold Barbara, Carninci Piero, Guigó Roderic, Gingeras Thomas R., Rosenbloom Kate R., Sloan Cricket A., Learned Katrina, Malladi Venkat S., Wong Matthew C., Barber Galt P., Cline Melissa S., Dreszer Timothy R., Heitner Steven G., Karolchik Donna, Kent W. James, Kirkup Vanessa M., Meyer Laurence R., Long Jeffrey C., Maddren Morgan, Raney Brian J., Furey Terrence S., Song Lingyun, Grasfeder Linda L., Giresi Paul G., Lee Bum-Kyu, Battenhouse Anna, Sheffield Nathan C., Simon Jeremy M., Showers Kimberly A., Safi Alexias, London Darin, Bhinge Akshay A., Shestak Christopher, Schaner Matthew R., Kim Seul Ki, Zhang Zhuzhu Z., Mieczkowski Piotr A., Mieczkowska Joanna O., Liu Zheng, McDaniell Ryan M., Ni Yunyun, Rashid Naim U., Min Jae Kim Sheera Adar, Zhang Zhancheng, Wang Tianyuan, Winter Deborah, Keefe Damian, Birney Ewan, Iyer Vishwanath R., Lieb Jason D., Crawford Gregory E., Li Guoliang, Sandhu Kuljeet Singh, Zheng Meizhen, Wang Ping, Luo Oscar J., Shahab Atif, Fullwood Melissa J., Ruan Xiaoan, Ruan Yijun, Myers Richard M., Pauli Florencia, Williams Brian A., Gertz Jason, Marinov Georgi K., Reddy Timothy E., Vielmetter Jost, Partridge E., Trout Diane, Varley Katherine E., Gasper Clarke, Bansal Anita, Pepke Shirley, Jain Preti, Amrhein Henry, Bowling Kevin M., Anaya Michael, Cross Marie K., King Brandon, Muratet Michael A., Antoshechkin Igor, Newberry Kimberly M., McCue Kenneth, Nesmith Amy S., Fisher-Aylor Katherine I., Pusey Barbara, Gilberto DeSalvo Stephanie L. Parker, Balasubramanian Sreeram, Davis Nicholas S., Meadows Sarah K., Eggleston Tracy, Gunter Chris, Newberry J. Scott, Levy Shawn E., Absher Devin M., Mortazavi Ali, Wong Wing H., Wold Barbara, Blow Matthew J., Visel Axel, Pennachio Len A., Elnitski Laura, Margulies Elliott H., Parker Stephen C. J., Petrykowska Hanna M., Abyzov Alexej, Aken Bronwen, Barrell Daniel, Barson Gemma, Berry Andrew, Bignell Alexandra, Boychenko Veronika, Bussotti Giovanni, Chrast Jacqueline, Davidson Claire, Derrien Thomas, Despacio-Reyes Gloria, Diekhans Mark, Ezkurdia Iakes, Frankish Adam, Gilbert James, Gonzalez Jose Manuel, Griffiths Ed, Harte Rachel, Hendrix David A., Howald Cédric, Hunt Toby, Jungreis Irwin, Kay Mike, Khurana Ekta, Kokocinski Felix, Leng Jing, Lin Michael F., Loveland Jane, Lu Zhi, Manthravadi Deepa, Mariotti Marco, Mudge Jonathan, Mukherjee Gaurab, Notredame Cedric, Pei Baikang, Rodriguez Jose Manuel, Saunders Gary, Sboner Andrea, Searle Stephen, Sisu Cristina, Snow Catherine, Steward Charlie, Tanzer Andrea, Tapanari Electra, Tress Michael L., van Baren Marijke J., Walters Nathalie, Washietl Stefan, Wilming Laurens, Zadissa Amonida, Zhang Zhengdong, Brent Michael, Haussler David, Kellis Manolis, Valencia Alfonso, Gerstein Mark, Reymond Alexandre, Roderic Guigó Jennifer Harrow, Hubbard Timothy J., Landt Stephen G., Frietze Seth, Abyzov Alexej, Addleman Nick, Alexander Roger P., Auerbach Raymond K., Balasubramanian Suganthi, Bettinger Keith, Bhardwaj Nitin, Boyle Alan P., Cao Alina R., Cayting Philip, Charos Alexandra, Cheng Yong, Cheng Chao, Eastman Catharine, Euskirchen Ghia, Fleming Joseph D., Grubert Fabian, Habegger Lukas, Hariharan Manoj, Harmanci Arif, Iyengar Sushma, Jin Victor X., Karczewski Konrad J., Kasowski Maya, Lacroute Phil, Lam Hugo, Lamarre-Vincent Nathan, Leng Jing, Lian Jin, Lindahl-Allen Marianne, Min Renqiang, Miotto Benoit, Monahan Hannah, Moqtaderi Zarmik, Mu Xinmeng J., O’Geen Henriette, Ouyang Zhengqing, Patacsil Dorrelyn, Pei Baikang, Raha Debasish, Ramirez Lucia, Reed Brian, Rozowsky Joel, Sboner Andrea, Shi Minyi, Sisu Cristina, Slifer Teri, Witt Heather, Wu Linfeng, Xu Xiaoqin, Yan Koon-Kiu, Yang Xinqiong, Yip Kevin Y., Zhang Zhengdong, Struhl Kevin, Weissman Sherman M., Gerstein Mark, Farnham Peggy J., Snyder Michael, Tenenbaum Scott A., Penalva Luiz O., Doyle Francis, Karmakar Subhradip, Landt Stephen G., Bhanvadia Raj R., Choudhury Alina, Domanus Marc, Ma Lijia, Moran Jennifer, Patacsil Dorrelyn, Slifer Teri, Victorsen Alec, Yang Xinqiong, Snyder Michael, White Kevin P., Auer Thomas, Centanin Lazaro, Eichenlaub Michael, Gruhl Franziska, Heermann Stephan, Hoeckendorf Burkhard, Inoue Daigo, Kellner Tanja, Kirchmaier Stephan, Mueller Claudia, Reinhardt Robert, Schertel Lea, Schneider Stephanie, Sinn Rebecca, Wittbrodt Beate, Wittbrodt Jochen, Weng Zhiping, Whitfield Troy W., Wang Jie, Collins Patrick J., Aldred Shelley F., Trinklein Nathan D., Partridge E. Christopher, Myers Richard M., Dekker Job, Jain Gaurav, Lajoie Bryan R., Sanyal Amartya, Balasundaram Gayathri, Bates Daniel L., Byron Rachel, Canfield Theresa K., Diegel Morgan J., Dunn Douglas, Ebersol Abigail K., Frum Tristan, Garg Kavita, Gist Erica, Hansen R. Scott, Boatman Lisa, Haugen Eric, Humbert Richard, Jain Gaurav, Johnson Audra K., Johnson Ericka M., Kutyavin Tattyana V., Lajoie Bryan R., Lee Kristen, Lotakis Dimitra, Maurano Matthew T., Neph Shane J., Neri Fiedencio V., Nguyen Eric D., Qu Hongzhu, Reynolds Alex P., Roach Vaughn, Rynes Eric, Sabo Peter, Sanchez Minerva E., Sandstrom Richard S., Sanyal Amartya, Shafer Anthony O., Stergachis Andrew B., Thomas Sean, Thurman Robert E., Vernot Benjamin, Vierstra Jeff, Vong Shinny, Wang Hao, Weaver Molly A., Yan Yongqi, Zhang Miaohua, Akey Joshua M., Bender Michael, Dorschner Michael O., Groudine Mark, MacCoss Michael J., Navas Patrick, Stamatoyannopoulos George, Kaul Rajinder, Dekker Job, Stamatoyannopoulos John A., Dunham Ian, Beal Kathryn, Brazma Alvis, Flicek Paul, Herrero Javier, Johnson Nathan, Keefe Damian, Lukk Margus, Luscombe Nicholas M., Sobral Daniel, Vaquerizas Juan M., Wilder Steven P., Batzoglou Serafim, Sidow Arend, Hussami Nadine, Kyriazopoulou-Panagiotopoulou Sofia, Libbrecht Max W., Schaub Marc A., Kundaje Anshul, Hardison Ross C., Miller Webb, Giardine Belinda, Harris Robert S., Wu Weisheng, Bickel Peter J., Banfai Balazs, Boley Nathan P., Brown James B., Huang Haiyan, Li Qunhua, Jingyi Jessica Li William Stafford Noble, Bilmes Jeffrey A., Buske Orion J., Hoffman Michael M., Sahu Avinash D., Kharchenko Peter V., Park Peter J., Baker Dannon, Taylor James, Weng Zhiping, Iyer Sowmya, Dong Xianjun, Greven Melissa, Lin Xinying, Wang Jie, Xi Hualin S., Zhuang Jiali, Gerstein Mark, Alexander Roger P., Balasubramanian Suganthi, Cheng Chao, Harmanci Arif, Lochovsky Lucas, Min Renqiang, Mu Xinmeng J., Rozowsky Joel, Yan Koon-Kiu, Yip Kevin Y., and Birney Ewan. An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414):57–74, 2012. ISSN 0028–0836. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Djebali Sarah, Davis Carrie A., Merkel Angelika, Dobin Alex, Lassmann Timo, Mortazavi Ali, Tanzer Andrea, Lagarde Julien, Lin Wei, Schlesinger Felix, Xue Chenghai, Marinov Georgi K., Khatun Jainab, Williams Brian A., Zaleski Chris, Rozowsky Joel, Maik Röder Felix Kokocinski, Abdelhamid Rehab F., Alioto Tyler, Antoshechkin Igor, Baer Michael T., Bar Nadav S., Batut Philippe, Bell Kimberly, Bell Ian, Chakrabortty Sudipto, Chen Xian, Chrast Jacqueline, Curado Joao, Derrien Thomas, Drenkow Jorg, Dumais Erica, Dumais Jacqueline, Duttagupta Radha, Falconnet Emilie, Fastuca Meagan, Kata Fejes-Toth Pedro Ferreira, Foissac Sylvain, Fullwood Melissa J., Gao Hui, Gonzalez David, Gordon Assaf, Gunawardena Harsha, Howald Cedric, Jha Sonali, Johnson Rory, Kapranov Philipp, King Brandon, Kingswood Colin, Luo Oscar J., Park Eddie, Persaud Kimberly, Preall Jonathan B., Ribeca Paolo, Risk Brian, Robyr Daniel, Sammeth Michael, Schaffer Lorian, See Lei-Hoon, Shahab Atif, Skancke Jorgen, Ana Maria Suzuki Hazuki Takahashi, Tilgner Hagen, Trout Diane, Walters Nathalie, Wang Huaien, Wrobel John, Yu Yanbao, Ruan Xiaoan, Hayashizaki Yoshihide, Harrow Jennifer, Gerstein Mark, Hubbard Tim, Reymond Alexandre, Antonarakis Stylianos E., Hannon Gregory, Giddings Morgan C., Ruan Yijun, Wold Barbara, Carninci Piero, Guigó Roderic, and Gingeras Thomas R.. Landscape of transcription in human cells. Nature, 489(7414):101–108, 2012. ISSN 0028–0836. doi: 10.1038/nature11233. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Abascal Federico, Acosta Reyes, Addleman Nicholas J., Adrian Jessika, Afzal Veena, Ai Rizi, Aken Bronwen, Akiyama Jennifer A., Al Jammal Omar, Amrhein Henry, Anderson Stacie M., Andrews Gregory R., Antoshechkin Igor, Ardlie Kristin G., Armstrong Joel, Astley Matthew, Banerjee Budhaditya, Barkal Amira A., Barnes If H. A., Barozzi Iros, Barrell Daniel, Barson Gemma, Bates Daniel, Baymuradov Ulugbek K., Bazile Cassandra, Beer Michael A., Beik Samantha, Bender M. A., Bennett Ruth, Benoit Bouvrette Louis Philip, Bernstein Bradley E., Berry Andrew, Bhaskar Anand, Bignell Alexandra, Blue Steven M., Bodine David M., Boix Carles, Boley Nathan, Borrman Tyler, Borsari Beatrice, Boyle Alan P., Brandsmeier Laurel A., Breschi Alessandra, Bresnick Emery H., Brooks Jason A., Buckley Michael, Burge Christopher B., Byron Rachel, Cahill Eileen, Cai Lingling, Cao Lulu, Carty Mark, Castanon Rosa G., Castillo Andres, Chaib Hassan, Chan Esther T., Chee Daniel R., Chee Sora, Chen Hao, Chen Huaming, Chen Jia-Yu, Chen Songjie, Cherry J. Michael, Chhetri Surya B., Choudhary Jyoti S., Chrast Jacqueline, Chung Dongjun, Clarke Declan, Cody Neal A. L., Coppola Candice J., Coursen Julie, D’Ippolito Anthony M., Dalton Stephen, Danyko Cassidy, Davidson Claire, Davila-Velderrain Jose, Davis Carrie A., Dekker Job, Deran Alden, DeSalvo Gilberto, Despacio-Reyes Gloria, Dewey Colin N., Dickel Diane E., Diegel Morgan, Diekhans Mark, Dileep Vishnu, Ding Bo, Djebali Sarah, Dobin Alexander, Dominguez Daniel, Donaldson Sarah, Drenkow Jorg, Dreszer Timothy R., Drier Yotam, Duff Michael O., Dunn Douglass, Eastman Catharine, Ecker Joseph R., Edwards Matthew D., El-Ali Nicole, Elhajjajy Shaimae I., Elkins Keri, Emili Andrew, Epstein Charles B., Evans Rachel C., Ezkurdia Iakes, Fan Kaili, Farnham Peggy J., Farrell Nina P., Feingold Elise A., Ferreira Anne-Maud, Katherine Fisher-Aylor Stephen Fitzgerald, Flicek Paul, Sheng Foo Chuan, Fortier Kevin, Frankish Adam, Freese Peter, Fu Shaliu, Fu Xiang-Dong, Fu Yu, Fukuda-Yuzawa Yoko, Fulciniti Mariateresa, Funnell Alister P. W., Gabdank Idan, Galeev Timur, Gao Mingshi, Garcia Giron Carlos, Garvin Tyler H., Gelboin-Burkhart Chelsea Anne, Georgolopoulos Grigorios, Gerstein Mark B., Giardine Belinda M., Gifford David K., Gilbert David M., Gilchrist Daniel A., Gillespie Shawn, Gingeras Thomas R., Gong Peng, Gonzalez Alvaro, Gonzalez Jose M., Good Peter, Goren Alon, Gorkin David U., Graveley Brenton R., Gray Michael, Greenblatt Jack F., Griffiths Ed, Groudine Mark T., Grubert Fabian, Gu Mengting, Guigó Roderic, Guo Hongbo, Guo Yu, Guo Yuchun, Gursoy Gamze, Gutierrez-Arcelus Maria, Halow Jessica, Hardison Ross C., Hardy Matthew, Hariharan Manoj, Harmanci Arif, Harrington Anne, Harrow Jennifer L., Hashimoto Tatsunori B., Hasz Richard D., Hatan Meital, Haugen Eric, Hayes James E., He Peng, He Yupeng, Heidari Nastaran, Hendrickson David, Heuston Elisabeth F., Hilton Jason A., Hitz Benjamin C., Hochman Abigail, Holgren Cory, Hou Lei, Hou Shuyu, Hsiao Yun-Hua E., Hsu Shanna, Huang Hui, Hubbard Tim J., Huey Jack, Hughes Timothy R., Hunt Toby, Ibarrientos Sean, Issner Robbyn, Iwata Mineo, Izuogu Osagie, Jaakkola Tommi, Jameel Nader, Jansen Camden, Jiang Lixia, Jiang Peng, Johnson Audra, Johnson Rory, Jungreis Irwin, Kadaba Madhura, Kasowski Maya, Kasparian Mary, Kato Momoe, Kaul Rajinder, Kawli Trupti, Kay Michael, Keen Judith C., Keles Sunduz, Keller Cheryl A., Kelley David, Kellis Manolis, Kheradpour Pouya, Kim Daniel Sunwook, Kirilusha Anthony, Klein Robert J., Knoechel Birgit, Kuan Samantha, Kulik Michael J., Kumar Sushant, Kundaje Anshul, Kutyavin Tanya, Lagarde Julien, Lajoie Bryan R., Lambert Nicole J., Lazar John, Lee Ah Young, Lee Donghoon, Lee Elizabeth, Lee Jin Wook, Lee Kristen, Leslie Christina S., Levy Shawn, Li Bin, Li Hairi, Li Nan, Li Shantao, Li Xiangrui, Li Yang I., Li Ying, Li Yining, Li Yue, Lian Jin, Libbrecht Maxwell W., Lin Shin, Lin Yiing, Liu Dianbo, Liu Jason, Liu Peng, Liu Tingting, Liu X. Shirley, Liu Yan, Liu Yaping, Long Maria, Lou Shaoke, Loveland Jane, Lu Aiping, Lu Yuheng, Lécuyer Eric, Ma Lijia, Mackiewicz Mark, Mannion Brandon J., Mannstadt Michael, Manthravadi Deepa, Marinov Georgi K., Martin Fergal J., Mattei Eugenio, McCue Kenneth, McEown Megan, McVicker Graham, Meadows Sarah K., Meissner Alex, Mendenhall Eric M., Messer Christopher L., Meuleman Wouter, Meyer Clifford, Miller Steve, Milton Matthew G., Mishra Tejaswini, Moore Dianna E., Moore Helen M., Moore Jill E., Moore Samuel H., Moran Jennifer, Mortazavi Ali, Mudge Jonathan M., Munshi Nikhil, Murad Rabi, Myers Richard M., Nandakumar Vivek, Nandi Preetha, Narasimha Anil M., Narayanan Aditi K., Naughton Hannah, Navarro Fabio C. P., Navas Patrick, Nazarovs Jurijs, Nelson Jemma, Neph Shane, Jun Neri Fidencio, Nery Joseph R., Nesmith Amy R., Newberry J. Scott, Newberry Kimberly M., Ngo Vu, Nguyen Rosy, Nguyen Thai B., Nguyen Tung, Nishida Andrew, Noble William S., Novak Catherine S., Novoa Eva Maria, Nuñez Briana, O’Donnell Charles W., Olson Sara, Onate Kathrina C., Otterman Ericka, Ozadam Hakan, Pagan Michael, Palden Tsultrim, Pan Xinghua, Park Yongjin, Partridge E. Christopher, Paten Benedict, Pauli-Behn Florencia, Pazin Michael J., Pei Baikang, Pennacchio Len A., Perez Alexander R., Perry Emily H., Pervouchine Dmitri D., Phalke Nishigandha N., Pham Quan, Phanstiel Doug H., Plajzer-Frick Ingrid, Pratt Gabriel A., Pratt Henry E., Preissl Sebastian, Pritchard Jonathan K., Pritykin Yuri, Purcaro Michael J., Qin Qian, Quinones-Valdez Giovanni, Rabano Ines, Radovani Ernest, Raj Anil, Rajagopal Nisha, Ram Oren, Ramirez Lucia, Ramirez Ricardo N., Rausch Dylan, Raychaudhuri Soumya, Raymond Joseph, Razavi Rozita, Reddy Timothy E., Reimonn Thomas M., Ren Bing, Reymond Alexandre, Reynolds Alex, Rhie Suhn K., Rinn John, Rivera Miguel, Rivera-Mulia Juan Carlos, Roberts Brian S., Manuel Rodriguez Jose, Rozowsky Joel, Ryan Russell, Rynes Eric, Salins Denis N., Sandstrom Richard, Sasaki Takayo, Sathe Shashank, Savic Daniel, Scavelli Alexandra, Scheiman Jonathan, Schlaffner Christoph, Schloss Jeffery A., Schmitges Frank W., See Lei Hoon, Sethi Anurag, Setty Manu, Shafer Anthony, Shan Shuo, Sharon Eilon, Shen Quan, Shen Yin, Sherwood Richard I., Shi Minyi, Shin Sunyoung, Shoresh Noam, Siebenthall Kyle, Sisu Cristina, Slifer Teri, Sloan Cricket A., Smith Anna, Snetkova Valentina, Snyder Michael P., Spacek Damek V., Srinivasan Sharanya, Srivas Rohith, Stamatoyannopoulos George, Stamatoyannopoulos John A., Stanton Rebecca, Steffan Dave, Stehling-Sun Sandra, Strattan J. Seth, Su Amanda, Sundararaman Balaji, Suner Marie-Marthe, Syed Tahin, Szynkarek Matt, Tanaka Forrest Y., Tenen Danielle, Teng Mingxiang, Thomas Jeffrey A., Toffey Dave, Tress Michael L., Trout Diane E., Trynka Gosia, Tsuji Junko, Upchurch Sean A., Ursu Oana, Uszczynska-Ratajczak Barbara, Uziel Mia C., Valencia Alfonso, Van Biber Benjamin, van der Velde Arjan G., Van Nostrand Eric L., Vaydylevich Yekaterina, Vazquez Jesus, Victorsen Alec, Vielmetter Jost, Vierstra Jeff, Visel Axel, Vlasova Anna, Vockley Christopher M., Volpi Simona, Vong Shinny, Wang Hao, Wang Mengchi, Wang Qin, Wang Ruth, Wang Tao, Wang Wei, Wang Xiaofeng, Wang Yanli, Watson Nathaniel K., Wei Xintao, Wei Zhijie, Weisser Hendrik, Weissman Sherman M., Welch Rene, Welikson Robert E., Weng Zhiping, Westra Harm-Jan, Whitaker John W., White Collin, White Kevin P., Wildberg Andre, Williams Brian A., Wine David, Witt Heather N., Wold Barbara, Wolf Maxim, Wright James, Xiao Rui, Xiao Xinshu, Xu Jie, Xu Jinrui, Yan Koon-Kiu, Yan Yongqi, Yang Hongbo, Yang Xinqiong, Yang Yi-Wen, Yardımcı Galip Gürkan, Yee Brian A., Yeo Gene W., Young Taylor, Yu Tianxiong, Yue Feng, Zaleski Chris, Zang Chongzhi, Zeng Haoyang, Zeng Weihua, Zerbino Daniel R., Zhai Jie, Zhan Lijun, Zhan Ye, Zhang Bo, Zhang Jialing, Zhang Jing, Zhang Kai, Zhang Lijun, Zhang Peng, Zhang Qi, Zhang Xiao-Ou, Zhang Yanxiao, Zhang Zhizhuo, Zhao Yuan, Zheng Ye, Zhong Guoqing, Zhou Xiao-Qiao, Zhu Yun, Zimmerman Jared, Moore Jill E., Purcaro Michael J., Pratt Henry E., Epstein Charles B., Shoresh Noam, Adrian Jessika, Kawli Trupti, Davis Carrie A., Dobin Alexander, Kaul Rajinder, Halow Jessica, Van Nostrand Eric L., Freese Peter, Gorkin David U., Shen Yin, He Yupeng, Mackiewicz Mark, Pauli-Behn Florencia, Williams Brian A., Mortazavi Ali, Keller Cheryl A., Zhang Xiao-Ou, Elhajjajy Shaimae I., Huey Jack, Dickel Diane E., Snetkova Valentina, Wei Xintao, Wang Xiaofeng, Rivera-Mulia Juan Carlos, Rozowsky Joel, Zhang Jing, Chhetri Surya B., Zhang Jialing, Victorsen Alec, White Kevin P., Visel Axel, Yeo Gene W., Burge Christopher B., Lécuyer Eric, Gilbert David M., Dekker Job, Rinn John, Mendenhall Eric M., Ecker Joseph R., Kellis Manolis, Klein Robert J., Noble William S., Kundaje Anshul, Guigó Roderic, Farnham Peggy J., Cherry J. Michael, Myers Richard M., Ren Bing, Graveley Brenton R., Gerstein Mark B., Pennacchio Len A., Snyder Michael P., Bernstein Bradley E., Wold Barbara, Hardison Ross C., Gingeras Thomas R., Stamatoyannopoulos John A., and Weng Zhiping. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature, 583(7818):699–710, 2020. ISSN 0028–0836. doi: 10.1038/s41586-020-2493-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Lizio Marina, Harshbarger Jayson, Shimoji Hisashi, Severin Jessica, Kasukawa Takeya, Sahin Serkan, Abugessaisa Imad, Fukuda Shiro, Hori Fumi, Ishikawa-Kato Sachi, Mungall Christopher J, Arner Erik, Baillie J Kenneth, Bertin Nicolas, Bono Hidemasa, de Hoon Michiel, Diehl Alexander D, Dimont Emmanuel, Freeman Tom C, Fujieda Kaori, Hide Winston, Kaliyaperumal Rajaram, Katayama Toshiaki, Lassmann Timo, Meehan Terrence F, Nishikata Koro, Ono Hiromasa, Rehli Michael, Sandelin Albin, Schultes Erik A, Hoen Peter A C ‘t, Tatum Zuotian, Thompson Mark, Toyoda Tetsuro, Wright Derek W, Daub Carsten O, Itoh Masayoshi, Carninci Piero, Hayashizaki Yoshihide, Forrest Alistair R R, Kawaji Hideya, and FANTOM consortium. Gateways to the FANTOM5 promoter level mammalian expression atlas. Genome Biology, 16(1):22, 2015. ISSN 1465–6906. doi: 10.1186/s13059-014-0560-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.García-Pérez Raquel, Ramirez Jose Miguel, Ripoll-Cladellas Aida, Chazarra-Gil Ruben, Oliveros Winona, Soldatkina Oleksandra, Bosio Mattia, Rognon Paul Joris, Capella-Gutierrez Salvador, Calvo Miquel, Reverter Ferran, Guigó Roderic, Aguet François, Ferreira Pedro G., Ardlie Kristin G., and Melé Marta. The landscape of expression and alternative splicing variation across human traits. Cell Genomics, 3(1):100244, 2023. ISSN 2666–979X. doi: 10.1016/j.xgen.2022.100244. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Herrmann Christina J, Schmidt Ralf, Kanitz Alexander, Artimo Panu, Gruber Andreas J, and Zavolan Mihaela. PolyASite 2.0: a consolidated atlas of polyadenylation sites from 3 end sequencing. Nucleic Acids Research, 48(D1):D174–D179, 2019. ISSN 0305–1048. doi: 10.1093/nar/gkz918. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Li Heng. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18):3094–3100, 2018. ISSN 1367–4803. doi: 10.1093/bioinformatics/bty191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Wyman Dana and Mortazavi Ali. TranscriptClean: variant-aware correction of indels, mismatches and splice junctions in long-read transcripts. Bioinformatics, 35(2):340–342, 2019. ISSN 1367–4803. doi: 10.1093/bioinformatics/bty483. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Wyman Dana, Balderrama-Gutierrez Gabriela, Reese Fairlie, Jiang Shan, Rahmanian Sorena, Forner Stefania, Matheos Dina, Zeng Weihua, Williams Brian, Trout Diane, England Whitney, Chu Shu-Hui, Spitale Robert C., Tenner Andrea J., Wold Barbara J., and Mortazavi Ali. A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification. bioRxiv, page 672931, 2020. doi: 10.1101/672931. [DOI] [Google Scholar]
- 22.Çelik Muhammed Hasan and Mortazavi Ali. Analysis of alternative polyadenylation from long-read or short-read RNA-seq with LAPA. bioRxiv, page 2022.11.08.515683, 2022. doi: 10.1101/2022.11.08.515683. [DOI] [Google Scholar]
- 23.Glinos Dafni A., Garborcauskas Garrett, Hoffman Paul, Ehsan Nava, Jiang Lihua, Gokden Alper, Dai Xiaoguang, Aguet François, Brown Kathleen L., Garimella Kiran, Bowers Tera, Costello Maura, Ardlie Kristin, Jian Ruiqi, Tucker Nathan R., Ellinor Patrick T., Harrington Eoghan D., Tang Hua, Snyder Michael, Juul Sissel, Mohammadi Pejman, MacArthur Daniel G., Lappalainen Tuuli, and Cummings Beryl B.. Transcriptome variation in human tissues revealed by long-read sequencing. Nature, pages 1–8, 2022. ISSN 0028–0836. doi: 10.1038/s41586-022-05035-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Tardaguila Manuel, de la Fuente Lorena, Marti Cristina, Pereira Cécile, Pardo-Palacios Francisco Jose, del Risco Hector, Ferrell Marc, Mellado Maravillas, Macchietto Marissa, Verheggen Kenneth, Edelmann Mariola, Ezkurdia Iakes, Vazquez Jesus, Tress Michael, Mortazavi Ali, Martens Lennart, Rodriguez-Navarro Susana, Moreno-Manzano Victoria, and Conesa Ana. SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification. Genome Research, 28(3):396–411, 2018. ISSN 1088–9051. doi: 10.1101/gr.222976.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Joglekar Anoushka, Prjibelski Andrey, Mahfouz Ahmed, Collier Paul, Lin Susan, Schlusche Anna Katharina, Marrocco Jordan, Williams Stephen R., Haase Bettina, Hayes Ashley, Chew Jennifer G., Weisenfeld Neil I., Wong Man Ying, Stein Alexander N., Hardwick Simon A., Hunt Toby, Wang Qi, Dieterich Christoph, Bent Zachary, Fedrigo Olivier, Sloan Steven A., Risso Davide, Jarvis Erich D., Flicek Paul, Luo Wenjie, Pitt Geoffrey S., Frankish Adam, Smit August B., Ross M. Elizabeth, and Hagen U. Tilgner. A spatially resolved brain region- and cell type-specific isoform atlas of the postnatal mouse brain. Nature Communications, 12(1):463, 2021. doi: 10.1038/s41467-020-20343-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Castello Alfredo, Fischer Bernd, Eichelbaum Katrin, Horos Rastislav, Beckmann Benedikt M., Strein Claudia, Davey Norman E., Humphreys David T., Preiss Thomas, Steinmetz Lars M., Krijgsveld Jeroen, and Hentze Matthias W.. Insights into RNA Biology from an Atlas of Mammalian mRNA-Binding Proteins. Cell, 149(6):1393–1406, 2012. ISSN 0092–8674. doi: 10.1016/j.cell.2012.04.031. [DOI] [PubMed] [Google Scholar]
- 27.Martins S B, Eide T, Steen R L, Jahnsen T, Skalhegg B S, and Collas P. HA95 is a protein of the chromatin and nuclear matrix regulating nuclear envelope dynamics. Journal of Cell Science, 113(21):3703–3713, 2000. ISSN 0021–9533. doi: 10.1242/jcs.113.21.3703. [DOI] [PubMed] [Google Scholar]
- 28.Laing Nigel G., Dye Danielle E., Wallgren-Pettersson Carina, Richard Gabriele, Monnier Nicole, Lillis Suzanne, Winder Thomas L., Lochmüller Hanns, Graziano Claudio, Mitrani-Rosenbaum Stella, Twomey Darren, Sparrow John C., Beggs Alan H., and Nowak Kristen J.. Mutations and polymorphisms of the skeletal muscle α-actin gene (ACTA1). Human Mutation, 30(9):1267–1277, 2009. ISSN 1059–7794. doi: 10.1002/humu.21059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Procknow Sara S. and Kozel Beth A.. Emerging mechanisms of elastin transcriptional regulation. American Journal of Physiology-Cell Physiology, 323(3):C666–C677, 2022. ISSN 0363–6143. doi: 10.1152/ajpcell.00228.2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Lacroix Matthieu, Rodier Geneviève, Kirsh Olivier, Houles Thibault, Delpech Hélène, Seyran Berfin, Gayte Laurie, Casas Francois, Pessemesse Laurence, Heuillet Maud, Bellvert Floriant, Portais Jean-Charles, Berthet Charlene, Bernex Florence, Brivet Michele, Boutron Audrey, Cam Laurent Le, and Sardet Claude. E4F1 controls a transcriptional program essential for pyruvate dehydrogenase activity. Proceedings of the National Academy of Sciences, 113(39):10998–11003, 2016. ISSN 0027–8424. doi: 10.1073/pnas.1602754113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Kahn R A, Kern F G, Clark J, Gelmann E P, and Rulka C. Human ADP-ribosylation factors. A functionally conserved family of GTP-binding proteins. Journal of Biological Chemistry, 266(4):2606–2614, 1991. ISSN 0021–9258. doi: 10.1016/s0021-9258(18)52288-2. [DOI] [PubMed] [Google Scholar]
- 32.2013.
- 33.Tress Michael L., Abascal Federico, and Valencia Alfonso. Alternative Splicing May Not Be the Key to Proteome Complexity. Trends in Biochemical Sciences, 42(2):98–110, 2017. ISSN 0968–0004. doi: 10.1016/j.tibs.2016.08.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Lagarde Julien, Uszczynska-Ratajczak Barbara, Carbonell Silvia, Pérez-Lluch Sílvia, Abad Amaya, Davis Carrie, Gingeras Thomas R, Frankish Adam, Harrow Jennifer, Guigo Roderic, and Johnson Rory. High-throughput annotation of full-length long noncoding RNAs with capture long-read sequencing. Nature Genetics, 49(12):1731–1740, 2017. ISSN 1061–4036. doi: 10.1038/ng.3988. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Brooks Angela, Pardo-Palacios Francisco, Reese Fairlie, Carbonell-Sala Silvia, Diekhans Mark, Liang Cindy, Wang Dingjie, Williams Brian, Adams Matthew, Behera Amit, Lagarde Julien, Li Haoran, Prjibelski Andrey, Balderrama-Gutierrez Gabriela, Çelik Muhammed Hasan, De María Maite, Denslow Nancy, Garcia-Reyero Natàlia, Goetz Stefan, Hunter Margaret, Loveland Jane, Menor Carlos, Moraga David, Mudge Jonathan, Takahashi Hazuki, Tang Alison, Youngworth Ingrid, Carninci Piero, Guigó Roderic, Tilgner Hagen, Wold Barbara, Vollmers Christopher, Sheynkman Gloria, Frankish Adam, Au Kin Fai, Conesa Ana, and Mortazavi Ali. Systematic assessment of long-read RNA-seq methods for transcript identification and quantification. 2021. doi: 10.21203/rs.3.rs-777702/v1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Hardwick Simon A., Hu Wen, Joglekar Anoushka, Fan Li, Collier Paul G., Foord Careen, Balacco Jennifer, Lanjewar Samantha, Sampson Maureen McGuirk, Koopmans Frank, Prjibelski Andrey D., Mikheenko Alla, Belchikov Natan, Jarroux Julien, Lucas Anne Bergstrom, Palkovits Miklós, Luo Wenjie, Milner Teresa A., Ndhlovu Lishomwa C., Smit August B., Trojanowski John Q., Lee Virginia M. Y., Fedrigo Olivier, Sloan Steven A., Tombácz Dóra, Ross M. Elizabeth, Jarvis Erich, Boldogkői Zsolt, Gan Li, and Tilgner Hagen U.. Single-nuclei isoform RNA sequencing unlocks barcoded exon connectivity in frozen brain tissue. Nature Biotechnology, 40(7):1082–1092, 2022. ISSN 1087–0156. doi: 10.1038/s41587-022-01231-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Rebboah Elisabeth, Reese Fairlie, Williams Katherine, Balderrama-Gutierrez Gabriela, McGill Cassandra, Trout Diane, Rodriguez Isaryhia, Liang Heidi, Wold Barbara J., and Mortazavi Ali. Mapping and modeling the genomic basis of differential RNA isoform expression at single-cell resolution with LR-Split-seq. Genome Biology, 22(1):286, 2021. doi: 10.1186/s13059-021-02505-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Tang Alison D., Soulette Cameron M., van Baren Marijke J., Hart Kevyn, Hrabeta-Robinson Eva, Wu Catherine J., and Brooks Angela N.. Full-length transcript characterization of SF3B1 mutation in chronic lymphocytic leukemia reveals downregulation of retained introns. Nature Communications, 11(1):1438, 2020. doi: 10.1038/s41467-020-15171-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Yue Feng, Cheng Yong, Breschi Alessandra, Vierstra Jeff, Wu Weisheng, Ryba Tyrone, Sandstrom Richard, Ma Zhihai, Davis Carrie, Pope Benjamin D, Shen Yin, Pervouchine Dmitri D, Djebali Sarah, Thurman Robert E, Kaul Rajinder, Rynes Eric, Kirilusha Anthony, Marinov Georgi K, Brian A Williams Diane Trout, Amrhein Henry, Katherine Fisher-Aylor Igor Antoshechkin, Gilberto DeSalvo Lei-Hoon See, Fastuca Meagan, Drenkow Jorg, Zaleski Chris, Dobin Alex, Prieto Pablo, Lagarde Julien, Bussotti Giovanni, Tanzer Andrea, Denas Olgert, Li Kanwei, M A Bender Miaohua Zhang, Byron Rachel, Mark T Groudine David Mc-Cleary, Pham Long, Ye Zhen, Kuan Samantha, Edsall Lee, Wu Yi-Chieh, Rasmussen Matthew D, Bansal Mukul S, Kellis Manolis, Keller Cheryl A, Morrissey Christapher S, Mishra Tejaswini, Jain Deepti, Dogan Nergiz, Robert S Harris Philip Cayting, Kawli Trupti, Boyle Alan P, Euskirchen Ghia, Kundaje Anshul, Lin Shin, Lin Yiing, Jansen Camden, Malladi Venkat S, Cline Melissa S, Erickson Drew T, Kirkup Vanessa M, Learned Katrina, Sloan Cricket A, Rosenbloom Kate R, de Sousa Beatriz Lacerda, Beal Kathryn, Pignatelli Miguel, Flicek Paul, Lian Jin, Kahveci Tamer, Lee Dongwon, Kent W James, Santos Miguel Ramalho, Herrero Javier, Notredame Cedric, Johnson Audra, Vong Shinny, Lee Kristen, Bates Daniel, Neri Fidencio, Diegel Morgan, Canfield Theresa, Sabo Peter J, Wilken Matthew S, Reh Thomas A, Giste Erika, Shafer Anthony, Kutyavin Tanya, Haugen Eric, Dunn Douglas, Reynolds Alex P, Neph Shane, Humbert Richard, Hansen R Scott, Bruijn Marella De, Selleri Licia, Rudensky Alexander, Josefowicz Steven, Samstein Robert, Eichler Evan E, Orkin Stuart H, Levasseur Dana, Papayannopoulou Thalia, Chang Kai-Hsin, Skoultchi Arthur, Gosh Srikanta, Disteche Christine, Treuting Piper, Wang Yanli, Weiss Mitchell J, Blobel Gerd A, Cao Xiaoyi, Zhong Sheng, Wang Ting, Good Peter J, Lowdon Rebecca F, Adams Leslie B, Zhou Xiao-Qiao, Pazin Michael J, Feingold Elise A, Wold Barbara, Taylor James, Mortazavi Ali, Weissman Sherman M, Stamatoyannopoulos John A, Snyder Michael P, Guigo Roderic, Gingeras Thomas R, Gilbert David M, Hardison Ross C, Beer Michael A, Ren Bing, and The Mouse ENCODE Consortium. A comparative encyclopedia of DNA elements in the mouse genome. Nature, 515(7527):355–364, 2014. ISSN 0028–0836. doi: 10.1038/nature13992. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Takahashi Hazuki, Kato Sachi, Murata Mitsuyoshi, and Carninci Piero. Gene Regulatory Networks, Methods and Protocols. Methods in Molecular Biology, 786:181–200, 2011. ISSN 1064–3745. doi: 10.1007/978-1-61779-292-2\_11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Batut Philippe and Gingeras Thomas R.. RAMPAGE: Promoter Activity Profiling by Paired-End Sequencing of 5-Complete cDNAs. Current Protocols in Molecular Biology, 104(1):25B.11.1–25B.11.16, 2013. ISSN 1934–3639. doi: 10.1002/0471142727.mb25b11s104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Song Lingyun and Crawford Gregory E.. DNase-seq: A High-Resolution Technique for Mapping Active Gene Regulatory Elements across the Genome from Mammalian Cells. Cold Spring Harbor Protocols, 2010(2):pdb.prot5384, 2010. ISSN 1940–3402. doi: 10.1101/pdb.prot5384. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Trincado Juan L., Entizne Juan C., Hysenaj Gerald, Singh Babita, Skalic Miha, Elliott David J., and Eyras Eduardo. SUPPA2: fast, accurate, and uncertainty-aware differential splicing analysis across multiple conditions. Genome Biology, 19(1):40, 2018. ISSN 1474–7596. doi: 10.1186/s13059-018-1417-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Manuel Rodriguez Jose, Maietta Paolo, Ezkurdia Iakes, Pietrelli Alessandro, Wesselink Jan-Jaap, Lopez Gonzalo, Valencia Alfonso, and Tress Michael L.. APPRIS: annotation of principal and alternative splice isoforms. Nucleic Acids Research, 41(D1):D110–D117, 2013. ISSN 0305–1048. doi: 10.1093/nar/gks1058. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Endres Dominik M. and Schindelin Johannes E.. A New Metric for Probability Distributions. IEEE Transactions on Information Theory, 49(7):1858, 2003. ISSN 0018–9448. doi: 10.1109/tit.2003.813506. [DOI] [Google Scholar]
- 46.Virtanen Pauli, Gommers Ralf, Oliphant Travis E, Haberland Matt, Reddy Tyler, Cournapeau David, Burovski Evgeni, Peterson Pearu, Weckesser Warren, Bright Jonathan, van der Walt Stéfan J, Brett Matthew, Wilson Joshua, Millman K Jarrod, Mayorov Nikolay, Nelson Andrew R J, Jones Eric, Kern Robert, Larson Eric, Carey C J, İlhan Polat, Feng Yu, Moore Eric W, VanderPlas Jake, Laxalde Denis, Perktold Josef, Cimrman Robert, Henriksen Ian, Quintero E A, Harris Charles R, Archibald Anne M, Ribeiro Antônio H, Pedregosa Fabian, van Mulbregt Paul, SciPy 1 0 Contributors, Vijaykumar Aditya, Bardelli Alessandro Pietro, Rothberg Alex, Hilboll Andreas, Kloeckner Andreas, Scopatz Anthony, Lee Antony, Rokem Ariel, Woods C Nathan, Fulton Chad, Masson Charles, Häggström Christian, Fitzgerald Clark, Nicholson David A, Hagen David R, Pasechnik Dmitrii V, Olivetti Emanuele, Martin Eric, Wieser Eric, Silva Fabrice, Lenders Felix, Wilhelm Florian, Young G, Price Gavin A, Ingold Gert-Ludwig, Allen Gregory E, Lee Gregory R, Audren Hervé, Probst Irvin, Jörg P Dietrich Jacob Silterra, Webber James T, Slavič Janko, Nothman Joel, Buchner Johannes, Kulick Johannes, Schönberger Johannes L, de Miranda Cardoso José Vinícius, Reimer Joscha, Harrington Joseph, Rodríguez Juan Luis Cano, Nunez-Iglesias Juan, Kuczynski Justin, Tritz Kevin, Thoma Martin, Newville Matthew, Kümmerer Matthias, Bolingbroke Maximilian, Tartre Michael, Pak Mikhail, Smith Nathaniel J, Nowaczyk Nikolai, Shebanov Nikolay, Pavlyk Oleksandr, Per A Brodtkorb Perry Lee, McGibbon Robert T, Feldbauer Roman, Lewis Sam, Tygier Sam, Sievert Scott, Vigna Sebastiano, Peterson Stefan, More Surhud, Pudlik Tadeusz, Oshima Takuya, Pingel Thomas J, Robitaille Thomas P, Spura Thomas, Thouis R Jones Tim Cera, Leslie Tim, Zito Tiziano, Krauss Tom, Upadhyay Utkarsh, Yaroslav O Halchenko, and Yoshiki Vázquez-Baeza. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature Methods, 17(3):261–272, 2020. ISSN 1548–7091. doi: 10.1038/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Reese Fairlie and Mortazavi Ali. Swan: a library for the analysis and visualization of long-read transcriptomes. Bioinformatics, 37(9):btaa836–, 2020. ISSN 1367–4803. doi: 10.1093/bioinformatics/btaa836. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Kuo Richard I., Cheng Yuanyuan, Zhang Runxuan, Brown John W. S., Smith Jacqueline, Archibald Alan L., and Burt David W.. Illuminating the dark side of the human transcriptome with long read transcript sequencing. BMC Genomics, 21(1):751, 2020. doi: 10.1186/s12864-020-07123-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Altschul Stephen F., Gish Warren, Miller Webb, Myers Eugene W., and Lipman David J.. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403–410, 1990. ISSN 0022–2836. doi: 10.1016/s0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- 50.Quinlan Aaron R. and Hall Ira M.. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6):841–842, 2010. ISSN 1367–4803. doi: 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
• Human LR-RNA-seq processed data / processing pipeline
• Human LR-RNA-seq datasets
• Mouse LR-RNA-seq processed data / processing pipeline
• Mouse LR-RNA-seq datasets
• Human short-read RNA-seq datasets
• Human microRNA-seq datasets





