Ambiguous splice sites distinguish circRNA and linear splicing in the human genome

Roozbeh Dehghannasiri; Linda Szabo; Julia Salzman

doi:10.1093/bioinformatics/bty785

. 2018 Sep 5;35(8):1263–1268. doi: 10.1093/bioinformatics/bty785

Ambiguous splice sites distinguish circRNA and linear splicing in the human genome

Roozbeh Dehghannasiri ¹, Linda Szabo ², Julia Salzman ^1,^2,^✉

Editor: Inanc Birol

PMCID: PMC6477988 PMID: 30192918

Abstract

Motivation

Identification of splice sites is critical to gene annotation and to determine which sequences control circRNA biogenesis. Full-length RNA transcripts could in principle complete annotations of introns and exons in genomes without external ontologies, i.e., ab initio. However, whether it is possible to reconstruct genomic positions where splicing occurs from full-length transcripts, even if sampled in the absence of noise, depends on the genome sequence composition. If it is not, there exist provable limits on the use of RNA-Seq to define splice locations (linear or circular) in the genome.

Results

We provide a formal definition of splice site ambiguity due to the genomic sequence by introducing equivalent junction, which is the set of local genomic positions resulting in the same RNA sequence when joined through RNA splicing. We show that equivalent junctions are prevalent in diverse eukaryotic genomes and occur in 88.64% and 78.64% of annotated human splice sites in linear and circRNA junctions, respectively. The observed fractions of equivalent junctions and the frequency of many individual motifs are statistically significant when compared against the null distribution computed via simulation or closed-form. The frequency of equivalent junctions establishes a fundamental limit on the possibility of ab initio reconstruction of RNA transcripts without appealing to the ontology of “GT-AG” boundaries defining introns. Said differently, completely ab initio is impossible in the vast majority of splice sites in annotated circRNAs and linear transcripts.

Availability and implementation

Two python scripts generating an equivalent junction sequence per junction are available at: https://github.com/salzmanlab/Equivalent-Junctions.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Splicing is well-known to be fundamental in developmental regulation and causal for disease (Fackenthal and Godley, 2008; Gamazon and Stranger, 2014; Sveen et al., 2016), highlighting the importance of the problem of “discovering the splicing code”. A major open question in computational biology remains whether the sequence of DNA itself encodes sufficient information to determine the location of human exons, i.e., positions where pre-mRNAs from this sequence are spliced (Burge et al., 1999; Burge and Karlin, 1998; Mathé et al., 2002). Despite important early progress (Burge and Karlin, 1997; Lukashin and Borodovsky, 1998; Stephens and Schneider, 1992), it has now become clear that exons in the human genome have not been completely annotated, and that with current computational prediction using DNA sequence supplemented with RNA sequencing, new exons and splice sites are constantly discovered, even in well-studied genomes like humans (Barrett et al., 2017; Costa et al., 2010).

The positions of exons in the human and other metazoan genomes cannot be computationally predicted from DNA sequence alone or even with advanced experimental and computational approaches as enriched motifs do not have sufficient information content to identify splice sites (Burge and Karlin, 1997; Rosenberg et al., 2015). Moreover, the recently discovered ubiquity of backsplicing that generates circRNA (Salzman et al., 2012; Wang et al., 2014) raises significant biological and bioinformatic questions such as: are splice signals and sequence used to generate circRNA shared by linear splice junctions? and basic database/annotation questions like: what are the genomic positions that define circRNA splicing?

In the absence of ab initio prediction from DNA sequence, RNA sequencing can be used as an empirical observation of mature splice transcripts, and thus, in principle, can be used to discover splice sites in the genome (Pan et al., 2008; Trapnell et al., 2009). The explosion of higher throughput long-read RNA sequencing technologies capable of full transcript sequencing has motivated large resources and computational efforts to resolve the position of exons and introns without appealing to ontology (Lu et al., 2016). Is this theoretically possible? If so, perfect knowledge of a full-length RNA transcript—linear or circular—must have a unique spliced alignment in the genome. Said another way, with error-free, complete length sequencing knowledge, do nucleotide sequences from RNA-Seq provide enough information to recover the underlying DNA template for the transcriptome? To our knowledge, this assumption has not been systematically tested.

In addition to these basic mechanistic and functional questions regarding where splice sites are located and how they are regulated, the potential of using RNA sequence for uniquely determining its precursor pre-mRNA is critical for benchmarking ab-initio spliced alignment algorithms (Carrara et al., 2015; Grabherr et al., 2011; Liu et al., 2014; Rapaport et al., 2013; Szabo and Salzman, 2016; Teng et al., 2016).

To establish criteria for benchmarking spliced aligners and to potentially discover new motifs enriched at splice boundaries differing between circRNA “back-” and linear “forward-” RNA splicing, we perform a systematic analysis of fundamental limitations of defining the positions of exons in DNA from RNA sequence alone, using the most recent genome builds. Our code and methodology is general and can be applied to any reference genome and transcriptome. Our results show that based on the mRNA sequence alone, exon–intron boundaries cannot be precisely located for 88.64% and 78.64% of annotated human splice sites in linear and circular transcripts, respectively. Also, 96% of human genes have at least one annotated exon–exon junction whose splice sites cannot be precisely located using mRNA sequence alone, without appealing to annotations or assumptions about intronic sequence structure. These results have important implications for bioinformatics regarding intrinsic limitations of ontology-free ab initio transcript reconstruction. Beyond these significant implications for genomics and bioinformatics, the degeneracy we report may generate testable hypotheses for future studies on how human introns are defined and how they differ between “back” and “forward” splicing.

2 Materials and methods

We begin with an annotated list of human introns and seek to test if the exon–exon junctions induced by splicing them out in silico could be reconstructed de novo. Specifically, we define a junction as follows: if exon A is defined by its genomic start and end positions (x₁, y₁) and exon B by positions (x₂, y₂), we call the sequence $J : = [(x_{1}, y_{1}) (x_{2}, y_{2})],$ a junction formed by exons A and B. Now considering the junction J, let k₁ and k₂ be the largest natural numbers such that

J = [(x_{1}, y_{1} - k_{1}) (x_{2} - k_{1}, y_{2})],

and

J = [(x_{1}, y_{1} + k_{2}) (x_{2} + k_{2}, y_{2})] .

In this case, we say we have a k-equivalent junction where k = k₁ + k₂. In this definition, k₁ and k₂ determine the maximum upstream and downstream shift of the splice sites y₁ and x₂, respectively, such that the sequence J does not change. Note that J can be uniquely reconstructed from RNA-Seq data without ontology only if k = 0. Taking this mathematical description into account, we define precise and equivalent junctions as follows:

Definition 1. Precise junctions are the subset of junctions whose genomic coordinates can be precisely identified using only RNA sequence.

Definition 2. Equivalent junctions are the subset of junctions whose genomic coordinates cannot be precisely identified by the RNA sequence.

The equivalence class for an equivalent junction is defined as a set of junctions, obtained by simply shifting the splice sites by 1 or more base pairs either upstream or downstream that yield the same junction sequence. We also define the equivalent junction sequence corresponding to k for a junction J as the longest sequence that can be attributed to either the donor or acceptor. For an equivalent junction J, k is always bigger than 0. Figure 1 provides an example equivalent junction in which y₁ = 1 467 341 (donor coordinate) and x₂ = 1 467 480 (acceptor coordinate). For this junction, shifting the donor and acceptor coordinates further downstream by up to three bases leads to the same junction sequence. This is due to the repetition of CTG immediately after coordinates y₁ and x₂, which can be attributed to either the intron or the exon. Therefore, the equivalent junction sequence is CTG and there are 4 distinct junctions that result in the same junction sequence, i.e., a 3-equivalent junction.

Fig. 1. — An example equivalence class consisting of four equivalent junctions in the MYO1C gene on chromosome 17. Coordinates are according to the GRCh38 human genome assembly. The nucleotides attributed to the donor exon are in blue and those attributed to the acceptor exon are in red (Color version of this figure is available at *Bioinformatics* online.)

We define a junction by the chromosome, strand, the 3′ coordinate of the donor exon, and the 5′ coordinate of the acceptor exon. Starting with a gene annotation file (either in the GTF or BED format) and a genome sequence file (typically as a fasta file), based on the transcriptome information in the annotation file, for each gene we extract all pairs of neighboring annotated exons that can be found in an annotated transcript corresponding to that gene (Supplementary Fig. S1). While a junction (i.e., an exon–exon pair) might be present in more than one transcript, it is counted only once in our analysis. For each pair of neighboring exons, we examine the nucleotides in the DNA sequence both downstream and upstream of the annotated splice sites (for up to 30 nucleotides in each direction) and determine whether the same junction sequence would be observed in the mature RNA by extending the sequence attributed to the donor exon and shortening the acceptor exon sequence and vice versa.

3 Results

We profile equivalent junctions for six eukaryotic organisms: human, mouse, rat, fly, Salpingoeca rosetta, and yeast. For human, we further analyze equivalent junctions for circular junctions and exon-skipping alternative splicing events. The genome sequence and gene annotation files were downloaded as explained in the supplement. Also, the annotation file for human circular transcripts was downloaded from circBase database (Glažar et al., 2014). For exon-skipping events, the coordinates of the cassette exons were extracted from the Alt Events table in the UCSC Genome Browser. Using the annotation file (which provides the positions of all genes, transcripts, and exons) and the downloaded set of cassette exons, a BED file containing all exon-skipping events for the human genome is created and then used for the equivalent junction analysis.

Table 1 summarizes the prevalence of equivalent junctions for various organisms. In this table, total number of junctions is the total number of distinct annotated exon-exon junctions; total number of genes is the number of unique gene names in the annotation file; non-canonical equivalent junctions are the subset of equivalent junctions that cannot be uniquely mapped even by assuming the presence of the canonical splicing motif GT-AG; and genes with equivalent junction are the number of unique genes that have at least one equivalent junction in their annotated transcripts. Equivalent junctions are highly prevalent, being observed in at least 80% of annotated junctions (except for fly and yeast) and in the vast majority of genes within each genome. Even assuming the canonical splicing motif GT-AG does not solve the ambiguity for all equivalent junctions as shown by the non-canonical equivalent junctions column. For example, 1.03% and 35.78% of junctions in human and yeast still have ambiguous splice sites even by considering GT-AG. Strikingly, the fraction of backsplice junctions in circRNA that can be precisely defined is twice as high as the fraction of precisely-defined linear RNA junctions. As will be shown later, the difference in the prevalence of precise junctions in circRNA and linear RNA is statistically significant. Also, between circRNA and linear RNA, there is a three-fold difference in the fraction of the equivalent junctions that cannot be disambiguated even by considering the splicing motif GT-AG.

Table 1.

The prevalence of equivalent junctions in different transcriptomes

Transcriptome	Total number of junctions	Precise junctions	Equivalent junctions	Non-canonical equivalent junctions	Total number of genes	Genes with equivalent junction
Human	350 863	39 841 (11.36%)	311 022 (88.64%)	3614 (1.03%)	35 259	33 866 (96.04%)
Mouse	267 204	29 847 (11.17%)	237 357 (88.82%)	1710 (0.64%)	30 361	29 290 (96.47%)
Rat	201 751	26 134 (12.95%)	175 617 (87.05%)	3632 (1.80%)	22 935	22 125 (96.47%)
Fly	52 482	18 395 (35.05%)	34 087 (64.95%)	2451 (4.67%)	18 617	16 290 (87.50%)
Yeast	341	134 (39.30%)	207 (60.70%)	122 (35.78%)	331	206 (62.23%)
Salpingoeca rosetta	87 813	12 381 (14.1%)	75 432 (85.9%)	75 (0.085%)	10 655	10 271 (96.39%)
Human (circRNA)	91 775	19 607 (21.36%)	72 168 (78.64%)	3092 (3.37%)	11 769	11 226 (95.75 %)
Human (exon-skipping)	68 013	7234 (11.32%)	60 779 (88.68%)	417 (0.613%)	14 306	13 775 (96.28%)

Open in a new tab

Figure 2 shows the fractions of prevalent equivalent junction sequences for different transcriptomes. The most frequent equivalent junction sequence is AG for human (linear RNA and exon-skipping events), mouse, and rat, and is G for other genomes, including human circRNA. This is expected given that the canonical splicing motif GT-AG is enriched at intron boundaries. Also, the prevalence of some equivalent junctions is remarkably different in human linear RNA and circRNA. For example, the fractions of C, T, and A (all later shown to be statistically significant in circRNA) are much larger in circRNA than in linear RNA (Supplementary Table S1).

Fig. 2. — The Most prevalent equivalent junctions in different transcriptomes. Junctions are sorted based on their prevalence in the human linear RNA. The proportion of each equivalent junction sequence is computed relative to the total number of junctions

Figure 3 presents the proportion of each equivalent junction length k across various genomes. It can be seen that a large fraction of equivalent junctions are either of length k = 1 or k = 2, which is due to the fact that AG and G are the two most frequent equivalent junction sequences for all genomes, as shown in Figure 2. Our analysis shows that equivalent junction sequences of length 10 or less account for the vast majority of junctions, but much longer sequences are present as well in the transcriptomes.

Fig. 3. — The length of equivalent junction sequences in linear RNA for various organisms

For each transcriptome, which is characterized by its corresponding annotation file, we test whether the observed fraction for each equivalent junction sequence is higher than would be expected in a permuted annotation file, in which the coordinates of the donor and acceptor splice sites of a junction are conditionally independent given the true annotation (i.e., annotated exons). Under this null, for each junction in the list of most prevalent junctions (given in Supplementary Table S1), we compare its observed fraction (obtained based on the true annotation file) against a null distribution of its fractions based on 2000 permuted annotations and compute its empirical P-value based on which we find statistically significant equivalent junctions. Table 2 reports the statistically significant equivalent junctions and their corresponding empirical P-values. The junctions for human circRNA are listed separately as the list of most prevalent junctions for human circRNA are considerably different from other transcriptomes. To compute the P-values for equivalent junctions in circRNA, we compare the rate of each equivalent junction sequence against the null distribution of the equivalent junctions that had been built for linear splicing. The motifs enriched at splice sites suggested by this table show a small but statistically significant effect size that deviates from the empirical null distribution (based on simulation). For human, precise junctions are significant in both linear RNA and circRNA. Also, the empirical P-values of the equivalent junction sequences ACAG, AAG, CCAGG, TAG, and GTA for human linear splicing, and C, AGGTA, T, A, TAG, and GTA for circRNA are significant.

Table 2.

Empirical P-values for the significant equivalent junction sequences

Empirical P-value								Human (circRNA)
Junction	Human	Mouse	Rat	Fly	Yeast	Salpingoeca	Human (exon-skip)	Junction	Empirical P-value
Precise	<0.0005	<0.0005	—	<0.0005	—	<0.0005	—	Precise	<0.0005
AG	—	—	—	—	—	—	<0.0005	G	—
G	—	—	—	—	—	—	—	AG	—
AGG	—	—	—	—	—	—	—	AGG	—
GG	—	—	<0.0005	—	—	—	—	GG	—
CAG	—	—	—	—	—	—	—	CAG	—
AGGT	—	—	—	—	—	—	—	AGGT	—
CAGG	—	—	0.0005	—	NA	—	0.0005	GGT	—
GGT	—	—	—	—	—	—	—	CAGG	—
GT	—	—	—	—	—	—	—	GT	—
CAGGT	—	0.001	<0.0005	—	NA	—	—	CAGGT	—
AGGTG	—	—	0.0005	—	NA	0.0015	—	C	<0.0005
AGGTA	—	—	—	—	NA	—	<0.0005	AGGTA	0.001
GGTG	—	—	—	—	NA	—	—	T	<0.0005
ACAG	0.001	—	—	—	NA	—	<0.0005	A	<0.0005
AAG	<0.0005	<0.0005	—	—	NA	—	<0.0005	AGGTG	—
CCAG	—	—	—	—	NA	—	—	ACAG	—
CCAGG	<0.0005	—	—	—	NA	—	0.001	AAG	—
TAG	<0.0005	<0.0005	—	—	—	—	<0.0005	GGTG	—
GGTA	—	—	0.001	—	—	—	0.0005	TAG	<0.0005
GTA	<0.0005	—	0.0005	—	—	—	0.001	GTA	<0.0005

Open in a new tab

Bonferroni correction was used in the statistical analysis. Non-significant P-values are shown by dashes and NA means that the junction sequence is not observed in the transcriptome.

To show that the reported P-values based on permutation tests are in complete agreement with theory and also provide a flavor of how the closed-form values of the null probabilities can be computed in theory, for human linear transcripts, we compute the closed-form values of the null probabilities for the precise junction and the equivalent junction sequence G in the supplementary material. In Supplementary Figure S4, we provide the marginal probability distributions of the ten most 3′ nucleotides in the donor exon and the ten most 5′ nucleotides in the acceptor exon flanking the splice sites in the human linear transcripts. These probabilities have been obtained by recording the sequence of nucleotides for each annotated exon-exon pair. The distributions given in the figure fit with our expectation that AG and GT are the canonical dinucleotides on the 3′ end of donor exons and the 5′ end of acceptor exons, respectively. Using these marginal distributions, the theoretical probabilities of observing a precise junction and an equivalent junction sequence G based on the null hypothesis can be computed, which are 0.104 and 0.211, respectively (probabilistic calculations are given in the supplement). Comparing these theoretical null probabilities with the true probabilities obtained by the true annotation (shown in Fig. 2), one can predict a significant P-value for the fraction of precise junctions and a non-significant P-value for the equivalent junction sequence G, as obtained via simulations.

We utilize the Kullback–Leibler (KL) divergence and total variation to measure the difference between the true distribution and the simulated null distributions obtained for the equivalent junctions. The KL divergence, being always between 0 and 1, is a measure of how different two probability distributions are. The interpretation of the KL divergence is that the more different the two distributions are, the closer the KL divergence would be to 1. Letting P be the true distribution of the equivalent junctions and Q be a simulated null distribution obtained based on a permuted annotation file, we compute the KL divergence from Q to P as $KL (P | | Q) = - \sum_{i} P (i) log \frac{Q (i)}{P (i)}$ . A well-separated P and Q results in a larger KL divergence, which implies a larger effect size. Table 3 reports the averages of the KL divergence and the total variation being taken across all generated permuted annotation files. Focusing on linear splicing, mouse and yeast have the lowest and highest average KL divergences, respectively. The KL divergence and total variation computed for human circRNA are much higher compared to those of human linear splicing.

Table 3.

Average KL divergence and total variation between the true distribution and the simulated null distributions of equivalent junctions

	Mouse	Rat	Fly	Yeast	Salpingoeca	Human
						Linear	circRNA	Exon skip.
KL diverg.	0.0062	0.012	0.009	0.040	0.0085	0.0087	0.113	0.0091
Total var.	0.0098	0.016	0.025	0.037	0.0080	0.0084	0.107	0.0035

Open in a new tab

To test whether the distributions of the equivalent junctions (including the precise junctions) in linear and circRNA junctions are statistically different, we performed Pearson’s Chi-squared test and found that the two distributions are significantly different (P-value < 2.2 × 10^–¹⁶). Also, the KL divergence and the total variation between the two distributions are 0.129 and 0.10, respectively, indicating a large distance between the two distributions.

4 Conclusion

In this paper, we show that annotated human introns cannot be established by the sequence of mature mRNA in 89% of the annotated human splice sites, and similarly large numbers for a set of other genomes we profiled. In the human genome, we find a significant increase in the fraction of precise junctions for circular transcripts compared to the linear transcripts. There are important bioinformatic and biological questions that this work raises for the future. Most significantly, why are “synonymous” motifs at splice sites maintained across evolution, especially given that pre-RNA base-pairing with the U1 snRNA sequence cannot predict splice site choice (Roca et al., 2013). Further, the observed effect size in the precise junction frequency suggests a yet-to-be-identified potential mechanistic difference between circRNA and linear RNA biogenesis in the human genome. While we do not have evidence for mechanisms underlying the apparent differences between linear and circular RNA biogenesis, it is tempting to speculate that they could be related to properties of the spliceosomal recognition of upstream 5′ and downstream 3′ splice sites.

The findings in this paper also have purely bioinformatic implications: any transcript reconstruction algorithm must make assumptions about sequence motifs flanking introns. This implies that, even with perfect measurement technology (long read and zero error rate), completely ab initio assignment of splice sites is not possible, and a critical hard-coded and ontology-guided assignment of intron boundaries must be used in 89% of annotated junctions. Longer read lengths for RNA-Seq will not overcome this problem as it results in inherent non-identifiability of splice sites with respect to the genome. Finally, considering that such a large fraction of human introns had degenerate boundaries, a provocative, yet untested hypothesis could be that this degeneracy may be related to mechanisms by which introns are recognized and spliced. Our findings support the idea that the field should turn its focus to the expression of RNA sequence itself rather than calling splice sites at particular positions, reporting the class of genomic positions in the context of equivalent junctions.

Supplementary Material

Supplementary Data

Click here for additional data file.^{(297KB, pdf)}

Acknowledgements

We thank members of the Salzman Lab for their critical comments on the manuscript

Funding

This work was supported by the National Cancer Institute [R00 CA168987], National Institute of General Medical Sciences [R01 GM116847], National Science Foundation [MCB-1552196], The Joint Initiative for Metrology in Biology Seed grant, McCormick-Gabilan Fellowship, Baxter Family Fellowship, and Alfred P. Sloan Foundation. R.D. is supported by the Cancer Systems Biology Scholars program at Stanford [R25 CA180993].

Conflict of Interest: none declared.

References

Barrett S.P. et al. (2017) ciRS-7 exonic sequence is embedded in a long non-coding RNA locus. PLoS Genet., 13, e1007114.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Burge C., Karlin S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol., 268, 78–94. [DOI] [PubMed] [Google Scholar]
Burge C. et al. (1999). Splicing of Precursors to mRNAs by the Spliceosomes. Laboratory Press, Cold Spring Harbor, NY, pp. 525–560. [Google Scholar]
Burge C.B., Karlin S. (1998) Finding the genes in genomic DNA. Current Opin. Struct. Biol., 8, 346–354. [DOI] [PubMed] [Google Scholar]
Carrara M. et al. (2015) Alternative splicing detection workflow needs a careful combination of sample prep and bioinformatics analysis. BMC Bioinformatics, 16, S2.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Costa V. et al. (2010) Uncovering the complexity of transcriptomes with RNA-Seq. BioMed. Res. Int., 2010, 1.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fackenthal J.D., Godley L.A. (2008) Aberrant RNA splicing and its functional consequences in cancer cells. Disease Models Mech., 1, 37–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gamazon E.R., Stranger B.E. (2014) Genomics of alternative splicing: evolution, development and pathophysiology. Hum. Genet., 133, 679–687. [DOI] [PubMed] [Google Scholar]
Glažar P. et al. (2014) circBase: a database for circular RNAs. RNA, 20, 1666–1670. [DOI] [PMC free article] [PubMed] [Google Scholar]
Grabherr M.G. et al. (2011) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol., 29, 644–652. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu R. et al. (2014) Comparisons of computational methods for differential alternative splicing detection using RNA-Seq in plant systems. BMC Bioinformatics, 15, 364.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lu H. et al. (2016) Oxford Nanopore MinION sequencing and genome assembly. Genomics, Prot. Bioinformatics, 14, 265–279. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lukashin A.V., Borodovsky M. (1998) GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res., 26, 1107–1115. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mathé C. et al. (2002) Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res., 30, 4103–4117. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pan Q. et al. (2008) Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet., 40, 1413–1415. [DOI] [PubMed] [Google Scholar]
Rapaport F. et al. (2013) Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol., 14, R95.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Roca X. et al. (2013) Pick one, but be quick: 5’ splice sites and the problems of too many choices. Genes Dev., 27, 129–144. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rosenberg A.B. et al. (2015) Learning the sequence determinants of alternative splicing from millions of random sequences. Cell, 163, 698–711. [DOI] [PubMed] [Google Scholar]
Salzman J. et al. (2012) Circular RNAs are the predominant transcript isoform from hundreds of human genes in diverse cell types. PloS One, 7, e30733.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stephens R.M., Schneider T.D. (1992) Features of spliceosome evolution and function inferred from an analysis of the information at human splice sites. J. Mol. Biol., 228, 1124–1136. [DOI] [PubMed] [Google Scholar]
Sveen A. et al. (2016) Aberrant RNA splicing in cancer; expression changes and driver mutations of splicing factor genes. Oncogene, 35, 2413.. [DOI] [PubMed] [Google Scholar]
Szabo L., Salzman J. (2016) Detecting circular RNAs: bioinformatic and experimental challenges. Nat. Rev. Genet., 17, 679–692. [DOI] [PMC free article] [PubMed] [Google Scholar]
Teng M. et al. (2016) A benchmark for RNA-seq quantification pipelines. Genome Biol., 17, 74.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Trapnell C. et al. (2009) Tophat: discovering splice junctions with RNA-Seq. Bioinformatics, 25, 1105–1111. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang P.L. et al. (2014) Circular RNA is expressed across the eukaryotic tree of life. PLoS One, 9, e90859.. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Click here for additional data file.^{(297KB, pdf)}

[bty785-B1] Barrett S.P. et al. (2017) ciRS-7 exonic sequence is embedded in a long non-coding RNA locus. PLoS Genet., 13, e1007114.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty785-B2] Burge C., Karlin S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol., 268, 78–94. [DOI] [PubMed] [Google Scholar]

[bty785-B3] Burge C. et al. (1999). Splicing of Precursors to mRNAs by the Spliceosomes. Laboratory Press, Cold Spring Harbor, NY, pp. 525–560. [Google Scholar]

[bty785-B4] Burge C.B., Karlin S. (1998) Finding the genes in genomic DNA. Current Opin. Struct. Biol., 8, 346–354. [DOI] [PubMed] [Google Scholar]

[bty785-B5] Carrara M. et al. (2015) Alternative splicing detection workflow needs a careful combination of sample prep and bioinformatics analysis. BMC Bioinformatics, 16, S2.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty785-B6] Costa V. et al. (2010) Uncovering the complexity of transcriptomes with RNA-Seq. BioMed. Res. Int., 2010, 1.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty785-B7] Fackenthal J.D., Godley L.A. (2008) Aberrant RNA splicing and its functional consequences in cancer cells. Disease Models Mech., 1, 37–42. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty785-B8] Gamazon E.R., Stranger B.E. (2014) Genomics of alternative splicing: evolution, development and pathophysiology. Hum. Genet., 133, 679–687. [DOI] [PubMed] [Google Scholar]

[bty785-B9] Glažar P. et al. (2014) circBase: a database for circular RNAs. RNA, 20, 1666–1670. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty785-B10] Grabherr M.G. et al. (2011) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol., 29, 644–652. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty785-B11] Liu R. et al. (2014) Comparisons of computational methods for differential alternative splicing detection using RNA-Seq in plant systems. BMC Bioinformatics, 15, 364.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty785-B12] Lu H. et al. (2016) Oxford Nanopore MinION sequencing and genome assembly. Genomics, Prot. Bioinformatics, 14, 265–279. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty785-B13] Lukashin A.V., Borodovsky M. (1998) GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res., 26, 1107–1115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty785-B14] Mathé C. et al. (2002) Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res., 30, 4103–4117. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty785-B15] Pan Q. et al. (2008) Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet., 40, 1413–1415. [DOI] [PubMed] [Google Scholar]

[bty785-B16] Rapaport F. et al. (2013) Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol., 14, R95.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty785-B17] Roca X. et al. (2013) Pick one, but be quick: 5’ splice sites and the problems of too many choices. Genes Dev., 27, 129–144. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty785-B18] Rosenberg A.B. et al. (2015) Learning the sequence determinants of alternative splicing from millions of random sequences. Cell, 163, 698–711. [DOI] [PubMed] [Google Scholar]

[bty785-B19] Salzman J. et al. (2012) Circular RNAs are the predominant transcript isoform from hundreds of human genes in diverse cell types. PloS One, 7, e30733.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty785-B20] Stephens R.M., Schneider T.D. (1992) Features of spliceosome evolution and function inferred from an analysis of the information at human splice sites. J. Mol. Biol., 228, 1124–1136. [DOI] [PubMed] [Google Scholar]

[bty785-B21] Sveen A. et al. (2016) Aberrant RNA splicing in cancer; expression changes and driver mutations of splicing factor genes. Oncogene, 35, 2413.. [DOI] [PubMed] [Google Scholar]

[bty785-B22] Szabo L., Salzman J. (2016) Detecting circular RNAs: bioinformatic and experimental challenges. Nat. Rev. Genet., 17, 679–692. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty785-B23] Teng M. et al. (2016) A benchmark for RNA-seq quantification pipelines. Genome Biol., 17, 74.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty785-B24] Trapnell C. et al. (2009) Tophat: discovering splice junctions with RNA-Seq. Bioinformatics, 25, 1105–1111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bty785-B25] Wang P.L. et al. (2014) Circular RNA is expressed across the eukaryotic tree of life. PLoS One, 9, e90859.. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Ambiguous splice sites distinguish circRNA and linear splicing in the human genome

Roozbeh Dehghannasiri

Linda Szabo

Julia Salzman

Roles