Analysis of 948 RNA sequencing data sets produced a comprehensive map of intron branchpoints and lariat RNAs in Arabidopsis thaliana, tomato, rice, and maize.
Abstract
Lariats are formed by excised introns, when the 5′ splice site joins with the branchpoint (BP) during splicing. Although lariat RNAs are usually degraded by RNA debranching enzyme 1, recent findings in animals detected many lariat RNAs under physiological conditions. By contrast, the features of BPs and to what extent lariat RNAs accumulate naturally are largely unexplored in plants. Here, we analyzed 948 RNA sequencing data sets to document plant BPs and lariat RNAs on a genome-wide scale. In total, we identified 13,872, 5199, 29,582, and 13,478 BPs in Arabidopsis (Arabidopsis thaliana), tomato (Solanum lycopersicum), rice (Oryza sativa), and maize (Zea mays), respectively. Features of plant BPs are highly similar to those in yeast and human, in that BPs are adenine-preferred and flanked by uracil-enriched sequences. Intriguingly, ∼20% of introns harbor multiple BPs, and BP usage is tissue-specific. Furthermore, 10,580 lariat RNAs accumulate in wild-type Arabidopsis plants, and most of these lariat RNAs originate from longer or retroelement-depleted introns. Moreover, the expression of these lariat RNAs is accompanied by the incidence of back-splicing of parent exons. Collectively, our results provide a comprehensive map of intron BPs and lariat RNAs in four plant species and uncover a link between lariat turnover and splicing.
INTRODUCTION
In eukaryotes, splicing of mRNA precursors (pre-mRNAs), a highly conserved critical step for gene expression, comprises two catalytic steps (Ruskin et al., 1984). In the first step, the 5′ splice site (5′ss; usually a GU dinucleotide) is attacked and concurrently the 5′ end of the intron is joined to the branchpoint (BP) by forming a 2′-5′-phosphodiester bond. This results in the production of a 5′ exon and a lariat intermediate RNA that consists of a lariat-form intron and a 3′ exon. These intermediates are then subjected to the second step of the reaction, in which the 3′ splice site (3′ss; usually an AG dinucleotide) is attacked and the two exons are ligated to produce the mRNA (Ruskin et al., 1984). The excised lariat introns, termed lariat RNAs, are traditionally thought to be degraded quickly, when a dedicated debranching enzyme1 (DBR1) recognizes the BP and linearizes the lariat, promoting its rapid degradation (Ruskin and Green, 1985; Nam et al., 1997; Kim et al., 2000, 2001; Wang et al., 2004). As an obligate signal, the BP must be properly selected to ensure efficient splicing (Jacquier and Rosbash, 1986). Recent studies showed that features of the BP are highly conserved from yeast to human, that BP selection is indeed regulated, and that BP mutation occurs in various diseases (Taggart et al., 2012, 2017; Bitton et al., 2014; Mercer et al., 2015; Pineda and Bradley, 2018).
In general, lariat RNAs derived from excised introns are usually destined for intron recycling, although in animals some debranched lariat RNAs can be further processed into mirtron microRNAs (Okamura et al., 2007; Ruby et al., 2007), or into small interfering RNAs in yeast (Dumesic et al., 2013), or into small nucleolar RNAs (Ooi et al., 1998). However, some lariat RNAs accumulate under physiological conditions in animals (Zhang et al., 2013; Talhouarne and Gall, 2014, 2018; Tay and Pek, 2017). Moreover, two recent studies showed that intron RNAs promote cell survival in yeast (Morgan et al., 2019; Parenteau et al., 2019), further indicating that intronic RNAs are not useless by-products of splicing but rather that these intronic RNAs play essential roles in eukaryotes.
Loss-of-function mutants of DBR1 are embryonic lethal in both plants and animals and accompanied by overaccumulation of lariat RNAs (Wang et al., 2004; Zheng et al., 2015; Li et al., 2016), indicating that DBR1 is essential for viability in both plants and animals. We showed that lariat RNAs act as decoys to inhibit genome-wide microRNA biogenesis by sequestering the Dicer complex (Li et al., 2016). Together, the findings that lariat RNAs act as decoys to sequester the toxicity of TDP-43 in amyotrophic lateral sclerosis disease (Armakola et al., 2012) and that loss of DBR1 caused compromised retrovirus replication (Ye et al., 2005; Galvis et al., 2014, 2017; Zhang et al., 2018) indicate that a strategy to control lariat RNA abundance is a potential therapeutic approach.
In earlier studies, the detection of lariat RNAs was usually based on RT-PCR, which exploits the ability of the reverse transcriptase to read through the BP (Suzuki et al., 2006). With breakthroughs of high-throughput sequencing technologies and bioinformatics analyses, several studies from animals, such as Xenopus tropicalis, Drosophila melanogaster, mouse (Mus musculus), chicken (Gallus gallus domesticus), zebrafish (Danio rerio), and human, showed genome-wide accumulation of lariat RNAs in stable circular forms (Zhang et al., 2013; Tay and Pek, 2017; Talhouarne and Gall, 2018), implying that the phenomenon of lariat RNAs accumulating naturally is evolutionarily conserved in animals. Although the formation of lariat RNAs is highly conserved in eukaryotes, the features of lariat RNAs in plants were largely unexplored. Importantly, the features of BPs and/or whether flanking sequences of BPs play a role in lariat RNA turnover were also unclear in plants. In contrast to the increasing understanding of BPs and lariat RNAs in yeast (Bitton et al., 2014) and animals (Taggart et al., 2012; Mercer et al., 2015; Pineda and Bradley, 2018), a genome-wide analysis of plant intronic BPs and lariat RNAs had not yet been reported.
Here, we performed large-scale analyses to systematically identify BPs across four plant species, dicots and monocots, to provide a comprehensive map of intron BPs on a genome-wide scale in plants. Our results indicate that plant introns prefer adenines as their BPs, that many introns have multiple BPs, and that BP usage is tissue-specific. Furthermore, using circular RNA sequencing (RNA-seq) analyses from the wild type and a weak viable dbr1 mutant (dbr1-2), we showed that over 10,000 lariat RNAs accumulate with at least 5 FPKM (fragments per kilobase pair per million sequencing tags) in wild-type Arabidopsis (Arabidopsis thaliana). The expression of these lariat RNAs is anticorrelated with the insertion frequency of retroelements in the introns but is positively correlated with the incidence of back-splicing of flanking exons. Our data provide insights into the characteristics of plant lariat RNAs and intron BPs and reveal an unexpected complexity of BP selection, lariat RNA turnover, and splicing.
RESULTS
Transcriptomic Analyses of Col-0 and dbr1-2 by Circular RNA-Seq
Because recent studies showed that accumulation of lariat RNAs occurs widely in animals (Zhang et al., 2013; Tay and Pek, 2017; Talhouarne and Gall, 2018), we investigated whether this was also the case in plants. As it is known that lariat RNAs exist in a circular form by escaping linearization in vivo, we performed circular RNA-seq to globally identify lariat RNAs under physiological conditions. Briefly, by taking advantage of dbr1-2, a weak viable allele of dbr1 (Li et al., 2016), we enriched sequencing tags spanning the junction between the 5′ss and the BP from the transcriptomes of Col-0 and dbr1-2 for genome-wide identification of BPs and lariat RNAs (Supplemental Figure 1). The RNA-seq profiles were consistent in multiple samples for both Col-0 and dbr1-2, with and without RNase R treatments (Supplemental Figure 2A).
As most linear mRNAs were digested by RNase R, the RNase R-treated samples had lower gene expression levels than samples without RNase R treatments (Supplemental Figure 3A). The global gene expression patterns of Col-0 and dbr1-2 were very similar, as shown by the high correlation coefficient values between the Col-0 and dbr1-2 samples without RNase R treatments (Supplemental Figure 3B). The hierarchical clustering showed that the Col-0 and dbr1-2 samples without RNase R treatments grouped together, with very little difference within the cluster, which was much smaller than with the RNase R-treated samples of Col-0 and dbr1-2 (Supplemental Figure 3B). Principal component analysis (PCA) also showed that samples without RNase R treatments were clustered (Supplemental Figure 3C). However, relative to their expression levels in Col-0, only 60 genes were deregulated in dbr1-2 (with multiple test corrected P < 0.05; Supplemental Data Set 1). These results suggest that the two genotypes are similar in expression profiles of genes.
However, the accumulation levels of introns were significantly increased in dbr1-2, whether in samples with and without RNase R treatments (P < 10−100, in both cases, Mann-Whitney U test) (Figure 1A). The intron expression was consistent for Col-0 and dbr1-2 profiles with and without RNase R treatments (Supplemental Figure 2B). Interestingly, samples were clustered based on the intron-level profiles by both hierarchical clustering (Figure 1B) and PCA (Figure 1C). Furthermore, the differences between Col-0 and dbr1-2 samples detected using intron levels were larger than those using clustering results of genes (Supplemental Figure 3B). These results suggest that the intronic expression underlies the main differences in the transcriptomes of Col-0 and dbr1-2. By selecting intronic transcripts that had average abundances of ≥5 FPKM (Fragments Per Kilo basepairs per Million sequencing tags) in Col-0 samples with RNase R treatments, 10,580 transcripts (619 transcripts were detected only in Col-0 and 9961 transcripts were detected in both Col-0 and dbr1-2) were identified as lariat RNAs from 6585 genes (548 genes from Col-0 only and 6037 genes from both Col-0 and dbr1-2; Figure 1D; Supplemental Data Set 1), indicating that these lariat RNAs accumulated under physiological conditions. Of note, the number of total annotated introns is 128,271 from 22,524 genes in the Arabidopsis genome, but more than 50% of the introns (64,213/128,271 = 50.4%) are ≤100 nucleotides (nt), which were underrepresented in our study presumably due to the intentional depletion during library preparation.
Figure 1.
A Summary of the Results of Eight RNA-Seq Profiles.
(A) A global view of intron expression levels in the eight RNA-seq profiles. Introns of >100 bp were used. The average expression levels of Col-0 and dbr1-2, with and without RNase R treatments, were compared using the Mann-Whitney U test. r1 and r2 indicate replicate 1 and replicate 2, respectively.
(B) The correlation coefficient between intron expression levels and hierarchical clustering analysis.
(C) PCA based on the intron expression levels.
(D) Venn diagram showing the numbers of genes with accumulation of lariat RNA in Col-0 and dbr1-2. The numbers indicate gene numbers with average abundances of intronic transcripts ≥5 FPKM in Col-0 and dbr1-2.
(E) The numbers of downregulated (blue) and upregulated (red) introns in dbr1-2 RNase R (+) samples when compared with Col-0 RNase R (+) samples. FC, fold change; FDR, false discovery rate.
(F) The length of all introns >100 bp and introns deregulated in dbr1-2 samples with RNase R treatments.
In contrast to those in Col-0, 15,602 intronic transcripts (5641 transcripts were detected only in dbr1-2 and 9961 transcripts were detected in both Col-0 and dbr1-2) from 10,242 genes (4205 genes from dbr1-2 only and 6037 genes from both Col-0 and dbr1-2) showed average abundances of ≥5 FPKM in dbr1-2 samples with RNase R treatments (Figure 1D), and 6720 unique intronic transcripts from 4672 genes had significantly higher expression levels in dbr1-2 than in Col-0 (Figure 1E; Supplemental Data Set 1). Notably, this higher intronic expression was heavily biased for long introns (Figure 1F). This bias might be due to small introns depleted during the commercial library construction protocols. The increased intronic accumulation in dbr1-2 was due to the accumulation of lariat RNAs because linear RNAs were removed in the RNase R treatments.
To exclude the possibility that increased intronic accumulation in dbr1-2 was caused by defective splicing efficiency, we compared the splicing efficiency in Col-0 and dbr1-2 using RNA-seq profiles without RNase R treatments. The overall splicing efficiency showed no significant difference between Col-0 and dbr1-2 (Supplemental Figure 3D). Therefore, because dbr1-2 showed minor effects on gene expression but major effects on intron expression, it was reasonable to use these transcriptomes to further investigate BPs and lariat RNAs in Arabidopsis.
BP Features Are Highly Conserved from Dicots to Monocots
Reverse transcriptase can traverse the BP to copy the intronic region upstream of the BP, and thus this product contains two juxtaposed intronic segments that align in an inverted order, defining the 5′ss and the BP (Suzuki et al., 2006). Considering that this same read-through phenomenon also occurs during the construction of RNA-seq libraries, we developed a computational pipeline using RNA-seq data sets to systemically map the BPs in plants (Figure 2A). In brief, we first aligned all sequenced reads to the genome with TopHat2 (Kim et al., 2013) or HISAT2 (Kim et al., 2015). Then, we aligned unmapped reads to introns with BLASTN or Bowtie 2 (Langmead and Salzberg, 2012). For those reads that could be partially mapped to introns, we examined whether the unmapped regions of the same reads could be aligned to the same introns (Figure 2A). The reads that could span the 5′ss and another region close to the 3′ss with at least 6 nt on each segment were used to infer the BP. The last nucleotide of the mapped region close to the 3′ss is predicted as the BP. The reads that cover the same BP were grouped. By employing this approach to analyze 948 RNA-seq profiles in total (Supplemental Data Set 2), including 167 RNA-seq profiles for Arabidopsis, 264 RNA-seq profiles for tomato (Solanum lycopersicum), 207 RNA-seq profiles for rice (Oryza sativa), and 310 RNA-seq profiles for maize (Zea mays), we obtained ∼300,000 informative sequenced reads in total and identified 13,872 BPs from 6414 introns in Arabidopsis (Supplemental Data Set 3), 5199 BPs from 2566 introns in tomato (Supplemental Data Set 3), 29,582 BPs from 11,026 introns in rice (Supplemental Data Set 3), and 13,487 BPs from 5986 introns in maize (Supplemental Data Set 3).
Figure 2.
Characterization of BPs in Arabidopsis, Tomato, Rice, and Maize.
(A) The computational pipeline to identify BPs. The sequencing reads were aligned to the genome with TopHat (v2). The unmapped reads were aligned to the database of introns with BLASTN. The partially mapped reads were examined to find whether the remaining parts of the same reads could be aligned to the same introns with a self-developed program. The reads supporting the same BP were collectively used to infer the corresponding BP. The results from different RNA-seq profiles were combined.
(B) The percentages of different nucleotides as the identified BPs in the four plant species (Arabidopsis, tomato, rice, and maize). Numbers indicate the numbers with the indicated nucleotide as the BP, and percentages indicate the ratio of the indicated BP numbers relative to the total numbers of identified BPs.
(C) The distribution of the distance from the BP to the 3′ss in the four plant species (Arabidopsis, tomato, rice, and maize). In the distribution of maize, the widths of bars from −1000 to −50 are 19 nt, and the widths of bars from −50 to 0 are 1 nt.
(D) The nucleotide preferences flanking the BP in the four plant species (Arabidopsis, tomato, rice, and maize).
In both dicot species (Arabidopsis and tomato) and monocot species (rice and maize), BPs within constitutive introns were most frequently adenines (>50%), followed by thymines/uracils (15–20%), guanines (∼8–20%), and cytosines (∼2–10%; Figure 2B), as reported in yeast and human (Taggart et al., 2012, 2017; Bitton et al., 2014; Mercer et al., 2015; Pineda and Bradley, 2018). By randomly selecting 16 lariat RNAs from Arabidopsis for Sanger sequencing, we confirmed that these BPs were adenines (Supplemental Figures 3E and 3F). In addition, previous studies showed that the distance from the BP to the 3′ss is tightly constrained (Taggart et al., 2012; Bitton et al., 2014; Mercer et al., 2015). We found that BPs were preferentially positioned within 50 nt upstream of the 3′ss in Arabidopsis, tomato, and rice (Figure 2C), highly similar to those in yeast and human (Taggart et al., 2012; Bitton et al., 2014; Mercer et al., 2015). However, only 51.2% (6903/13,487) of BPs were located within 50 nt upstream of the 3′ss in maize (Figure 2C), and around half of BPs were positioned between 100 and 1000 nt from the 3′ss (Figure 2C). The heterogeneity of the distance of BPs from the 3′ss in maize indicates that the mechanism of BP selection in maize is more complicated.
Although >50% of constitutive introns in plants use adenine as the BP, a substantial portion of introns applied other nucleotides as their BPs (Figure 2B; Supplemental Figure 4). To exclude the possibility that non-adenine BPs were caused by lower fidelity during the conversion of BPs by reverse transcriptase, we analyzed the mutation events of BP during conversion by calculating the indicated nucleotide of the referred genome relative to total numbers of identified BPs in each species. We showed that adenine in the annotated sequences of pre-mRNAs was much more easily converted to uracil (Supplemental Figure 5A–5D). In contrast, guanines retained high fidelity during library construction (Supplemental Figure 5A–5D). This phenomenon has been reported in a previous study (Taggart et al., 2012). Therefore, the identified BPs with guanines are most likely noncanonical BPs in plants.
Since the flanking regions of BPs bind U2 small nuclear RNA, we surmised that nucleotides flanking the BPs might be important for BP recognition. To identify potential cis-elements, we analyzed nucleotides around the BP. We identified a consensus motif containing a 10-nt uracil-rich element downstream of the BP (Figure 2D), and the second position upstream of the BP exhibited a strong preference for the uracil nucleotide (4.0-fold enrichment) in all four plant species (Figure 2D), which is consistent with a recent finding in human cell lines (Mercer et al., 2015). Moreover, multiple nucleotides downstream of the BP showed preferences for uracils (Figure 2D). These observations indicate that BP selection is highly conserved from plants to animals.
Multiple BPs in Plants
In general, a default BP is set for each intron. However, by calculating the number of BPs in a single intron in all four plant species, we found that although most introns only had one identified BP, ∼20% of the introns used two or more BPs (Figure 3A). For example, At4g39260.1 I1 (the first intron of At4g39260.1) has two BPs identified from our RNA-seq analyses, and the interval between the two BPs is very short (Supplemental Figure 6A). We ranked the BPs for each intron by the number of mapped lariat reads and defined the one supported by the highest number of reads as the major BP. Consistent with this definition, we found that the major BP in At4g39260.1 I1 is supported by more informative sequencing reads (Supplemental Figure 6B), in which the major BP (the 258th) is adenine supported by 15 reads and the second BP (the 256th) is uracil supported by 4 reads (Supplemental Figure 6B).
Figure 3.
Multiple BPs in the Four Plant Species.
(A) The intron numbers (y axis) with indicated BP numbers (x axis) in the four plant species (Arabidopsis, tomato, rice, and maize).
(B) The distance distributions of multiple BPs along the intron in the four plant species. Only introns with ≥5 lariat reads were counted. The percentages of lariat reads (y axis) were calculated by dividing the number of lariat reads for a specific BP by the total number of lariat reads for an intron.
We then used Sanger sequencing to further validate the two BPs in At4g39260.1 I1. The score peaks before the 258th nucleotide of this intron were distinct, which indicates a single possible nucleotide. However, on and after the 258th nucleotide, there were multiple peaks for each nucleotide. We carefully examined the sequence that corresponds to the major and minor peaks by Sanger sequencing. The two sequences (Supplemental Figure 6C) actually resulted from the two BPs at the 258th and 256th nt of At4g39260.1 I1, respectively.
Next, we investigated whether the distance of the BP from the 3′ss affected BP usage. Quantitatively plotting the numbers of lariat reads as a function of BP position showed that the majority of the most frequently used BPs resided within a narrow window in all four plant species, consistent with the restricted genome-wide distribution of BPs (Figure 3B). Together with the fact that multiple BPs occur widely in human cell lines (Pineda and Bradley, 2018), these findings indicate that the phenomenon of multiple BPs is conserved from plants to mammals.
Tissue-Specific BPs in Arabidopsis and Rice
We surmised that the existence of multiple BPs might play a regulatory role in pre-mRNA splicing. In other words, BP usage might be developmentally regulated, as recently reported in human cell lines (Pineda and Bradley, 2018). To test this hypothesis, we investigated whether certain introns exhibit tissue-specific BP usage in Arabidopsis and rice. By grouping 167 RNA-seq profiles in Arabidopsis according to the tissue used for RNA extraction, we selected five tissues (callus, roots, seedlings, leaves, and inflorescences) that had the largest numbers of supporting reads flanking the BP to identify tissue-specific BPs using multinomial proportion tests and estimated the false discovery rate according to a previous study (Benjamini and Hochberg, 1995). Due to the transient nature of lariat RNAs and the specific selection of poly(A)+ RNA in traditional RNA-seq library construction protocols, informative reads that traverse the lariat junction between the 5′ss and the BP are rare. However, we still detected 136 tissue-specific BPs in Arabidopsis (Supplemental Data Set 4). By using the same method, we grouped 207 rice RNA-seq profiles according to the tissue used for RNA extraction and selected five tissues (nematode-induced giant cells, panicle, roots, shoots, and vascular cells) that had the largest numbers of supporting reads flanking the BP to identify tissue-specific BPs using the multinomial proportion tests. Consequently, we identified 565 tissue-specific BPs in rice (Supplemental Data Set 4).
Given the above-mentioned positional effects on BP usage, we expected to observe preferential usage of BPs that were within ∼50 bp proximal to the 3′ss. Unexpectedly, we found that BP usage was instead highly tissue-specific (Supplemental Data Set 4). For example, three BPs were identified for the ninth intron of At3g01500.1, in which the distal BP (the 154th nucleotide upstream of the 3′ss) had a significantly higher preference in leaves and seedlings but not in inflorescences (P = 1.7 × 10−61, multinomial proportion test) (Figure 4A), while the proximal BPs (the 33rd and 34th nucleotide upstream of the 3′ss) were frequently used in inflorescences but not in leaf and seedlings (Figure 4A). Similarly, three BPs were identified for the first intron of Os04g16748, in which the most distal BP (the 700th nucleotide upstream of the 3′ss) was only detected in panicles and the distal BP (the 87th nt upstream of the 3′ss) was mainly used in giant cells and vascular cells, while the closest BP (the 7th nucleotide upstream of the 3′ss) was frequently used in panicles, shoots, and roots (Figure 4B).
Figure 4.
Tissue-Specific BP Usage in Arabidopsis and Rice.
(A) and (B) The percentages of supporting lariat reads at the indicated BPs in different tissues for the ninth intron of At3g01500.1 (A) and for the first intron of Os04g16748.1 (B). The nucleotides in uppercase and red letters were tissue-specific BPs identified in this study. The P values indicate the multiple test corrected P values. The numbers below the sequences of the introns are the positions of the identified BPs in the introns from the 3′ss.
(C) The summarized information of identified BPs for the seventh intron of At3g23590.1 by RT-PCR followed by Sanger sequencing. The seventh intron length of At3g23590.1 is 311 bp. The labels 216th, 217th, 220th… indicate the distance downstream of the 5′ss, and F and R indicate the pair of primers used for RT-PCR to amplify lariat RNAs originated from this intron.
(D) The distributions of multiple BPs in different tissues by RT-PCR followed by Sanger sequencing. Seventeen, 14, 21, 24, and 28 individual clones were sequenced for roots, seedlings, leaves, inflorescences, and siliques, respectively. The numbers within the circles indicate the clones carrying the indicated BPs identified by Sanger sequencing.
To further validate this tissue-specific BP usage, we amplified the RT-PCR products of the seventh intron of At3g23590.1 from five different tissues (roots, seedlings, leaves, inflorescences, and siliques) using indicated divergent primers and performed Sanger sequencing. We obtained seven different isoforms of the seventh intron of At3g23590 by using the same pair of primers and showed that seven unique BPs existed in the lariat RNAs (Figure 4C). To examine whether the usage of these seven BPs exhibited a tissue-specific pattern, we sequenced more than 10 independent clones of RT-PCR products for each tissue and counted the frequency of different BPs in tested tissues (Supplemental Figure 7). We showed that different BPs exhibited significant preferences in specific tissues (Figure 4D). For example, the 216th BP was mainly selected in leaves, inflorescences, and siliques (Figure 4D). In contrast, the 224th BP was preferentially used in roots and seedlings (Figure 4D). Unexpectedly, the 285th and 287th BPs, two BPs within 50 bp upstream of the 3′ss, were seldom used in any tested tissues. Although the regulatory mechanism of specific BP selection remains unknown, these results suggest that the multiple BP usage is indeed regulated in a tissue-specific manner, which is consistent with results in human introns (Pineda and Bradley, 2018).
A Subset of Introns Self-Circularize with the 5′ss and the 3′ss in Plants
Several studies show that BP selection determines the 3′ss recognition, in which the first AG downstream of the BP is usually used as the 3′ss (Smith et al., 1989; Gooding et al., 2006). Although other criteria, including secondary structure, context flanking the AG, distance to neighboring AGs, and an optimal distance between the BP and AG shown above (Figure 2D), have been used (Chen et al., 2000; Chua and Reed, 2001; Meyer et al., 2011), the “AG exclusion zone” has been widely accepted to predict the 3′ss. To validate whether the AG exclusion zone is applied to the 3′ss recognition of plant introns, we scanned the context of the 3′ss of all introns with BPs identified in our study. As expected, 58.2% in Arabidopsis, 54.2% in tomato, 54.8% in rice, and 48.2% in maize of the BPs selected the first AG downstream of the BP as the 3′ss (P < 0.001, by permutation test; Figure 5A; Supplemental Data Set 3). However, a substantial portion (∼30–40% in four species) of the 3′ss skipped the first AG downstream of the BP (Figure 5A; Supplemental Data Set 3). More interestingly, some introns appeared to avoid AG downstream of the BP but instead selected a non-AG as the 3′ss (Figure 5A; Supplemental Data Set 3). These observations suggest that the determination of the 3′ss in plants is not tightly regulated by the AG exclusion zone.
Figure 5.
The Distributions of the First AGs Downstream of the BP in Plants.
(A) The nucleotide categories of the 3′ss based on the BP in four plant species. The four types of the 3′ss based on the BP were classified as follows: (1) the first AG downstream of the BP is the 3′ss (blue); (2) an AG other than the first AG is the 3′ss (orange); (3) the 3′ss is non-AG (gray); and (4) the BP is one of the nucleotides of the 3′ss (yellow).
(B) The BP of the second intron of At1g70830.1 is exactly located within the 3′ss and its supporting reads in two different RNA-seq profiles (SRR1190492 and SRR3234408). Numbers at right indicate the total number of supporting reads for each transcript.
(C) The BP of the eighth intron of Zm0001d013156.2 in maize is exactly located within the 3′ss and its supporting reads in two different RNA-seq profiles (SRR765414 and SRR765622). Numbers at right indicate the total number of supporting reads for each transcript.
Unexpectedly, we found that 107 introns in Arabidopsis, 82 introns in tomato, 429 introns in rice, and 269 introns in maize showed an overlap between the BP and the 3′ss (Figure 5A; Supplemental Data Set 5) indicating that these intronic RNAs self-circularized with the 5′ss and the 3′ss, as also reported in human cells (Gardner et al., 2012; Taggart et al., 2012; Tay and Pek, 2017; Talhouarne and Gall, 2018). Several examples showed that these intron transcripts were indeed circularized from the 5′ss to the 3′ss (Figures 5B and 5C; Supplemental Figure 8), and each circularized intronic RNA was detected in at least two independent RNA-seq profiles with more than 30 unique supporting reads (Figures 5B and 5C; Supplemental Figure 8). Moreover, the average lengths of these stably accumulated introns are longer than the average lengths of all introns in the four species, especially in rice and maize (Supplemental Figure 9). These observations indicate that some intronic RNAs are not traditionally degraded; instead, these intronic RNAs can accumulate with a circular form in vivo.
Identification of Lariat-Derived Circular RNAs in Arabidopsis
Lariat RNA formation during splicing is highly conserved in eukaryotes, and we previously showed that some lariat RNAs accumulate naturally in plants (Li et al., 2016), as also reported in animals (Gardner et al., 2012; Zhang et al., 2013; Tay and Pek, 2017; Talhouarne and Gall, 2018). To identify lariat RNAs in plants on a genome-wide scale, we performed circular RNA-seq using total RNAs from inflorescences of wild-type plants and focused on the reads mapped to intronic regions only. Since RNase R degrades most linear RNAs, introns with significant accumulation of sequencing reads in RNase R-treated Col-0 samples were regarded as lariat-derived circular RNAs. We identified 10,580 lariat-derived circular RNAs with ≥5 FPKM generated from 6585 genes in Col-0 (Figure 1D; Supplemental Data Set 1). Among these lariat-derived circular RNAs, 1489 lariat RNAs with ≥20 FPKM were detected in wild-type plants (Figure 6A). The average length of 64,058 introns >100 bp in the Arabidopsis genome is 253 bp, but the average length of 10,580 introns with lariat accumulation is 378 bp, which is significantly longer than that of all introns (P < 10−15, Welch’s t test; Figure 6B). However, among these 10,580 introns with lariat accumulation in Col-0, the intron length was anticorrelated with the expression of lariat-derived circular RNAs (Figure 6C), suggesting that the expression of larger introns is limited. Consistent with this point, we found that introns up-regulated in dbr1-2 were significantly shorter than introns with lariat accumulation in Col-0 (Figure 6B).
Figure 6.
The Identification and Validation of Lariat RNAs in Arabidopsis.
(A) The distribution of the total 10,580 lariat RNAs identified in Col-0 with different expression levels. FPKM was used to evaluate the expression level of each lariat RNA, and the numbers indicate total numbers of lariat RNAs with indicated expression levels.
(B) The length of all introns >100 bp in the Arabidopsis genome, introns with lariat RNA expression ≥5 FPKM in Col-0, and introns with lariat RNA introns in dbr1-2. Asterisks indicate P < 10−15, Welch’s t test.
(C) The length of introns with indicated expression levels of lariat RNAs. Asterisks indicate P < 10−15, Welch’s t test, compared with the group of 5542 introns with lariat RNA expression at FPKM 5 to 10 in Col-0.
(D) The genome browser of the abundances of four lariat-derived circular RNAs in the eight RNA-seq profiles. The dashed regions highlight the region between the BP and the 3′ss without supporting reads. Labels such as “Ch4: 9714955-9715275” indicate the genetic locations of the regions shown in the chromosome. Numbers in brackets indicate the values of the normalized expression levels. r1 and r2 indicate replicate 1 and replicate 2, respectively.
(E) Schematic showing the information of the probes used in (F) and (G).
(F) RNA gel blotting detected the expression of the four lariat-derived circular RNAs shown in (D). The lower panels (28S and 18S rRNAs were stained by ethidium bromide) indicate the loading controls, and all total RNA samples were loaded onto the same denatured agarose gel. The size of each band is indicated in each blot.
(G) RNA gel blotting further detected the lariat-derived circular RNA of At4g17390.1 I2 and At3g52590.1 I3 on the denatured PAGE gel. Equal amounts of total RNA samples from Col-0 and dbr1-2 were loaded onto the same urea-PAGE gel. The right-most panel shows the sample of total RNAs treated by RNase R. Linear RNA standards of 380, 238, and 183 nt were used as size indicators. The red arrows indicate the detected bands.
By examining the frequency of a single gene harboring lariat RNAs, we showed that most genes only allowed one lariat RNA to accumulate (Supplemental Figure 10A) and that more than 2000 lariat RNAs were originated from the first intron (Supplemental Figure 10B). The potential coding capacity analysis showed that most lariat-derived circular RNAs are noncoding transcripts (Supplemental Figure 10C). Moreover, the expression of the lariat-derived circular RNAs is moderately correlated with expression of the parent gene in both Col-0 and dbr1-2 (Supplemental Figures 10D and 10E), which is consistent with the finding that lariat-derived circular RNAs promote expression of the parent gene in human cell lines (Zhang et al., 2013). However, the correlation coefficient between the expression of genes and introns was significantly decreased in dbr1-2 (Supplemental Figures 10D and 10E), consistent with the disturbed processing of lariat RNAs in dbr1-2.
To validate identified lariat-derived circular RNAs in vivo, we randomly chose four loci (Figure 6D) for detection by RNA gel blotting. These four lariat-derived circular RNAs represent two types of loci. One type is present in wild-type (Col-0) plants (i.e., At4g17390.1 I2 and At3g52590.1 I3; Figure 6D), while the other only accumulates in the dbr1-2 mutant (i.e., At1g60995.1 I8 and At5g23050.1 I8; Figure 6D). We first analyzed these lariat-derived circular RNAs in denatured agarose gels by RNA gel blotting using the antisense transcript of the intron as the probe (Figure 6E). As shown in Figure 6F, these four previously unreported intronic RNAs were detected with expected sizes. At4g17390.1 I2 and At3g52590.1 I3 were detected in Col-0, and At1g60995.1 I8 and At5g23050.1 I8 were only detected in dbr1-2 (Figure 6F). Although the levels of mature mRNA of At4g17390.1 and At3g52590.1 were comparable between Col-0 and dbr1-2, the intronic RNA levels of At4g17390.1 I2 and At3g52590.1 I3 were significantly higher in dbr1-2 (Figure 6F). Notably, the sizes of intronic RNAs are much smaller than the mRNA of the parent genes (Figure 6F), indicating that these intronic RNAs are individual transcripts.
To further exclude that the detected bands seen in Figure 6F are not alternative linear precursor mRNAs or linear individual intronic RNAs, we loaded all RNA samples with three different-sized RNA standards (Figure 6G, far left) on the same denatured PAGE gels and then performed RNA gel blotting using the same probes as in Figure 6E for At4g17390.1 I2 and At3g52590.1 I3. It is well known that circular RNAs usually migrate much slower than linear RNAs with equivalent sizes on denatured PAGE gels. Consistent with the nature of lariat-derived circular RNAs observed in RNA-seq, both individual RNAs from At4g17390.1 I2 (290 nucleotide) and At3g52590.1 I3 (343 nucleotide) migrated much more slowly than the linear RNA standards, although their predicted sizes are much smaller than those of the RNA standards (Figure 6G). Moreover, the RNA from At4g17390.1 I2 migrated slightly slower than that the one from At3g52590.1 I3 (Figure 6G), consistent with their size difference.
To conclusively demonstrate that these intronic RNAs are circular RNAs, we first treated total RNA using RNase R to degrade linear RNAs (Supplemental Figure 10F) and then examined whether the transcripts of At4g17390.1 I2 were retained by RNA gel blotting. As expected, there was a distinct band of At4g17390.1 I2 at exactly the same position as in the RNase R-treated samples (Figure 6G), suggesting that this intronic RNA is present in vivo with a circular form. Indeed, we observed that sequencing reads from At4g17390.1 I2, At3g52590.1 I3, At1g60995.1 I8, and At5g23050.1 I8 only covered their respective intronic regions between the 5′ss and the BP and that there were no sequencing reads corresponding to the region between the BP and the 3′ss (the dashed region in Figure 6D). We thus systemically identified hundreds of previously unidentified but highly abundant lariat-derived circular RNAs in Arabidopsis.
Lariat-Accumulated Introns Accompany Increased Incidences of Exonic Back-Splicing Events
Since lariat-derived circular RNAs are formed simultaneously with the maturation of pre-mRNAs, we tested whether the accumulation of lariat RNA affects linear mRNA maturation in plants. It is known that specific sequences in the introns, such as Alu elements in mammalian introns, promote back-splicing of adjacent exons, thus inhibiting the production of linear mature mRNA (Liang and Wilusz, 2014; Zhang et al., 2014; Kramer et al., 2015). However, this mechanism might not be applicable in species that lack noticeable flanking intronic secondary structure, and a subsequent study showed that the formation of double lariats contributes to the occurrence of exonic back-splicing events in yeast (Barrett et al., 2015), indicating that there might be a connection between lariat structure and exon circularization. By analyzing the ratio of back-splicing events of two flanking exons, we showed that the incidence of back-splicing events was significantly correlated to the accumulation of lariat-derived circular RNAs (Figure 7A). Moreover, the correlation between the exonic circularization and intronic accumulation was independent of the position of flanking exons (i.e., both upstream and downstream adjacent exons exhibited increased incidence of back-splicing; Figure 7A). In addition to the correlation with two adjacent exons, the incidence of back-splicing events of the parent gene was also significantly increased with the accumulation of lariat-derived circular RNAs (Supplemental Figure 11A). These results indicate that the rapid elimination of lariat RNAs favors the production of linear mature mRNAs.
Figure 7.
The Interplay Among Exonic Back-Splicing, Intronic TE Insertion, and Lariat RNA Accumulation.
(A) Fold change of back-splicing incidence at two adjacent exons. Asterisks indicate P < 0.001, χ2 test, compared with all introns in the genome.
(B) The percentage of different types of TEs in different groups of introns. All introns represents introns >100 nt in the genome, other introns represent introns with lariat RNA accumulation at the level of FPKM 5 to 10, 10 to 20, 20 to 30, 30 to 50, and >50, and introns with up-regulated lariat RNAs in dbr1-2 as indicated.
(C) The comparison of lengths of RE-introns and non-RE introns.
(D) The comparisons of intron expression between RE-introns and non-RE introns. Asterisks indicate P < 10−10, Mann-Whitney U test. r1 and r2 indicate replicate 1 and replicate 2, respectively.
(E) Three examples of introns with or without retroelements in the eight RNA-seq profiles. Labels such as “Ch2: 14713970-14715028” indicate the genetic locations of the regions shown in the chromosome. Numbers in brackets indicate the values of the normalized expression level.
Exclusion of Retroelements in Lariat RNA-Accumulated Introns
To understand sequence features of lariat RNA-accumulated introns, we investigated whether the insertion of transposable elements (TEs) in the intronic regions affected the turnover of lariat RNAs. We used RepeatMasker (Tempel, 2012) to analyze the distributions of TEs in three types of introns (i.e., all 64,058 introns >100 bp, 10,580 introns with lariat-derived circular RNA accumulation in Col-0, and 6720 introns with higher accumulation of lariat-derived circular RNAs in dbr1-2). As shown in Figure 7B, 6510 introns harbored various types of TEs or repeated sequences, mainly including retroelements (long terminal repeat elements, Short Interspersed Nuclear elements (SINEs) and Long Interspersed Nuclear Elements (LINEs), ∼1% of total length), DNA transposons (∼1.5% of total length), and simple repeats (∼1.2% of total length). In contrast, introns with accumulation of lariat-derived circular RNAs were significantly depleted of long terminal repeat retroelements and DNA transposons (Figure 7B) but retained the simple repeated sequences (Figure 7B). Especially, those introns with the most abundant lariat-derived circular RNAs (≥50 FPKM in Col-0) were specifically enriched in satellite sequences (Figure 7B). In total, there are retroelements in 486 of all introns with ≥100 bp in the Arabidopsis genome (Supplemental Figure 11B; Supplemental Data Set 6). The ratio of introns with retroelements was significantly reduced in introns with lariat RNA accumulated in Col-0 and in introns with higher expression in dbr1-2 (P = 3.8 × 10−16 and P = 1.6 × 10−6, respectively, Fisher’s exact test; Supplemental Figure 11B). Compared with naturally accumulated introns in Col-0, the ratio of introns with repeat elements was slightly increased in introns with higher expression in dbr1-2 (P = 0.04, Fisher’s exact test; Supplemental Figure 11B). These results indicate that the insertion of different classes of TEs might play a role in the turnover of lariat RNAs.
Given that introns are longer in more complex eukaryotes and the insertion of TEs into intronic regions most likely contributes to the increase of intron length, we wanted to know if these introns depleted of retroelements are longer than other introns. We named those 486 introns with >100 nt in length and retroelement sequences as RE-introns and other introns as non-RE introns. Indeed, RE-introns are significantly longer than other introns (P < 10−100, Student’s t test; Figure 7C). Moreover, because the insertion of TEs usually leads to the formation of heterochromatic status of parent genes, which generally inhibits gene expression, we compared the expression levels of RE-introns and non-RE introns. RE-introns themselves had significantly lower expression levels than non-RE introns (P < 10−10, Mann-Whitney U test) in all eight RNA-seq profiles (Figure 7D), further suggesting that the presence of retroelements is anticorrelated with lariat RNA accumulation. Furthermore, the expression levels of parent genes with RE-introns were also significantly lower than for non-RE parental genes (P < 10−10, Mann-Whitney U test) in all eight RNA-seq profiles (Supplemental Figure 11C). As shown in Figure 7E, there are three retroelements in At2g34880.1 I5, which might contribute to its extremely low expression levels. In contrast, At2g14080.1 I1 only consists of one Long Interspersed Nuclear Element (LINE), and the expression of both parent gene and intron were much higher than At2g34880 (Figure 7E). Furthermore, At4g39260.1 I1 contained no retroelements and had much higher expression levels than either At2g34880.1 I5 or At2g14080.1 I1 (Figure 7E), and the parent gene At4g39260.1 also had much higher expression levels than At2g34880.1 and At2g14080.1, suggesting that retroelements contribute to the expression levels of both parent genes and their introns.
In Col-0 and dbr1-2 RNase R-untreated transcriptomes, we found that the expression level of At2g14080.1 I1 was very limited (Figure 7E), but At2g14080.1 I1 expression was abundant in RNase R (+) libraries, especially in dbr1-2 RNase R (+) libraries (Figure 7E), further indicating that DBR1 is responsible for the degradation of At2g14080.1 I1. Therefore, we systemically examined the types of TEs in higher expressed introns in dbr1-2. We found that unlike the exclusion of retroelements and DNA transposons in naturally accumulated introns in Col-0 (Figure 7B), a substantial portion of higher expressed introns in dbr1-2 harbored retroelements and DNA repeats (Supplemental Figure 11D). Collectively, these analyses indicate that retroelement insertion is anticorrelated with the accumulation of lariat RNAs.
DISCUSSION
Although both BP selection and lariat RNA formation are essential during pre-mRNA splicing, the features of BPs and lariat RNA detection have been mostly reported case by case. The first large-scale analysis of BPs was performed in Fairbrother’s lab (Taggart et al., 2012, 2017), in which high-throughput RNA-seq data from human cell lines were used to find the BP location and to map the distribution of splicing factors around BPs. A subsequent study developed a data-driven algorithm, Lariat Sequence Site Origin, to map precisely the location of BPs on a genome-wide scale in yeast (Bitton et al., 2014). With the improvement of circular RNA-seq, Mercer and colleagues used RNase R digestion followed by RNA-seq to enrich sequences that traverse the lariat junction and provided a first comprehensive map for human BPs (Mercer et al., 2015; Taggart et al., 2017). All these studies provide comprehensive knowledge about BPs and lariat RNAs in yeast and human. However, the understanding about BPs and lariat RNAs in plants was still unexplored. In this study, we utilized a huge number of published RNA-seq data sets from four plant species to extract BPs, and we took advantage of the viability of a weak allele of dbr1 to enrich for lariat RNAs, thus providing a comprehensive view of BPs in both monocots and dicots. Moreover, the systemic identification of lariat-derived circular RNAs in wild-type plants opens a research avenue that will allow examination of the unexpected role of intron transcripts.
The basic principles of the BP selection are highly conserved from plants to human (Taggart et al., 2012, 2017; Bitton et al., 2014; Mercer et al., 2015; Pineda and Bradley, 2018). First, the BP nucleotide is strictly constrained in distance from the 3′ss. Second, the BP nucleotide exhibits a strong preference for adenine. Third, sequences flanking the BP exhibit uracil-rich nucleotides. Fourth, uracil is preferred as the second nucleotide upstream of the BP. One of the earliest steps in spliceosome assembly is the binding of SF1 to the BP (Pastuszak et al., 2011), a process for which SF1 requires only the UnA motif, providing a mechanistic explanation for the importance of the uracil in the second last position before the BP.
Although downstream sequences of the BP in plants are uracil-rich (this study), both uracil-rich and cytosine-rich downstream sequences in human (Mercer et al., 2015) indicate that heterogeneity of downstream sequences may enable the sequence-specific selection of multiple BPs by the spliceosome, resulting in more complicated regulation of splicing in larger genomes. Besides the common features in the BP nucleotide and flanking sequences, we also found that multiple BPs (Figure 3) and tissue-specific BP usage (Figure 4) might contribute to the complexity of gene expression in plants.
Interestingly, we observed that the accumulation of lariat-derived circular RNAs was correlated with the occurrence of back-splicing events of flanking exonic regions (Figure 7), indicating that quick turnover of lariat RNAs by DBR1 is beneficial for pre-mRNA splicing to favor the production of linear mRNA. Recent mechanistic studies show that intronic complementary sequences (Liang and Wilusz, 2014; Zhang et al., 2014; Kramer et al., 2015), the homodimerization of specific proteins binding to intronic regions (Conn et al., 2015), or potential intronic RNA-RNA interaction (Ivanov et al., 2015) promote exonic back-splicing events. Our finding that the rapid turnover of lariat RNAs prevents back-splicing uncovers a new perspective to understand the biological significance of intron metabolism in gene expression. Identification of lariat RNA binding proteins will provide further mechanistic evidence of the balance between linear mRNA production and intronic circular RNA formation.
In addition, we observed an anticorrelation between intronic retroelement insertion and lariat RNA accumulation in plants (Figure 7; Supplemental Figure 11). Two possibilities might explain this phenomenon. First, that the insertion of TEs in intronic regions of coding genes usually leads to heterochromatization of the parent gene, and thus the transcription of the parent gene is limited, which finally leads to less production of lariat RNAs. Second, that the transcription of parent genes is quite normal but their corresponding lariat RNAs with TE sequences are preferentially degraded by DBR1. The latter possibility is consistent with previous findings that DBR1 was initially identified as the regulator for TE transposition in yeast (Karst et al., 2000). Together with our finding that those TE-contained introns were more highly expressed in dbr1-2 (Supplemental Figure 11F), these results suggest that lariat RNAs formed from TE-contained introns might be much more sensitive to DBR1 activity.
In summary, our work provides a comprehensive map of BPs and lariat-derived circular RNAs in four plant species, uncovers features of BPs and lariat-derived circular RNAs, shows a potential link between intron metabolism and the evolution of TEs, and opens a novel perspective to understand the communication between intronic circular RNAs and exonic circular RNAs.
METHODS
Materials and RNA-Seq Libraries
Arabidopsis (Arabidopsis thaliana) Col-0 was used as the wild type. Seeds of dbr1-2 were generated (Li et al., 2016). Plants were grown in a 16-h-light (bulb type, Philips TLD 36W/865, with eight tubes)/8-h-dark growth room. Inflorescences were collected for total RNA extraction with Trizol (Ambion). Total RNA was treated with a Ribo Zero kit (Epicenter) to obtain rRNA-depleted RNA, then incubated with or without RNase R (Epicenter) and subjected to phenol:chloroform purification. Purified RNAs were used for library preparation with the Illumina TruSeq Stranded Total RNA High Throughput Sample Prep Kit (P/N15031048), and libraries were sequenced with the Illumina HiSeq 2500 sequencer at Genergy. Two replicates for each sample were performed.
RNA Gel Blotting
Total RNA was extracted from inflorescences using Trizol reagent (Invitrogen). Twenty micrograms of total RNA was loaded on 1.2% (w/v) denatured agarose gels or 5% urea-PAGE gels and transferred to a nylon membrane. [32P]α-UTP-labeled antisense RNAs as probes or linear standards were transcribed in vitro using T7 RNA polymerase. Hybridization was performed using hybridization buffer (Ambion), and the signals were detected using Typhoon FLA9500 (GE Healthcare). Primers used for in vitro transcription are listed in Supplemental Table.
Validation of Lariat RNAs by RT-PCR Follower by Sanger Sequencing
Lariat RNAs across the BP were detected by RT-PCR as described (Suzuki et al., 2006). Total RNA with RNase R treatments were used as templates. cDNA synthesis was performed using SuperScript III (Invitrogen) with random hexamers. Reaction mixtures were incubated at 30°C for 10 min, at 42°C for 120 min, at 50°C for 30 min, at 60°C for 30 min, and at 99°C for 5 min. Then lariat RNAs were obtained by PCR and purified by gel purification for Sanger sequencing to identify the BP. Primer sequences used are listed in Supplemental Table.
Computational Analysis of the RNA-Seq Profiles
The RNA-seq libraries were mapped to the genome of Arabidopsis (version TAIR10) using Cufflinks v2.2.1 (Trapnell et al., 2010). Cuffquant and Cuffnorm were used to quantify and normalize the FPKM values of the genes, respectively. Correlation coefficients of gene expression levels were calculated for Col-0 and dbr1-2 with and without RNase R digestion. Normalized FPKM values of genes in the Col-0 and dbr1-2 samples without RNase R treatments were compared to find deregulated genes with edgeR (Robinson et al., 2010). Genes with average abundances of at least 5 FPKM in either dbr1-2 or Col-0 and multiple test corrected P < 0.05 were designated as deregulated genes. Genes with abundances of at least 10 FPKM in at least one of the eight samples and sd of at least 1 were used for further analyses. The normalized FPKM values plus one were log-scaled to calculate the correlation coefficient values between samples. The correlation coefficient values were applied to the pheatmap function in the pheatmap library in R to perform hierarchical clustering. These filtered genes were also used to perform PCA. Log-scaled normalized FPKM values plus one were applied to the prcomp function in the psych library in R to perform PCA.
Estimation of the Expression Levels of Lariat RNAs
The bedtools genomecov command of bedtools (Quinlan and Hall, 2010) was used to calculate the genome coverage of RNA-seq libraries. A custom program was used to calculate FPKMs of introns of annotated genes in TAIR10, using the genome coverage results of RNA-seq libraries. To compare global changes of intron expression, the average intron expression levels were calculated for Col-0 and dbr1-2 with and without RNase R treatments. Those intronic transcripts with expression levels of FPKM ≥ 5 from the Col-0 samples with RNase R treatments were defined as lariat RNAs in wild-type plants. The differences of intron expression levels for Col-0 and dbr1-2, with and without treatments, were evaluated with the Mann-Whitney U test. The correlation coefficients of intron expression levels were calculated for the two samples of Col-0 and dbr1-2 with and without RNase R treatments. To find deregulated introns in dbr1-2, introns with at least 5 FPKM in dbr1-2 were kept. Then, the expression levels of introns in dbr1-2 and Col-0 with RNase R treatments were used to find deregulated introns using edgeR (Robinson et al., 2010). The introns with false discovery rate values smaller than 0.05 and log-scaled fold change larger than 1 were deemed higher expressed introns in dbr1-2. The introns were selected using the same criteria as genes, then used to perform hierarchical clustering and PCA using the same methods as the genes.
Correlation between the Expression Levels of Introns and Their Parent Genes
The average FPKM values of introns and the average FPKM values of their host genes were used to calculate the correlation coefficient values in the four Col-0 and four dbr1-2 samples without RNase R treatment. If a gene had more than one intron, the intron closest to the transcription start site was kept. The average FPKM values of introns and average FPKM values of genes should both be larger than 1 or equal to 1. Only introns with at least 200 bp were used for analysis.
The Computational Pipeline for Identifying BPs
Reverse aligned reads to 2′-5′-phosphodiester site regions were identified with a customized computational pipeline (Figure 2A). First, a database of all introns in Arabidopsis (for all annotated genes in the TAIR10 database) was generated using a self-written program. Second, RNA-seq profiles were aligned to the genome using TopHat2 for self-generated data sets (Kim et al., 2013) or HISAT2 for published data sets (Kim et al., 2015), by specifying the unmapped reads using the option “–un-conc.” For TopHat2, reads that could not be mapped to the genome were retrieved with bamToFastq in bedtools (Quinlan and Hall, 2010). Then, the unmapped reads were aligned to introns of TAIR10 annotated genes with BLASTN for self-generated RNA-seq data sets, using the options of “-S 1 -e 1e-20,” or with Bowtie 2 (Langmead and Salzberg, 2012) for published RNA-seq data sets, using the options of “–local -q–norc–no-unal -p 32 -a–no-hd–no-sq.” Finally, a self-written program was used to check whether the remaining regions of the partially matched reads could also be aligned to the same introns. We selected reads that spanned the 5′ss and the potential BP, requiring that both of the two matched segments in a matched read had at least 6 nt. The BP is then the last nucleotide of the matched segment near the 3′ end of the intron. The branch positions that were detected in at least one of the selected RNA-seq libraries were retained for counting the different branch nucleotides of lariat RNAs.
To identify BPs in rice (Oryza sativa), tomato (Solanum lycopersicum), and maize (Zea mays), the Michigan State University Rice Genome Annotation for Oryza sativa Nipponbare (release 7), ITGA3.20 annotation for the tomato genome (Tomato Genome Consortium), and annotation of maize cv B73 (version 4), respectively, were used to retrieve the intron sequences. Selected RNA-seq profiles of rice, tomato, and maize (as listed in Supplemental Data Set 2) were used to identify BPs in rice using the same method as for Arabidopsis. The regions from 10 bp upstream to 10 bp downstream of the detected BP were used to analyze nucleotide composition.
Identifying Tissue-Specific BPs in Arabidopsis and Rice
The number of supporting reads for identified BPs in Arabidopsis and rice were grouped into different tissues (Supplemental Data Set 4). The five tissues with the largest numbers of supporting reads of BPs were used to identify tissue-specific BPs using the multinomial proportion test (Pineda and Bradley, 2018). The obtained P values were corrected using the method proposed by Benjamini and Hochberg (1995). BPs with multiple-test-correlated P < 0.05 were deemed tissue-specific BPs.
Calculating the Splicing Efficiency
The bedtools genomecov command of bedtools (Quinlan and Hall, 2010) was used to calculate the genome coverage of the RNA-seq libraries. The maximal number of reads that cover the +10-bp regions of exon-to-intron sites, EI, was calculated. The number of junction reads, JR, was reported by TopHat in the Cufflinks pipeline. The splicing efficiency of a gene was calculated as the log2 value of EI/JR as proposed by Bitton et al. (2014).
Analysis of TEs in Introns
RepeatMasker (version open-4.0.6; Tempel, 2012) was used to analyze TEs in all introns longer than 100 bp, 10,580 introns with lariat accumulation (≥5 FPKM) in Col-0, and 6720 introns with higher expression in dbr1-2. The RepeatMasker edition of RepBase (Bao et al., 2015) was used in RepeatMasker.
Accession Numbers
Sequence data from this article can be found with the accession numbers listed in Supplemental Data Set 2. The RNA-seq data are deposited in the National Center for Biotechnology Information Gene Expression Omnibus database with series accession number GSE117416.
Supplemental Data
Supplemental Figure 1. Schematic view of the experimental design.
Supplemental Figure 2. Correlation of gene and intron expression levels for each group between two biological replicates.
Supplemental Figure 3. Gene expression patterns of the used samples and validation of selected BPs in Arabidopsis.
Supplemental Figure 4. Examples of introns using guanines as their BPs in plants.
Supplemental Figure 5. Fidelity of BPs during library construction of RNA-seq samples.
Supplemental Figure 6. Validation of two BPs in At4g39260.1 I1.
Supplemental Figure 7. Validation of the seven BPs of At3g23590.1 I7.
Supplemental Figure 8. Examples of self-circularized introns in tomato and rice.
Supplemental Figure 9. Length distribution of self-circularized introns in plants.
Supplemental Figure 10. Other characteristics of lariat RNAs in Arabidopsis.
Supplemental Figure 11. Back-splicing and expression of parent genes with lariat RNA accumulation.
Supplemental Table. Primers used in the study.
Supplemental Data Set 1. Transcriptome analysis of Col-0 and dbr1-2.
Supplemental Data Set 2. List of RNA-seq data sets used in this study.
Supplemental Data Set 3. BPs identified in four plant species.
Supplemental Data Set 4. Tissue-specific bps identified in Arabidopsis and rice.
Supplemental Data Set 5. Self-circularized introns identified in four plant species.
Supplemental Data Set 6. Introns with retroelements in Arabidopsis.
Dive Curated Terms
The following phenotypic, genotypic, and functional terms are of significance to the work described in this paper:
AUTHOR CONTRIBUTIONS
Y.Zhe. and B.Z. conceived and designed the research; X.Z. performed most bioinformatic analyses; Y.Zha., T.W., and Z.L. performed biological experiments, including preparing samples for RNA-seq and validation of lariat RNAs; H.G., Q.T., and J.C. provided technical help and critical comments on this project; Y.Zhe. designed and implemented the computational methods; K.C., L.L., C.L., and J.G. helped to analyze RNA-seq data; Y.Zhe. and B.Z. wrote the article.
Acknowledgments
We thank Sheila McCormick for editing. This work was supported by the National Natural Science Foundation of China (grants 31830045, 31671261, and 31470281 to B.Z. and grant 31460295 to Y.Zhe.).
References
- Armakola M., Higgins M.J., Figley M.D., Barmada S.J., Scarborough E.A., Diaz Z., Fang X., Shorter J., Krogan N.J., Finkbeiner S., Farese R.V. Jr., Gitler A.D. (2012). Inhibition of RNA lariat debranching enzyme suppresses TDP-43 toxicity in ALS disease models. Nat. Genet. 44: 1302–1309. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bao W., Kojima K.K., Kohany O. (2015). Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA 6: 11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barrett S.P., Wang P.L., Salzman J. (2015). Circular RNA biogenesis can proceed through an exon-containing lariat precursor. eLife 4: e07540. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Benjamini Y., Hochberg Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Series B Stat. Methodol. 57: 289–300. [Google Scholar]
- Bitton D.A., Rallis C., Jeffares D.C., Smith G.C., Chen Y.Y., Codlin S., Marguerat S., Bähler J. (2014). LaSSO, a strategy for genome-wide mapping of intronic lariats and branch points using RNA-seq. Genome Res. 24: 1169–1179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen S., Anderson K., Moore M.J. (2000). Evidence for a linear search in bimolecular 3′ splice site AG selection. Proc. Natl. Acad. Sci. USA 97: 593–598. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chua K., Reed R. (2001). An upstream AG determines whether a downstream AG is selected during catalytic step II of splicing. Mol. Cell. Biol. 21: 1509–1514. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Conn S.J., Pillman K.A., Toubia J., Conn V.M., Salmanidis M., Phillips C.A., Roslan S., Schreiber A.W., Gregory P.A., Goodall G.J. (2015). The RNA binding protein quaking regulates formation of circRNAs. Cell 160: 1125–1134. [DOI] [PubMed] [Google Scholar]
- Dumesic P.A., Natarajan P., Chen C., Drinnenberg I.A., Schiller B.J., Thompson J., Moresco J.J., Yates J.R. III, Bartel D.P., Madhani H.D. (2013). Stalled spliceosomes are a signal for RNAi-mediated genome defense. Cell 152: 957–968. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Galvis A.E., Fisher H.E., Nitta T., Fan H., Camerini D. (2014). Impairment of HIV-1 cDNA synthesis by DBR1 knockdown. J. Virol. 88: 7054–7069. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Galvis A.E., Fisher H.E., Fan H., Camerini D. (2017). Conformational changes in the 5′ end of the HIV-1 genome dependent on the debranching enzyme DBR1 during early stages of infection. J. Virol. 91: e01377. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gardner E.J., Nizami Z.F., Talbot C.C. Jr., Gall J.G. (2012). Stable intronic sequence RNA (sisRNA), a new class of noncoding RNA from the oocyte nucleus of Xenopus tropicalis. Genes Dev. 26: 2550–2559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gooding C., Clark F., Wollerton M.C., Grellscheid S.N., Groom H., Smith C.W. (2006). A class of human exons with predicted distant branch points revealed by analysis of AG dinucleotide exclusion zones. Genome Biol. 7: R1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ivanov A., Memczak S., Wyler E., Torti F., Porath H.T., Orejuela M.R., Piechotta M., Levanon E.Y., Landthaler M., Dieterich C., Rajewsky N. (2015). Analysis of intron sequences reveals hallmarks of circular RNA biogenesis in animals. Cell Rep. 10: 170–177. [DOI] [PubMed] [Google Scholar]
- Jacquier A., Rosbash M. (1986). RNA splicing and intron turnover are greatly diminished by a mutant yeast branch point. Proc. Natl. Acad. Sci. USA 83: 5835–5839. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karst S.M., Rütz M.L., Menees T.M. (2000). The yeast retrotransposons Ty1 and Ty3 require the RNA Lariat debranching enzyme, Dbr1p, for efficient accumulation of reverse transcripts. Biochem. Biophys. Res. Commun. 268: 112–117. [DOI] [PubMed] [Google Scholar]
- Kim D., Pertea G., Trapnell C., Pimentel H., Kelley R., Salzberg S.L. (2013). TopHat2: Accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14: R36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim D., Langmead B., Salzberg S.L. (2015). HISAT: A fast spliced aligner with low memory requirements. Nat. Methods 12: 357–360. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim H.C., Kim G.M., Yang J.M., Ki J.W. (2001). Cloning, expression, and complementation test of the RNA lariat debranching enzyme cDNA from mouse. Mol. Cells 11: 198–203. [PubMed] [Google Scholar]
- Kim J.W., Kim H.C., Kim G.M., Yang J.M., Boeke J.D., Nam K. (2000). Human RNA lariat debranching enzyme cDNA complements the phenotypes of Saccharomyces cerevisiae dbr1 and Schizosaccharomyces pombe dbr1 mutants. Nucleic Acids Res. 28: 3666–3673. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kramer M.C., Liang D., Tatomer D.C., Gold B., March Z.M., Cherry S., Wilusz J.E. (2015). Combinatorial control of Drosophila circular RNA expression by intronic repeats, hnRNPs, and SR proteins. Genes Dev. 29: 2168–2182. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Langmead B., Salzberg S.L. (2012). Fast gapped-read alignment with Bowtie 2. Nat. Methods 9: 357–359. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li Z., Wang S., Cheng J., Su C., Zhong S., Liu Q., Fang Y., Yu Y., Lv H., Zheng Y., Zheng B. (2016). Intron lariat RNA inhibits microRNA biogenesis by sequestering the dicing complex in Arabidopsis. PLoS Genet. 12: e1006422. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liang D., Wilusz J.E. (2014). Short intronic repeat sequences facilitate circular RNA production. Genes Dev. 28: 2233–2247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mercer T.R., Clark M.B., Andersen S.B., Brunck M.E., Haerty W., Crawford J., Taft R.J., Nielsen L.K., Dinger M.E., Mattick J.S. (2015). Genome-wide discovery of human splicing branchpoints. Genome Res. 25: 290–303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meyer M., Plass M., Pérez-Valle J., Eyras E., Vilardell J. (2011). Deciphering 3'ss selection in the yeast genome reveals an RNA thermosensor that mediates alternative splicing. Mol. Cell 43: 1033–1039. [DOI] [PubMed] [Google Scholar]
- Morgan J.T., Fink G.R., Bartel D.P. (2019). Excised linear introns regulate growth in yeast. Nature 565: 606–611. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nam K., Lee G., Trambley J., Devine S.E., Boeke J.D. (1997). Severe growth defect in a Schizosaccharomyces pombe mutant defective in intron lariat degradation. Mol. Cell. Biol. 17: 809–818. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Okamura K., Hagen J.W., Duan H., Tyler D.M., Lai E.C. (2007). The mirtron pathway generates microRNA-class regulatory RNAs in Drosophila. Cell 130: 89–100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ooi S.L., Samarsky D.A., Fournier M.J., Boeke J.D. (1998). Intronic snoRNA biosynthesis in Saccharomyces cerevisiae depends on the lariat-debranching enzyme: Intron length effects and activity of a precursor snoRNA. RNA 4: 1096–1110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Parenteau J., Maignon L., Berthoumieux M., Catala M., Gagnon V., Abou Elela S. (2019). Introns are mediators of cell response to starvation. Nature 565: 612–617. [DOI] [PubMed] [Google Scholar]
- Pastuszak A.W., Joachimiak M.P., Blanchette M., Rio D.C., Brenner S.E., Frankel A.D. (2011). An SF1 affinity model to identify branch point sequences in human introns. Nucleic Acids Res. 39: 2344–2356. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pineda J.M.B., Bradley R.K. (2018). Most human introns are recognized via multiple and tissue-specific branchpoints. Genes Dev. 32: 577–591. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Quinlan A.R., Hall I.M. (2010). BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 26: 841–842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robinson M.D., McCarthy D.J., Smyth G.K. (2010). edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26: 139–140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ruby J.G., Jan C.H., Bartel D.P. (2007). Intronic microRNA precursors that bypass Drosha processing. Nature 448: 83–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ruskin B., Green M.R. (1985). An RNA processing activity that debranches RNA lariats. Science 229: 135–140. [DOI] [PubMed] [Google Scholar]
- Ruskin B., Krainer A.R., Maniatis T., Green M.R. (1984). Excision of an intact intron as a novel lariat structure during pre-mRNA splicing in vitro. Cell 38: 317–331. [DOI] [PubMed] [Google Scholar]
- Smith C.W., Porro E.B., Patton J.G., Nadal-Ginard B. (1989). Scanning from an independently specified branch point defines the 3′ splice site of mammalian introns. Nature 342: 243–247. [DOI] [PubMed] [Google Scholar]
- Suzuki H., Zuo Y., Wang J., Zhang M.Q., Malhotra A., Mayeda A. (2006). Characterization of RNase R-digested cellular RNA source that consists of lariat and circular RNAs from pre-mRNA splicing. Nucleic Acids Res. 34: e63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Taggart A.J., DeSimone A.M., Shih J.S., Filloux M.E., Fairbrother W.G. (2012). Large-scale mapping of branchpoints in human pre-mRNA transcripts in vivo. Nat. Struct. Mol. Biol. 19: 719–721. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Taggart A.J., Lin C.L., Shrestha B., Heintzelman C., Kim S., Fairbrother W.G. (2017). Large-scale analysis of branchpoint usage across species and cell lines. Genome Res. 27: 639–649. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Talhouarne G.J., Gall J.G. (2014). Lariat intronic RNAs in the cytoplasm of Xenopus tropicalis oocytes. RNA 20: 1476–1487. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Talhouarne G.J.S., Gall J.G. (2018). Lariat intronic RNAs in the cytoplasm of vertebrate cells. Proc. Natl. Acad. Sci. USA 115: E7970–E7977. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tay M.L., Pek J.W. (2017). Maternally inherited stable intronic sequence RNA triggers a self-reinforcing feedback loop during development. Curr. Biol. 27: 1062–1067. [DOI] [PubMed] [Google Scholar]
- Tempel S. (2012). Using and understanding RepeatMasker. Methods Mol. Biol. 859: 29–51. [DOI] [PubMed] [Google Scholar]
- Trapnell C., Williams B.A., Pertea G., Mortazavi A., Kwan G., van Baren M.J., Salzberg S.L., Wold B.J., Pachter L. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28: 511–515. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang H., Hill K., Perry S.E. (2004). An Arabidopsis RNA lariat debranching enzyme is essential for embryogenesis. J. Biol. Chem. 279: 1468–1473. [DOI] [PubMed] [Google Scholar]
- Ye Y., De Leon J., Yokoyama N., Naidu Y., Camerini D. (2005). DBR1 siRNA inhibition of HIV-1 replication. Retrovirology 2: 63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang S.Y., et al. (2018). Inborn errors of RNA lariat metabolism in humans with brainstem viral infection. Cell 172: 952–965.e918. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang X.O., Wang H.B., Zhang Y., Lu X., Chen L.L., Yang L. (2014). Complementary sequence-mediated exon circularization. Cell 159: 134–147. [DOI] [PubMed] [Google Scholar]
- Zhang Y., Zhang X.O., Chen T., Xiang J.F., Yin Q.F., Xing Y.H., Zhu S., Yang L., Chen L.L. (2013). Circular intronic long noncoding RNAs. Mol. Cell 51: 792–806. [DOI] [PubMed] [Google Scholar]
- Zheng S., Vuong B.Q., Vaidyanathan B., Lin J.Y., Huang F.T., Chaudhuri J. (2015). Non-coding RNA generated following lariat debranching mediates targeting of AID to DNA. Cell 161: 762–773. [DOI] [PMC free article] [PubMed] [Google Scholar]








