Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Dec 14.
Published in final edited form as: J Genet Genomics. 2024 Aug 31;51(11):1338–1341. doi: 10.1016/j.jgg.2024.08.007

COL: a method for identifying putatively functional circular RNAs

Zheng Li 1, Bandhan Sarker 2, Fengyu Zhao 3, Tianjiao Zhou 4, Jianzhi Zhang 5,*, Chuan Xu 6,*
PMCID: PMC11645182  NIHMSID: NIHMS2039744  PMID: 39218058

Circular RNAs (circRNAs) are a class of endogenous, single-stranded, covalently closed, mostly non-coding RNAs that are produced by back-splicing that links a downstream splice-donor site with an upstream splice-acceptor site (Chen, 2016; Kristensen et al., 2019). CircRNAs have been reported to have important regulatory functions, such as acting as miRNA (Hansen et al., 2013; Memczak et al., 2013) or protein (Ashwal-Fluss et al., 2014) sponge to regulate gene expression, acting as scaffolds to mediate the formation of complexes (Du et al., 2016), and being translated into small functional peptides (Pamudurti et al., 2017). Notwithstanding, of millions of known circRNAs, those with demonstrated functions are a tiny fraction.

Identifying functional circRNAs is challenging because of the presence of orders of magnitude more nonfunctional circRNAs. The frequently used approach may be the differential expression method (DEM) (Reiner et al., 2003; Huang et al., 2016), but it needs at least two groups of samples (e.g., patients vs. controls), which is obviously inapplicable to a single sample or non-comparable samples. Another commonly used approach may be named the expression level method (ELM) (Memczak et al., 2013; Huang et al., 2019), which treats highly expressed circRNAs as putatively functional. While functional circRNAs are likely to be highly expressed, highly expressed circRNAs may not be functional because their high expressions may simply arise from the high expressions of their parental genes. The conservation of circRNAs across species is another useful criterion for identifying functional circRNAs (CM), because random splicing errors are unlikely to be conserved across species (Rybak-Wolf et al., 2015; Suenkel et al., 2020). However, CM requires not only the sequenced genomes but also circRNA data from multiple species. Recently, a high-throughput experimental screening method based on CRISPR-Cas13 was developed (Li et al., 2021). This experimental method improves the discovery of functional circRNAs, but it is complicated and costly and cannot be used in species without a CRISPR-Cas13 screening system. Therefore, there is a need for a simple, inexpensive, and widely applicable method for identifying putatively functional circRNAs.

In this work, we propose and validate such a method. We hypothesize that, if a circRNA is functional, the back-splicing that generates the circRNA should be considered functional back-splicing. Therefore, to identify functional circRNAs is to discover functional back-splicing. Functional back-splicing is expected to have the following three features. First, the error hypothesis of the origin of circRNAs asserts that most back-splicing results from deleterious splicing error, predicting a negative correlation between the amount of splicing at a splicing junction and the back-splicing rate at the junction, which was indeed observed (Xu and Zhang, 2021). Functional back-splicing is expected to break this rule and show a higher back-splicing rate than predicted by the splicing amount. Second, functional back-splicing is expected to have conserved splicing motifs. Third, functional back-splicing should have a relatively high back-splicing level to allow the production of sufficient circRNA molecules.

To empirically verify the above suggested characteristics of functional back-splicing, we should ideally compare functional back-splicing with the rest of back-splicing. However, because only a few instances of back-splicing are demonstrably functional, we instead compared human back-splicing that is shared with macaque and mouse with the rest of human back-splicing, because shared back-splicing is more likely than unshared back-splicing to be functional. We combined 11 human tissues to represent human and analyzed all the back-splicing from a recently published dataset (Ji et al., 2019). Consistent with our previous finding (Xu and Zhang, 2021), a significant negative correlation exists between the back-splicing rate and the total splicing amount across splicing junctions (Fig. S1A). To identify back-splicing violating this general rule, we regressed the back-splicing rate against the splicing amount (Fig. S1A) and calculated Cook’s distance for each back-splicing event. A back-splicing event is considered an outlier if its Cook’s distance exceeds four times the mean Cook’s distance of all back-splicing events. We respectively observed 3,587 outliers above and 236 outliers below the regression line (Fig. S1A). The error hypothesis suggests that the outliers above the regression line are beneficial (i.e., functional) whereas those below the regression line are particularly deleterious (Xu and Zhang, 2021). Indeed, shared back-splicing contains significantly more outliers above the regression line than does unshared back-splicing (P < 10−16, Fisher’s exact test; Fig. 1A). By contrast, shared back-splicing contains fewer outliers below the regression line than does unshared back-splicing (Fig. S1B). Both linear-splicing and back-splicing use GU/AG splicing motifs. We found that the splicing motifs of 97.4% of shared back-splicing are conserved across human, macaque, and mouse, significantly higher than the corresponding value (64.2%) for unshared back-splicing (P < 10−16, Fisher’s exact test; Fig. 1B). We then measured the relative back-splicing level at a splicing junction by the number of corresponding back-splicing reads per million total back-splicing reads in the sample (BSRPM) (Liu et al., 2019; Okholm et al., 2020). Indeed, BSRPM is higher for shared than unshared back-splicing (P < 10−16, Mann–Whitney U test; Fig. 1C). Therefore, all three hypothesized features of functional back-splicing are verified in the shared back-splicing of the human tissues combined.

Fig. 1.

Fig. 1.

The development and evaluation of COL. (A) Outliers above the regression line are enriched in shared back-splicing relative to unshared back-splicing. (B) Shared back-splicing has a higher fraction of conserved splicing motifs than does unshared back-splicing. (C) The back-splicing level is higher for shared than unshared back-splicing. (D) The predictive power of each feature. The dotted line represents the baseline (i.e., the fraction of shared back-splicing events in all back-splicing events). (E) The predictive power of combined features through Venn diagram integration. P-value is from chi-squared test. (F) Pairwise comparison between RRA ranking and four other rankings. Each dot represents a back-splicing event. Dots above and below the diagonal are respectively colored in red and blue, with their numbers and fractions indicated. P-values are from binomial tests. (G) Fraction of back-splicing events identified by COL (under different cutoffs of Cook’s distance) that are shared. (H) Fraction of back-splicing events identified by COL (using motifs with different levels of conservation) that are shared. P-value is from Chi-square test. (I) Fraction of back-splicing events in different top proportions of candidates identified by COL or ELM that are shared. (J) Fraction of experimentally validated back-splicing events in the top 1% of the back-splicing candidates identified by COL. (K) Fraction of back-splicing events identified by COL or ELM that are shared. P-value is from a chi-squared test. (L) Ranks of shared back-splicing events. Shared back-splicing events identified by COL are ranked by RRA whereas those identified by ELM are ranked by the back-splicing level. Dots above and below the diagonal are respectively colored in red and blue, with their numbers and fractions indicated. The binomial P-value is presented. (M) Fraction of shared back-splicing events with smaller ranks in COL than in ELM in various top proportions of candidates identified by both methods.

To examine the utility of the above three features for predicting shared (as a proxy for functional) back-splicing, we measured the level of enrichment of shared back-splicing in the candidate back-splicing identified using these features. In human, 2,318 (2.8%) of 83,458 back-splicing events are shared among human, macaque, and mouse. So, 2.8% is the baseline in our enrichment calculation. Of the 3,587 outliers over the regression line in Fig. S1A, 723 (or 20.2%) belong to shared back-splicing, a 7.2-fold enrichment relative to the baseline (P < 10−16, Fisher’s exact test; Fig. 1D). Similarly, 4.2% of back-splicing with conserved splicing motifs belong to shared back-splicing, a 1.5-fold enrichment relative to the baseline (P < 10−16, Fisher’s exact test; Fig. 1D). To compare the utility of back-splicing levels with that of regression outliers, we examined the top 3,587 back-splicing events in terms of their back-splicing levels and found that 22.5% of them belong to shared back-splicing, an 8.0-fold enrichment over the baseline (P < 10−16, Fisher’s exact test; Fig. 1D). We next examined whether a combination of the three features has an enhanced power in predicting shared back-splicing. We first used the Venn diagram to integrate the back-splicing events identified by each feature. We found that any combination of two features has a higher predictive power than any single feature (maximum P < 0.06, chi-squared test; Fig. 1E). The combination of all three features has the highest power—31.4% of the 1960 back-splicing events identified are shared (Fig. 1E). To prioritize these candidates, we tested a few ordering methods such as Average, Stuart, and Robust Rank Aggregation (RRA) that can integrate different ranking results (Kolde et al., 2012). Because the feature of splicing motif conservation is binary (i.e., conserved or not), we only considered Cook’s distance and back-splicing level in the ordering. In the pairwise comparison between any two ranking methods, we found the RRA ranking method to assign smaller rankings to a greater number of shared back-splicing events than any other ranking method examined (Fig. 1F and S2). Therefore, based on the above results from the analysis of shared back-splicing as a proxy for functional back-splicing, we developed an integrative method to predict functional back-splicing (Fig. S3). Because this method integrates conservation of splicing motifs, outliers in the regression analysis, and high back-splicing level, we refer to it as the COL method hereinafter. To facilitate the utilization of COL, we have developed an R package, which is available at https://github.com/XuLabSJTU/COL.

While the COL developed above can enrich shared back-splicing, how the predictive power of COL varies with the threshold of each feature considered is unclear. Therefore, we evaluated COL’s performance under different thresholds of the three features. As shown in Fig. 1E, the predictive power of COL (i.e., the fraction of identified back-splicing that is shared among human, macaque, and mouse) is 31.4% under the thresholds of (i) four times the mean Cook’s distance for the outliers, (ii) motif conservation between human and mouse, and (iii) a minimum back-splicing level determined according to the first two criteria. The rationale behind the use of the three features for predicting functional back-splicing (Fig. 1AC) implies that applying stricter thresholds of these features will likely improve the predictive power of COL. When we set 4× to 20× the mean Cook’s distance as the cutoff, we indeed observed that the predictive power of COL rose with the threshold (Fig. 1G). Furthermore, we predict that considering motifs conserved between human and mouse, which diverged ~90 mya (Hedges et al., 2006), should make COL perform better than considering motifs conserved between human and macaque, which diverged 23 mya (Disotell and Tosi, 2007). This prediction is indeed correct (Fig. 1H). In addition, considering motifs shared by all three species in COL outperforms considering those shared by only two species (Fig. 1H). Because the back-splicing level cutoff is determined by the other two features in COL, we did not examine its effect separately. Together, the above analyses suggest that increasing the thresholds of outlier and motif conservation can improve the predictive power of COL.

To assess the performance of COL in predicting functional back-splicing, we selected different proportions (from top 100% to top 1%) of the ranked candidates and calculated the fraction of shared back-splicing. We found the fraction of shared back-splicing rises with the rank priority of candidates (Fig. 1I). Remarkably, this fraction reaches 84.2% (i.e., 16 of 19 candidates) for the top 1% of candidates. Additionally, we searched the literature for experimental evidence for the functionality of the top 1% of the back-splicing events identified by COL. We found such experimental evidence for 16 of the top 19 candidate back-splicing events (Fig. 1J; Data S1). We next compared COL with ELM, which is the only existing method that does not require more information than what COL uses. Because the back-splicing level is one of the three considerations of COL, COL should outperform ELM. We indeed found that the fraction of shared back-splicing events identified by COL is significantly higher than that identified by ELM (P < 0.04, chi-squared test; Fig. 1K). This trend remains regardless of the specific top proportion of candidates considered (Fig. 1I). In particular, for the top 1% of the candidates, the fraction of shared back-splicing identified by COL is 45.4% higher than that identified by ELM (0.579) (Fig. 1I). To further compare COL with ELM, we examined the rankings of the shared back-splicing candidates identified by the two methods (a smaller ranking indicates a higher probability of being functional). Of the 542 shared back-splicing events identified by both methods, 74% have smaller rankings in COL than ELM, significantly more than the random expectation of 50% (P < 10−16, binomial test; Fig. 1L). Qualitatively similar results were observed in different top proportions of candidates (Fig. 1M), suggesting that COL performs better than ELM in rankings of putatively functional circRNAs identified. To test the robustness of COL’s performance, we further applied COL in other datasets and the results remained qualitatively unchanged (see the Supplementary Discussion; Fig. S4 and S5).

In summary, we hypothesize that functional back-splicing events that generate functional circRNAs (i) exhibit substantially higher back-splicing rates than expected from the total splicing amounts, (ii) have conserved splicing motifs, and (iii) show unusually high back-splicing levels. We confirm these features in back-splicing shared among human, macaque, and mouse, which should enrich functional back-splicing. Integrating the three features, we design a computational method named COL for identifying putatively functional back-splicing. Different from the methods that require multiple samples, COL can predict functional back-splicing using a single sample. Under the same data requirement, COL has a lower false positive rate than that of the commonly used method that is based on the back-splicing level alone. We conclude that COL is an efficient and versatile method for rapid identification of putatively functional back-splicing and circRNAs that can be experimentally validated.

Supplementary Material

Supplementary materials
Data S1

Acknowledgments

We thank Dr. Wei Xue for his help in processing the data from Li et al (2021). This study was supported by National Natural Science Foundation of China (32270704, 32100518 and 32472630), National Science and Technology Innovation 2030 Major Projects for ‘Brain Science and Brain-Inspired Research’ (2022ZD0214400), Medical-Engineering Crossover Fund of Shanghai Jiao Tong University (YG2022QN084), and the U.S. National Institutes of Health (R35GM139484 to J.Z.).

Footnotes

Conflict of interest

The authors declare that they have no conflict of interest.

Contributor Information

Zheng Li, Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai 200240, China.

Bandhan Sarker, Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai 200240, China.

Fengyu Zhao, Department of Statistics, George Washington University, Washington, District of Columbia, DC 20052, USA..

Tianjiao Zhou, Department of Otorhinolaryngology Head and Neck Surgery, Shanghai Sixth People’s Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai, 200233, China.

Jianzhi Zhang, Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA.

Chuan Xu, Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai 200240, China.

References

  1. Ashwal-Fluss R, Meyer M, Pamudurti NR, Ivanov A, Bartok O, Hanan M, Evantal N, Memczak S, Rajewsky N,Kadener S, 2014. Circrna biogenesis competes with pre-mrna splicing. Mol Cell 56, 55–66. [DOI] [PubMed] [Google Scholar]
  2. Chen LL, 2016. The biogenesis and emerging roles of circular rnas. Nat Rev Mol Cell Bio 17, 205–211. [DOI] [PubMed] [Google Scholar]
  3. Disotell TR,Tosi AJ, 2007. The monkey’s perspective. Genome Biol 8, 226. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Du WW, Yang W, Liu E, Yang Z, Dhaliwal P,Yang BB, 2016. Foxo3 circular rna retards cell cycle progression via forming ternary complexes with p21 and cdk2. Nucleic Acids Res 44, 2846–2858. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Hansen TB, Jensen TI, Clausen BH, Bramsen JB, Finsen B, Damgaard CK,Kjems J, 2013. Natural rna circles function as efficient microrna sponges. Nature 495, 384–388. [DOI] [PubMed] [Google Scholar]
  6. Hedges SB, Dudley J,Kumar S, 2006. Timetree: A public knowledge-base of divergence times among organisms. Bioinformatics 22, 2971–2972. [DOI] [PubMed] [Google Scholar]
  7. Huang MG, Zhong ZY, Lv MX, Shu J, Tian Q,Chen JX, 2016. Comprehensive analysis of differentially expressed profiles of lncrnas and circrnas with associated co-expression and cerna networks in bladder carcinoma. Oncotarget 7, 47186–47200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Huang X, He M, Huang S, Lin RR, Zhan M, Yang D, Shen H, Xu SW, Cheng W, Yu JX, et al. , 2019. Circular rna circerbb2 promotes gallbladder cancer progression by regulating pa2g4-dependent rdna transcription. Molecular Cancer 18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Ji P, Wu W, Chen S, Zheng Y, Zhou L, Zhang J, Cheng H, Yan J, Zhang S, Yang P, et al. , 2019. Expanded expression landscape and prioritization of circular rnas in mammals. Cell Rep 26, 3444–3460 e3445. [DOI] [PubMed] [Google Scholar]
  10. Kolde R, Laur S, Adler P, Vilo J, 2012. Robust rank aggregation for gene list integration and meta-analysis. Bioinformatics 28, 573–580. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Kristensen LS, Andersen MS, Stagsted LVW, Ebbesen KK, Hansen TB,Kjems J, 2019. The biogenesis, biology and characterization of circular rnas. Nat Rev Genet 20, 675–691. [DOI] [PubMed] [Google Scholar]
  12. Li S, Li X, Xue W, Zhang L, Yang LZ, Cao SM, Lei YN, Liu CX, Guo SK, Shan L, et al. , 2021. Screening for functional circular rnas using the crispr-cas13 system. Nat Methods 18, 51–59. [DOI] [PubMed] [Google Scholar]
  13. Liu Z, Ran Y, Tao C, Li S, Chen J, Yang E, 2019. Detection of circular rna expression and related quantitative trait loci in the human dorsolateral prefrontal cortex. Genome Biol 20, 99. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Memczak S, Jens M, Elefsinioti A, Torti F, Krueger J, Rybak A, Maier L, Mackowiak SD, Gregersen LH, Munschauer M, et al. , 2013. Circular rnas are a large class of animal rnas with regulatory potency. Nature 495, 333–338. [DOI] [PubMed] [Google Scholar]
  15. Okholm TLH, Sathe S, Park SS, Kamstrup AB, Rasmussen AM, Shankar A, Chua ZM, Fristrup N, Nielsen MM, Vang S, et al. , 2020. Transcriptome-wide profiles of circular rna and rna-binding protein interactions reveal effects on circular rna biogenesis and cancer pathway expression. Genome Med 12, 112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Pamudurti NR, Bartok O, Jens M, Ashwal-Fluss R, Stottmeister C, Ruhe L, Hanan M, Wyler E, Perez-Hernandez D, Ramberger E, et al. , 2017. Translation of circrnas. Mol Cell 66, 9–21 e27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Reiner A, Yekutieli D,Benjamini Y, 2003. Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 19, 368–375. [DOI] [PubMed] [Google Scholar]
  18. Rybak-Wolf A, Stottmeister C, Glazar P, Jens M, Pino N, Giusti S, Hanan M, Behm M, Bartok O, Ashwal-Fluss R, et al. , 2015. Circular rnas in the mammalian brain are highly abundant, conserved, and dynamically expressed. Molecular Cell 58, 870–885. [DOI] [PubMed] [Google Scholar]
  19. Suenkel C, Cavalli D, Massalini S, Calegari F,Rajewsky N, 2020. A highly conserved circular rna is required to keep neural cells in a progenitor state in the mammalian brain. Cell Reports 30, 2170–2179. [DOI] [PubMed] [Google Scholar]
  20. Xu C,Zhang J, 2021. Mammalian circular rnas result largely from splicing errors. Cell Rep 36, 109439. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary materials
Data S1

RESOURCES