ABSTRACT
Virus circular RNAs (circRNA) have been reported to be extensively expressed and play important roles in viral infections. Previously we build the first database of virus circRNAs named VirusCircBase which has been widely used in the field. This study significantly improved the database on both the data quantity and database functionality: the number of virus circRNAs, virus species, host organisms was increased from 46440, 23, 9 to 60859, 43, 22, respectively, and 1902 full-length virus circRNAs were newly added; new functions were added such as visualization of the expression level of virus circRNAs and visualization of virus circRNAs in the Genome Browser. Analysis of the expression of virus circRNAs showed that they had low expression levels in most cells or tissues and showed strong expression heterogeneity. Analysis of the splicing of virus circRNAs showed that they used a much higher proportion of non-canonical back-splicing signals compared to those in animals and plants, and mainly used the A5SS (alternative 5’ splice site) in alternative-splicing. Most virus circRNAs have no more than two isoforms. Finally, human genes associated with the virus circRNA production were investigated and more than 1000 human genes exhibited moderate correlations with the expression of virus circRNAs. Most of them showed negative correlations including 42 genes encoding RNA-binding proteins. They were significantly enriched in biological processes related to cell cycle and RNA processing. Overall, the study provides a valuable resource for further studies of virus circRNAs and also provides new insights into the biogenesis mechanisms of virus circRNAs.
KEYWORDS: Virus, circular RNA, biogenesis mechanisms, database, next-generation-sequencing, third-generation-sequencing
Introduction
Circular RNAs (circRNAs) are covalently closed single-stranded transcripts and have been universally expressed in the eukaryotes, prokaryotes, archaea [1] and viruses [2]. The discovery of circRNAs can be traced back to 1976 when they were initially identified in plant viroids [3]. Later, Ashwal-Fluss et al. found that circRNAs were generated through co-transcriptional back-splicing, competing with canonical splicing and resulting in a reduction in mRNA synthesis at the same loci [4]. Unlike mRNAs, circRNAs in eukaryotes are primarily produced through a unique mechanism called back splicing by covalently linking the downstream 5′ splice (donor) site with the upstream 3′ splice (acceptor) site [5], distinguishing it from the canonical splicing, while in prokaryotes and archaea, circRNAs have been reported to be formed from introns by the RNA endonuclease and ligase [6]. This structural characteristic grants resistance to exonucleases and enhances stability compared to linear RNAs [7]. There are three models for circRNA biogenesis, including RNA binding proteins (RBPs)-mediated, intron pairing-driven and lariat-driven circularization [8]. The circRNA has been reported to play an extensive role in the cell [9] and also be associated with lots of diseases such as cancers [10–12].
The rapid development of the next-generation-sequencing (NGS) technology has greatly facilitated the identification and characterization of circRNAs [13]. Multiple methods such as CIRI2 [14], circRNA_finder [15], find_circ [16], CIRCexplorer [17] and Mapsplice [18] have been developed to identify circRNAs from the NGS data based on the back-splicing signal of circRNAs. Numerous circRNAs have been identified and several databases for circRNAs have emerged, including circBase [19] which is one of the earlist circRNA database containing more than 200,000 circRNAs from human, Mus musculus, C. elegans and Latimeria menadoensis, CircAtlas [20] which is a high-precision circRNA database containing more than three million circRNAs identified from 10 vertebrates, circMine [21] which is a comprehensive database of circRNA transcriptomes associated with human diseases, PlantcircBase [22] which includes 171,118 circRNAs from 21 plant species, and so on.
Recovery of the full-length sequence of circRNAs can facilitate further studies of the functions [23]. The short read lengths in NGS data hinder the recovery of full-length sequences of circRNAs, although computational methods such as CIRI-full [24] have been developed to recover the full-length circRNA from the NGS data. Compared to the NGS technology, the third-generation-sequencing (TGS) technology has the advantage of generating long reads [25], which can help the identification of full-length circRNAs. Several computational methods have been developed to recover full-length circRNAs from the TGS data such as CIRI-long [26] and CYCLER [27]. Zhang et al. [26] systematically described the profile of full-length circRNAs in the adult mouse brain and revealed the complexity and splicing events of circRNAs by the CIRI-long method.
Virus circRNAs have been long overlooked due to the small size of virus genomes compared to eukaryotic genomes [28]. In recent years, the characterization of eukaryotic circRNAs [29] has prompted a renewed focus on viruses. Viral circRNAs have been reported to play important roles in regulating host gene expression [30], interacting with cellular components [31] and influencing host immune responses [32]. For instance, Sekiba et al. reported that a hepatitis B virus (HBV)-derived circRNA interacted with the human DExH-Box helicase 9 (DHX9) [33] which is an RNA helicase involved in innate immune responses against DNA viruses [32]; Liu et al. found that a Human gammaherpesvirus 4 (HHV4) encoded circRNA named circRPMS1 could function as a miRNA sponge and facilitate the development of HHV4-associated epithelial tumours [31]. To facilitate the study of viral circRNAs, we built the first virus circRNA database named VirusCircBase [2] which has been widely used in the community [34–38]. This study significantly improved the database on both the data quantity and database functionality and re-constructed the database of virus circRNAs named VirusCircBase 2.0 (available at http://computationalbiology.cn/VirusCircBase2/#/home). Based on the data in VirusCircBase 2.0, we investigated the back-splicing signal and alternative splicing type of virus circRNAs, and further identified human genes associated with virus circRNA production. This study may provide new insights into the biogenesis mechanism of virus circRNAs.
Materials and methods
Overview of the study
Figure 1 shows the workflow of the study which included three sections. The first section is Data collection during which the viral infection-related RNA-Seq, Oxford Nanopore and PacBio SMRT datasets were obtained from the NCBI SRA and GEO databases. The second section is Data analysis, during which viral circRNAs were identified by five software tools, including CIRI2 [14], find_circ [16], circRNA_finder [15], CIRI-full [24] and CIRI-long [26]. The third section is Database construction, during which the VirusCircBase was significantly enhanced and re-constructed as VirusCircBase 2.0.
Figure 1.
The workflow of the study. It contains three sections including Data collection, Data analysis and Database construction. Please see the maintext for the details about these sections.
Data collection
The viral infection-related RNA-Seq datasets were collected from the NCBI SRA and GEO databases in March 2022. They were filtered according to our previous study [2]. The datasets which have been analysed in our previous study [2] were excluded. For the datasets for which no virus genome was specified, VIGA, a software tool for virus identification and reference-based genome assembly (available at https://github.com/viralInformatics/VIGA) [39], was used to select the best reference genome from the NCBI RefSeq database. The viral infection-related TGS datasets were obtained by searching for “virus[All Fields] not ‘amplicon’[All Fields]” in the NCBI SRA and GEO databases in November 2022. They were filtered by the library construction method, with only rRNA-depleted and RNase R-treated data being retained. All datasets used in the study were provided in Table S1.
Virus circRNA identification
Three de novo computational methods, including CIRI2, circRNA_finder, and find_circ, were used to identify virus circRNAs from the viral infection-related RNA-Seq datasets according to our previous study [2]. CIRI-full (version 2.0, parameter: “-l reads length”) [24] was used to identify full-length circRNAs from paired-end RNA-Seq data. All methods mentioned above rely on GT/AG signals for accurately detecting back-splicing junctions from the NGS data [14–16,24]. However, the short length of reads in NGS data and the fixed GT/AG splicing signal used by these methods may hinder the identification of full-length circRNAs. Thus, the TGS data generated by both Oxford Nanopore and PacBio SMRT technology were further used for identifying full-length viral circRNAs. They were firstly pre-processed and filtered: for the Oxford Nanopore data, adapter trimming was performed using Porechop (version 0.2.4, parameter: “–discard_middle”, https://github.com/rrwick/Porechop); for the PacBio SMRT data, the quality of subreads in the fastq format was checked using FastQC (version 0.11.8, https://github.com/s-andrews/FastQC) [40] and low-quality subreads were filtered using fastp (version 0.20.0, parameter: “–q 7”) [41]. Then, the CIRI-long (version v1.1.0, default parameter settings) [26] was used to identify full-length viral circRNAs and collapse isoforms from the TGS data. As viruses infect a wide range of hosts which may use diverse splicing signals [42], all splicing signals provided in CIRI-long were used, including donor/acceptor sequence of GT/AG, GC/AG, AT/AC, GT/AC, AT/AG, GT/TG, GA/AG, GG/AG, GT/GG, GT/AT, GT/AA. Besides, the AT/TA signal which was reported in the alternative splicing of the virus Human orthopneumovirus (RSV) [43] was also used in the analysis. The alternative splicing types of viral circRNAs was identified using CIRI-AS [44].
Quantification of viral circRNAs
RPM (Reads Per Million) was used to quantify viral circRNAs according to Zhang’s study [17]. It was calculated as follows:
where NBSJ is the number of back-splicing junction reads for the viral circRNA and M is the total number of raw reads in the sample. It was observed that the abundance of viral circRNAs identified by different software tools varied (Figure S1A). Thus, the abundances of viral circRNAs identified by a tool (except the CIRI-full) were normalized by multiplying a factor which was calculated by dividing the median abundance of virus circRNAs identified by CIRI-full by that identified by the tool, which was listed as follows:
where i refers to one of tools including CIRI-long, CIRI2, circRNA_finder and find_circ.
Identification of human genes associated with the viral circRNA production
Since most virus circRNAs were identified in human cells or tissues, we focused on the identification of human genes associated with the production of viral circRNAs. Because the read alignment strategy was different for NGS and TGS data [25] and considering that most virus circRNAs were identified from NGS data, we only focused on the high-confidence viral circRNAs identified from the NGS data, which were defined as having five or more back-splicing junction reads and having a length of greater than 200 bp [2], resulting in a total of 10,681 high-confidence viral circRNAs.
The workflow for identification of human genes associated with the viral circRNA production were shown in Figure S2. To quantify the expression level of high-confidence virus circRNAs in a cell or tissue, firstly, for each sample of the given cell or tissue, the abundance of high-confidence virus circRNAs was summed and then normalized by the length of the virus genome (kb); then, the median abundance of virus circRNAs in multiple samples of the same cell or tissue was taken as the expression level of virus circRNAs in the given cell or tissue.
To quantify the expression level of genes in a cell or tissue, for each sample of the given cell or tissue, firstly, the clean reads were aligned to the human reference genome hg38 (downloaded from Ensemble on May 8th, 2022) with STAR (version 2.7.1a, parameter: “–readFilesCommand zcat –limitBAMsortRAM 1020169226 –twopassMode Basic –outSAMtype BAM SortedByCoordinate”) [45]; then, featureCounts (version 2.0.1, parameter: “-t exon -g gene_id -Q 10 –primary -p”) [46] was used to generate counts of reads uniquely mapped to genes based on the annotation file provided in Ensemble (releases 107, https://ftp.ensembl.org/pub/release-107/gtf/homo_sapiens/); then, the Z-score method was used to normalize the read counts of each gene in a sample. The median of the normalized read counts of each gene in multiple samples of the same cell or tissue was taken as the expression level of the gene in the given cell or tissue.
A total of 31 human cells or tissues which included 197 samples were used to analyse the correlation between the human gene expression and the expression level of virus circRNAs. The Spearman Correlation Coefficient (SCC) was used to assess the extent of correlations.
Functional enrichment analysis of human genes
The KEGG pathway and Gene Ontology (GO) enrichment analysis was conducted with functions of enrichKEGG and enrichGO, respectively, in the package clusterProfiler (version 4.6.2) [47]. All KEGG pathways and GO terms with FDR adjusted p-values less than 0.05 were considered significant enrichment.
Statistical analysis
All statistical analyses were conducted in Python (version 3.8) and R (version 4.2.2). The SCCs and relevant p-values were calculated using the function of stats.spearmanr in Python. The Wilcoxon rank-sum test was conducted using the function of wilcox.test in R. A p-value less than 0.05 was considered statistically significant.
Results
Overall comparison of the updated and original VirusCircBase
VirusCircBase 2.0 was updated in terms of data quantity and database functionality compared to its previous version (Table 1). In terms of data, the number of virus circRNAs was increased from 46440 to 60859 in VirucCircBase 2.0. Specifically, the number of virus species, host organisms, tissue or cell types, RNA-Seq biosamples used in identification of virus circRNAs, interactions between virus circRNAs and host miRNAs was increased from 23, 9, 46, 337, 30818 to 43, 22, 83, 566, 72158, respectively. Besides, 1902 full-length virus circRNAs were newly added in the updated database. They were either identified from 21 TGS datasets or from 90 NGS datasets. In terms of the database functionality, three novel functions were provided in the updated VirusCircBase: (i) visualization of the expression level of virus circRNAs in samples in the form of heatmaps; (ii) visualization of virus circRNAs on viral genomes in the form of Genome Browser; (iii) browse virus circRNAs by the host (Figure 1).
Table 1.
Comparison between VirusCircBase 1.0 and VirusCircBase 2.0.
| VirusCircBase 1.0 | VirusCircBase 2.0 | |
|---|---|---|
| Statistics | ||
| # Number of identified viral circRNAs | 46,440 | 60,859 | 
| # Number of host organisms | 9 | 22 | 
| # Number of tissue/cell type | 46 | 83 | 
| # Number of Virus species | 23 | 43 | 
| # Number of RNA-seq data bioprojects | 48 | 90 | 
| # Number of RNA-seq data biosamples | 337 | 566 | 
| # Number of PacBio&Nanopore data bioprojects | / | 21 | 
| # Number of PacBio&Nanopore data biosamples | / | 195 | 
| # Number of full-length viral circRNAs | / | 1,902 | 
| Function annotation | ||
| # Number of interactions between virus circRNAs and host miRNAs | 30,818 | 72,157 | 
| Database | ||
| # Web design | Limited functions | Well designed with integrated resources | 
| # Genome Browser | No | Yes | 
| # Expression | No | Yes | 
| # Web link | http://www.computationalbiology.cn/ViruscircBase/home.html | http://computationalbiology.cn/VirusCircBase2/#/ | 
Data summary in VirusCircBase 2.0
The viral circRNAs in the updated VirusCircBase were detected by one or multiple of the following five methods: CIRI2, find_circ, circRNA_finder, CIRI-full and CIRI-long (Figure 1). Among them, 59643 viral circRNAs were detected by CIRI2, find_circ, and circRNA_finder from the NGS data; 1902 full-length viral circRNAs were detected by CIRI-full from the NGS data and by CIRI-long from the TGS data; 42 experimentally-validated viral circRNAs were collected from the literature (Table S3).
The viral circRNAs were detected from 43 viral species of 25 viral families across six Baltimore groups (Figure 2). Most viral circRNAs were identified from viruses of positive-sense single-stranded RNA (ssRNA(+)) (67.3%) and double-stranded DNA (dsDNA) (30.6%). Among ssRNA (+) viruses, the Middle East respiratory syndrome-related coronavirus (MERS-CoV), Severe acute respiratory syndrome-related coronavirus (SARS-CoV) and Sindbis virus (SINV) had the most circRNAs identified, with 28754, 9324 and 2065 circRNAs, respectively. In the dsDNA group, the Human gammaherpesvirus 8 (HHV8), Human gammaherpesvirus 1 (HHV1) and Vaccinia virus (VACV) had most circRNAs identified, with 5364, 3100 and 3017 circRNAs, respectively. When analysing virus circRNAs by viral family, we found that most virus circRNAs were encoded by viruses of the Coronaviridae (62.7%) and Herpesviridae (24.0%).
Figure 2.
The number of viral circRNAs detected in each virus in the VirusCircBase 2.0. The bar above the x-axis represents the number of circRNAs identified for each virus, with the red portion indicating the newly added viral circRNAs of VirusCircBase 2.0. The first bar below the x-axis represents the Baltimore group to which the virus belongs, while the second bar represents the host kingdom of the virus. For clarity, the abbreviation of virus names was used. Supplementary Table S2 displays the full names of these viruses. ssRNA(+), positive-sense single-stranded RNA virus; dsDNA, double-stranded DNA virus; ssRNA(-), negative-sense single-stranded RNA virus; dsDNA-RT, double-stranded DNA reverse transcribing virus; ssDNA, single-stranded DNA virus; dsRNA, double-stranded RNA virus.
As for the newly added virus circRNAs, they were encoded by 35 virus species, among which 20 were newly added in the database. SARS-CoV and HHV1 had the largest number of increases of circRNAs (from 4157 to 9324 and from 379 to 2721, respectively). Among the newly added virus species, VACV had the largest number of circRNAs. Interestingly, the virus Mammalian orthoreovirus (MRV1) which encoded 125 circRNAs was the only virus ever reported with the ability of encoding circRNAs among dsRNA viruses. Besides, two phages, i.e. Pseudomonas virus PhiKZ (PhiKZ) and Escherichia virus P1 (PhageP1), were also found to encode 249 and 178 circRNAs, respectively.
Length of virus circRNAs
The length distribution of full-length and other virus circRNAs were analysed and compared (Figure S3). Overall, both kinds of virus circRNAs had similar length distributions. Most viral circRNAs had lengths ranging from 200 to 1000 bp. However, compared to other virus circRNAs, the full-length virus circRNAs had a larger ratio of circRNAs with length greater than 2000bp.
Low expression and expression heterogeneity of virus circRNAs
The expression level of virus circRNAs identified by different software tools had significant differences as they used different alignment tools (Figure S1). Thus, the abundances of virus circRNAs were first normalized to remove the bias introduced by different tools (see Materials and Methods and Figure S1). Then, the expression level of virus circRNAs in different cells or tissues was analysed and visualized (Figure 3). Most viruses had circRNA expressions in only a few cells or tissues due to limited data. Besides, they had low expressions of circRNAs in these cells or tissues. For viruses which had expression of circRNAs in multiple cells or tissues such as HHV1, HHV4, HHV8, the influenza A virus (IAV) and SARS-CoV, a large difference in expressions was observed in different cells or tissues. For example, HHV8 had low abundances in most cells (log10-transformed normalized RPM: 2.16–3.79), but had high abundances in cells of TREx-BCBL1 RTA and iSLK-BAC16 (log10-transformed normalized RPM of 4.59 and 4.97, respectively). Interestingly, some viruses such as the Mus musculus papillomavirus type 1 (MmuPV1) only had high expression of circRNAs in the tail tissue of Mus musculus (log10-transformed normalized RPM: 6.0), suggesting an important role of viral circRNA in the chronic infection of MmuPV1 in the tail tissue.
Figure 3.
The expression of viral circRNA across different tissues or cells in different hosts. The x-axis represents the abbreviations of different viral species, and the bar above represents the Baltimore classification of the virus. The y-axis represents the host tissues or cells, with the first bar on the left represents the host kingdom and the second bar represents the host species. Each coloured block represents the median expression of viral circRNA in that tissue, with blue and red colours representing low and high expression, respectively, according to the figure legend.
Analysis of the mechanism of viral circRNA formation
The diversity of back-splicing signals of virus circRNAs was analysed based on the full-length circRNAs detected from the TGS data. As shown in Figure 4(A), 70% of viral circRNAs used the GT/AG splicing signal which is also the predominant signal used in animals and plants. However, viral circRNAs exhibited a higher proportion of non-canonical splicing signals compared to animals and plants in which non-canonical splicing signals accounted for less than 1% [48,49]. For example, the GC/AG signal was the most abundant non-canonical one for viral circRNAs, accounting for 7.1% of the total, which was much higher than that for mammals (0.7%) [13]. Besides, the U12-type (AT/AC, GT/AC and AT/AG) signals were used in only 0.3% of human introns [49] and 0.15% of Arabidopsis thaliana introns [48], while they were used in 10.7% of virus circRNAs (Figure 4(A)). Interestingly, when analysing the back-splicing signals of virus circRNAs by virus group (Baltimore group), although the predominant signal was still GT/AG, the most abundant non-canonical signals differed between virus groups (Figure S4A). For dsDNA viruses, GC/AG was the most abundant non-canonical signal, while for ssRNA(+) and ssRNA(-) viruses, AT/AC and AT/AG were the most abundant ones. Given the dramatic difference in the number of circRNAs encoded by different viruses, to investigate whether viruses with different ability of circRNA-encoding had differences in the back-splicing signal, viruses were classified into three groups (strong, middle and weak) according to the number of virus circRNAs they encoded (Figure S5A). Three groups of viruses had large different compositions on non-canonical signals. For example, the top three non-canonical signals in the group with strong ability of circRNA-encoding were GC/AG, AT/AC, AT/AG, while those in the group with weak ability of circRNA-encoding were GA/AG, GG/AG, GT/GG.
Figure 4.
Analysis of the mechanism of viral circRNA formation. (A) Composition of different types of back-splicing signals of viral circRNAs. (B) Composition of different alternative splicing types of viral circRNAs. A5SS, Alternative 5’ splice site; ES, Exon skipping; A3SS, Alternative 3’ splice site; IR, Retained intron. (C) The distribution of the number of isoforms in virus circRNAs.
To investigate the alternative splicing pattern of viral circRNAs, the splice types of viral circRNAs were analysed based on the full-length circRNAs detected from the NGS data. It was found that the A5SS (alternative 5’ splice site) type was the most prevalent type in viral circRNAs (55.9%), followed by ES (exon skipping) (22.0%) and A3SS (alternative 3’ splice site) (15.7%) (Figure 4(B)). This was different from that observed in humans in which ES was the most common splicing type in circRNAs, followed by A3SS and A5SS [50]. When analysing the alternative splicing pattern of virus circRNAs by viral group (Baltimore group), the dsDNA virus had similar pattern as that observed using all data, while the ssRNA(-) virus only had types of A5SS and A3SS observed (Figure S4B). When analysing the alternative splicing pattern of virus circRNAs by viral ability of circRNA-encoding (Figure S5B), large differences were observed between the patterns in three groups. For example, the group with strong ability of circRNA-encoding had three types of A3SS (47.1%), A5SS (41.2%) and ES (11.7%), while the group with middle ability of circRNA-encoding had multiple types with the top two of A5SS (57.9%) and ES (28.9%).
The number of isoforms derived from virus circRNAs was further analysed based on full-length viral circRNAs. According to Ye’s study [51], any 2 partial overlapping circRNAs derived from the same locus were defined here as circRNA isoforms. Most viral circRNAs had only one or two isoforms (69.90%) (Figure 4(C)). Only a very small proportion of viral circRNAs had five or more isoforms. This is much different from that observed in mammals [23] and plants [52] in which circRNAs generally have two or more isoforms.
Analysis of host genes affecting viral circRNA production
Then, we investigated host genes which may affect the production of virus circRNAs. We focused on human genes associated with virus circRNA production since most virus circRNAs were identified in human cells or tissues. The correlations between the expression level of virus circRNAs and human genes were calculated (see Materials and Methods and Figure S2 for the workflow of the correlation analysis). A total of 411 and 755 human genes were found to have moderate positive and negative correlations (defined as the absolute SCC greater than or equal to 0.3) (Table S4), respectively, with the viral circRNA production (Figure S6). Previous studies have shown that the RBPs were frequently involved in circRNA production [53]. They can both promote or inhibit the biogenesis of virus circRNAs [54]. Interestingly, 43 RBP-encoding genes had moderate correlations with the virus circRNA production and most of them (42/43) had significant negative correlations. For example, the LSM6 gene which encodes a RBPs and plays a role in the decapping and decaying processes of mRNA had an SCC of −0.47 with the virus circRNA production (Figure S7); the RBM38 gene which encodes the RNA binding motif protein 38 and were reported to negatively regulate virus circRNAs had an SCC of −0.36 with the virus circRNA production (Figure S7). When considering all RBP-encoding genes together, the RBP-encoding genes had larger negative correlations with virus circRNA production than other human genes did (p-value < 2.2E-16 in the Wilcoxon rank-sum test) (Figure S8).
Functional analysis showed that the positively correlated genes were enriched in limited functions (Figure S9), such as biological processes of gland development and wound healing, molecular functions of receptor activity and extracellular matrix. Thus, we focused on the functions of the 755 genes with negative correlations to viral circRNA productions. A total of 85, 264 and 26 GO terms in the domains of Cellular Component (CC), Biological Process (BP) and Molecular Function (MF), and 14 KEGG pathways were enriched in these genes. The enriched cellular components were mainly related to spindle, chromosome regions, spliceosome complexes and ribosomes (Figure 5(A)); the enriched biological processes were mainly related to cell cycle and RNA processing (Figure 5(B)); the enriched molecular functions were mainly related to catalytic activity, nuclease activity, tubulin and microtubule binding (Figure 5(C)); the enriched KEGG pathways were mainly related to cell cycle, ribosome, COVID-19, spliceosome, Oocyte meiosis, RNA degradation, and so on (Figure 5(D)).
Figure 5.
The top 10 enriched GO terms in the domain of Cellular Component (A), Biological Process (B), Molecular Function (C) and the top 10 enriched KEGG pathways (D) for the human genes which had moderate negative correlations with virus circRNA production. The size of dots represented the number of genes in each domain, and the colour indicated the significance of enrichment (adjusted p-values less than 0.05).
Discussion
Virus circRNAs have been reported to play important roles in virus infections [30,31,55,56]. Our previous study built the first virus circRNA database named VirusCircBase [2] which has been widely used in the field. This study significantly improved the database on both the data quantity and database functionality, resulting in the reconstruction of the database as VirusCircBase 2.0. It would be a valuable resource for further studies of virus circRNAs. Based on the updated database, the back-splicing signals and alternative splicing types of viruses circRNAs were analysed, and human genes associated with the generation of virus circRNAs were identified and analysed, which provides new insights into the biogenesis mechanisms of virus circRNAs.
In eukaryotes, circRNAs are primarily produced through back-splicing whereby a downstream 5’ splice (donor) site is covalently linked with an upstream 3’ splice (acceptor) site [5]. Viruses rely on host cells for circRNA production and may have similar mechanisms to their hosts [57]. However, compared to host organisms, the biogenesis of viral circRNAs may have some distinct features and regulatory mechanisms. For example, the virus circRNAs predominantly used the GT/AG signal in back splicing which is similar to that observed in animals and plants, possibly due to the major reliance on the U2-dependent spliceosome. However, unlike animals and plants which use a very small proportion (<1%) of non-canonical splicing signals in circRNA generation, viral circRNAs showed a significantly higher frequency of such non-canonical splicing signals, which was similar to that reported in Chasseur’s study [56]. Besides, the predominance of the A5SS type in the alternative splicing of virus circRNAs suggests that the virus may produce diverse virus circRNAs by selecting different 5′ splice sites, which is different from that observed in animals [50] and plants [58]. These results suggest that the virus may employ diverse strategies in circRNA generation [56].
Previous studies have identified numerous human proteins especially the RBPs involved in the circRNA production [53]. However, it is still unknown which kinds of human proteins are involved in the generation of virus circRNAs. A previous study has shown that DHX9 can suppress the production of HBV-derived circRNAs [33]. This study, for the first time, systematically investigated human genes associated with virus circRNA production. Much more human genes (nearly two times) were found to have moderate negative correlations with the generation of virus circRNAs than those having moderate positive correlations. RBPs have been reported to be frequently involved in the formation of circRNAs in multiple organisms including humans [5]. They were also reported to promote or suppress the production of virus circRNAs. Interestingly, our study showed that the RBP-encoding genes had stronger negative correlations with virus circRNA generation than other genes, and most of them had negative correlations with the virus circRNA production, suggesting that most human RBPs may inhibit the biogenesis of virus circRNAs. This was consistent with previous studies which reported several RBPs that can suppress or negatively regulate the virus circRNA production such as DHX9 [59], QKI [54], RBM38 [54] and NF90/NF110 [60]. The negatively correlated human genes were involved in biological processes related to cell cycle and RNA processing, suggesting that viral circRNA production may compete with host RNA processing, resulting in reductions of the mRNA synthesis at the same site and enhanced resistance to RNA degradation, thereby increasing viral circRNA production. This provides new insights into the biogenesis mechanism of virus circRNAs and potentially reveals new targets for therapeutic intervention of virus infections by interfering with virus circRNA production.
There are some limitations to the study. Firstly, although more than 40 virus species have been found to encode circRNAs in the database, this number is far from complete as circRNAs may be a kind of common molecules for viruses. Secondly, circRNAs identified in each virus may be seriously underestimated as there was large heterogeneity in the expression of virus circRNAs. Thirdly, the virus circRNAs in the database were biased to some viruses such as coronaviruses and herpesviruses as these viruses were most sampled in public databases. Thus, the database will be routinely updated to capture the diversity of virus circRNAs. Fourthly, the viral circRNAs were predicted using computational methods and more experimental efforts are needed to validate the expression of these virus circRNAs. Fifthly, although lots of human genes were found to be associated with the virus circRNA production, they may be direct or indirect targets of circRNAs, and it is still unclear which of these genes were associated with the production of a given virus circRNA. Besides, the mechanism of such associations and the mechanism of virus circRNA biogenesis remain answered. Much efforts are needed to clarify such mechanisms.
Nevertheless, the updated VirusCircBase can be served as a valuable resource for studying virus circRNAs. Further analysis of back-splicing signals and alternative splicing types of virus circRNAs and identification of human genes associated with virus circRNA production provides new insights into the biogenesis mechanisms of virus circRNAs.
Key points
VirusCircBase 2.0 introduces an updated and enhanced database of virus circRNAs, improving data quantity and functionality.
VirusCircBase 2.0 includes 60,859 circRNAs from 43 viral species, providing a comprehensive resource for researchers.
Virus circRNAs exhibit low expressions and strong expression heterogeneity, highlighting their unique regulatory properties.
Virus circRNAs demonstrate distinct and independent splicing features compared to those in animals and plants.
Human genes associated with virus circRNA formation are revealed, shedding light on the biogenesis mechanism of virus circRNAs.
Authors’ contributions
Ping Fu analysed the data, built the VirusCircBase 2.0 and wrote the draft; Zena Cai provided advice and guidance on the pipeline; Ping Fu, Zena Cai and Zhiyuan Zhang analysed the data; Xiangxian Meng contributed to the manuscript text; Yousong Peng designed and supervised the project, wrote and revised the manuscript.
Supplementary Material
Acknowledgements
We thank members in PengLab for helpful discussions on the manuscript.
Funding Statement
This work was supported by the National Key Plan for Scientific Research and Development of China (2022YFC2303802) and National Natural Science Foundation of China (32170651 & 32370700).
Disclosure statement
No potential conflict of interest was reported by the author(s).
Data availability statement
All data used in the study are available in Supplementary materials and in VirusCircBase 2.0 which is accessible at http://computationalbiology.cn/VirusCircBase2/#/home.
References
- 1.Lai X, Bazin J, Webb S, et al. Advances in experimental medicine and biology. Adv Exp Med Biol. 2018;1087:329–343. doi: 10.1007/978-981-13-1426-1_26. [DOI] [PubMed] [Google Scholar]
 - 2.Cai Z, Fan Y, Zhang Z, et al. VirusCircBase: a database of virus circular RNAs. Brief Bioinform. 2021;22:2182–2190. doi: 10.1093/bib/bbaa052. [DOI] [PubMed] [Google Scholar]
 - 3.Sanger HL, Klotz G, Riesner D, et al. Viroids are single-stranded covalently closed circular RNA molecules existing as highly base-paired rod-like structures. Proc Natl Acad Sci USA. 1976;73:3852–3856. doi: 10.1073/pnas.73.11.3852. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 4.Ashwal-Fluss R, Meyer M, Pamudurti NR, et al. circRNA biogenesis competes with pre-mRNA splicing. Mol Cell. 2014;56:55–66. doi: 10.1016/j.molcel.2014.08.019. [DOI] [PubMed] [Google Scholar]
 - 5.Quan G, Li J.. Circular RNAs: biogenesis, expression and their potential roles in reproduction. J Ovarian Res. 2018;11:1–12. doi: 10.1186/s13048-018-0381-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 6.Lasda E, Parker R.. Circular RNAs: diversity of form and function. RNA. 2014;20:1829–1842. doi: 10.1261/rna.047126.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 7.Liu X, Zhang Y, Zhou S, et al. Circular RNA: an emerging frontier in RNA therapeutic targets, RNA therapeutics, and mRNA vaccines. J Controlled Release. 2022;348:84–94. doi: 10.1016/j.jconrel.2022.05.043. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 8.Lasda E, Parker R.. Circular RNAs: diversity of form and function. RNA. 2014;20:1829–1842. doi: 10.1261/rna.047126.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 9.Wang PL, Bao Y, Yee M-C, et al. Circular RNA is expressed across the eukaryotic tree of life. PLoS One. 2014;9:e90859. doi: 10.1371/journal.pone.0090859. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 10.Burd CE, Jeck WR, Liu Y, et al. Expression of linear and novel circular forms of an INK4/ARF-associated non-coding RNA correlates with atherosclerosis risk. PLoS Genet. 2010;6:e1001233. doi: 10.1371/journal.pgen.1001233. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 11.Najafi S. The emerging roles and potential applications of circular RNAs in ovarian cancer: a comprehensive review. J Cancer Res Clin Oncol. 2023;149:2211–2234. doi: 10.1007/s00432-022-04328-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 12.Zhang Z-h, Wang Y, Zhang Y, et al. The function and mechanisms of action of circular RNAs in Urologic Cancer. Mol Cancer. 2023;22:61. doi: 10.1186/s12943-023-01766-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 13.Vo JN, Cieslik M, Zhang Y, et al. The landscape of circular RNA in cancer. Cell. 2019;176:869–881.e13. doi: 10.1016/j.cell.2018.12.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 14.Gao Y, Zhang J, Zhao F.. Circular RNA identification based on multiple seed matching. Brief Bioinform. 2018;19:803–810. doi: 10.1093/bib/bbx014. [DOI] [PubMed] [Google Scholar]
 - 15.Westholm JO, Miura P, Olson S, et al. Genome-wide analysis of Drosophila circular RNAs reveals their structural and sequence properties and age-dependent neural accumulation. Cell Rep. 2014;9:1966–1980. doi: 10.1016/j.celrep.2014.10.062. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 16.Memczak S, Jens M, Elefsinioti A, et al. Circular RNAs are a large class of animal RNAs with regulatory potency. Nature. 2013;495:333–338. doi: 10.1038/nature11928. [DOI] [PubMed] [Google Scholar]
 - 17.Zhang X-O, Wang H-B, Zhang Y, et al. Complementary sequence-mediated exon circularization. Cell. 2014;159:134–147. doi: 10.1016/j.cell.2014.09.001. [DOI] [PubMed] [Google Scholar]
 - 18.Wang K, Singh D, Zeng Z, et al. MapSplice: accurate mapping of RNA-Seq reads for splice junction discovery. Nucleic Acids Res. 2010;38:e178–e178. doi: 10.1093/nar/gkq622. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 19.Glažar P, Papavasileiou P, Rajewsky N.. circBase: a database for circular RNAs. RNA. 2014;20:1666–1670. doi: 10.1261/rna.043687.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 20.Wu W, Ji P, Zhao F.. scRNA-Seq assessment of the human lung, spleen, and esophagus tissue stability after cold preservation. Genome Biol. 2020;21:1–14. doi: 10.1186/s13059-019-1906-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 21.Zhang W, Liu Y, Min Z, et al. circMine: a comprehensive database to integrate, analyze and visualize human disease–related circRNA transcriptome. Nucleic Acids Res. 2022;50:D83–D92. doi: 10.1093/nar/gkab809. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 22.Xu X, Du T, Mao W, et al. PlantcircBase 7.0: full-length transcripts and conservation of plant circRNAs. Plant Commun. 2022;3:100343. doi: 10.1016/j.xplc.2022.100343. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 23.Xin R, Gao Y, Gao Y, et al. isoCirc catalogs full-length circular RNA isoforms in human transcriptomes. Nat Commun. 2021;12:266. doi: 10.1038/s41467-020-20459-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 24.Zheng Y, Ji P, Chen S, et al. Reconstruction of full-length circular RNAs enables isoform-level quantification. Genome Med. 2019;11:1–20. doi: 10.1186/s13073-019-0614-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 25.Amarasinghe SL, Su S, Dong X, et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 2020;21:1–16. doi: 10.1186/s13059-020-1935-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 26.Hou L, Zhang J, Zhao F.. Full-length circular RNA profiling by nanopore sequencing with CIRI-long. Nat Protoc. 2023;18:1795–1813. doi: 10.1038/s41596-023-00815-w. [DOI] [PubMed] [Google Scholar]
 - 27.Stefanov SR, Meyer IM.. CYCLeR—a novel tool for the full isoform assembly and quantification of circRNAs. Nucleic Acids Res. 2023;51:e10–e10. doi: 10.1093/nar/gkac1100. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 28.Chaitanya K. Structure and organization of virus genomes. Genome Genomics: From Archaea Eukaryotes. Springer Nature Singapore Pte Ltd; 2019. [Google Scholar]
 - 29.Tan KE, Lim YY.. Viruses join the circular RNA world. FEBS J. 2021;288:4488–4502. doi: 10.1111/febs.15639. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 30.Huang J-t, Chen J-n, Gong L-p, et al. Identification of virus-encoded circular RNA. Virology. 2019;529:144–151. doi: 10.1016/j.virol.2019.01.014. [DOI] [PubMed] [Google Scholar]
 - 31.Liu Q, Shuai M, Xia Y.. Knockdown of EBV-encoded circRNA circRPMS1 suppresses nasopharyngeal carcinoma cell proliferation and metastasis through sponging multiple miRNAs. Cancer Manag Res. 2019;11:8023–8031. doi: 10.2147/CMAR.S218967. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 32.Ng YC, Chung W-C, Kang H-R, et al. A DNA-sensing–independent role of a nuclear RNA helicase, DHX9, in stimulation of NF-κB–mediated innate immunity against DNA virus infection. Nucleic Acids Res. 2018;46:9011–9026. doi: 10.1093/nar/gky742. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 33.Sekiba K, Otsuka M, Ohno M, et al. DHX9 regulates production of hepatitis B virus-derived circular RNA and viral protein levels. Oncotarget. 2018;9:20953–20964. doi: 10.18632/oncotarget.25104. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 34.Zhu M, Liang Z, Pan J, et al. Hepatocellular carcinoma progression mediated by hepatitis B virus-encoded circRNA HBV_circ_1 through interaction with CDK1. Mol Ther-Nucleic Acids. 2021;25:668–682. doi: 10.1016/j.omtn.2021.08.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 35.Niu M, Ju Y, Lin C, et al. Characterizing viral circRNAs and their application in identifying circRNAs in viruses. Brief Bioinform. 2022;23:bbab404. doi: 10.1093/bib/bbab404. [DOI] [PubMed] [Google Scholar]
 - 36.Barbagallo D, Palermo CI, Barbagallo C, et al. Competing endogenous RNA network mediated by circ_3205 in SARS-CoV-2 infected cells. Cell Mol Life Sci. 2022;79:75. doi: 10.1007/s00018-021-04119-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 37.Chen S, Zheng J, Zhang B, et al. Identification and characterization of virus-encoded circular RNAs in host cells. Microb Genom. 2022;8. doi: 10.1099/mgen.0.000848. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 38.Li I, Chen YG.. Emerging roles of circular RNAs in innate immunity. Curr Opin Immunol. 2021;68:107–115. doi: 10.1016/j.coi.2020.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 39.Fu P, Wu Y, Zhang Z, et al. VIGA: an one-stop tool for eukaryotic Virus Identification and Genome Assembly from next-generation-sequencing data. bioRxiv, 2023.2006.2014.545025. doi: 10.1101/2023.06.14.545025. [DOI] [PMC free article] [PubMed]
 - 40.Andrews S. FastQC: a quality control tool for high throughput sequence data. Babraham bioinformatics. Cambridge: Babraham Institute; 2010. [Google Scholar]
 - 41.Chen S, Zhou Y, Chen Y, et al. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34:i884–i890. doi: 10.1093/bioinformatics/bty560. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 42.Zhou C, Liu S, Song W, et al. Characterization of viral RNA splicing using whole-transcriptome datasets from host species. Sci Rep. 2018;8:1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 43.Yao W, Pan J, Liu Z, et al. The cellular and viral circRNAome induced by respiratory syncytial virus infection. mBio. 2021;12:e0307521. doi: 10.1128/mBio.03075-21. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 44.Gao Y, Wang J, Zheng Y, et al. Comprehensive identification of internal structure and alternative splicing events in circular RNAs. Nat Commun. 2016;7:12060. doi: 10.1038/ncomms12060. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 45.Dobin A, Davis CA, Schlesinger F, et al. STAR: ultrafast universal RNA-Seq aligner. Bioinformatics. 2013;29:15–21. doi: 10.1093/bioinformatics/bts635. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 46.Liao Y, Smyth GK, Shi W.. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30:923–930. doi: 10.1093/bioinformatics/btt656. [DOI] [PubMed] [Google Scholar]
 - 47.Wu T, Hu E, Xu S, et al. clusterProfiler 4.0: a universal enrichment tool for interpreting omics data. Innovation (Cambridge (Mass.)). 2021;2:100141. doi: 10.1016/j.xinn.2021.100141. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 48.Zhu W, Brendel V.. Identification, characterization and molecular phylogeny of U12-dependent introns in the Arabidopsis thaliana genome. Nucleic Acids Res. 2003;31:4561–4572. doi: 10.1093/nar/gkg492. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 49.Levine A, Durbin R.. A computational scan for U12-dependent introns in the human genome sequence. Nucleic Acids Res. 2001;29:4006–4013. doi: 10.1093/nar/29.19.4006. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 50.Gao Y, Wang J, Zheng Y, et al. Comprehensive identification of internal structure and alternative splicing events in circular RNAs. Nat Commun. 2016;7:12060. doi: 10.1038/ncomms12060. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 51.Ye C-Y, Zhang X, Chu Q, et al. Full-length sequence assembly reveals circular RNAs with diverse non-GT/AG splicing signals in rice. RNA Biol. 2017;14:1055–1063. doi: 10.1080/15476286.2016.1245268. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 52.Chu Q, Bai P, Zhu X, et al. Characteristics of plant circular RNAs. Brief Bioinform. 2018;21:135–143. doi: 10.1093/bib/bby111. [DOI] [PubMed] [Google Scholar]
 - 53.Ji P, Wu W, Chen S, et al. Expanded expression landscape and prioritization of circular RNAs in mammals. Cell Rep. 2019;26:3444–3460.e5. doi: 10.1016/j.celrep.2019.02.078. [DOI] [PubMed] [Google Scholar]
 - 54.Tagawa T, Oh D, Santos J, et al. Characterizing expression and regulation of gamma-herpesviral circular RNAs. Front Microbiol. 2021;12:670542. doi: 10.3389/fmicb.2021.670542. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 55.Zhao J, Lee EE, Kim J, et al. Transforming activity of an oncoprotein-encoding circular RNA from human papillomavirus. Nat Commun. 2019;10:2300. doi: 10.1038/s41467-019-10246-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 56.Chasseur AS, Trozzi G, Istasse C, et al. Marek’s disease virus virulence genes encode circular RNAs. J Virol. 2022;96:e00321–e00322. doi: 10.1128/jvi.00321-22. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 57.Eger N, Schoppe L, Schuster S, et al. Circular RNA splicing. Adv Exp Med Biol. 2018;(1087):41–52. doi: 10.1007/978-981-13-1426-1_4. [DOI] [PubMed] [Google Scholar]
 - 58.Cao X, Xu X, Dong J, et al. Genome-wide identification and functional analysis of circRNAs in Trichophyton rubrum conidial and mycelial stages. BMC Genom. 2022;23:1–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
 - 59.Aktaş T, Avşar Ilık İ, Maticzka D, et al. DHX9 suppresses RNA processing defects originating from the Alu invasion of the human genome. Nature. 2017;544:115–119. doi: 10.1038/nature21715. [DOI] [PubMed] [Google Scholar]
 - 60.Li X, Liu C-X, Xue W, et al. Coordinated circRNA biogenesis and function with NF90/NF110 in viral infection. Mol Cell. 2017;67:214–227.e217. doi: 10.1016/j.molcel.2017.05.023. [DOI] [PubMed] [Google Scholar]
 
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All data used in the study are available in Supplementary materials and in VirusCircBase 2.0 which is accessible at http://computationalbiology.cn/VirusCircBase2/#/home.





