Skip to main content
Genomics, Proteomics & Bioinformatics logoLink to Genomics, Proteomics & Bioinformatics
. 2020 Jul 16;18(2):161–172. doi: 10.1016/j.gpb.2018.12.011

IC4R-2.0: Rice Genome Reannotation Using Massive RNA-seq Data

Jian Sang 1,2,3,#,, Dong Zou 1,2,#, Zhennan Wang 3,4,#, Fan Wang 1,2, Yuansheng Zhang 1,2,3, Lin Xia 1,2,3, Zhaohua Li 1,2,3, Lina Ma 1,2, Mengwei Li 1,2,3, Bingxiang Xu 1,3, Xiaonan Liu 1,2,3, Shuangyang Wu 1,3, Lin Liu 1,2,3, Guangyi Niu 1,2,3, Man Li 1,2,3, Yingfeng Luo 1,3, Songnian Hu 1,3,††,, Lili Hao 1,2,⁎,#, Zhang Zhang 1,2,3,
PMCID: PMC7646092  PMID: 32683045

Abstract

Genome reannotation aims for complete and accurate characterization of gene models and thus is of critical significance for in-depth exploration of gene function. Although the availability of massive RNA-seq data provides great opportunities for gene model refinement, few efforts have been made to adopt these precious data in rice genome reannotation. Here we reannotate the rice (Oryza sativa L. ssp. japonica) genome based on integration of large-scale RNA-seq data and release a new annotation system IC4R-2.0. In general, IC4R-2.0 significantly improves the completeness of gene structure, identifies a number of novel genes, and integrates a variety of functional annotations. Furthermore, long non-coding RNAs (lncRNAs) and circular RNAs (circRNAs) are systematically characterized in the rice genome. Performance evaluation shows that compared to previous annotation systems, IC4R-2.0 achieves higher integrity and quality, primarily attributable to massive RNA-seq data applied in genome annotation. Consequently, we incorporate the improved annotations into the Information Commons for Rice (IC4R), a database integrating multiple omics data of rice, and accordingly update IC4R by providing more user-friendly web interfaces and implementing a series of practical online tools. Together, the updated IC4R, which is equipped with the improved annotations, bears great promise for comparative and functional genomic studies in rice and other monocotyledonous species. The IC4R-2.0 annotation system and related resources are freely accessible at http://ic4r.org/.

Keywords: Genome reannotation, IC4R, Rice, RNA-seq, Gene model

Introduction

As a major crop for more than 7000 years, rice is one of the most important staple food feeding a large number of people throughout the world, with vital significance for global food security. Possessing a relatively small genome and high genetic transformation efficiency, rice is also an excellent model system for studying monocotyledonous biology [1]. Since 1997, great efforts have been devoted to deciphering the rice (Oryza sativa L. ssp. japonica and indica) genomes [2], [3], [4], and finally in 2005, representative genomes were assembled into chromosome scale [5], [6]. It should be noted, however, that any rice genome can be fully utilized only when its high-quality annotation is available; incomplete, incorrect, or ambiguous annotation could bring considerable obstacles for comprehensive characterization of gene function and in-depth exploration of molecular mechanisms underlying complex agronomic traits. Therefore, complete and accurate genome annotation is of fundamental importance in support of yielding scientific findings in rice studies [7], [8], [9], [10].

Genome reannotation holds the potential to not only improve structural and functional information but also discover novel protein-coding and non-coding genes. Nowadays, next-generation sequencing (NGS) technologies have triggered an explosion of RNA-seq data, providing great opportunity in genome reannotation. As RNA-seq analysis enables identification of splice junction sites and novel exons with higher confidence [11], there is no doubt that rice genome annotation can be significantly improved based on these precious data, especially when considering the efforts that have already been paid in other species [12], [13], [14]. Currently, there are two widely used annotation systems for the rice (O. sativa L. ssp. japonica) genome, namely, MSU-7.0 and RAP-DB [10]. However, they were generated mainly based on expressed sequence tags (EST) and cDNA sequences, etc., with limited amount of high-throughput NGS data integrated [10]. Although RNA-seq libraries from various rice tissues and diverse experimental conditions are growing at an unprecedented pace, so far, no attempt has been made to apply all these valuable resources for rice gene model refinement. Therefore, it is highly desirable to reannotate the rice genome based on large-scale integration of high-throughput transcriptomic data.

The Information Commons for Rice (IC4R, http://ic4r.org) [15], [16], [17], one of the core resources of National Genomics Data Center (NGDC, http://bigd.big.ac.cn) [18], [19], [20], is a public database integrating multiple omics data for rice and providing high-quality annotations. Here, we perform rice genome reannotation based on integration of large-scale RNA-seq data and consequently release a new annotation system—IC4R-2.0 for O. sativa L. ssp. japonica. IC4R-2.0 presents considerable improvements by enhancing structural completeness of protein-coding genes, incorporating an abundance of functional annotations, and systematically identifying long non-coding RNAs (lncRNAs) and circular RNAs (circRNAs) in rice genome. Accordingly, we upgrade the IC4R database by providing more user-friendly web interfaces and implementing a series of practical online tools. Collectively, the improved annotation system IC4R-2.0 as well as the updated database remarkably increase the utility of the rice genome, thereby bearing great promise for comparative and functional genomic studies in rice and other monocotyledonous species.

Method

RNA-seq data collection

More than 1800 RNA-seq datasets of O. sativa L. ssp. japonica released before May 1st, 2017 were downloaded from NCBI Sequence Read Archive (SRA) [21] and NGDC Genome Sequence Archive (GSA) [22]. These datasets were generated from a diversity of rice tissues across various developmental stages and experimental conditions. After removal of libraries with short sequencing reads (average length < 36 bp) and unclear meta-information, a total of 1503 RNA-seq datasets (http://ic4r.org/statistics/RNA-Seq-dataset) with approximately 5.32 terabytes in file size (FASTQ format) were used for rice genome reannotation.

Genome reannotation process

RNA-seq datasets in SRA format were converted into FASTQ format by SRA toolkit (v.2.4.2). Raw reads were adapter-trimmed and quality-filtered (Phred score ≥ 33; read length ≥ 36 bp) using Trimmomatic (v.0.36) [23] with parameters (LEADING: 15, TRAILING: 15, SLIDINGWINDOW: 4:15). The reference-based RNA-seq mapping was performed by HISAT aligner (v.2.1.0) [24] with default parameters. Processed reads were aligned against the reference genome (Os-Nipponbare-Reference-IRGSP-1.0), which was obtained from the Rice Genome Annotation Project (http://rice.plantbiology.msu.edu).

The alignment files in SAM format were converted into BAM format and sorted by SAMtools (v.0.1.19) with default parameters. StringTie (v.1.1.2) [24] was then used to assemble the sorted BAM files into transcripts under the guidance of MSU-7.0, and also to estimate their expression levels by transcripts per million (TPM). The junction reads spanning exon–exon sites of transcripts were annotated and calculated using regtools (v.0.5.0). To decrease the noises caused by lowly-expressed and fragmented sequences, reconstructed transcripts were filtered by a relatively strict threshold (length ≥ 200 bp; TPM ≥ 2.0; minimum reads per bp coverage ≥ 2.5). Meanwhile, exon–exon junction sites of each transcript should be spanned by valid supporting junction reads. After that, GMAP (v.2015) [25] and BLAT (v.35) [26] were used together to align the resulting non-redundant transcripts back to their genomic loci and further merge them into more complete and coordinated transcripts. Then, PASA (v.2.1.0) [27] was used to update MSU-7.0 gene models according to the new transcripts.

Regarding lncRNAs, transcripts with single exon, length < 200 bp, or ORF size > 100 bp were excluded. Afterwards, the remaining transcripts were compared to the updated protein-coding annotation by Cuffcompare (v.2.2.1). Transcripts with relationship ‘u’ (unknown intergenic transcript), ‘o’ (generic exonic overlap with a reference), and ‘x’ (natural antisense transcript) were selected for further analysis. All retained transcripts were blasted against the plant protein sequences from UniRef90 [28] using BLASTX (v.2.2.31+) (E-value cutoff: 1E–05) to remove potential protein-coding transcripts. After that, Coding Potential Calculator (CPC) (v.0.9-r2) [29] and LGC (v.1.0) [30] were collectively used for lncRNA identification with default parameters, and only the consensus ones identified by both tools were incorporated into IC4R-2.0.

Among the 1503 RNA-seq datasets, only those generated from RiboMinus or RiboZero sequencing libraries were selected for circRNA identification. After quality control, the remaining clean reads were aligned against the reference genome (Os-Nipponbare-Reference-IRGSP-1.0) by BWA (v.0.7.10-r789). Then, circRNAs were detected and characterized by CIRI (v.2.0.6) [31] through a two-step process: (1) detecting junction reads with paired chiastic clipping (GT-AC) signals; (2) detecting additional junction reads and further filtering to remove false positives caused by incorrectly mapped reads.

Characterization of tissue specificity

The expression breadth, coefficient of variance (CV), and tissue specificity index (τ-value) [32] were used to evaluate expression variability for both protein-coding genes and lncRNAs. Specifically, the τ-values vary between 0 and 1, where the lower τ-values represent less variable expression profiles across different tissues, vice versa. A criterion for selection of housekeeping (HK) and tissue-specific (TS) genes was suggested as follows: (1) HK genes are defined as genes with τ-value < 0.5 and CV < 0.5, and expressed in > 80% tissues; (2) TS genes are defined as genes with τ-value ≥ 0.95, and expressed in < 15% tissues; (3) expressed invariable genes (EIGs) are defined as a set of strictly defined HK genes with relatively constant expression levels, which have τ-value < 0.45 and CV < 0.5, and are expressed in > 85% tissues [33].

Database implementation

The updated IC4R was implemented by Java Platform Enterprise Edition (J2EE) as the back-end components and deployed in Apache Tomcat Server (an open-source Java Servlet Container) on a CentOS release 6.5 Linux system. Hypertext Markup Language 5 (HTML5), Cascading Style Sheets 3 (CSS3), Asynchronous JavaScript and XML (AJAX), Data-Driven Documents (D3), Bootstrap, and JQuery were used together to provide user-friendly and interactive front-end web interfaces. All annotation data in the updated IC4R were stored and managed in the open-source MySQL relational database system.

Genome reannotation

In this study, we set up the IC4R reannotation pipeline based upon a reference-guided transcript assembly and reconstruction strategy (Figure S1). As a result, a total of 9,826,047 non-repeated transcripts (17.64 GB in FASTA format) are yielded from 1503 public RNA-seq datasets, representing a great abundance of rice transcriptomes from various tissues and diverse experimental conditions. The reconstructed transcripts are mapped back to the reference genome and subsequently merged as coordinated transcripts to further generate the new annotation system—IC4R-2.0.

Structural improvements

In total, IC4R-2.0 comprises 56,221 protein-coding gene loci corresponding to 80,039 mRNAs. Compared to the previous two annotation systems (MSU-7.0 and RAP-DB), the completeness of gene structure is improved in IC4R-2.0, as the mean lengths of mRNAs, coding sequences (CDS), and exons, as well as the average number of exons per transcript, are all increased (Figure 1A–D). Another improvement is an increase in the number of mRNAs attributed to the inclusion of both 5′ and 3′ untranslated regions (UTRs) (Table 1). Meanwhile, a total of 16.36% rice gene models are identified to possess alternative splicing, corresponding to 1.42 spliced isoforms per gene on average, which is higher than values obtained using previous annotation systems (1.18 for MSU-7.0 and 1.15 for RAP-DB). Apparently, more alternative splicing events with intron retention are identified in IC4R-2.0 (Figure S2).

Figure 1.

Figure 1

Comparison of structural features among different ricegenome annotation systems

Structural features of rice genes in terms of mRNA length (A), CDS length (B), 5′-UTR length (C), and 3′-UTR length (D) were compared using IC4R-2.0 developed in the current study, with two previous annotation systems, MSU-7.0 and RAP-DB. P values were calculated using Student’s t-tests. *, P < 0.05; **, P < 0.01; ***, P < 0.001.

Table 1.

Statistics of three different annotation systems for rice genome

graphic file with name fx1.gif

Note: The RAP-DB (V2018-03-29) annotation system was obtained from https://rapdb.dna.affrc.go.jp/; the MSU-7.0 annotation system was obtained from http://rice.plantbiology.msu.edu on April 12, 2018. BUSCO, Benchmarking Universal Single-Copy Orthologs.

Based on the large-scale integration of RNA-seq data, more than 27,000 gene loci are improved in IC4R-2.0 with structural modification (including gene extension and gene merging) and novel gene identification. For example, LOC_Os12g32950, previously annotated in MSU-7.0, is extended to be a more complete gene model IC4R-OSJ12G289800 in IC4R-2.0 through adding a 3′ boundary exon, which is in fact well supported by RAP-DB (Figure 2A). Another case of gene extension is observed in IC4R-OSJ12G211900, which is improved by not only adjunction of an internal exon to the MSU-7.0 locus, but also inclusion of 5′-UTR and 3′-UTR (Figure 2B). Furthermore, IC4R-2.0 updates 218 loci by gene merging. Specifically, two neighbouring loci (LOC_Os07g47280 and LOC_Os07g47284) that were originally annotated as separate genes in MSU-7.0, are merged together to form a single gene (IC4R-OSJ07G424200) in IC4R-2.0, which consistently contains complete domains of DNA polymerase zeta catalytic (Figure 2C and S3). Strikingly, IC4R-2.0, based on massive RNA-seq data, identifies a total of 456 novel genes, which possess sufficient RNA-seq evidence but have not been reported in any previous annotation systems (Figure 2D). Particularly, these novel loci are further verified via multiple protein sequence alignments; taking IC4R-OSJ01G191000 as an example, its reliability is well supported by protein homologs in other four Oryza species (Figure 3).

Figure 2.

Figure 2

Structural improvements of IC4R-2.0

A. Gene structural update by adding 3′-exon. B. Gene extension by adjunction of internal exon as well as inclusion of both 5′-UTR and 3′-UTR. C. Gene fusion. D. Novel gene identification. RNA-seq evidence including read coverage and read alignment is displayed.

Figure 3.

Figure 3

Sequence alignment and phylogenetic tree of a protein encoded bya newly identified gene

The protein sequence of IC4R-OSJ01G191000 was aligned with its homologs from Oryza barthii, Oryza glaberrima, Oryza nivara, and Oryza sativa indica by ClustalX (v2.1). Sequence alignment color code was determined according to the ClustalX color scheme by Jalview software. Jnetpred is used for secondary structure prediction (red ribbon for α-helix and green ribbon for ß-sheet) and the prediction confidence is estimated using JNETCONF, with higher values for higher confidence. A phylogenetic tree was constructed using the protein sequences for multiple alignment based on neighbor-joining algorithm using MEGA v7.0 with 1000 bootstrap replications.

Characterization of gene expression patterns

To explore the expression patterns of all updated gene models in IC4R-2.0, we investigate their tissue specificity (estimated by τ-value) and expression level (estimated by TPM). When comparing IC4R-2.0 with MSU-7.0 or RAP-DB, all genes can be classified as updated genes (that have different gene structures or are newly identified) and non-updated genes (that have no difference between any two compared annotation systems). Consequently, in contrast to MSU-7.0, we find that the updated gene group in IC4R-2.0 exhibits significantly higher τ-value and lower TPM value than the non-updated gene group. Similar trends are obtained when comparing IC4R-2.0 with RAP-DB (Figure 4A and B). These results clearly demonstrate that genes in the updated group tend to be more tissue-specific and lowly expressed, as RNA-seq data provides higher-resolution transcriptomic evidence and accordingly enable more accurate identification of lowly expressed and/or tissue-specific genes.

Figure 4.

Figure 4

Comparison of tissue specificity index and expression abundance between updated and non-updated genegroups in IC4R-2.0

The newly identified gene models and those with updated structure in IC4R-2.0 in comparison with previous gene models in MSU-7.0 or RAP-DB are included in the updated group, while genes in IC4R-2.0 with same structure as the previous gene models in MSU-7.0 or RAP-DB are included into the non-updated gene group. P values were calculated by Student’s t-tests. *, P < 0.05; **, P < 0.01; ***, P < 0.001.

Notably, we find that 3996 gene loci annotated in IC4R-2.0 were controversial in the previous annotation systems. For instance, IC4R-OSJ08G082200 and IC4R-OSJ06G014600, were annotated only in one of the annotation systems (RAP-DB and MSU-7.0, respectively), without reaching an agreement on gene annotation. Strikingly, IC4R-2.0 verifies the annotation reliability of these two genes with strong evidence from both RNA-seq data (Figure S4A and B) and multiple sequence alignments of the protein products (Figure S5A and B). These results clearly indicate that IC4R-2.0, with the advantage of RNA-seq-based evidence, is capable of bridging the gaps between MSU-7.0 and RAP-DB, and thereby achieves more confident improvements for rice genome annotation.

Functional annotation

IC4R-2.0 additionally presents significant improvements by acquiring multi-level functional annotations. First, all protein sequences identified in IC4R-2.0 are blasted against plant sequences from UniRef90 database. Functional descriptions of the best BLASTP hit (E-value cut-off: 1E–05) are subsequently extracted, corresponding to a total of 47,693 (85.7%) protein-coding genes. Second, a comprehensive set of ontologies, including Gene Ontology (GO), Trait Ontology (TO), Environment Ontology (EO), and Plant Ontology (PO), are retrieved by Blast2GO [34] or via ID mapping to other plant ontology resources. As a result, 43,066 protein-coding genes gain ontology terms. Meanwhile, functional motifs and domains are identified in IC4R-2.0 using InterProScan [35], yielding a total of 155,618 functional entries assigned to 48,159 gene models. Taken together, 55,080 protein-coding genes in IC4R-2.0 are functionally annotated and the detailed statistics are summarized in Table S1.

Identification of ncRNAs

lncRNAs and circRNAs are important functional regulators involved in many aspects of plant biology [31], [36]. Taking advantage of abundant RNA-seq data, we systematically carry out a genome-wide identification of lncRNAs and circRNAs in the rice genome. As a result, 3215 lncRNA loci corresponding to 6259 transcripts are identified (Table 1). Further investigation shows that the size of lncRNA transcripts ranges from 201 to 14,159 bp, with mean length of 1191 bp. Conforming to a previous report [13], lncRNAs, in contrast to protein-coding genes, possess lower expression level and fewer exons (Figure 5A and B). Moreover, narrower expression breadth and higher τ-value are consistently observed in lncRNAs, suggesting that they are likely to be more tissue-specific (Figure 5C and D). Meanwhile, a total of 4373 circRNAs are identified, among which, 3342 (76.42%) are exonic, 762 (17.42%) are intergenic, and the remaining 269 (6.16%) are intronic. All these ncRNAs are publicly available at the IC4R website (http://ic4r.org/browse/lncRNA and http://ic4r.org/browse/circRNA).

Figure 5.

Figure 5

Characterization of lncRNAsin IC4R-2.0

Expression breadth refers to the sum of tissues in which the lncRNAs or protein-coding genes are expressed. Eighteen tissue types of rice are included in the current study according to the corresponding meta-information of RNA-seq libraries. These include aleurone, callus, coleoptile, crown, embryo, endosperm, flower, leaf, meristem, node, panicle, root, seed, seedling, sheath, shoot, spikelet, and stem.

Evaluation of annotation

We evaluate IC4R-2.0 by comparison with MSU-7.0 and RAP-DB in terms of annotation completeness and quality. First, to assess the completeness of IC4R-2.0, we carry out routine analysis of Benchmarking Universal Single-Copy Orthologs (BUSCO) [37] based on the latest plant dataset (embryophyta odb9). As a result, more complete BUSCO genes are identified in IC4R-2.0 (1389), compared to MSU-7.0 (1378) and RAP-DB (1190), and the number of missing BUSCO genes in IC4R-2.0 is reduced accordingly (Table 1 and Figure S6). Additionally, we use MAKER-P software package [38] to evaluate the annotation quality as indicated by Annotation Edit Distance (AED), which ranges from 0 to 1 for each transcript. Specifically, lower AED values represent higher annotation quality, vice versa. Consistently, IC4R-2.0 gives rise to a cumulative curve that shifts to the left, presenting lower AED values than MSU-7.0 or RAP-DB (Figure S7). Together, based on the results shown above, IC4R-2.0 represents a more complete annotation system with better quality, which is primarily attributable to massive RNA-seq data applied in the genome reannotation.

Database update

Data organization and presentation

The IC4R database is significantly updated by incorporating the improved genome annotations and providing user-friendly interfaces for data organization and presentation in light of protein-coding genes, lncRNAs, and circRNAs. For protein-coding genes, basically, IC4R houses a wealth of fundamental information, including gene summary (e.g., symbols, genomic context, and external hyperlinks), transcripts, associated functional entries, and ontologies (Figure 6A–G). Most importantly, the updated IC4R incorporates abundant information on gene expression, involving expression profiles, expression breadth, τ-value, and associated RNA-seq libraries (Figure 6C). Meanwhile, it features community annotation [39], [40], allowing users to contribute their knowledge and expertise to further improvement on gene annotation (Figure 6H). When it comes to lncRNAs, IC4R provides additional information of coding potential scores estimated by both CPC and LGC (Figure S8A–E). Regarding circRNAs, IC4R presents not only basic information, but also Compact Idiosyncratic Gapped Alignment Report (CIGAR) types [31] and the supporting back-spliced junction reads (Figure S9A–D).

Figure 6.

Figure 6

Screenshots of protein-coding genepage in IC4R

Functionality improvement

IC4R is also considerably upgraded by improving multiple functionalities. First, its information retrieval/search functionality is optimized to be more user-friendly and straightforward; it allows a variety of keywords (including gene, lncRNA, circRNA, and domain) as query and also supports fuzzy search. Second, IC4R incorporates a built-in BLAST module and accordingly is capable of sequence similarity search (http://ic4r.org/blast). Third, a web tool named HK-TS Gene Finder (http://ic4r.org/hk-ts) is provided, which is able to identify HK and TS genes with customized criteria. Additionally, a lightweight ID mapping tool (http://ic4r.org/idmapper) is deployed in IC4R, helpful to convert gene IDs among IC4R-2.0, MSU-7.0, and RAP-DB. Last but not least, an interactive genome browser—JBrowse is implemented in IC4R, enabling users to flexibly investigate any given gene, lncRNA, or circRNA in a visualized manner.

Enhanced data accessibility

To facilitate access to the new annotation system, IC4R provides a series of flat files for public downloading (http://ic4r.org/download), including gene structural annotation (GFF format), nucleotide and protein sequences (FASTA format), correspondence between IC4R-2.0, MSU-7.0, and RAP-DB ID systems (CSV format), predicted CpG island (TSV format), as well as exon–exon junction information (BED format). Furthermore, to make these associated data accessible more efficiently, an open application programming interface (API) (http://ic4r.org/api) is provided for automatic retrieval.

Conclusions and future directions

Here we have reannotated the rice genome based on integration of large-scale RNA-seq data and accordingly released the new annotation system IC4R-2.0. It significantly updates rice gene models by not only enhancing structural completeness of protein-coding genes, but also identifying novel genes, lncRNAs, and circRNAs. Meanwhile, considerable upgrades are made in IC4R database by implementing more user-friendly interfaces and new functionalities, which together would be of broad utility for functional genomic studies in rice. However, we cannot rule out the possibility that the new annotation system may contain flawed mapping in duplicated genes presumably derived from non-Nipponbare RNA-seq data. Thus, future directions include regular updates of rice reference gene models by integrating more high-quality Nipponbare-specific RNA-seq datasets (especially those with long read length) as well as other types of data (e.g., proteomics data). In addition, more efforts will be devoted to genome-wide reannotation of ncRNAs, such as microRNAs (miRNAs), PIWI-interacting RNAs (piRNAs), small interfering RNAs (siRNAs), and small nucleolar RNAs (snoRNAs). Furthermore, genome reannotation will be conducted for not only O. sativa L. ssp. japonica, but also other cultivated and wild rice species.

Availability

The IC4R-2.0 annotation system and related resources are freely accessible at http://ic4r.org/.

Credit author statement

Jian Sang: Investigation, Formal analysis, Software, Visualization, Writing - Original Draft. Dong Zou: Software. Zhennan Wang: Investigation, Formal analysis. Fan Wang: Software. Yuansheng Zhang: Investigation, Data Curation. Lin Xia: Investigation. Zhaohua Li: Investigation. Lina Ma: Formal analysis. Mengwei Li: Formal analysis. Bingxiang Xu: Formal analysis. Xiaonan Liu: Data Curation. Shuangyang Wu: Data Curation. Lin Liu: Data Curation. Guangyi Niu: Data Curation. Man Li: Data Curation. Yingfeng Luo: Data Curation. Songnian Hu: Supervision. Lili Hao: Supervision, Software, Writing - Review & Editing. Zhang Zhang: Supervision, Writing - Review & Editing. All authors read and approved the final manuscript.

Competing interests

The authors have declared that no competing interests exist.

Acknowledgments

This work was supported by grants from the Strategic Priority Research Program of Chinese Academy of Sciences (Grant No. XDA08020102 to ZZ and SH), the Youth Innovation Promotion Association of Chinese Academy of Science (Grant No. 2018134 to LH), National Programs for High Technology Research and Development (Grant Nos. 2015AA020108 and 2012AA020409 to ZZ), the 100-Talent Program of Chinese Academy of Sciences (to YB and ZZ), and the National Natural Science Foundation of China (Grant No. 31100915 to LH).

Handled by Long Mao

Footnotes

Peer review under responsibility of Beijing Institute of Genomics, Chinese Academy of Sciences and Genetics Society of China.

Supplementary data to this article can be found online at https://doi.org/10.1016/j.gpb.2018.12.011.

Contributor Information

Songnian Hu, Email: husn@im.ac.cn.

Lili Hao, Email: haolili@big.ac.cn.

Zhang Zhang, Email: zhangzhang@big.ac.cn.

Supplementary material

The following are the Supplementary data to this article:

Supplementary Figure S1

Flowchart of rice genome reannotationprocedure in IC4R

mmc1.pdf (200.8KB, pdf)
Supplementary Figure S2

Comparison of alternative splicing events annotated using IC4R-2.0, MSU-7.0,and RAP-DB

mmc2.pdf (98.8KB, pdf)
Supplementary Figure S3

Multiple protein sequence alignment of a merged gene locus and its homologs in other Oryza species The protein sequence of IC4R-OSJ07G424200 was aligned with its homologs from Oryza sativa indica, Oryza glaberrima, Oryza rufipogon, and Oryza barthii by ClustalX (v2.1). Sequence alignment color code was determined according to the ClustalX color scheme by Jalview software. Jnetpred is used for secondary structure prediction (red ribbon for α-helix and green ribbon for ß-sheet) and the prediction confidence is estimated using JNETCONF, with higher values for higher confidence.

mmc3.pdf (15MB, pdf)
Supplementary Figure S4

Illustration of gap-filling gene loci inIC4R-2.0 A. The gene locus IC4R-OSJ08G082200 is identified in both IC4R-2.0 and RAP-DB but not in MSU-7.0. Compared to the same locus in RAP-DB, IC4R-OSJ08G082200 is extended to be a more complete gene model by adding both 5'-UTR and 3'-UTR. B. The gene locus IC4R-OSJ06G014600 is identified in both IC4R-2.0 and MSU-7.0 but not in RAP-DB.

mmc4.pdf (187.8KB, pdf)
Supplementary Figure S5

Multiple protein sequence alignments of gap-filling gene loci in IC4R-2.0 and their homologs in otherOryza species A. The protein sequence of IC4R-OSJ08G082200 was aligned with its homologs from Oryza sativa indica, Oryza meridionalis, Oryza nivara, and Oryza rufipogon by ClustalX (v2.1). B. The protein sequence of IC4R-OSJ06G014600 was aligned with its homologs from Oryza sativa indica, Oryza glumaepatula, Oryza nivara, and Oryza rufipogon by ClustalX (v2.1). Sequence alignment color code was determined according to the ClustalX color scheme by Jalview software. Jnetpred is used for secondary structure prediction (red ribbon for α-helix and green ribbon for ß-sheet) and the prediction confidence is estimated using JNETCONF, with higher values for higher confidence.

mmc5.pdf (4.4MB, pdf)
Supplementary Figure S6

Comparison of genome annotation completeness of IC4R-2.0, MSU-7.0, and RAP-DBbased on BUSCO analysis BUSCO plot presents the relative proportion of missing (red), fragmented (yellow), complete and duplicated (dark blue), and complete and single copy (light blue) BUSCO genes identified for IC4R-2.0, MSU-7.0, and RAP-DB, respectively. BUSCO, Benchmarking Universal Single-Copy Orthologs.

mmc6.pdf (119KB, pdf)
Supplementary Figure S7

Comparison of genome annotation quality of IC4R-2.0, MSU-7.0, and RAP-DBbased on AED analysis Cumulative fraction of AED is used to evaluate the quality of genome annotation according to the nucleotide/protein evidence. Lower AED scores indicate that the gene models are better annotated with the underlying evidence. AED support for gene models in IC4R-2.0 (red line) is improved over MSU-7.0 (grey line) and RAP-DB (blue line). AED, Annotation Edit Distance.

mmc7.pdf (1.3MB, pdf)
Supplementary Figure S8

Screenshots of lncRNApage in IC4R

mmc8.pdf (1.2MB, pdf)
Supplementary Figure S9

Screenshots of circRNApage in IC4R

mmc9.pdf (92.8KB, pdf)
Supplementary Table S1
mmc10.docx (15.4KB, docx)

References

  • 1.Goff S.A. Rice as a model for cereal genomics. Curr Opin Plant Biol. 1999;2:86–89. doi: 10.1016/S1369-5266(99)80018-1. [DOI] [PubMed] [Google Scholar]
  • 2.Yu J., Hu S.N., Wang J., Wong G.K., Li S., Liu B. A draft sequence of the rice genome (Oryza sativa L. ssp indica) Science. 2002;296:79–92. doi: 10.1126/science.1068037. [DOI] [PubMed] [Google Scholar]
  • 3.Goff S.A., Ricke D., Lan T.H., Presting G., Wang R., Dunn M. A draft sequence of the rice genome (Oryza sativa L. ssp japonica) Science. 2002;296:92–100. doi: 10.1126/science.1068275. [DOI] [PubMed] [Google Scholar]
  • 4.Kurata N., Umehara Y., Tanoue H., Sasaki T. Physical mapping of the rice genome with YAC clones. Plant Mol Biol. 1997;35:101–113. [PubMed] [Google Scholar]
  • 5.International Rice Genome Sequencing Project The map-based sequence of the rice genome. Nature. 2005;436:793–800. doi: 10.1038/nature03895. [DOI] [PubMed] [Google Scholar]
  • 6.Yu J., Wang J., Lin W., Li S., Li H., Zhou J. The Genomes of Oryza sativa: A history of duplications. PLoS Biol. 2005;3:266–281. doi: 10.1371/journal.pbio.0030038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Ouyang S., Zhu W., Hamilton J., Lin H., Campbell M., Childs K. The TIGR rice genome annotation resource: improvements and new features. Nucleic Acids Res. 2007;35:D883–D887. doi: 10.1093/nar/gkl976. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Ohyanagi H., Tanaka T., Sakai H., Shigemoto Y., Yamaguchi K., Habara T. The Rice Annotation Project Database (RAP-DB): hub for Oryza sativa ssp japonica genome information. Nucleic Acids Res. 2006;34:D741–D744. doi: 10.1093/nar/gkj094. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Tanaka T., Antonio B.A., Kikuchi S., Matsumoto T., Nagamura Y., Numa H. The rice annotation project database (RAP-DB): 2008 update. Nucleic Acids Res. 2008;36:D1028–D1033. doi: 10.1093/nar/gkm978. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Kawahara Y., de la Bastide M., Hamilton J.P., Kanamori H., McCombie W.R., Ouyang S. Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data. Rice. 2013;6:4. doi: 10.1186/1939-8433-6-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Trapnell C., Williams B.A., Pertea G., Mortazavi A., Kwan G., van Baren M.J. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28:511–515. doi: 10.1038/nbt.1621. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Li Z., Zhang Z.H., Yan P., Huang S., Fei Z., Lin K. RNA-Seq improves annotation of protein-coding genes in the cucumber genome. BMC Genomics. 2011;12:540. doi: 10.1186/1471-2164-12-540. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Li Y., Wei W., Feng J., Luo H., Pi M., Liu Z. Genome re-annotation of the wild strawberry Fragaria vesca using extensive Illumina- and SMRT-based RNA-seq datasets. DNA Res. 2018;25:61–70. doi: 10.1093/dnares/dsx038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Cheng C.Y., Krishnakumar V., Chan A.P., Thibaud-Nissen F., Schobel S., Town C.D. Araport11: a complete reannotation of the Arabidopsis thaliana reference genome. Plant J. 2017;89:789–804. doi: 10.1111/tpj.13415. [DOI] [PubMed] [Google Scholar]
  • 15.IC4R Project Consortium Information Commons for Rice (IC4R) Nucleic Acids Res. 2016;44:D1172–D1180. doi: 10.1093/nar/gkv1141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Zhang Z., Sang J., Ma L., Wu G., Wu H., Huang D. RiceWiki: a wiki-based database for community curation of rice genes. Nucleic Acids Res. 2014;42:D1222–D1228. doi: 10.1093/nar/gkt926. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Xia L., Zou D., Sang J., Xu X., Yin H., Li M. Rice Expression Database (RED): An integrated RNA-Seq-derived gene expression database for rice. J Genet Genomics. 2017;44:235–241. doi: 10.1016/j.jgg.2017.05.003. [DOI] [PubMed] [Google Scholar]
  • 18.National Genomics Data Center Members and Partners Database resources of the National Genomics Data Center in 2020. Nucleic Acids Res. 2020;48:D24–D33. doi: 10.1093/nar/gkz913. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.BIG Data Center Members Database Resources of the BIG Data Center in 2018. Nucleic Acids Res. 2018;46:D14–D20. doi: 10.1093/nar/gkx897. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Luo J. GSA and BIGD: filling the gap of bioinformatics resource and service in China. Genomics Proteomics Bioinformaics. 2017;15:11–13. doi: 10.1016/j.gpb.2017.02.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Leinonen R., Sugawara H., Shumway M. International Nucleotide Sequence Database Collaboration. The sequence read archive. Nucleic Acids Res. 2011;39:D19–D21. doi: 10.1093/nar/gkq1019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Wang Y., Song F., Zhu J., Zhang S., Yang Y., Chen T. GSA: genome sequence archive. Genomics Proteomics Bioinformatics. 2017;15:14–18. doi: 10.1016/j.gpb.2017.01.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Bolger A.M., Lohse M., Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30:2114–2120. doi: 10.1093/bioinformatics/btu170. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Pertea M., Kim D., Pertea G.M., Leek J.T., Salzberg S.L. Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat Protoc. 2016;11:1650–1667. doi: 10.1038/nprot.2016.095. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Wu T.D., Watanabe C.K. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics. 2005;21:1859–1875. doi: 10.1093/bioinformatics/bti310. [DOI] [PubMed] [Google Scholar]
  • 26.Kent W.J. BLAT - The BLAST-like alignment tool. Genome Res. 2002;12:656–664. doi: 10.1101/gr.229202. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Haas B.J., Delcher A.L., Mount S.M., Wortman J.R., Smith R.K., Hannick L.I. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 2003;31:5654–5666. doi: 10.1093/nar/gkg770. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.UniProt Consortium The Universal Protein Resource (UniProt) Nucleic Acids Res. 2007;35:D193–D197. doi: 10.1093/nar/gkl929. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Kong L., Zhang Y., Ye Z.Q., Liu X.Q., Zhao S.Q., Wei L. CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res. 2007;35:W345–W349. doi: 10.1093/nar/gkm391. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Wang G., Yin H., Li B., Yu C., Wang F., Xu X. Characterization and identification of long non-coding RNAs based on feature relationship. Bioinformatics. 2019;35:2949–2956. doi: 10.1093/bioinformatics/btz008. [DOI] [PubMed] [Google Scholar]
  • 31.Gao Y., Wang J., Zhao F. CIRI: an efficient and unbiased algorithm for de novo circular RNA identification. Genome Biol. 2015;16:4. doi: 10.1186/s13059-014-0571-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Yanai I., Benjamin H., Shmoish M., Chalifa-Caspi V., Shklar M., Ophir R. Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics. 2005;21:650–659. doi: 10.1093/bioinformatics/bti042. [DOI] [PubMed] [Google Scholar]
  • 33.Ma L., Cui P., Zhu J., Zhang Z., Zhang Z. Translational selection in human: more pronounced in housekeeping genes. Biol Direct. 2014;9:17. doi: 10.1186/1745-6150-9-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Conesa A., Gotz S., Garcia-Gomez J.M., Terol J., Talon M., Robles M. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics. 2005;21:3674–3676. doi: 10.1093/bioinformatics/bti610. [DOI] [PubMed] [Google Scholar]
  • 35.Jones P., Binns D., Chang H.Y., Fraser M., Li W.Z., McAnulla C. InterProScan 5: genome-scale protein function classification. Bioinformatics. 2014;30:1236–1240. doi: 10.1093/bioinformatics/btu031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Liu X., Hao L., Li D., Zhu L., Hu S. Long non-coding RNAs and their biological roles in plants. Genomics Proteomics Bioinformatics. 2015;13:137–147. doi: 10.1016/j.gpb.2015.02.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Simao F.A., Waterhouse R.M., Ioannidis P., Kriventseva E.V., Zdobnov E.M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31:3210–3212. doi: 10.1093/bioinformatics/btv351. [DOI] [PubMed] [Google Scholar]
  • 38.Campbell M.S., Law M.Y., Holt C., Stein J.C., Moghe G.D., Hufnagel D.E. MAKER-P: a tool Kit for the rapid creation, management, and quality control of plant genome annotations. Plant Physiol. 2014;164:513–524. doi: 10.1104/pp.113.230144. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Sang J., Wang Z., Li M., Cao J., Niu G., Xia L. ICG: a wiki-driven knowledgebase of internal control genes for RT-qPCR normalization. Nucleic Acids Res. 2018;46:D121–D126. doi: 10.1093/nar/gkx875. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Zhang Z., Zhu W., Luo J. Bringing biocuration to China. Genomics Proteomics Bioinformatics. 2014;12:153–155. doi: 10.1016/j.gpb.2014.07.001. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Figure S1

Flowchart of rice genome reannotationprocedure in IC4R

mmc1.pdf (200.8KB, pdf)
Supplementary Figure S2

Comparison of alternative splicing events annotated using IC4R-2.0, MSU-7.0,and RAP-DB

mmc2.pdf (98.8KB, pdf)
Supplementary Figure S3

Multiple protein sequence alignment of a merged gene locus and its homologs in other Oryza species The protein sequence of IC4R-OSJ07G424200 was aligned with its homologs from Oryza sativa indica, Oryza glaberrima, Oryza rufipogon, and Oryza barthii by ClustalX (v2.1). Sequence alignment color code was determined according to the ClustalX color scheme by Jalview software. Jnetpred is used for secondary structure prediction (red ribbon for α-helix and green ribbon for ß-sheet) and the prediction confidence is estimated using JNETCONF, with higher values for higher confidence.

mmc3.pdf (15MB, pdf)
Supplementary Figure S4

Illustration of gap-filling gene loci inIC4R-2.0 A. The gene locus IC4R-OSJ08G082200 is identified in both IC4R-2.0 and RAP-DB but not in MSU-7.0. Compared to the same locus in RAP-DB, IC4R-OSJ08G082200 is extended to be a more complete gene model by adding both 5'-UTR and 3'-UTR. B. The gene locus IC4R-OSJ06G014600 is identified in both IC4R-2.0 and MSU-7.0 but not in RAP-DB.

mmc4.pdf (187.8KB, pdf)
Supplementary Figure S5

Multiple protein sequence alignments of gap-filling gene loci in IC4R-2.0 and their homologs in otherOryza species A. The protein sequence of IC4R-OSJ08G082200 was aligned with its homologs from Oryza sativa indica, Oryza meridionalis, Oryza nivara, and Oryza rufipogon by ClustalX (v2.1). B. The protein sequence of IC4R-OSJ06G014600 was aligned with its homologs from Oryza sativa indica, Oryza glumaepatula, Oryza nivara, and Oryza rufipogon by ClustalX (v2.1). Sequence alignment color code was determined according to the ClustalX color scheme by Jalview software. Jnetpred is used for secondary structure prediction (red ribbon for α-helix and green ribbon for ß-sheet) and the prediction confidence is estimated using JNETCONF, with higher values for higher confidence.

mmc5.pdf (4.4MB, pdf)
Supplementary Figure S6

Comparison of genome annotation completeness of IC4R-2.0, MSU-7.0, and RAP-DBbased on BUSCO analysis BUSCO plot presents the relative proportion of missing (red), fragmented (yellow), complete and duplicated (dark blue), and complete and single copy (light blue) BUSCO genes identified for IC4R-2.0, MSU-7.0, and RAP-DB, respectively. BUSCO, Benchmarking Universal Single-Copy Orthologs.

mmc6.pdf (119KB, pdf)
Supplementary Figure S7

Comparison of genome annotation quality of IC4R-2.0, MSU-7.0, and RAP-DBbased on AED analysis Cumulative fraction of AED is used to evaluate the quality of genome annotation according to the nucleotide/protein evidence. Lower AED scores indicate that the gene models are better annotated with the underlying evidence. AED support for gene models in IC4R-2.0 (red line) is improved over MSU-7.0 (grey line) and RAP-DB (blue line). AED, Annotation Edit Distance.

mmc7.pdf (1.3MB, pdf)
Supplementary Figure S8

Screenshots of lncRNApage in IC4R

mmc8.pdf (1.2MB, pdf)
Supplementary Figure S9

Screenshots of circRNApage in IC4R

mmc9.pdf (92.8KB, pdf)
Supplementary Table S1
mmc10.docx (15.4KB, docx)

Data Availability Statement

To facilitate access to the new annotation system, IC4R provides a series of flat files for public downloading (http://ic4r.org/download), including gene structural annotation (GFF format), nucleotide and protein sequences (FASTA format), correspondence between IC4R-2.0, MSU-7.0, and RAP-DB ID systems (CSV format), predicted CpG island (TSV format), as well as exon–exon junction information (BED format). Furthermore, to make these associated data accessible more efficiently, an open application programming interface (API) (http://ic4r.org/api) is provided for automatic retrieval.


Articles from Genomics, Proteomics & Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES