Abstract
RiceGeneThresher is a public online resource for mining genes underlying genome regions of interest or quantitative trait loci (QTL) in rice genome. It is a compendium of rice genomic resources consisting of genetic markers, genome annotation, expressed sequence tags (ESTs), protein domains, gene ontology, plant stress-responsive genes, metabolic pathways and prediction of protein–protein interactions. RiceGeneThresher system integrates these diverse data sources and provides powerful web-based applications, and flexible tools for delivering customized set of biological data on rice. Its system supports whole-genome gene mining for QTL by querying using DNA marker intervals or genomic loci. RiceGeneThresher provides biologically supported evidences that are essential for targeting groups or networks of genes involved in controlling traits underlying QTL. Users can use it to discover and to assign the most promising candidate genes in preparation for the further gene function validation analysis. The web-based application is freely available at http://rice.kps.ku.ac.th.
INTRODUCTION
Genetic studies of many traits in rice over the past decade have generated data suggesting that specific regions of rice chromosomes contain sites that influence expressions of quantitative traits or quantitative trait loci (QTL). In recent years, many quantitative traits in rice have been discovered by QTL mapping. There are more than 8000 QTL controlling various complex traits that have been located on different chromosome regions in rice (http://www.gramene.org/qtl/index.html). The goal of QTL mapping is to identify the genes underlying polygenic traits and gain a better understanding of their physiology and biochemistry (1). To identify genes, the candidate gene approach has been applied in plant genetics in the past decade for the characterization and cloning of QTL. It constitutes a complementary strategy to map-based cloning for identifying the genes placed within the QTL intervals (2). The process of selecting candidate genes relies on a wealth of information gained through traditional genetics and molecular approaches (3). Presently, many publicly useful biological data, especially, the high-throughput technologies are significantly increasing the volume of biological information to assist gene function identification. Most of these biological data are published on the public domain databases. Scientists target these databases to apply bioinformatics approaches and data integration systems to find the most promising candidate genes and to elucidate functions of rice genes. This study presents the newly improved RiceGeneThresher. It has been designed and developed to integrate catalogs from the public domain databases on rice that involve genetic information, genome annotation, expressed sequence tags (ESTs), protein information such as protein domains, gene ontology (GO), metabolic pathway information, prediction of protein–protein interaction and stress-responsive genes. RiceGeneThresher system provides a generic data warehousing solution for fast and flexible querying of rice biological data sets and integration system. It is able to generate evidences from each type of omics information that are essential for analyzing and targeting groups or networks of loci involved in the controlling of gene expression for the specific traits underlying the QTL regions. Users, ranging from breeders, laboratory researchers to the experienced molecular biologists, can use it in a wide variety of applications and scenarios to find the most promising candidate gene to improve rice cultivars. To compare RiceGeneThresher with the relevant existing database like GRAMENE (4), although GRAMENE is a large resource for major crop genomes, it does not provide an easy way for mining biological data underlying QTL. Unlike RiceGeneThresher, it was specially designed to use for doing QTL mining intuitively and easily. GRAMENE contains various biological data, which are displayed across diverse software such as Ensembl Genome Browser, CMap, BioMart and BioCyc. Differ from RiceGeneThresher, it consolidates both diverse biological data and third-party software into one system that is convenient for users to discover and display candidate genes.
DATA SOURCES AND ANALYSIS
Rice genome and gene annotations were collected from Michigan State University (http://rice.plantbiology.msu.edu/data_download.shtml). ESTs of rice were retrieved from National Center for Biological Information (NCBI) EST database (http://www.ncbi.nlm.nih.gov/projects/dbEST/). ESTs were trimmed of contaminated DNA sequences by SeqClean (http://compbio.dfci.harvard.edu/tgi/software/) and were aligned on the TIGR pseudomolecules by using BLAT (5) with cut-off percent identity ≥95%, E-value ≤1e-10 and similarity scores ≥200. Rice DNA markers were collected from the GRAMENE database. They were aligned to the TIGR pseudomolecules by using DNA sequences and DNA primer sequences by BLAT and electronic polymerase chain reaction (e-PCR) (6), respectively. The entire protein domains of rice were analyzed by InterProScan (7). GO terms were downloaded from http://www.geneontology.org/GO.downloads.shtml and each rice protein was assigned to GO terms by InterPro2GO. KEGG metabolic pathways information for rice genes were downloaded from ftp://ftp.genome.jp/pub/kegg/genes/organisms/osa/. Rice proteins from KEGG were mapped to assign metabolic pathways to rice proteins from TIGR by using BLASTp (8) with the best BLAST hit cut-off score. To predict the entire protein–protein interactions of rice, evidences of protein–protein interactions of Arabidopsis thaliana were obtained from The Arabidopsis Information Resource (TAIR) database (9) (http://www.arabidopsis.org). The Interolog Mapping (10) method and the phylogenomic approach (11) were combined to apply for inferring protein–protein interaction evidences from Arabidopsis to rice. In addition, by connecting with the Generation Challenge Programme (GCP), The Generation Challenge Programme Comparative Plant Stress-responsive Gene Catalogue (12) has been incorporated into RiceGeneThresher database for identifying candidate stress-responsive genes underlying QTL. The GCP is an online resource documenting stress-responsive genes comparatively across plant species.
RICEGENETHRESHER DATABASE AND IMPLEMENTATION
RiceGeneThresher is a MySQL database based mainly on the Chado schemata of the Generic Model Organism Database project (http://www.gmod.org) (13), with local enhancements where necessary. This has been developed and designed to store biological data such as genome sequences, genome annotation, protein sequences, protein domains, metabolic pathways and so on. RiceGeneThresher requires access to many sources of biological data, many of which are distributed across the internet. To make data sources for feeding RiceGeneThresher system, RiceGeneThresherDataSource API was developed as the middleware for querying and transferring data to the front-end. RiceGeneThresherDataSource incorporates not only the static data from RiceGeneThresher database (MySQL) but it also incorporates the dynamic data across the internet by using GCP-compliant BioMOBY (14). The front-end is a web-based interface that uses Java-based software technology running on the top of Apache Tomcat Web Server. RiceGeneThresher web interface was implemented by Asynchronous JavaScript and XML (AJAX) technology, which consists of DWR (http://getahead.org/dwr/), Velocity (http://velocity.apache.org/) and EXT2.0 (http://extjs.com/). Third-party bioinformatics software such as Cytoscape (15), ATV (16), Jalview (17) and BLAST, an NCBI search tool, were combined with RiceGeneThresher for analyzing and displaying the queried results. In addition, the European Bioinformatics Institute (EBI) web services (18), for instance WUBLAST, NCBIBLAST and InterProScan are also incorporated into RiceGeneThresher web-based software for doing the similarity searching of DNA or protein sequences on the fly.
USER INTERFACE
Presently, there are two main options for using RiceGeneThresher: (i) mining genome region of interest by submitting DNA marker names or DNA sequences, (ii) querying genes by gene locus names or gene annotation keywords.
Mining a genome region
Users can select the genome region of interest by submitting DNA marker names or DNA sequences (Figure 1A and B). After the submission, RiceGeneThresher displays the position of those DNA markers or DNA sequences on the rice physical map and it automatically selects the widest flanking positions on rice physical map (Figure 1C).
Figure 1.
The figure shows RiceGeneThresher main menu (A), the query by using flanking DNA marker names (B), the genome region of interest for mining rice biological data (C), the example of genome feature information (D) and the particular rice gene information (E).
For changing the auto selection, users can specify the genome region of interest by selecting on the flanking DNA marker positions or DNA sequence positions. RiceGeneThresher explores various kinds of rice biological information found on the particular genome region of interest. Main features consist of genome, transcriptome, proteome, metabolome, phylogenomic and interactome.
Genome feature
The genome feature (Figure 1D) describes genome structure information. It displays data of genome structure; for example, a total number of contigs tilling path, the total size of DNA sequences, a total number of both transposable element (TE)-related and non-TE-related genes, gene density, average gene length and total number of DNA markers found in that particular genome region. It also categorizes genes into a group of annotation types such as those consisting of hypothetical genes, known function genes, putative function genes, similar to function genes, expressed genes and transposable related genes.
Transcriptome feature
The transcriptome feature shows a collection of EST libraries that have high similarity searching scores, E-values and percentage of identities against genes found in the genome region of interest. EST libraries were divided into a group of EST experiment names and a group of EST tissue types. Users can select candidate stress-responsive genes, for example, drought stress-responsive genes, by selecting on genes that are involved in EST library names such as ‘IRRI Drought Stress’.
Proteome feature
The proteome feature displays a list of GO terms of each gene product divided into a group of molecular function, biological process and cellular component, and it also displays a group of protein domains analyzed by the InterProScan.
Metabolome feature
The metabolome feature displays the mapping of rice gene found in the selected genome region into KEGG metabolic pathways. Its feature explores three kinds of pathway information. The first information shows the number of protein-coding genes that is predicted as the enzyme are located on the genome region of interest. The second shows the number of predicted metabolic pathways that are found on that particular genome region and the last information shows, in each metabolic pathway, the protein-coding genes involved in that particular metabolic pathway.
Phylogenomic feature
The phylogenomic feature shows data curated by GCP Comparative Plant Stress-responsive Gene Catalogue, which located in the genome region of interest. Users can easily select candidate stress-responsive genes by using this feature and it provides more information about protein families, phylogenetic trees, multiple sequence alignments (MSA) and associated experimental evidence.
Protein–protein interaction feature
The protein–protein interaction feature shows results of protein–protein interaction found in the genome region of interest. RiceGeneThresher displays protein–protein interaction prediction in a table format. In addition, the third-party software such as Cytoscape is incorporated to display all protein–protein interactions in the graphic mode by using Java web start technology.
Querying genes
Users can query genes in rice genome by using gene annotation keywords for example ‘protein kinase’ or gene locus id. When users select on a gene of interest, RiceGeneThresher displays individual gene information in both graphical user interface and standard web pages. Gene information (Figure 1E) consist of a picture of gene structure that locates on a rice chromosome location, coding DNA sequences, whole-gene DNA sequences, upstream DNA sequences and translated protein sequences. The result of protein analyses such as protein domains, protein–protein interaction prediction, and similarity searching results of protein by using EBI's web services, and protein phylogenomics by querying against the GreenPhyl database (http://greenphyl.cines.fr/cgi-bin/greenphyl.cgi) are also included in the information.
FUTURE DIRECTIONS
Currently, the authors of this study are generating expression data from many experiments by using rice Affymetrix GeneChip (http://www.affymetrix.com) together with microarray data from public databases and literatures. The authors also plan to expand RiceGeneThresher system to support those microarray data for finding candidate genes underlying genome region of interest. As draft whole-genomic sequences generated from new generation sequencing technologies become more available, the authors plan to develop new features into RiceGeneThresher in order to integrate these short shot-gun sequences into its gene mining platform. There are also plans to update the rice genome database to use the TIGR rice genome release 6.0.
FUNDING
GCP Fellowship Award (to RiceGeneThresher project); the National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand (to RiceGeneThresher project). Funding for open access charge: BIOTEC.
Conflict of interest statement. None declared.
REFERENCES
- 1.Korstanje R, Paigen B. From QTL to gene: the harvest begins. Nat. Genet. 2002;31:235–236. doi: 10.1038/ng0702-235. [DOI] [PubMed] [Google Scholar]
- 2.Pflieger S, Lefebvre V, Causse M. The candidate gene approach in plant genetics: a review. Mol. Breeding. 2001;7:275–291. [Google Scholar]
- 3.Borevitz JO, Chory J. Genomics tools for QTL analysis and gene discovery. Curr. Opin. Plant. Biol. 2004;7:132–136. doi: 10.1016/j.pbi.2004.01.011. [DOI] [PubMed] [Google Scholar]
- 4.Liang C, Jaiswal P, Hebbard C, Avraham S, Buckler ES, Casstevens T, Hurwitz B, McCouch S, Ni J, Pujar A, et al. Gramene: a growing plant comparative genomics resource. Nucleic Acids Res. 2008;36:D947–D953. doi: 10.1093/nar/gkm968. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Kent WJ. BLAT–the BLAST-like alignment tool. Genome Res. 2002;12:656–664. doi: 10.1101/gr.229202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Schuler GD. Sequence mapping by electronic PCR. Genome Res. 1997;7:541–550. doi: 10.1101/gr.7.5.541. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bradley P, Bork P, Bucher P, Cerutti L, et al. InterPro, progress and status in 2005. Nucleic Acids Res. 2005;33:D201–D205. doi: 10.1093/nar/gki106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- 9.Garcia-Hernandez M, Berardini TZ, Chen G, Crist D, Doyle A, Huala E, Knee E, Lambrecht M, Miller N, Mueller LA, et al. TAIR: a resource for integrated Arabidopsis data. Funct. Integr. Genomics. 2002;2:239–253. doi: 10.1007/s10142-002-0077-z. [DOI] [PubMed] [Google Scholar]
- 10.Yu H, Luscombe NM, Lu HX, Zhu X, Xia Y, Han J.-DJ, Bertin N, Chung S, Vidal M, Gerstein M. Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs. Genome Res. 2004;14:1107–1118. doi: 10.1101/gr.1774904. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Zmasek C, Eddy S. RIO: analyzing proteomes by automated phylogenomics using resampled inference of orthologs. BMC Bioinformatics. 2002;3:14. doi: 10.1186/1471-2105-3-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Wanchana S, Thongjuea S, Ulat VJ, Anacleto M, Mauleon R, Conte M, Rouard M, Ruiz M, Krishnamurthy N, Sjolander K, et al. The generation challenge programme comparative plant stress-responsive gene catalogue. Nucleic Acids Res. 2008;36:D943–D946. doi: 10.1093/nar/gkm798. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Mungall CJ, Emmert DB. A Chado case study: an ontology-based modular schema for representing genome-associated biological information. Bioinformatics. 2007;23:i337–i346. doi: 10.1093/bioinformatics/btm189. [DOI] [PubMed] [Google Scholar]
- 14.Wilkinson M, Schoof H, Ernst R, Haase D. BioMOBY successfully integrates distributed heterogeneous bioinformatics web services. The PlaNet exemplar case. Plant Physiol. 2005;138:5–17. doi: 10.1104/pp.104.059170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13:2498–2504. doi: 10.1101/gr.1239303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Zmasek CM, Eddy SR. ATV: display and manipulation of annotated phylogenetic trees. Bioinformatics. 2001;17:383–384. doi: 10.1093/bioinformatics/17.4.383. [DOI] [PubMed] [Google Scholar]
- 17.Clamp M, Cuff J, Searle SM, Barton GJ. The Jalview Java alignment editor. Bioinformatics. 2004;20:426–427. doi: 10.1093/bioinformatics/btg430. [DOI] [PubMed] [Google Scholar]
- 18.Labarga A, Valentin F, Anderson M, Lopez R. Web services at the European Bioinformatics Institute. Nucleic Acids Res. 2007;35:W6–W11. doi: 10.1093/nar/gkm291. [DOI] [PMC free article] [PubMed] [Google Scholar]

