A database of annotated tentative orthologs from crop abiotic stress transcripts

Jayashree Balaji; Jonathan H Crouch; Prasad VNS Petite; David A Hoisington

. 2006 Oct 7;1(6):225–227.

A database of annotated tentative orthologs from crop abiotic stress transcripts

Jayashree Balaji ^1,^*, Jonathan H Crouch ², Prasad VNS Petite ¹, David A Hoisington ³

PMCID: PMC1891691 PMID: 17597893

Abstract

A minimal requirement to initiate a comparative genomics study on plant responses to abiotic stresses is a dataset of orthologous sequences. The availability of a large amount of sequence information, including those derived from stress cDNA libraries allow for the identification of stress related genes and orthologs associated with the stress response. Orthologous sequences serve as tools to explore genes and their relationships across species. For this purpose, ESTs from stress cDNA libraries across 16 crop species including 6 important cereal crops and 10 dicots were systematically collated and subjected to bioinformatics analysis such as clustering, grouping of tentative orthologous sets, identification of protein motifs/patterns in the predicted protein sequence, and annotation with stress conditions, tissue/library source and putative function. All data are available to the scientific community at http://intranet.icrisat.org/gt1/tog/homepage.htm. We believe that the availability of annotated plant abiotic stress ortholog sets will be a valuable resource for researchers studying the biology of environmental stresses in plant systems, molecular evolution and genomics.

Keywords: database, orthologs, comparative genomics, abiotic stress transcripts

Background

Integrated approaches to the study of abiotic stress response in plants are important especially since drought and salinity stress are primary reasons for crop losses worldwide. The study of stress response pathways includes analysis of information from stress related metabolic and physiological changes, comparative genomics, gene expression studies and structural, and functional data of stress proteins. Plants have stress specific adaptive responses as well as responses which protect the plants from more than one environmental stress. Multiple stress perception and signaling pathways exist - some specific; others may cross talk at various steps. [1,2] Identification of genes related to stress is an important aspect in the study of plant response to abiotic stress. A minimal requirement to initiate a comparative genomics study across abiotic stress conditions is a dataset of orthologs. The availability of a large amount of sequence information, especially that derived from cDNA libraries in response to abiotic stress allows for the generation of a putative list of candidate genes using the orthologs approach. Orthologs are genes in different species that have evolved from a common ancestral gene by speciation and generally retain an equivalent or similar function in the course of evolution.

A high degree of sequence conservation across species and the availability of partial gene sequence data led to the development of comprehensive orthologous gene alignment such as the TOGA (tentative orthologous gene alignments from EST datasets) [3] and the COG (clusters of orthologous groups of proteins) databases. [4,5] The TOGA database currently contains 25 plant species while fewer plant species are represented in the COG database. We report here the generation and availability of tentative orthologous annotated datasets for 16 economically important crop species that are vulnerable to the abiotic stresses of heat, dehydration, cold and salt; and for which ESTs generated from stress cDNA libraries are available in the public domain. The aim in building the dataset is to provide users with a catalogue of annotated sequences associated with abiotic stress, identify elements common to all conditions from those that differ, identify categories of functions that are affected under stress conditions and provide users with a list of genes that have the highest representation across tentative orthologous sets.

Methodology

Dataset

Sequences derived from cDNA libraries generated from tissues subject to heat, dehydration, salt and cold stress from sixteen crop species were used to construct the database. The sequences were downloaded from TIGR [6], NCBI [7] in 2003 and updated in June 2005.

Bioinformatics analysis

The sequences were assembled into contigs and singletons crop-wise using a parallelized version of cap3 [8 ] on a paracel HPC. To construct tentative ortholog sets, each species-specific dataset consisting of contigs and singletons was Blast searched against every other dataset using Blastn (standalone BLAST version 2.2.6). If a reciprocal best-hit (RBH) relationship between these sequences was revealed, then the reciprocal best hits formed a tentative ortholog set. An additional constraint was that each set must comprise sequences from at least three crop species. Scripts were written in Visual Basic to search and assemble tentative ortholog sets after the Blast searches. Sequences were searched for microsatellite markers using the tool SSRIT. [9] Sequences in each dataset were translated and searched for protein motifs/patterns against the Prosite database of protein families and domains. All datasets were searched against the species specific plant repeats database [10] and hits with an e-value < 1e-5 and an alignment of over 30% of length of query sequence were annotated as repeats. Tentative functional descriptions for the remainder of the sequences were retrieved from each of the databases. These annotations were classified under the 28 functional categories described in the MIPS Functional catalogue Funcat. [ 11] Scripts written in Java were used to carry out this classification. Multiple sequence alignments have been built using ClustalW (version 1.83).

Database and GUI

The data is housed in a relational database on the MSSQL server 2000. The database GUI has been developed using Active Server Pages (ASP).

Utility

The database provides a collection of annotated tentative orthologous sequences from sixteen crop species (Table 1) across four abiotic stress conditions (Table 2). The suite of user interfaces (Figure 1) allow the user to browse the database and query for: (a) annotated transcripts that are expressed across stress conditions, (b) transcripts with microsatellites that could be used as conserved functional markers, (c) conserved hypothetical genes that have orthologs in many other species but for which no function has been determined, and (d) ortholog sets with sequence alignment based on annotation, stress conditions or cluster size. The availability of this dataset is a useful resource for researchers studying the biology and genomics of stress response in plants and in the molecular evolution of genes involved in the stress response.

Table 1. Coverage of monocot and dicot stress related sequences.

Species	Number of stress libraries	ESTs	Number of clusters (singletons + contigs)	ESTs in orthologous sets	Clusters in orthologous sets
Wheat	28	20130	11037	8394	2806
Maize	19	21439	10194	9292	3032
Rice	10	13784	8128	4890	1939
Barley	8	12414	7315	5976	2403
Sorghum	5	37590	13815	16828	3321
Pearl millet	3	1945	1443	824	464
Rye	2	1351	945	938	594
Arabidopsis	37	18637	10362	3675	984
Common bean	11	412	206	259	97
Tomato	6	901	637	419	275
Soybean	4	18236	10363	5103	1571
Cowpea	3	38	37	14	14
Groundnut	2	860	679	356	266
Potato	2	17	12	7	4
Chickpea	1	358	56	55	19
Medicago	1	8294	5140	2444	976
Total	142	156406	80369	59474	18765

Open in a new tab

Table 2. Number of ortholog sets sharing sequences across stress conditions.

Stress Condition	Number of tentative ortholog sets
Heat + Cold	91
Heat + Dehydration	1171
Heat + Salt	69
Cold + Dehydration	6851
Cold + Salt	348
Dehydration + Salt	3304
Heat + Cold + Dehydration	2105
Heat + Dehydration + Salt	2323
Cold + Dehydration + Salt	10416
Heat + Cold + Salt	371
Heat + Cold + Salt + Dehydration	8390

Open in a new tab

Screen captures of the database GUI. (A) Home page, (B) plant species covered in the current version of the database, (C - H) query pages

Future development

We routinely update and expand the database and analyses as additional sequence data becomes available; annotate sequence data with experimental information on candidate genes; and provide users with a reliability score for the ortholog sets constructed along with an analysis of orthologs developed using alternative algorithms.

Footnotes

Citation:Balaji et al., Bioinformation 1(6): 225-227 (2006)

References

1.Chinnusamy V, et al. J Exp Bot. 2004;55:225. doi: 10.1093/jxb/erh005. [DOI] [PubMed] [Google Scholar]
2.Rabbani MA, et al. Plant Physiol. 2003;133:1755. doi: 10.1104/pp.103.025742. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Lee Y, et al. Genome Res. 2002;12:493. doi: 10.1101/gr.212002. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Tatusov RL, et al. BMC Bioinformatics. 2003;4:41. doi: 10.1186/1471-2105-4-41. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Tatusov RL, et al. Nucleic Acids Res. 2001;28:33. doi: 10.1093/nar/28.1.33. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. http://www.tigr.org/tdb/tgi/plant.shtml.
7. http://www.ncbi.nlm.nih.gov/
8.Huang X, et al. Genome Res. 2003;13:2164. doi: 10.1101/gr.1390403. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Temnykh S, et al. Genome Res. 2001;11:1441. doi: 10.1101/gr.184001. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. http://www.tigr.org/tdb/e2k1/plant.repeats/
11. http://mips.gsf.de/projects/funcat.

[R01] 1.Chinnusamy V, et al. J Exp Bot. 2004;55:225. doi: 10.1093/jxb/erh005. [DOI] [PubMed] [Google Scholar]

[R02] 2.Rabbani MA, et al. Plant Physiol. 2003;133:1755. doi: 10.1104/pp.103.025742. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R03] 3.Lee Y, et al. Genome Res. 2002;12:493. doi: 10.1101/gr.212002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R04] 4.Tatusov RL, et al. BMC Bioinformatics. 2003;4:41. doi: 10.1186/1471-2105-4-41. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R05] 5.Tatusov RL, et al. Nucleic Acids Res. 2001;28:33. doi: 10.1093/nar/28.1.33. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R06] 6. http://www.tigr.org/tdb/tgi/plant.shtml.

[R07] 7. http://www.ncbi.nlm.nih.gov/

[R08] 8.Huang X, et al. Genome Res. 2003;13:2164. doi: 10.1101/gr.1390403. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R09] 9.Temnykh S, et al. Genome Res. 2001;11:1441. doi: 10.1101/gr.184001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10. http://www.tigr.org/tdb/e2k1/plant.repeats/

[R11] 11. http://mips.gsf.de/projects/funcat.

PERMALINK

A database of annotated tentative orthologs from crop abiotic stress transcripts

Jayashree Balaji

Jonathan H Crouch

Prasad VNS Petite

David A Hoisington

Abstract

Background

Methodology

Dataset

Bioinformatics analysis

Database and GUI

Utility

Table 1. Coverage of monocot and dicot stress related sequences.

Table 2. Number of ortholog sets sharing sequences across stress conditions.

Figure 1.

Future development

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A database of annotated tentative orthologs from crop abiotic stress transcripts

Jayashree Balaji

Jonathan H Crouch

Prasad VNS Petite

David A Hoisington

Abstract

Background

Methodology

Dataset

Bioinformatics analysis

Database and GUI

Utility

Table 1. Coverage of monocot and dicot stress related sequences.

Table 2. Number of ortholog sets sharing sequences across stress conditions.

Figure 1.

Future development

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases