Skip to main content
Plant Physiology logoLink to Plant Physiology
. 2004 Mar;134(3):951–959. doi: 10.1104/pp.103.033878

Evaluation of Monocot and Eudicot Divergence Using the Sugarcane Transcriptome1,[w]

Michel Vincentz 1, Frank AA Cara 1, Vagner K Okura 1, Felipe R da Silva 1, Guilherme L Pedrosa 1, Adriana S Hemerly 1, Adriana N Capella 1, Mozart Marins 1, Paulo C Ferreira 1, Suzelei C França 1, Laurent Grivet 1, Andre L Vettore 1, Edson L Kemper 1, Willian L Burnquist 1, Maria LP Targon 1, Walter J Siqueira 1, Eiko E Kuramae 1, Celso L Marino 1, Luis EA Camargo 1, Helaine Carrer 1, Luis L Coutinho 1, Luiz R Furlan 1, Manoel VF Lemos 1, Luiz R Nunes 1, Suely L Gomes 1, Roberto V Santelli 1,2, Maria H Goldman 1, Maurício Bacci Jr 1, Eder A Giglioti 1, Otávio H Thiemann 1, Flávio H Silva 1, Marie-Anne Van Sluys 1, Francisco G Nobrega 1, Paulo Arruda 1, Carlos FM Menck 1,*
PMCID: PMC389918  PMID: 15020759

Abstract

Over 40,000 sugarcane (Saccharum officinarum) consensus sequences assembled from 237,954 expressed sequence tags were compared with the protein and DNA sequences from other angiosperms, including the genomes of Arabidopsis and rice (Oryza sativa). Approximately two-thirds of the sugarcane transcriptome have similar sequences in Arabidopsis. These sequences may represent a core set of proteins or protein domains that are conserved among monocots and eudicots and probably encode for essential angiosperm functions. The remaining sequences represent putative monocot-specific genetic material, one-half of which were found only in sugarcane. These monocot-specific cDNAs represent either novelties or, in many cases, fast-evolving sequences that diverged substantially from their eudicot homologs. The wide comparative genome analysis presented here provides information on the evolutionary changes that underlie the divergence of monocots and eudicots. Our comparative analysis also led to the identification of several not yet annotated putative genes and possible gene loss events in Arabidopsis.


Flowering plants (angiosperms) originated approximately 200 million years ago (MYA; Wolfe et al., 1989; Wikstrom et al., 2001) and subsequently diverged into several lineages that further diversified to form the approximately 250,000 angiosperm species known today (Wikstrom et al., 2001). A wide range of morphological, developmental, and metabolic diversity has allowed angiosperms to adapt to contrasting environmental conditions. This variability also contributed to their domestication. Eudicot species of the families Fabaceae (lentil [Lens culinaris], soybean [Glycine max], pea [Pisum sativum], and common bean [Phaseolus vulgaris]), Solanaceae (tomato [Lycopersicon esculentum], potato [Solanum tuberosum], and eggplant [Solanum melongena]), Brassicaceae (oil-seed rape [Brassica napus], cabbage [Brassica capitata], and broccoli [Brassica oleracea]), and Euphorbiaceae (cassava [Manihot esculenta Crantz.]) are just a few examples of plants that contribute significantly to the human (Homo sapiens) diet. The Solanaceae and Brassicaceae diverged around 112 to 156 MYA, early in the radiation of eudicot plants (Yang et al., 1999). Within the monocotyledons, the grass family (Poaceae), which arose about 60 MYA (Kellog, 2001), includes important staple crops such as rice (Oryza sativa), maize (Zea mays), wheat, sugarcane (Saccharum officinarum), barley (Hordeum vulgare), and sorghum (Sorghum bicolor).

Comparative genomics provides a starting point for understanding the genetic basis of the biological diversity among plant species. In this regard, the genome sequences of two model plants, the eudicotyledon Arabidopsis (family Brassicaceae; Arabidopsis Genome Initiative, 2000) and the monocotyledon rice (family Poaceae; Feng et al., 2002; Goff et al., 2002; Sasaki et al., 2002; Yu et al., 2002) are expected to have a profound impact on plant genetics and our understanding of plant evolution (Somerville and Somerville, 1999; Poethig, 2001). Comparative genome analysis has already shown that a major source of functional innovation is related to gene family amplification by genome-wide, local chromosome and tandem gene duplication (Lynch and Conery, 2000; Vision et al., 2000; Copley et al., 2001; Poethig, 2001; Sankoff, 2001; Kondrashov et al., 2002). In addition, new combinations of functional protein domains created by exon shuffling (Henikoff et al., 1997; Rubin et al., 2000), the invention of new protein domains by rapid sequence divergence from preexisting motifs (Koonin et al., 2000; Rubin et al., 2000), gene loss (Aravind et al., 2000; Braun et al., 2000; Ku et al., 2000; Allen, 2002), and horizontal gene transfer (Lander et al., 2001; Salzberg et al., 2001; Stanhope et al., 2001) have been implicated in the generation of the genetic diversity among eukaryotes.

One approach to obtain information about genome diversity among angiosperms is through comparative analysis of the available angiosperm sequences, which include the Arabidopsis and rice genomes and the large amount of expressed sequence tags (ESTs) that have been produced from several monocots and eudicots. Within this framework, we have generated approximately 240,000 ESTs from sugarcane, an economically important monocot crop belonging to the grass family (Grivet and Arruda, 2002). These ESTs were assembled into 42,982 contigs and single sequences, hereafter referred to as sugarcane assembled sequences (SASs), which may represent unique sugarcane transcripts (Telles and da Silva, 2001). The SAS sequences were analyzed in a bioinformatic pipeline focused on the comparison of the sugarcane transcriptome with eudicot and monocot sequences. The rational for this approach was to thoroughly search the sugarcane sequences against the reference genome of Arabidopsis and all eudicot ESTs. This strategy identified approximately 30% of SASs, which may represent monocot-specific sequences.

RESULTS

A total of 237,954 sugarcane ESTs were assembled into 42,982 SASs, which were estimated to represent over 30,000 unique genes (Vettore et al., 2003). These SASs were compared with DNA and protein sequences from a set of angiosperms according to the pipeline shown in Figure 1, using BLAST tools (cutoff E value of e-5). Because the SAS consensus sequences did not necessarily represent full-length cDNAs, the estimated sequence conservation may have been limited to protein portions that may represent functional domains.

Figure 1.

Figure 1.

Scheme for the comparative analysis of the sugarcane transcriptome with other angiosperms. Sugarcane consensus sequences were initially compared with Arabidopsis sequences. Those that did not match were successively compared with EST sequences from other eudicots or monocots. A final comparison with the rice genome sequence indicated those that were related to monocots or specific to sugarcane.

The first step of the analysis was to estimate the set of SASs that was similar to sequences of the model eudicot organism Arabidopsis. This step revealed that 70.5% of the SASs matched the Arabidopsis sequences (denoted “Arabidopsis matches” in Table I and Fig. 1). Interestingly, 99.1% of the SASs belonging to the Arabidopsis matches class also had a significant match with the rice genome (Table I), so that these SASs may define a core set of conserved eudicot-monocot sequences. The remaining 29.5% of the SASs were further compared with the genome sequence of rice and with ESTs from other eudicots and monocots (Fig. 1). Approximately 2.0% of the SASs matched sequences from other eudicots (“other eudicots” in Table I and Fig. 1) and may represent genes that have been lost in the Arabidopsis genome. Fourteen percent of the SASs matched rice or other monocots, but not eudicot sequences, and, thus, were assumed to be monocot-specific genes (“monocot class” in Table I and Fig. 1). The remaining SASs (13.5%, “no matches” in Table I and Fig. 1) were only found in the sugarcane transcriptome (see also Supplemental Table I).

Table I.

Comparative analysis of SAS and angiosperm sequences Classes are described in Figure 1.

Class % of Total % of SAS in Each Class with Match with Rice Genome (% of total)a
Arabidopsis matches 70.5 99.1 (70.0)
Other eudicots 2.0 86.4 (1.6)
Monocots 14.0 67.0 (10.0)
No matches 13.5 0.0
   Total (no. of SASs) 100.0 81.6
(42,982)
a

Nos. in parentheses represent the percentage of total SASs. See also Supplemental Table I

Potential Gene Loss in the Arabidopsis Genome

Of the 2% of SASs not present in Arabidopsis but with significant similarity with ESTs from other eudicots (“other eudicots” in Table I and Figs. 1 and 2A), 90% matched ESTs from monocots. The simplest interpretation for this is that the genes corresponding to these conserved angiosperm sequences were lost in Arabidopsis. To evaluate this possibility, the set of SASs in the “other eudicots” group that had a significant similarity (E values lower than e-10) with proteins in GenBank was investigated further. We identified 16 SASs, which could represent 13 gene loss events in Arabidopsis (Table II, see also Supplemental Table II). Of these, three encoded proteins involved in stress- and pathogen-induced responses in plants, one was similar to a bacterial protein from the family of atrazine and melanine chlorohydrolases, and another one was homologous to the human tRNA-guanine transglycosylase. The remaining eight genes encoded proteins with unknown functions (Table II). Interestingly, three of the hypothetical proteins were similar to proteins from cyanobacteria (Synechocystis sp). Therefore, these SASs may represent chloroplast proteins encoded by nuclear genes acquired from the ancestral cyanobacterial symbiont (Rujan and Martin, 2001). This hypothesis is supported by the presence of putative chloroplast-targeting signals at the N-terminal part of the sugarcane polypeptides (data not shown).

Figure 2.

Figure 2.

Distribution of the main classes for the SASs after comparative analysis. The results of the comparative analysis of sugarcane with Arabidopsis (A) and sugarcane to rice (B) are shown. The percentages are relative to the total number of SASs. The different classes of the sugarcane versus Arabidopsis comparison are those described in Figure 1 and Table I.

Table II.

Putative gene loss events in Arabidopsis

No. of SASs SAS BLASTX Best Match in GenBank (Accession No./E Value) Evolutionary Conservationa
6 1 Abscisic acid- and stress-induced protein, rice (T02663/e-13) Angiosperm
1 Pathogenesis-related protein, sorghum (T14817/e-61)
1 ASR3 abscisic stress ripening protein 3, tomato (P37220/e-10)
1 Hypothetical rice (AAG13540/e-133)
1 Hypothetical rice (BAB90560/4e-75)
1 Hypothetical rice (BAB89788/3e-23)
5 3 Hypothetical proteins, Synechocystis sp. (S76951/e-23; S75952/2e-32; S75174 / e-60) Cyanobacteria/angiosperm
1 APAG protein, Escherichia coli (P05636/3e-20) Bacteria/angiosperm
2 N-ethylammeline chlorohydrolase, Bacillus halodurans (BAB04465/2e-23) Bacteria/Archaea/Angiosperm/Schizosaccharomyces pombe
1 1 Putative glycoprotein, S. pombe (CAC19762/e-28) Eukaryote
1 tRNA-guanine transglycosylase, human (AAG60033/3e-52) Bacteria/Archaea/eukaryote
a

Taxons where putative protein homologs are found. See also Supplemental Table II

Phylogenetic analyses for alignments generated for each of these 13 genes and their homologs, which were retrieved from GenBank, were consistent with known species phylogeny, thus supporting the view of gene loss in the Arabidopsis lineage (data not shown). An example of the phylogeny analysis is shown in Figure 3 for the SASs encoding a polypeptide similar to the E. coli apaG (Table II). Two groups of sequences homologous to the bacterial apaG gene were identified in some angiosperms. One group included sequences from several eudicots and monocots but not from Arabidopsis and metazoans (group A, Fig. 3). The other group (group B, Fig. 3) suggested that in the ancestral lineage of plants and metazoans, an apaG homologous sequence was recruited to a protein containing an F box to form a new protein, which has been conserved in Arabidopsis. This evolutionary pattern of apaG homologous sequences can be explained most simply by differential gene loss events.

Figure 3.

Figure 3.

Phylogeny of bacterial apaG-related proteins. Unrooted tree inferred by the neighbor-joining analysis of the apaG motifs (73 amino acids, position 31-103 of E. coli apaG protein, accession no. P05636). Bootstrap values for 1,000 replicates are indicated as percentages along the branches. Sequences are identified by their accession numbers. All group A polypeptides and all but Arabidopsis group B polypeptides were deduced from EST sequences. The rice A and B polypeptides were obtained from the rice subsp. indica genomic sequence available at the National Center for Biotechnology Information blast server. Species abbreviations are as follows: for bacterial proteins, At, Agrobacterium tumefaciens; Bm, Brucella melitensis; Ec, E. coli; Ml, Mesorhizobium loti; Pa, Pseudomonas aeruginosa; Re, Rhizobium etli; Rs, Ralstonia solanacearum; St, Salmonalla typhinurium; Vc, Vibrio cholerae; Xf, Xylella fastidiosa; and Yp, Yersinia pestis; for mammals, Hs, human; and Mm, Mus musculus; and for angiosperms, At, Arabidopsis; Gm, soybean (Glycine max); Le, tomato; Os, rice; Sb, sorghum; Ssp, Saccharum sp.; St, potato; and Zm, maize. Angio., Angiosperm. The scale bar corresponds to 0.1-estimated amino acid substitution per site.

Comparison of the Sugarcane Transcriptome with the Rice Genome

Two draft sequences of the full rice genome have been published and were predicted to encode 32,000 to 55,000 genes (Goff et al., 2002; Yu et al., 2002). Comparative analysis showed that 81.6% of the SASs had a significant match with the rice genome (Fig. 2B). This higher proportion of sugarcane-rice matches compared with the sugarcane-Arabidopsis matches (70.5%) was expected considering the evolutionary distance between the latter two species and may correspond to the divergence of eudicots and monocots. Of the remaining SASs without a significant match with the rice genome, 4.9% had matches with sequences from other plants. These SASs may correspond to gene losses in rice or to portions of the rice genome that have not been sequenced yet. The latter possibility is supported by the observation that 572 SASs had a match with rice ESTs but not with the rice genomic sequence.

Monocotyledon-Specific Sequences

A fraction of 27.5% of the SASs showed no significant similarity with any eudicot sequences. Approximately half of these SASs (5,996) had significant matches with monocot sequences, including the rice genome, and were designated “monocots” (Table I; Figs. 1 and 2A) because these SASs may correspond to sequences restricted to monocot species. Alternatively, some of them may correspond to fast-evolving portions of proteins belonging to families conserved among angiosperms. Such cases could be detected through anchor monocot homologous sequences that would link the portion evolving at high rates to the more conserved sequences i.e. important functional domains, that characterize these conserved protein families (Fig. 4). According to this rational, the monocot EST corresponding to the best match of each SAS was used as anchor query sequences in a further comparison with eudicot EST sequences. When significant matches were found, the SAS contained in the anchor sequence was interpreted to be a fast-evolving sequence. Using this approach, 21% (1,368) of the SASs in the “monocots” were found to be sequences possibly evolving at high rates. The remaining SASs (79%, 5,028) may represent sequences that can be defined more strictly as monocot specific and could, theoretically, represent evolutionary novelties, loss events in eudicots, or horizontal gene transfers.

Figure 4.

Figure 4.

Schematic representation of the strategy used to detect possible fast-evolving sequences among SASs of the monocots class. Stretches of sequences that evolve at a high rate and belong to gene families conserved among angiosperms may be detected through a closely related homologous sequence called anchor sequence. The anchor sequence would link the fast-evolving sequence to a conserved domain that characterizes the angiosperm gene family. In this approach, the monocot EST corresponding to the best match of each SAS was used as an anchor sequence to search for possible eudicot homologous sequences using TBLASTX and a cutoff E-value limit of e-5.

To further evaluate the participation of these different evolutionary processes in the production of monocot-specific sequences, a manual validation of SASs that had a significant similarity with proteins in GenBank was undertaken. Most of these (164 of 215) could be organized into the four following categories: mobile genetic elements, transcription factors, stress and defense responses, and putative hypothetical rice proteins (Table III; see also Supplemental Table III). This pattern is likely to partly reflect the bias introduced by the more representative monocot proteins present in the databases. A large proportion (approximately 76%) of the 215 SASs were found to correspond to fast-evolving sequences, according to the rational described above and using the GenBank best matches as anchor query sequences (Table III).

Table III.

Evaluation of the monocot groups of SASs These SASs had a significant match with sequences in GenBank.

Fast-Evolving Sequencesa
Monocot Specific
Categories Motifs/Domainsb Proteinsc
Transposons 29 0 0
Transcription factors 17 2 0
Stress and defense responses 28 1 2
Hypothetical rice proteins 48 9 28
Others 42 2 7
Total 164 14 37
a

Fast-evolving sequences were identified according to the anchor scheme described in Figure 4

b

Monocot-specific motifs/domains refer to sequences encoding approximately 50 or more amino acids, which were not detected as fast-evolving sequences

c

Monocot-specific proteins refer to proteins lacking any similar sequences in eudicots based on BLAST comparison. See also Supplemental Table III

These results indicate that accelerated evolution of specific sequences of conserved eudicot-monocot protein families is an important aspect of angiosperm evolution. Consistent with this idea, and considering the mobile genetic elements within the “monocots,” the anchor analysis revealed eudicot relationships for the transposon and retrotransposon sequences. Thus, these mobile genetic elements represent highly divergent sequences of eudicot-monocot conserved transposons. Also of interest were the rapidly evolving sequences in the category of transcription factors (Table III), which may promote changes in transcriptional regulatory networks (Van der Hoeven et al., 2002). In addition, most of the fast-evolving sequences in the stress and defense responses category (Table III) were from resistance genes (R genes). This observation was consistent with previous results indicating adaptive evolution among R genes (Bergelson et al., 2001). The remaining 51 SASs represented monocot-specific sequences that may correspond to evolutionary novelties. These SASs were divided into two subsets. One of these (14 SASs) represented monocot-specific motifs or domains that are integrated in conserved eudicot-monocot proteins (Table III). Such cases may represent monocot-specific protein architectures. The second subset (37 SASs) identified monocot-specific proteins, most of which corresponded to rice putative or hypothetical proteins of unknown function (Table III). A minor group of monocot-specific proteins of known function was also identified and included seed storage prolamines, an antifungal peptide, and a phytase required for phytic acid (phosphorous storage) degradation.

Sugarcane-Specific Sequences

The proportion of SASs that did not match sequences from any other angiosperm was still high (5,812, corresponding to 13.5% of the SASs). This “no match” group (Table I; Fig. 2A) may correspond to highly variable sequences that either diverged significantly (evolving at high rates) from their homologs in other monocots or that are specific for sugarcane. Alternatively, they may represent 5′- or 3′-untranslated sequences that are likely to be under low selective pressure and, therefore, would not be detected by this comparison, which relied on open reading frames.

The 68 SASs with significant matches to proteins of other organisms (non-plant) were analyzed further. Of these, 34 were similar to fungal proteins, 11 to bacterial proteins, 16 to proteins from other organisms (including human), and three to plant virus proteins. Some of these examples are most likely contaminants from endophytic organisms or other sources, but horizontal gene transfer cannot be excluded. The remaining four SASs matched DNA sequences from plants that were not included in the pipeline. The extent to which these 5,812 SASs of the “no match” class represent novel genes restricted to sugarcane remains to be determined.

Non-Mapped EST Homologs in Arabidopsis Chromosomes

ESTs are useful for locating and annotating potential genes in chromosomal sequences. This search is normally done using ESTs from the same organism because direct nucleotide sequence alignment allows the identification of the chromosomal region that is potentially transcribed and processed to produce the original mRNA (Seki et al., 2002). However, when ESTs from the same organism are not available, the use of ESTs from closely related organisms may be valuable (Liu et al., 2001; Quackenbush et al., 2001; Fulton et al., 2002). The comparison of sugarcane sequences with Arabidopsis chromosomal DNA and annotated protein data resulted in the identification of several potential new Arabidopsis genes. In all, 871 SASs matched chromosomal DNA of Arabidopsis but showed no significant match with annotated proteins from this plant (see also Supplemental Table I).

The fact that these SASs matched the Arabidopsis genome does not mean that they correspond to nonannotated genes. Limitations are expected with this approach because false hits and pseudogenes can be traced. However, some features were used as criteria to validate the matches as candidate genes in Arabidopsis. First, 759 of 871 SASs (87%) had significant matches with DNA from other plants, implying that they encode conserved proteins, also expected to be found in Arabidopsis. Another criterion was the fact that the similarity of several SASs along the Arabidopsis DNA was discontinuous, indicating the presence of several exons. We found that 294 (34%) of these SASs fulfilled this last criterion. However, single matches (one exon) could also locate potential genes.

Two examples of potentially new genes in Arabidopsis are illustrated in Figure 5. One of these (Fig. 5A) identified five exons of a putative protein involved in cell wall biogenesis. Two of the exons did not match the GenMark prediction (Lukashin and Borodovsky, 1998), but there was no protein annotated at that locus. The other example is the case of a putative gene in Arabidopsis that probably has been only partially annotated (Fig. 5B). The SAS identified one exon that includes two predicted by the GenScan program (Burge and Karlin, 1997). This exon and a close annotated gene share strong similarity with different, contiguous positions of the alpha-1,3/4-fucosidase-precursor gene from bacteria, and are probably two exons of the same gene.

Figure 5.

Figure 5.

Examples of putative nonannotated genes in the Arabidopsis genome, based on the sequence of sugarcane cDNA sequences. The first line in A and B indicates the Arabidopsis sequence contig, with its prediction. A, Putative protein involved in cell wall biogenesis. B, Putative alpha-1,3/4-fucosidase precursor.

DISCUSSION

A comprehensive genome analysis using the consensus sequences assembled from approximately 240,000 sugarcane ESTs has revealed a core set of conserved eudicot-monocot sequences corresponding to 70.5% of the SASs. This result agrees with that found in a maize versus Arabidopsis (Brendel et al., 2002) and rice full cDNA and Arabidopsis (Kikuchi et al., 2003) comparison and suggests that approximately two-thirds of the genes expressed in monocots encode conserved proteins that fulfill similar angiosperm functions.

A set of sugarcane sequences was found to be conserved among angiosperms but was missing in Arabidopsis (Table II; Fig. 2). This finding suggests that the corresponding genes were present in the ancestor of monocots and eudicots and were subsequently lost in Arabidopsis. This conclusion agrees and complements the recent reports (Allen, 2002; Van der Hoeven et al., 2002), that compared eudicot ESTs with Arabidopsis sequences. These genes, which are missing in Arabidopsis, are expected to be unessential to angiosperms but may confer some selective advantage because they have been retained in distantly related plant species. Functionally related proteins may compensate for these losses. Alternatively, because some of these sequences are conserved among several eukaryote groups, they are likely to represent essential genes, and their absence in Arabidopsis raises the possibility that they still remain to be sequenced in this organism. These data are consistent with the idea that differential gene loss is an active process in the evolution of angiosperm genomes (Lynch and Conery, 2000; Allen, 2002).

Our comparative analysis revealed a set of sequences that appears to be specific to monocots. This monocot class of SAS is of particular interest because it may include sequences that could be related to monocot-specific traits and may represent true evolutionary novelties or gene losses in eudicots. A detailed analysis of these sequences indicated that a significant proportion corresponded to fast-evolving sequences found in members of conserved angiosperm gene families. A high rate of evolution can be related to low functional constraints and/or functional diversification. This latter possibility is more likely to be responsible for the production of new protein functions that may be involved in the differentiation of a specific evolutionary lineage. Gene duplication followed by sequence divergence is the main means for functional diversification (Lynch and Conery, 2000; Wendel, 2000). The finding that transcription regulatory factors are well represented as a functional category in the monocot class suggests that alterations in the transcriptional regulatory network are an essential feature of angiosperm evolution, in agreement with previous observations (Doebley and Lukens, 1998; Cronk, 2001; Van der Hoeven et al., 2002).

Some SASs of the Monocot class appeared to define new protein architecture that resulted from the association of monocot-specific sequences with conserved eudicot-monocot protein domains. These cases may represent exon shuffling events that could possibly lead to new functions. Together, these data indicate that evolutionary events underlying the differentiation of eudicots and monocots relies on functional diversification (generating new proteins functions) from duplicated copies of conserved gene families and on the acquisition of novel protein sequences.

Recently, a rice versus Arabidopsis comparison indicated that almost one-half of the 53,398 genes predicted in the rice genome (Yu et al., 2002) did not have a match in Arabidopsis. These results conflict with our findings that 70.5% of the sugarcane sequences matched with Arabidopsis protein sequences. This discrepancy could result from an over-estimation of the number of rice genes. If one assumes that the sugarcane data set is representative of a monocot transcriptome, a correction factor of 1.4 (70.5%/49.4%) would result in an estimation of approximately 38,000 rice genes (approximately 53,400/1.4). This is significantly greater than what has been found for Arabidopsis but close to the estimate of 35,000 genes for the tomato genome (Van der Hoeven et al., 2002). If this hypothesis is correct, then one would expect the number of genes in angiosperms to be 35,000 to 38,000 per genome and would indicate that plants have a very plastic and dynamic genome (for example, because of polyploidization or segmental duplications) capable of maintaining a similar number of genes. Our data also indicate that during the 200 MY since separation of the monocots and eudicots, approximately two-thirds of their genes have been kept very similar, whereas the remaining one-third consists of fast-evolving sequences or specific genes that could account for the differences between these two types of plants.

The sugarcane EST collection used in this study represents one of the largest and most representative transcriptome data set for a monocotyledon species. The evolutionary proximity of rice and sugarcane means that a large number of homologous sequences are to be expected from these two species. Hence, the sugarcane ESTs provide an important contribution to studies of the published draft genome of rice (Goff et al., 2002; Yu et al., 2002). Direct BLAST comparison can generate genetic information such as the confirmation of gene loci. The distribution of ESTs in different libraries (representing the mRNA expressed in different tissues or plant culture conditions) may also reflect a quantitative means of generating gene expression patterns for specific genes (Souza et al., 2001). For this, the SASs provide an important fraction of putative full-length genes (Vettore et al., 2003). These features may be used to locate sequences upstream of genes with known pattern of expression in different tissues. This, in turn, may help in the identification of putative promoters, which could be useful for crop improvement through plant genetic engineering. The fact that 12,208 SASs are probably information specific to monocots make these sugarcane genes important starting points for the identification of monocot-specific traits.

MATERIALS AND METHODS

The sugarcane (Saccharum officinarum) EST sequences were from the SUCEST project, which has been described previously (Telles and da Silva, 2001) and had low redundancy because the cDNA libraries were from different tissues and physiological conditions (Vettore et al., 2001). The set of 237,954 sequences was trimmed and assembled in 42,982 contigs and singletons (SASs) using the CAP3 program (Telles and da Silva, 2001). General information for these sequences was previously described (Vettore et al., 2003) and can be accessed (http://sucest.lad.ic.unicamp.br/public). Investigation of sequence similarities between SASs and other plants was conducted using BLASTX and TBLASTX algorithms (Altschul et al., 1997). The cutoff E value limit of e-5, a standard threshold, was used to define significant similarity. The Arabidopsis sequences were from the Munich Information Center for Protein Sequences (http://www.mips.biochem.mpg.de/proj/thal; Schoof et al., 2002). The rice (Oryza sativa) genome sequences were from rice subsp. indica and the Rice Genome Database of Chinese Super Hybrid Rice (http://btn.genomics.org.cn/rice), and the other sequences were from the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov). The data were last downloaded in April 2002.

SASs with no similarity to Arabidopsis sequences were compared against eudicotyledon (tomato [Lycopersicon esculentum], soybean [Glycine max], and Lotus japonicus) and monocotyledon (barley [Hordeum vulgare], rice, sorghum [Sorghum bicolor], and maize [Zea mays]) EST sequences. All SASs were compared with the complete GenBank database for categorization (Telles et al., 2001).

The presence of chloroplast target peptide was predicted with the ChloroP program (Emanuelsson et al., 1999; http://www.cbs.dtu.dk/services/ChloroP/). Protein phylogeny distance analyses were done with the NEIGHBOR program (PHYLIP program, Phylogeny Inference Package, version 3.57c, Department of Genetics, University of Washington, Seattle; Felsenstein, 1993; 1997) using a distance matrix generated by the PROTDIST program utilizing the PAM001 matrix.

Supplementary Material

Supplemental Data

Acknowledgments

The authors thank the technicians and researchers who contributed to the sequencing effort and whose names are listed at the Web site http://sucest.lad.ic.unicamp.br/public. The authors also acknowledge the contribution of Dr. Nicolas Carels for critical reading of the manuscript.

1

This work was jointly supported by the Fundação de Amparo à Pesquisa do Estado de São Paulo (São Paulo, Brazil), by the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq, Brasília, Brazil), and by COPERSUCAR (Piracicaba, Brazil).

[w]

The online version of this article contains Web-only data.

References

  1. Allen KD (2002) Assaying gene content in Arabidopsis. Proc Natl Acad Sci USA 99: 9568-9572 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389-3402 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796-815 [DOI] [PubMed] [Google Scholar]
  4. Aravind L, Watanabe H, Lipman DJ, Koonin EV (2000) Lineage-specific loss and divergence of functionally linked genes in eukaryotes. Proc Natl Acad Sci USA 97: 11319-11324 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bergelson J, Kreitman M, Stahl EA, Tian D (2001) Evolutionary dynamics of plant R-genes. Science 292: 2281-2285 [DOI] [PubMed] [Google Scholar]
  6. Braun EL, Halpern AL, Nelson MA, Natvig DO (2000) Large-scale comparison of fungal sequence information: mechanisms of innovation in Neurospora crassa and gene loss in Saccharomyces cerevisiae. Genome Res 10: 416-430 [DOI] [PubMed] [Google Scholar]
  7. Brendel V, Kurtz S, Walbot V (2002) Comparative genomics of Arabidopsis and maize: prospects and limitations. Genome Biol 3: REVIEWS1005 [DOI] [PMC free article] [PubMed]
  8. Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268: 78-94 [DOI] [PubMed] [Google Scholar]
  9. Copley R, Letunic I, Bork P (2001) Genome and protein evolution in eukaryotes. Curr Opin Chem Biol 6: 39-45 [DOI] [PubMed] [Google Scholar]
  10. Cronk QC (2001) Plant evolution and development in a post-genomic context. Nat Rev Genet 2: 607-619 [DOI] [PubMed] [Google Scholar]
  11. Doebley J, Lukens L (1998) Transcriptional regulators and the evolution of plant form. Plant Cell 10: 1075-1082 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Emanuelsson O, Nielsen H, von Heijne G (1999) ChloroP, a neural network-based method for predicting chloroplast transit peptides and their cleavage sites. Protein Sci 8: 978-984 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Felsenstein J (1993) PHYLIP (Phylogeny Inference Package) Version 3.5c. Department of Genetics, University of Washington, Seattle
  14. Felsenstein J (1997) An alternating least squares approach to inferring phylogenies from pairwise distances. Syst Biol 46: 101-111 [DOI] [PubMed] [Google Scholar]
  15. Feng Q, Zhang Y, Hao P, Wang S, Fu G, Huang Y, Li Y, Zhu J, Liu Y, Hu X et al. (2002) Sequence and analysis of rice chromosome 4. Nature 420: 316-320 [DOI] [PubMed] [Google Scholar]
  16. Fulton TM, Van der Hoeven R, Eannetta NT, Tanksley SD (2002) Identification, analysis, and utilization of conserved ortholog set markers for comparative genomics in higher plants. Plant Cell 14: 1457-1467 [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Goff SA, Ricke D, Lan TH, Presting G, Wang R, Dunn M, Glazebrook J, Sessions A, Oeller P, Varma H et al. (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296: 92-100 [DOI] [PubMed] [Google Scholar]
  18. Grivet L, Arruda P (2002) Sugarcane genomics: depicting the complex genome of an important tropical crop. Curr Opin Plant Biol 5: 122-127 [DOI] [PubMed] [Google Scholar]
  19. Henikoff S, Greene E, Pietrokovski S, Bork P, Attwood T, Hood L (1997) Gene families: the taxonomy of protein paralogs and chimeras. Science 278: 609-614 [DOI] [PubMed] [Google Scholar]
  20. Kellog E (2001) Evolutionary history of the grasses. Plant Physiol 125: 1198-1205 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Kikuchi S, Satoh K, Nagata T, Kawagashira N, Doi K, Kishimoto N, Yazaki J, Ishikawa M, Yamada H, Ooka H et al. (2003) Collection, mapping, and annotation of over 28,000 cDNA clones from japonica rice. Science 301: 376-379 [DOI] [PubMed] [Google Scholar]
  22. Kondrashov F, Rogozin I, Wolf Y, Koonin E (2002) Selection in the evolution of gene duplication. Genome Biol 3: 1-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Koonin EV, Aravind L, Kondrashov AS (2000) The impact of comparative genomics on our understanding of evolution. Cell 101: 573-576 [DOI] [PubMed] [Google Scholar]
  24. Ku HM, Vision T, Liu J, Tanksley SD (2000) Comparing sequenced segments of the tomato and Arabidopsis genomes: large-scale duplication followed by selective gene loss creates a network of synteny. Proc Natl Acad Sci USA 97: 9121-9126 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Lander E, Linton L, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, Fitzhugh W et al. (2001) Initial sequencing and analysis of the human genome. Nature 409: 860-921 [DOI] [PubMed] [Google Scholar]
  26. Liu H, Sachidanandam R, Stein L (2001) Comparative genomics between rice and Arabidopsis shows scant collinearity in gene order. Genome Res 11: 2020-2026 [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Lukashin AV, Borodovsky M (1998) GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res 26: 1107-1115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Lynch M, Conery JS (2000) The evolutionary fate and consequences of duplicate genes. Science 290: 1151-1155 [DOI] [PubMed] [Google Scholar]
  29. Poethig RS (2001) Life with 25,000 genes. Genome Res 11: 313-316 [DOI] [PubMed] [Google Scholar]
  30. Quackenbush J, Cho J, Lee D, Liang F, Holt I, Karamycheva S, Parvizi B, Pertea G, Sultana R, White J (2001) The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species. Nucleic Acids Res 29: 159-164 [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Rubin G, Yandell M, Wortman J (2000) Comparative genomics of the eukaryotes. Science 287: 2204-2215 [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Rujan T, Martin W (2001) How many genes in Arabidopsis come from cyanobacteria? An estimate from 386 protein phylogenies. Trends Genet 17: 113-120 [DOI] [PubMed] [Google Scholar]
  33. Salzberg SL, White O, Peterson J, Eisen JA (2001) Microbial genes in the human genome: lateral transfer or gene loss? Science 292: 1903-1906 [DOI] [PubMed] [Google Scholar]
  34. Sankoff D (2001) Gene and genome duplication. Curr Opin Genet Dev 11: 681-684 [DOI] [PubMed] [Google Scholar]
  35. Sasaki T, Matsumoto T, Yamamoto K, Sakata K, Baba T, Katayose Y, Wu J, Niimura Y, Cheng Z, Nagamura Y et al. (2002) The genome sequence and structure of rice chromosome 1. Nature 420: 312-316 [DOI] [PubMed] [Google Scholar]
  36. Schoof H, Zaccaria P, Gundlach H, Lemcke K, Rudd S, Kolesov G, Arnold R, Mewes HW, Mayer KF (2002) MIPS Arabidopsis thaliana Database (MAtDB): an integrated biological knowledge resource based on the first complete plant genome. Nucleic Acids Res 30: 91-93 [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Seki M, Narusaka M, Kamiya A, Ishida J, Satou M, Sakurai T, Nakajima M, Enju A, Akiyama K, Oono Y et al. (2002) Functional annotation of a full-length Arabidopsis cDNA collection. Science 296: 141-145 [DOI] [PubMed] [Google Scholar]
  38. Somerville C, Somerville S (1999) Plant functional genomics. Science 285: 380-383 [DOI] [PubMed] [Google Scholar]
  39. Souza GM, Simoes ACQ, Oliveira KC, Garay HM, Fiorini LC, Gomes FdS, Nishiyama-Junior MY, Silva AM (2001) The sugarcane signal transduction (SUCAST) catalogue: prospecting signal transduction in sugarcane. Genet Mol Biol 24: 25-34 [Google Scholar]
  40. Stanhope MJ, Lupas A, Italia MJ, Koretke KK, Volker C, Brown JR (2001) Phylogenetic analyses do not support horizontal gene transfers from bacteria to vertebrates. Nature 411: 940-944 [DOI] [PubMed] [Google Scholar]
  41. Telles GP, Braga MVD, Dias Z, Quitzau JAA, da Silva FR, Meidanis J (2001) Bioinformatics of the sugarcane EST project. Genet Mol Biol 24: 8-15 [Google Scholar]
  42. Telles GP, da Silva FR (2001) Trimming and clustering sugarcane ESTs. Genet Mol Biol l 24: 17-23 [Google Scholar]
  43. Van der Hoeven R, Ronning C, Giovannoni J, Martin G, Tanksley S (2002) Deductions about the number, organization, and evolution of genes in the tomato genome based on analysis of a large expressed sequence tag collection and selective genomic sequencing. Plant Cell 14: 1441-1456 [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Vettore AL, da Silva FR, Kemper EL, Arruda P (2001) The libraries that made SUCEST. Genet Mol Biol 24: 1-7 [Google Scholar]
  45. Vettore AL, da Silva FR, Kemper EL, Souza GM, Silva AM, Ferro MIT, Henrique-Silva F, Giglioti EA, Lemos MV, Coutinho LL et al. (2003) Analysis and functional annotation of an expressed sequence tag collection for tropical crop sugarcane. Genome Res 13: 2725-2735 [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Vision T, Brown D, Tanksley S (2000) The origins of genomic duplications in Arabidopsis. Science 290: 2114-2117 [DOI] [PubMed] [Google Scholar]
  47. Wendel J (2000) Genome evolution in polyploids. Plant Mol Biol 42: 225-249 [PubMed] [Google Scholar]
  48. Wikstrom N, Savolainen V, Chase MW (2001) Evolution of the angiosperms: calibrating the family tree. Proc R Soc Lond B Biol Sci 268: 2211-2220 [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Wolfe KH, Gouy M, Yang YW, Sharp PM, Li WH (1989) Date of the monocot-dicot divergence estimated from chloroplast DNA sequence data. Proc Natl Acad Sci USA 86: 6201-6205 [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Yang Y, Lai K, Tai P, Li W (1999) Rates of nucleotide substitution in angiosperm mitochondrial DNA sequences and dates of divergence between Brassica and other angiosperm lineages. J Mol Evol 48: 597-604 [DOI] [PubMed] [Google Scholar]
  51. Yu J, Hu S, Wang J, Wong GK, Li S, Liu B, Deng Y, Dai L, Zhou Y, Zhang X et al. (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296: 79-92 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Data

Articles from Plant Physiology are provided here courtesy of Oxford University Press

RESOURCES