Abstract
Arabidopsis has been popular as a model plant system for decades. Completion of the Arabidopsis genome and the availability of large expressed sequence-tag collections from other dicot species provides an opportunity to assess gene content in Arabidopsis, specifically by identifying genes from dicot test species that are absent from Arabidopsis. I report here results from these sorts of comparisons, carried out in part to assess the extent to which Arabidopsis is representative of dicot genomes and also the degree to which gene loss and novel gene acquisition has accompanied angiosperm speciation. More than 10% of the contigs from each of three dicot test species have no detectable homologue in Arabidopsis. By means of cross comparison among the test species, 154 specific cases of gene loss in the lineage leading to Arabidopsis were identified, including several well characterized enzymes and a group of proteins with strong homologs in the photosynthetic bacterium Synechocystis. These results show that although Arabidopsis is broadly representative of the other dicot genomes, there seems to be substantial variation even among relatively closely related genera. Further, although we cannot yet draw a causative link, variation in actual gene content seems appears to be a feature of angiosperm speciation.
The tremendous variety of morphology, growth habit, and biochemical makeup of flowering plants, with economically important species scattered among groups as diverse as oak trees and orchids, has fascinated scientists for generations. Central to understanding this group is determining whether this diversity results from modifications of essentially the same set of genes, by means of significant differences in gene content, or both. Variation in gene content can occur either by novel gene acquisition or specific gene loss, and the relative importance of these two processes is unknown. The completion of the Arabidopsis thaliana genome (1), in conjunction with several large expressed sequence tag (EST) projects in higher plant species, provides an opportunity to examine gene content in higher plants and the role of gene loss or acquisition in speciation. I have approached these questions by using Arabidopsis as a reference genome and comparing clustered and assembled EST sets (gene indices) from different species to Arabidopsis. In this comparison, a subset of each test genome, representing some unknown fraction of the total expressed genes, is compared with the complete reference genome, which limits the types of inference that may be derived. Specifically, gene absence may be inferred in the reference genome but not in the incomplete test genomes. Because the evolutionary distance between the genomes used here is significant enough to limit severely the utility of nucleotide level comparison, I used a rigorous frame-independent Smith Waterman implementation that does the comparison at the level of conceptual protein translation.
Three dicot species were selected for this study on the basis of phylogenetic position relative to Arabidopsis and the availability of good EST collections. A simplified phylogeny of these species (based on ref. 2) is shown in Fig. 1. Glycine max (soybean) and Medicago truncatula (both in the Fabales) and Arabidopsis (in the Brassicales) are in the Rosids, one of the main divisions of the dicots. Lycopersicon esculentum (tomato), in the Solanaceae, is in the Asterids, another major division of the dicots. The Asterids and the Rosids diverged early in the diversification of the dicots, 100–150 million years ago (3, 4). If a close homolog of some gene is found in L. esculentum, G. max, and M. truncatula, it seems certain to have been in the common ancestor to the Rosids and the Asterids. If such a gene is missing in Arabidopsis, this is direct evidence for gene loss in the lineage leading to Arabidopsis or for such highly accelerated divergence of this gene that it is no longer detectable by sequence similarity. This paper concerns a set of examples that provides direct evidence for gene loss.
Figure 1.

Simplified phylogeny of the species used in this study. Estimates of the age of divergence of key lineages are shown. The estimate for the divergence of the crucifers and the legumes is from ref. 3. The Asterid/Rosid divergence estimate is from refs. 3 and 4, and the monocot/dicot divergence estimate is from ref. 5. Mya, million years ago.
Materials and Methods
Unigene Set Assembly.
The three dicot test species were selected on the basis of phylogenetic position and sequence availability. For each test species, EST sequences were extracted from a local installation of GenBank. As a general rule, multiple labs or centers were involved in generating each sequence data set (consult the homepages of the M. truncatula Consortium, www.medicago.org, the Solanaceae Genomics Network, www.sgn.cornell.edu, and the Soybean EST Project at 129.186.26.94/soybeanest.html). Sequence was assembled into gene indices by using the PARACEL CLUSTERING PACKAGE 2.2.4 run with default parameters. Assembly in this package is based on the CAP4 program, which is an update of the publicly available CAP3 (6). The assembly process begins with an all-against-all comparison of the input sequences, which forms the basis for placing sequences into clusters that are subsequently assembled into contigs. A subset of the input sequences may not be placed into clusters (and thus not assembled into contigs), because no overlaps were detected with other input sequences at the comparison stage. These “true singletons” were not included in the comparisons to the Arabidopsis genome, which was in part to save computational load but more because this set is enriched for short, low-quality, or artifactual reads.
Arabidopsis Databases.
The Arabidopsis reference genome comprised three separate databases. First, a local copy of the publication release minimally redundant AGI genome sequence was downloaded from ftp.arabidopsis.org, which was supplemented with bacterial artificial chromosome sequence deposited in GenBank subsequent to the publication to maximally cover the small number of gaps present in the genomic sequence. The second database consisted of all Arabidopsis protein sequences in GenBank, concatenated with the 25,470 annotated proteins from the AGI publication genome (1). This protein database includes the complete genomes of the chloroplast and mitochondrion. The last component was a clustered EST database or Arabidopsis gene index. Arabidopsis ESTs (124,528) were extracted from GenBank and combined with 53,269 ESTs sequenced as part of the Paradigm Genetics Arabidopsis genomics effort. This resulting set of 177,817 ESTs was assembled by using PARACEL CLUSTERING PACKAGE 2.2.4 run with default parameters. The resulting gene index contained 17,759 contigs and 25,521 singletons, all of which were used in subsequent analysis. This combination of databases gives very good coverage of the genome. In order for a gene to be missing from this database, it must be present in one of the few remaining gaps in the genome, previously undescribed, and expressed at a low enough level not to be represented in the EST sets used in this study.
Sequence Comparison.
Sequence comparisons between the gene indices and the Arabidopsis genomic database or between gene indices were carried out with the PARACEL TSWX algorithm with hardware acceleration on a PARACEL GENEMATCHER 2. TSWX is an implementation of the Smith Waterman algorithm that compares six-frame conceptual translation of DNA with double-affine gaps and frameshift tolerance, which offers substantial advantages over NCBI TBLASTX, which does not allow gaps and carries out the comparison as a set of static frames and hence cannot accommodate frameshifts. The PARACEL TSWX algorithm gives a substantial increase in the sensitivity of the comparison by tolerating frameshift errors that are common in EST data and tolerating introns in genomic sequence. Hardware acceleration made it possible to complete the searches done for this study in an acceptable amount of time (≈200 h for the comparisons used in this study). Searches against the Arabidopsis databases were carried out with an expect value cutoff of 1 × 10−3. Selection of a cutoff in this sort of study is never entirely clear cut, especially because much of the relevant literature involves the BLAST algorithms. In my experience with TSWX, the vast majority of matches below this cutoff do not represent true homology. To be certain that true homologs have not been missed, the cases examined in this study that had no TSWX matches were examined further with more sensitive hidden Markov-model searches.
Results
Gene indices were constructed for each of the three test species by clustering and assembling the available ESTs. Table 1 summarizes these gene indices and the results of their comparison to Arabidopsis. Although the Arabidopsis genome is essentially complete, there are still about a dozen gaps (in addition to a small number of bacterial artificial chromosomes that still exist as unordered contigs). About half of these are in repeat-rich centromere regions that, although not gene-free, are very gene-poor. Nevertheless, it remains possible that a few genes still will be found in these gaps (updated information on the remaining gaps in the genome is available at www.arabidopsis.org/info/agicomplete.html). To maximize coverage of the genome after comparison to a genomic DNA database, including sequence made available since the publication of the genome, test EST gene indices were compared with a clustered EST set comprising 177,817 ESTs and a set of all described proteins from Arabidopsis. The probability of an expressed gene not being represented in one or all of these databases is very small. Between 10 and 15% of the contigs from three test species (G. max, M. truncatula, and L. esculentum) have no detectable homolog in any of the Arabidopsis databases. EST contigs with no detectable homolog in Arabidopsis either represent genes that are present in the test species and absent from Arabidopsis or represent various sorts of artifacts such as untranslated sequence, low-quality reads, or viral or other contaminants. Thus, it is difficult to estimate accurately the total percentage of genes in each test species that is absent from Arabidopsis. However, those contigs with no Arabidopsis homologs that can be verified either by annotation against the GenBank database or cross comparison to other gene indices represent a subset of these missing genes. For example, 545 of the 2,077 G. max contigs that did not have a detectable homolog in Arabidopsis did have homologs in the M. truncatula gene index. Some of these presumably represent novel gene-acquisition events in the lineage leading to the legumes, and some represent gene loss in the lineage leading to Arabidopsis.
Table 1.
Summary of gene indices and comparisons to the Arabidopsis databases
| Species | Total ESTs | Contigs | No hits in Arabidopsis | Percentage |
|---|---|---|---|---|
| Tomato | 94,523 | 10,576 | 1,002 | 9.5 |
| Soy | 137,952 | 14,343 | 2,077 | 14.5 |
| Medicago | 115,717 | 15,701 | 2,076 | 13.3 |
For each species the total number of EST sequences that were assembled is shown. Contigs, the number of contigs following assembly with the paracel assembly package. Singlets that were not assigned to a cluster were not included in the genomic comparisons. No hits in Arabidopsis, the number of contigs that failed to get a hit with an expect value of 1 × 10−3; percentage, the percentage of total contigs that failed to get a hit at this cutoff.
To assay for gene-loss events in Arabidopsis, I started with the 1,002 (of 10,567 total) contigs for the L. esculentum gene index that did not have a detectable homolog in any of the Arabidopsis databases and then selected the 154 of those that had a significant match in either the M. truncatula or G. max gene indices. These contigs, representing probable gene-loss events in Arabidopsis, then were compared with the GenBank protein database by using a Smith Waterman algorithm. Forty-five contigs had significant similarities in GenBank (Table 2, which is published as supporting information on the PNAS web site, www.pnas.org). Sixteen contigs are similar to unknown or hypothetical proteins in various species including four with similarities to hypothetical proteins from the cyanobacterium Synechocystis. There are four possible kinase similarities and four DNA- or RNA-binding proteins (although only two of these, both transcription factors, are strong enough to be considered clear homologs). There are 10 similarities to known enzymes including five polyphenol oxidase (PPO) and two ornithine decarboxylase (ODC) sequences. I will discuss in more detail PPO and ODC and the unknown Synechocystis proteins.
Two of the EST contigs with no Arabidopsis homologs are essentially identical to previously described members of the L. esculentum PPO family, which has seven described members (7), and three EST contigs represent potential new family members (Fig. 4, which is published as supporting information on the PNAS web site). PPO is a fairly loosely defined term and used here to refer specifically to catechol oxidase (EC 1.10.3.1) catalyzing the conversion of mono- and o-diphenols to o-diquinones. The highly reactive quinones autopolymerize to form brown polyphenolic melanins. The exact role of PPO is not understood fully, but a role in plant disease or pest resistance seems well established (8). Enzymatic browning is a very important concern in postharvest physiology, and a PubMed search shows that PPO has been studied in at least 20 plant species. I have identified one representative of this enzyme in M. truncatula (with six constituent ESTs) and at least four representatives in G. max, all of which also are well represented in the EST sets. No homolog of this enzyme was detected in any of the Arabidopsis databases constructed for this study, which seems unusual given the very broad distribution of this enzyme.
Another surprising gene that was not found in Arabidopsis is ODC catalyzing conversion of ornithine to putrescine with the release of CO2 (EC 4.1.1.17). In many organisms this is the rate-limiting step in the biosynthesis of polyamines including spermine and spermidine and is highly regulated at several levels (9). In plants, polyamines have been implicated in a number of developmental processes, and because of interest in manipulating the levels of various polyamine products including the highly toxic insecticide nicotine, there has been considerable interest in the control of this pathway. There is a second pathway to putrescine in bacteria, plants, and perhaps animal systems (10) involving the conversion of arginine to agmatine by arginine decarboxylase (ADC, EC 4.1.1.19), which subsequently is converted to putrescine. In tobacco these two pathways are differentially regulated (11).
As of this writing, GenBank contains ODC sequences from three plant species, L. esculentum, Nicotiana tabacum (tobacco), and Datura stramonium (jimsonweed). In the gene indices constructed for this study, I identified two L. esculentum ODC homologs (both with four constituent ESTs), one of them identical to GenBank accession no. 3668354, and the second a potential new family member (Fig. 5, which is published as supporting information on the PNAS web site). In addition I have identified a single G. max homolog (with 11 constituent reads). I was unable to identify a homolog in M. truncatula, which has EST coverage comparable to L. esculentum and G. max (Table 1). No homolog is found in any of the Arabidopsis databases. The available ODC sequences, including those identified in this study, were aligned with CLUSTALW (12), and a hidden Markov model profile was constructed from this alignment (13). This profile was used to search the Arabidopsis databases in an attempt to find an ODC homolog that had diverged too far to be detected by pairwise sequence similarity. No homologs were found, although ODC shows limited similarity to an Arabidopsis diaminopimelate decarboxylase-like protein. This similarity does not include the conserved residues in the ODC alignments.
Malmberg et al. (10) note that ODC and ADC are not present universally in all taxa, suggesting a pattern of polyphyletic loss in which most organisms have both activities, but some have lost one or the other. Although ADC and ODC may have different functions in many organisms, one activity or the other seems to be dispensable in some organisms, particularly those under selection for reduced genome size. Assays exist for ODC activity (14), and Watson et al. (15) attempted to isolate ADC and ODC mutants in Arabidopsis. They were unable to assay ODC activity in Arabidopsis, which is consistent with my results.
As mentioned above, four of the L. esculentum contigs that represent genes missing from Arabidopsis are similar to hypothetical proteins from the single-celled cyanobacterium Synechocystis PCC 6803 (Table 2). The absence of these genes is interesting in part because comparison of the complete Synechocystis genome to sequenced chloroplast genomes has established that the chloroplast probably originated from a single endosymbiosis event of a cyanobacterium that was quite similar to Synechocystis (16). Thus, these L. esculentum contigs represent ancient genes that originated in the endosymbiont genome, were stably transferred to the nuclear genome of land plants before the divergence of the Asterids and Rosids, and more recently have been lost from the lineage leading to Arabidopsis.
The first of these, sll0564, has a described Oryza sativa (Rice) homolog (GenBank accession no. 14589368, annotated as an unknown protein), and I have identified homologs in the nuclear genomes of G. max, L. esculentum, M. truncatula, Hordeum vulgare (Barley), and Zea mays (Maize) (Figs. 6 and 7, which are published as supporting information on the PNAS web site). There are no detectable homologs in any of the sequenced chloroplast genomes in GenBank. sll0564 is similar to bacterial methyltransferase genes and a hypothetical protein from the recently completed Nostoc genome. The second gene in this group, slr0730, has homologs in L. esculentum, G. max, M. truncatula, maize, and barley (Fig. 8, which is published as supporting information on the PNAS web site). There were no informative features in the multiple sequence alignment of these sequences, and when a profile hidden Markov model was constructed and used to search GenBank, no hits were found. The third example, sll0031, has strong homologs in L. esculentum, G. max, and barley. sll0031 is similar to a variety of archaebacterial and bacterial ferredoxins, dehydrogenases, and reductases, all sharing the Fer4 4Fe–4S binding domain (Pfam accession no. PF00037, Pfam version 6.6, pfam.wustl.edu). This domain is highly conserved in the L. esculentum contig (Figs. 9 and 10, which are published as supporting information on the PNAS web site). The final example is slr2032. In the gene indices constructed for this study, there are slr2032 homologs in L. esculentum, G. max, M. truncatula, maize, rice, and barley (in which two family members were found). slr2032 is highly similar to four hypothetical chloroplast proteins from lower plant species (Fig. 2). The degree of sequence conservation, considering the enormous evolutionary distance between the species shown, suggests that this protein has experienced substantial conservation pressure after its transfer from the endosymbiont to the nuclear genome.
Figure 2.
Multiple sequence alignment of slr2032 with four homologous proteins from primitive plant chloroplast genomes and the seven higher plant nuclear genome sequences identified in this study.
When a hidden Markov model profile was built from the alignment in Fig. 2 and searched against the Arabidopsis databases, no significant similarities were found in the Arabidopsis EST unigene set, the complete genome database, or the protein set. Thus, this ancient conserved protein, although present in at least three dicots and three monocots, is missing from Arabidopsis. As was the case with ODC and PPO, the presence of highly similar homologs in species flanking the lineage leading to Arabidopsis in the Asterids and Rosids shows that this is a specific gene-loss event in the lineage leading to Arabidopsis.
The pattern of presence or absence of slr2032 homologs can be placed into an evolutionary context based on plastid phylogeny inferred from comparison of sequenced chloroplast genomes and the completed Synechocystis genome (16). Current organelle genomes contain a very small fraction of the genes that were present in the original endosymbiont. Chloroplast evolution has involved massive flow of genes out of the endosymbiont genome with only a subset of genes becoming fixed in the nuclear genome (17). As shown in Fig. 3, the pattern of presence or absence of the slr2032 gene mirrors this process. In Cyanidium, a primitive alga in which the cyanobacterial endosymbiont still retains a peptidoglycan cell wall, a close slr2032 homolog is found in the plastid genome. In three members of the Rhodophyta, or red algae, slr2032 homologs are also found in the plastid genome, but no representative is found in the plastid genomes of Odontella, a diatom, or Euglena, a secondary endosymbiont. In the angiosperms, I have identified homologs of this gene in the nuclear genome of at least six species. Although the high level of sequence conservation and good representation in the EST sets suggests that this is an important gene, analysis of the multiple sequence alignment yielded no strong clues as to the function of this protein. There are no targeting or transmembrane domains, no significant motifs are found in the Prosite database (www.expasy.ch/prosite), and no domains are recognized in Pfam.
Figure 3.
Phylogenetic relationship of the genomes containing slr2032 homologs (based on ref. 15). For each species the number of proteins in the chloroplast genome is shown if known. Species in which an slr2032 homolog is found in the chloroplast genome are shown in blue. Those in which it was found in the nuclear genome are shown in red, and those in which it was not found in the chloroplast genome but in which presence or absence in the nuclear genome is unclear are shown in black. Arabidopsis, in which slr2032 is absent from both the chloroplast and nuclear genomes, is shown in yellow. Note that branches are not drawn to scale, and the position of Antithamnion is based only on its position in the Rhodophyta and not on the chloroplast genome comparisons in ref. 15.
slr2032 is an example of a gene that was transferred from the plastid to the nuclear genome some time after the separation of the Rhodophyta and Chlorophyta/Metaphyta and experienced substantial conservation pressure. There are documented examples of relatively recent transfers of genes from the plastid to nuclear genomes, indicating that this still is an ongoing process (18). The example of slr2032 shows that loss of endosymbiont genes from the nuclear genome also is still an ongoing process, at least in lineages that are under pressure for small genome size.
A recent genomic comparison study (19) concluded that the dominating factors in the divergence of the dicots have been repeated rounds of large-scale genome duplication followed by selective gene loss, and that the rate of gene loss seems to have been greater in the Arabidopsis lineage than in the L. esculentum lineage. The results presented here support this conclusion, and the completion of the Arabidopsis genome allows us to begin to catalog specific gene-loss events. This paper concentrates on clear cases of gene loss, but one hypothesis that would be consistent with the large number of contigs in the M. truncatula, G. max, and L. esculentum gene indices that lack detectable homologs in any of the Arabidopsis databases is that novel gene acquisition also has been an important force in angiosperm speciation. Testing this hypothesis will require further comparative genomic work. It is likely true that the high tolerance of plants for large-scale genomic duplication has been a powerful factor in novel gene evolution. The central question motivating this study is the degree to which genomic diversity accounts for the observable diversity of angiosperms. This paper provides a first step in answering this question by demonstrating specific examples of diversity of gene content. This sort of analysis will be important in understanding plant evolution, as well as in evaluating the choice of any one species as a model system. The results presented here show that Arabidopsis, as a model system, is broadly representative of the dicots, with the caveat that significant numbers of genes from any given dicot species (perhaps as much as 15%, Table 1) may be missing from Arabidopsis.
Supplementary Material
Acknowledgments
I thank Marie Coffin, Keith Davis, Lori Dircks, Carol Hamilton, and Ted Slater for critical reading of the manuscript. I also thank the Paracel Corporation for generously allowing use of GENEMATCHER 2 as part of a very much extended beta-testing program. The work reported here would not have been possible without this server.
Abbreviations
- EST
expressed sequence tag
- PPO
polyphenol oxidase
- ODC
ornithine decarboxylase
- ADC
arginine decarboxylase
Footnotes
This paper was submitted directly (Track II) to the PNAS office.
References
- 1.The Arabidopsis Genome Initiative. Nature (London) 2000;408:796–815. doi: 10.1038/35048692. [DOI] [PubMed] [Google Scholar]
- 2.Soltis P S, Soltis D E, Chase M W. Nature (London) 1999;402:402–404. doi: 10.1038/46528. [DOI] [PubMed] [Google Scholar]
- 3.Gandolfo M A, Nixon K C, Crepet W L. Am J Bot. 1998;85:964–974. [PubMed] [Google Scholar]
- 4.Yang Y-W, Lai K-N, Tai P-Y, Li W-H. J Mol Evol. 1999;48:597–604. doi: 10.1007/pl00006502. [DOI] [PubMed] [Google Scholar]
- 5.Wolfe K H, Gouy M, Yang Y-W, Sharp P M, Li W-H. Proc Natl Acad Sci USA. 1989;86:6201–6205. doi: 10.1073/pnas.86.16.6201. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Huang X, Madan A. Genome Res. 1999;9:868–877. doi: 10.1101/gr.9.9.868. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Newman S M, Eannetta N T, Yu H, Prince J P, de Vicente M C, Tanksley S D, Steffens J C. Plant Mol Biol. 1993;21:1035–1051. doi: 10.1007/BF00023601. [DOI] [PubMed] [Google Scholar]
- 8.Walker J R L, Ferrar P H. Biotechnol Genet Eng Rev. 1998;15:457–498. doi: 10.1080/02648725.1998.10647966. [DOI] [PubMed] [Google Scholar]
- 9.Davis R H, Morris D R, Coffino P. Microbiol Rev. 1992;56:280–290. doi: 10.1128/mr.56.2.280-290.1992. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Malmberg R L, Watson M B, Galloway G L, Yu W. Crit Rev Plant Sci. 1998;17:199–224. [Google Scholar]
- 11.Imanishi S, Hashizume K, Nakakita M, Kojima H, Matsubayashi Y, Hashimoto T, Sakagami Y, Yamada Y, Nakamura K. Plant Mol Biol. 1998;38:1101–1111. doi: 10.1023/a:1006058700949. [DOI] [PubMed] [Google Scholar]
- 12.Thompson J D, Higgins D G, Gibson T J. Nucleic Acids Res. 1994;22:4673–4680. doi: 10.1093/nar/22.22.4673. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Eddy S R. Curr Opin Struct Biol. 1996;6:361–365. doi: 10.1016/s0959-440x(96)80056-x. [DOI] [PubMed] [Google Scholar]
- 14.Birecka H, Bitonti A J, McCann P P. Plant Physiol. 1985;79:509–514. doi: 10.1104/pp.79.2.509. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Watson M B, Emory K K, Piatak R M, Malmberg R L. Plant J. 1998;13:231–239. doi: 10.1046/j.1365-313x.1998.00027.x. [DOI] [PubMed] [Google Scholar]
- 16.Martin W, Stoebe B, Goremykin V, Hansmann S, Hasegawa M, Kowallik K V. Nature (London) 1998;393:162–165. doi: 10.1038/30234. [DOI] [PubMed] [Google Scholar]
- 17.Martin W, Herrmann R G. Plant Physiol. 1998;118:9–17. doi: 10.1104/pp.118.1.9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Wakasugi T, Tsudzuki J, Ito S, Nakashima K, Tsudzuki T, Sagiura M. Proc Natl Acad Sci USA. 1994;91:9794–9798. doi: 10.1073/pnas.91.21.9794. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Ku H M, Vision T, Liu J, Tanksley S D. Proc Natl Acad Sci USA. 2000;97:9121–9126. doi: 10.1073/pnas.160271297. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


