Abstract
About 25% of the genes in the fully sequenced and annotated Arabidopsis genome have structures that are predicted solely by computer algorithms with no support from either nucleic acid or protein homologs from other species or expressed sequence matches from Arabidopsis. These are referred to as “hypothetical genes.” On chromosome 2, sequenced by The Institute for Genomic Research, there are approximately 800 hypothetical genes among a total of approximately 4,100 genes. To test their expression under various growth conditions and in specific tissues, we used six cDNA populations prepared from cold-treated, heat-treated, and pathogen (Xanthomonas campestris pv campestris)-infected plants, callus, roots, and young seedlings. To date, 169 hypothetical genes were tested, and 138 of them are found to be expressed in one or more of the six cDNA populations. By sequencing multiple clones from each 5′- and 3′-rapid amplification of cDNA ends (RACE) product and assembling the sequences, we generated full-length sequences for 16 of these genes. For 14 genes, there was one full-length assembly that precisely supported the intron-exon boundaries of their gene predictions, adding only 5′- and 3′-untranslated region sequences. However, for three of these genes, the other assemblies represent additional exons and alternatively spliced or unspliced introns. For the remaining two genes, the cDNA sequences reveal major differences with predicted gene structures. In addition, a total of six genes displayed more than one polyadenylation site. These data will be used to update gene models in The Institute for Genomic Research annotation database ATH1.
With the combined efforts of scientists from Europe, Japan, and the United States, the first higher plant genome-sequencing project, whole-genome sequencing of Arabidopsis has been completed. The sequences of chromosome 2 and 4 were first released in 1999 (Lin et al., 1999; Mayer et al., 1999), and the remaining three chromosomes were sequenced by the end of 2000 (Salanoubat et al., 2000; Tabata et al., 2000; Theologis et al., 2000). This provides scientists a wealth of information and knowledge with which to understand plant biology from a genomic perspective. The whole Arabidopsis genome encodes approximately 25,000 genes (The Arabidopsis Genome Initiative, 2000) and the functional analysis of these genes is a major challenge in this post-sequencing era. One approach to this, taken by several groups (Ceres, Stanford Genome Center, Salk Institute, Plant Gene Expression Center [in collaboration with RIKEN Genomic Sciences Center], and Institut National de la Recherche Agronomique/Genoplante) is to produce full-length cDNAs for all of the 25,000+ genes in the Arabidopsis genome, because these complete sequences are essential to fully understand their structure and function (Seki et al., 2001a, 2001b). Recent comparison of full-length cDNAs from Ceres and SSP with previous genomic annotation revealed that the structures of about one-third of the genes could be improved based on cDNA sequence (Haas et al., 2002). To complement these large-scale, undirected cloning and sequencing effects, we are focusing our attention on a special group of genes that are not represented in current expressed sequence tag (EST) collections and are therefore least likely to be sequenced by the large scale public efforts, thus requiring a targeted approach. This group of “hypothetical genes” is predicted only by ab initio computer algorithms such as Genscan (Burge and Karlin, 1997; now updated to Genscan+), Genemark.hmm (Lukashin and Borodovsky, 1998), and various splice site prediction programs (Uberbacher and Mural, 1991; Hebsgaard et al., 1996; Brendel and Kleffe, 1998) and have no database support.
Our goal is cloning, sequencing, and functional analysis of full-length cDNAs from these hypothetical genes. These genes represent the most obscure and challenging set, because there is no experimental evidence for the validity of the gene models, for the functions of the encoded proteins, or for the transcription of that particular region of the genome as evidenced by EST sequences in public databases. Possible reasons for the apparent lack of expression of these hypothetical genes include: (a) the limited variety of tissues from which current EST libraries have been made; (b) the low level of expression of the genes and/or their restriction to a limited number of cells or tissues or to specific environmental/experimental conditions that have not yet been explored; (c) An invalid gene prediction. Therefore, a full-length cDNA will provide the most robust method to validate a gene prediction, both by demonstrating its expression as a cDNA and by providing sequence information through which the details of the gene model can be confirmed, appropriately modified, or refuted. After this, their function in vivo can be analyzed by examination of their expression pattern and the plant phenotypes created by overexpression and expression inhibition.
Our target list is all of the hypothetical genes on chromosome 2, which, at the time of completion and annotation, numbered 1,094 (Lin et al., 1999). However, continued EST sequencing and the ongoing full-length cDNA efforts, which are using a more comprehensive range of tissues and experimental treatments, have produced cognate sequences for some genes that were originally annotated as hypothetical. Thus the number of genes that should be considered as hypothetical based on the above criteria is currently approximately 800 and will probably continue to decrease slightly as full-length cDNA sequencing progresses.
Expression analysis using microarrays is also of value in determining the possible expression of hypothetical genes. Using Arabidopsis chromosome 2 microarrays, H. Kim, E. Snesrud, and J. Quackenbush (unpublished data) at The Institute for Genomic Research (TIGR) have examined gene expression from various tissues and under a wide range of treatments. Their results show that hybridization can be detected to approximately 81% of the spots representing hypothetical genes for which no cDNA or EST sequence exists, indicating that either these genes or their paralogs are expressed in one or more of the conditions used. We intend to use this expression information to target first genes with the lowest (or zero) levels of expression, because these are least likely to be captured in the large-scale sequencing efforts.
To test the expression of hypothetical genes, we are currently using six cDNA populations prepared from cold-treated, heat-treated, and pathogen (Xanthomonas campestris pv campestris)-infected plants, callus, roots, and young seedlings. The results to date indicate about 82% of the hypothetical genes tested are expressed in one or more of our cDNA populations. Sixteen full-length cDNA sequences were obtained by sequencing multiple clones of the 5′- and 3′-RACE products from each hypothetical gene and assembling their sequences. Comparison of the assemblies for each gene with their in silico predictions showed that 14 genes have at least one full-length cDNA assembly consistent with the predictions, only adding some 5′- and 3′-untranslated region (UTR) sequences, whereas the remaining two genes had major differences from the predicted gene structure. However, among the 14 genes consistent with their predictions, seven of them also display different forms of the cDNA for the same gene including instances of alternative splice sites and unspliced introns, and in two others, exons were discovered in the UTRs. Six genes showed evidence for more than one polyadenylation site. Therefore, cloning and sequencing the full-length cDNAs of hypothetical genes not only provides evidence of expression, sequence information, and a foundation for functional studies of the hypothetical genes of Arabidopsis, but also could have some bearing on the regulation of their activities.
RESULTS
The original gene annotations for Arabidopsis chromosome 2 (Lin et al., 1999) served as the starting point of this study and can be found in their original form in the Plant division of GenBank. The TIGR version (http://www.tigr.org/tdb/e2k1/ath1/ath1.shtml) incorporates minor updates based on new evidence and was used for checking the status of selected genes before primer design. The 1999 annotation contained 1,094 hypothetical genes for which there was no experimental evidence for their existence and expression. Our goal is to test their expression, to obtain full-length sequence for all of them, and to analyze their function in vivo. In this pioneer study, a total of 169 hypothetical genes were analyzed for their expressions in six cDNA populations. Sixteen full-length cDNA sequences were obtained, which are discussed in detail below.
Selection of Hypothetical Genes
At the outset of the project, we grouped the then incomplete Arabidopsis proteome into gene families using the Geanfammer/Divclus approach (Park and Teichmann, 1998). From these single-linkage clusters, we identified families of genes that contained primarily or entirely hypothetical genes and selected one or two members of each family. We focused this analysis on chromosome 2 because this had been annotated at TIGR and because we were most familiar with the criteria used for the “hypothetical” assignment. Before primer design and experimentation, we checked each hypothetical assignment by searching the target gene against GenBank, because in some cases, sequences released since the original annotation (which took place over 2 to 3 years) may provide experimental support for the gene's expression and move it from the hypothetical to the “unknown” category. Only genes for which there was still no experimental support were pursued. As a broad class, hypothetical genes show a similar range of size and intron/exon composition to the remainder of the genome. Average transcript size is around 900 bases, with outliers more than 5 kb in length.
Expression Pattern of the Tested Hypothetical Genes
For each hypothetical gene, gene-specific primers for both 5′- and 3′-RACE (GSP1 and GSP2) were designed based on the predicted coding region. If the hypothetical gene was present in the cDNA population, a 200- to 500-bp PCR product would be amplified when the gene-specific primers were used in combination (Fig. 1). Six cDNA populations were constructed from cold-treated, heat-treated, and pathogen (X. campestris pv campestris)-treated plants, tissue culture (induced callus), roots, and young seedling. The GSP primers were used in combination to examine expression of the hypothetical gene in the six cDNA populations (Fig. 2). Among the 12 hypothetical gene tests shown in Figure 2, PCR products for seven genes (At2g39830, At2g40520, At2g41150, At2g41050, At2g41350, At2g42330, and At2g42370) were amplified from all six cDNA populations, whereas the remaining five genes were shown to be expressed in a variety of different tissues (summarized in Table I).
Table I.
cDNA Populations
|
||||||
---|---|---|---|---|---|---|
Hypothetical Gene | Young seedling | Tissue culture | Roots | Pathogen-infected | Heat-treated | Cold-treated |
At2g39440 | − | − | + | − | + | + |
At2g39790 | − | − | + | − | − | − |
At2g39830 | + | + | + | + | + | + |
At2g39920 | + | − | + | + | + | + |
At2g40390 | + | − | + | + | + | + |
At2g40520 | + | + | + | + | + | + |
At2g41050 | + | + | + | + | + | + |
At2g41150 | + | + | + | + | + | + |
At2g41350 | + | + | + | + | + | + |
At2g42140 | - | + | + | + | + | + |
At2g42330 | + | + | + | + | + | + |
At2g42370 | + | + | + | + | + | + |
For eight of the genes in Figure 2 (At2g39440, At2g39790, At2g39830, At2g39920, At2g40520, At2g41150, At2g41050, and At2g41350), the amplification product sizes from the cDNA populations were smaller than those amplified from genomic DNA, indicating that introns had been spliced from the transcripts of these hypothetical genes. To date, 169 hypothetical genes have been tested for gene expression in the six cDNA populations. One hundred and thirty-eight of them showed expression in one or more cDNA populations, with 70 genes being expressed in all six cDNA populations and 31 showing no expression in any of the cDNA populations (Table II). Thus, most of the hypothetical genes (82% of genes tested) are expressed in Arabidopsis. All of the expression results of these 169 hypothetical genes in our six cDNA populations are shown in Supplemental Data Table I, which can be viewed at www.plantphysiol.org.
Table II.
Tissue/Treatment | Young Plants | Tissue Culture | Roots | Pathogen-Infected | Heat-Treated | Cold-Treated | No Expression |
---|---|---|---|---|---|---|---|
Number of genes expressed | 108 | 83 | 106 | 98 | 113 | 112 | 31 |
Generation of Full-Length cDNA Sequences
For each target gene, a cDNA population showing strong expression was used as template for 5′- and 3′-RACE (Frohman et al., 1988). Five clones each from the 5′ and 3′ reactions were sequenced from both ends with approximately 75% success rate yielding approximately 15 sequences representing the cDNAs from that gene, which were then run through TIGR assembler (Sutton et al., 1995). For clarity, assemblies arising from 5′ and 3′ clones are represented separately in Figure 3. In some cases, 5′ sequences formed a single assembly and 3′ sequences formed a single assembly with a common overlapping region, which represents the full-length transcript of that gene. However, in many cases, the output of TIGR assembler was several 5′ or 3′ assemblies (Fig. 3). These arise because TIGR assembler will not assemble individual sequence reads if there are either internal mismatches of more than 2.5% or mismatched overhangs of more than 10 bases at the end. For each gene, all of the individual assemblies were aligned, along with the predicted cDNA, against the genomic sequence using dds/gap2 (Huang et al., 1997) for inspection. The results revealed that multiple assemblies arose from a single target gene because of the presence of alternate splice and/or polyadenylation isoforms in the collection of sequences from a single gene. Thus the RACE products from more than one distinct transcript from the same gene had been cloned and sequenced from a single cDNA preparation. In this study, we obtained full-length cDNA sequences of 16 hypothetical genes from different cDNA populations (Table III). They displayed some interesting features, such as differences with their predictions, alternately spliced variants, and multiple polyadenylation sites. The GenBank accession numbers of the full-length cDNA sequences and the predicted transcripts are shown in Table III.
Table III.
Pub_locus | Used cDNA Population | Assemblies | Gene Structure Highlights | ORF Lengtha | Poly(A) Siteb | GenBank Accession Nos. of cDNAs | GenBank Accession Nos. of Predicted Genes |
---|---|---|---|---|---|---|---|
At2g02540 | Heat-treated | Tmp5-1 | 5′-UTR with an intron | 331 | AY091515 | NM_126310 | |
Tmp3-1 | Poly(A) site 1 | 47 nt | |||||
Tmp3-2 | Poly(A) site 2 | 75 | |||||
At2g03620 | Roots | Tmp5-1 | 5′-UTR | 422 | AF499434 | NM_126412 | |
Tmp3-1 | Poly(A) site 1 | 126 | |||||
Tmp3-2 | Poly(A) site 2 | 150 | |||||
Tmp3-3 | Poly(A) site 3 | 208 | |||||
At2g05590 | Tissue culture | Tmp5-1 | 5′-UTR | AF345342 | NM_126582 | ||
Tmp3-1 | Two additional exons, predicted stop codon within an intron, poly(A) site 1 | 303 | 211 | ||||
Tmp3-2 | As Tmp3–1, alternative splice site for intron 1, poly(A) site 2 | 164 | 249 | ||||
At2g15220 | Tissue culture | Tmp5-1 | 5′-UTR | 226 | AF345341 | NM_127083 | |
At2g15760 | Tissue culture | Tmp5-1 | 5′-UTR | 316 | AF345343 | NM_127138 | |
At2g17570 | Tissue culture | Tmp5-1 | 5′-UTR with an intron | 296 | AF499435 | NM_127311 | |
At2g19180 | Cold-treated | Tmp3-1 | 3′-UTR with an intron | 180 | AY102540 | NM_127475 | |
At2g19870 | Young seedling | Tmp5-1 | 5′-UTR | 578 | AF499436 | NM_127545 | |
At2g23050 | Tissue culture | Tmp5-1 | 5′-UTR | 481 | AY102541 | NM_127869 | |
Tmp3-1 | Poly(A) site 1 | 164 | |||||
Tmp3-2 | Poly(A) site 2 | 187 | |||||
Tmp3-3 | Unspliced intron 3 | 438 | ? | ||||
At2g23370 | Cold-treated | Tmp5-1 | Extra introns and exons, predicted start codon within an intron | 340 | AY102542 | NM_127901 | |
Tmp3-1 | Poly(A) site 1 | 132 | |||||
Tmp3-2 | Poly(A) site 2 | 163 | |||||
At2g23790 | Cold-treated | Tmp5-1 | 5′-UTR | 337 | AY145900 | NM_127942 | |
At2g23940 | Cold-treated | Tmp5-1 | 5′-UTR | 174 | AY102543 | NM_127955 | |
At2g24440 | Cold-treated | Tmp5-1 | 5′-UTR | 183 | AY102544 | NM_128005 | |
Tmp5-2 | Alternative splice site for intron 1 | 105 | |||||
At2g41660 | Young seedling | Tmp5-1 | 5′-UTR | 298 | AF345340 | NM_129729 | |
At2g42430 | Tissue culture | Tmp5-1 | 5′-UTR | 246 | AF345339 | NM_129804 | |
At2g44220 | Roots | Tmp5-1 | 5′-UTR | AY102564 | NM_129986 | ||
Tmp3-1 | Poly(A) site 1 | 393 | 117 | ||||
Tmp3-2 | Unspliced intron 6, poly(A) site 2 | 270 | 144 |
ORF length is from the sequence merged by 5′ assembly and 3′ assembly.
Distance of poly(A) sites from stop codon.
Gene Structures with Major Differences from the Gene Predictions
At2g05590 and At2g23370 have major differences from their gene predictions (Fig. 3). At2g05590 is represented by one assembly (Tmp5-1) from the 5′ end and two assemblies (Tmp3-1 and Tmp3-2) from the 3′ end. The 5′ assembly agrees with the predicted intron-exon boundaries adding only 5′-UTR sequence. The two 3′ assemblies (Tmp3-1 and Tmp3-2) contain two additional exons compared with the predicted structure and also shorten the predicted exon 6 so that the originally predicted stop codon now falls in an intron. Tmp3-1 and Tmp3-2 show different poly(A) sites (Figs. 3 and 4). In addition, Tmp3-2 displays a disagreement with the gene prediction and Tmp3-2 at the 3′ splice site of intron 2 in that the splice acceptor position is 5-bp upstream and gives rise to a smaller intron 2 (Figs. 3 and 5). The full-length cDNA constructed from Tmp5-1 and Tmp3-1 encodes an open reading frame that terminates in the last exon, resulting in a 303 amino acid peptide compared with a 263 amino acid peptide encoded by the predicted gene (Fig. 3). However, the cDNA constructed from Tmp5-1 and Tmp3-2 contains a stop codon just 3 amino acids downstream of the 3′ splice site of intron 2, giving rise to a truncated 164-amino acid peptide (Fig. 5). At2g23370 is represented by one 5′ assembly (Tmp5-1) and two 3′ assemblies (Tmp3-1 and Tmp3-2) because of different poly(A) sites. These assemblies reveal that the actual gene structure is very different from that predicted, showing 11 exons compared with the four predicted. Furthermore, only four of the six predicted splice sites are supported by the experimental cDNA sequence. Six of the additional exons lie upstream of the predicted ATG, which itself lies in intron 6 of Tmp5-1 (Fig. 3). The largest open reading frame of this cDNA assembly encodes a protein of 340 amino acids in contrast to the 175 amino acids encoded by the computationally predicted gene model.
Examples of Alternately Spliced or Unspliced Introns
Among the genes for which full-length cDNA assemblies were obtained, At2g05590, At2g23050, At2g24440, and At2g44220 display assemblies with alternative splice sites and unspliced introns (Table III). As mentioned previously, At2g05590 shows major disagreements with its predicted structure, including an alternative 3′ splice site for intron 2 in the Tmp3-2 assembly, which gives rise to a truncated peptide. At2g23050 is represented by three 3′ end assemblies and in one of them (Tmp3-3), intron 3 is unspliced (Figs. 3 and 5). The two cDNA sequences formed by combining the 5′ (Tmp5-1) and 3′ (Tmp3-1 or Tmp3-2) assemblies match the prediction, but with different poly(A) sites, and encode a 481 amino acid peptide, whereas the full-length assembly formed by Tmp5-1 and Tmp3-3 with the unspliced intron 3 encodes a peptide of 438 amino acids. At2g24440 produced two 5′ assemblies (Tmp5-1 and Tmp5-2) and one 3′ assembly (Tmp3-1). Merging of Tmp5-1 and Tmp3-1 produces a cDNA entirely consistent with the prediction, encoding a 183-amino acid peptide and adding 5′- and 3′-UTR sequences. In the second 5′ assembly (Tmp5-2), the first intron uses a different splice donor site 11 bp downstream of the predicted site, which results in a truncated peptide of 105 amino acids when merged with Tmp3-1 (Figs. 3 and 5). There is also a single-base mismatch between the genomic sequence and the 11-bp extended 5′ exon in Tmp5-2 (Fig. 5). At2g44220 has one 5′ assembly (Tmp5-1) and two 3′ assemblies (Tmp3-1 and Tmp3-2; Fig. 3). When Tmp5-1 and Tmp3-1 are merged together, a full-length cDNA is generated that matches the predicted intron-exon boundaries and encodes a 393-amino acid protein. However, the other 3′ assembly (Tmp3-2) contains an unspliced intron 6 (Figs. 3 and 5). If Tmp5-1 is merged with Tmp3-2, a stop codon is created within that unspliced intron 6 resulting in a truncated 270-amino acid peptide (Figs. 3 and 5). Each of the intron structures for At2g05590 and At2g24440 that differ from the intron-exon borders in the corresponding predicted gene models are noticeably compliant with the conserved splice site dinucleotides 5′-GT, AG-3′ (Fig. 5).
Multiple Polyadenylation Sites
At2g02540, At2g05590, At2g23050, At2g23370, and At2g44220 are all represented by two 3′ assemblies because of the presence of two different polyadenylation sites. There are three 3′ assemblies for At2g03620, and each has a different polyadenylation site (Figs. 3 and 4).
Other Variations
The cDNA assemblies from the remaining genes (At2g15220, At2g15760, At2g17570, At2g19180, At2g19870, At2g23790, At2g23940, At2g41660, and At2g42430) precisely match the predicted gene structures, merely adding 5′- and 3′-UTRs. At2g17570 and At2g19180 have previously unannotated introns in their 5′- and 3′-UTRs, respectively.
The results of all of these comparisons are summarized in Table III. Disregarding simple 5′ or 3′ sequence extension and the multiple poly(A) sites, for five of the 16 genes examined, the existing prediction needed either to be modified with different splice sites and/or exons or to be augmented with evidence for alternative splicing. These results extend our experience in using the Ceres and SSP cDNAs to validate gene models, where approximately 35% of the models required some modification following comparison with full-length cDNAs (Haas et al., 2002). However, with only 16 genes examined, it may be premature to conclude that current models for hypothetical genes are more likely to contain errors than those for more highly expressed genes, where the gene model/annotation most likely either incorporated or was supported by database matches.
DISCUSSION
Expression Analysis of the Hypothetical Genes
Although hypothetical genes are the group of genes with no experimental evidence for either their structure or expression, among the 169 hypothetical genes on chromosome 2 examined to date, about 82% (138 of 169) were expressed in one or more of the six cDNA populations tested. The range of Arabidopsis tissues used to date in EST and cDNA sequencing projects is shown in Table IV. There are some overlaps between the tissues listed in Table IV and those selected for this study (e.g. roots and young plant). The absence of transcripts for hypothetical genes from the EST and cDNA sequencing projects to date could be attributable to low levels of expression so that transcripts are missed by these undirected, large-scale efforts.
Table IV.
Tissue Types | EST Nos. |
---|---|
Various treatments (RAFL)a | 51,639 |
Mixed tissues | 35,201 |
Root | 21,466 |
Aboveground organs | 15,793 |
Green siliques | 13,434 |
Seed | 11,247 |
Flower buds | 6,939 |
Rosette | 2,351 |
Seedling hypocotyl, 3 d | 2,045 |
Liquid-cultured seedlings | 1,868 |
Seedlings | 1,639 |
Dehydration, cold, high salt, abscisic acid, heat, and UV, dark-grown plants.
There are several possible reasons for the inability to generate amplification products from any of the cDNA populations for 31 of 169 (18%) hypothetical genes tested. (a) The gene prediction is incorrect; there is actually no coding sequence present at this region in the genome. (b) There is an expressed gene at predicted region, but one or more exon predictions are incorrect, so that one or both of the GSP primers lie within an intron and thus cannot amplify the spliced hypothetical gene transcript. Approximately one-third of the predicted gene models compared with Ceres cDNAs required some modification (Haas et al., 2002). Therefore, primers for these 31 genes that were not amplified will be placed at other locations and further attempts will be made to amplify products from cDNA. (c) The hypothetical genes are not expressed at a sufficient level in any of the six cDNA populations selected in this study to permit amplification or may represent nonexpressed pseudogenes.
The range of tissues used for cDNA preparations is now being expanded to increase the chances of capturing cDNAs for most, if not all, of the hypothetical genes. Tissues that are not well represented in current EST and cDNA sequencing projects will be selected, including plants that have been infected with Pseudomonas syringae pv tomato (which, in contrast to X. campestris infection, induces a hypersensitive response), hormone-treated, or exposed to drought, salt, UV, and H2O2 stress.
An alternative strategy to examine the expression of hypothetical genes is using microarray data. TIGR has constructed a 9,000+-element microarray that represents, in duplicate, all of the predicted genes on chromosome 2 as 3′-biased amplicons approximately 1 kb in size derived from genomic DNA. RNA isolated from seedlings, seedling roots, young, and mature leaves, various aerial tissues, flowers, callus tissue, heat, cold, salt, and hydrogen-peroxide-stressed plants and P. syringae and X. campestris-infected leaves was used in the hybridizations. There are 200 hypothetical genes showing no evidence of expression by microarray analysis in any of the tissues so far examined (H. Kim, E. Snesrud, and J. Quackenbush, personal communication). Like our PCR data, these results also indicate that most of hypothetical genes (approximately 82%) are expressed in different tissues or treatments. Fifty-five of the hypothetical genes with no microarray evidence of expression have already been examined by our PCR method on the six cDNA populations. Thirty-one show expression in one or more cDNA populations, and 12 of these genes show expression in all six cDNA populations. Thus, it is possible to detect expression of and capture the full-length cDNAs for genes that are expressed at levels below microarray detection. Twenty-two of the 31 hypothetical genes that showed no expression in the cDNA populations tested in this study conversely showed expression on microarrays using probes prepared from tissues that were also represented in our cDNA populations. Because our results described above demonstrate that genes expressed at undetectable levels on microarrays can be amplified, it seems likely that failure to amplify these genes is attributable to inaccurate gene predictions that resulted in the placement of primer(s) in regions of the predicted gene that are actually spliced out of the final transcript.
Analysis of the Full-Length cDNA Sequences
Assembling the 5′- and 3′-RACE product nucleotide sequences revealed a number of different forms of full-length cDNA for the same gene, including alternative intron donor or acceptor sites (At2g24440 and At2g05590), unspliced introns (At2g23050 and At2g44220) and multiple polyadenylation sites (At2g02540, At2g03620, At2g05590, At2g23050, At2g23370, and At2g44220) (Fig. 3, and summarized in Table III). It is unknown at this point whether these are genuine alternatively spliced transcripts with biological functions or just some mis-spliced product that will be degraded. There have been several previous reports of alternative splicing in Arabidopsis and some alternatively spliced transcripts do have different biological functions. Alternative splicing was found in the COP1 gene, resulting in the deletion of exon 11 and the generation of a truncated COP1b protein, which functions as a dominant negative regulator of wild-type COP1 function (Zhou et al., 1998). Interestingly, the splicing factor SR1 gene in Arabidopsis is also regulated by alternative splicing, and temperature determines the alternative-splicing ratio. It was proposed that one isoform of SR1 could play a role in cellular adaptation to a high-temperature environment (Lazar and Goodman, 2000). For the Arabidopsis U1 snRNA 70K gene, two distinct transcripts are produced by alternative splicing that give rise to two proteins, only the smaller of which can bind specifically to Arabidopsis U1 snRNA (Golovkin and Reddy, 1996). As with several of the cDNAs identified in this study, multiple polyadenylation sites were also observed in the 3′-UTRs of the U1 snRNA 70K gene (Golovkin and Reddy, 1996). When Kato et al. (1999) analyzed cDNAs on chromosome 1 of Arabidopsis, they found that alternative splicing produced two very similar cDNAs (ZCW32 and CW7). One of the transcripts has an intron donor site 13 bp downstream from that in the other, which generates an in-frame stop codon just after the splice site. This alternative splicing pattern is similar to that at intron 1 of At2g24440 (Fig. 3). Another example of alternative splicing occurs in the Arabidopsis Spo11 gene, the yeast homolog of which plays an important role in double-strand break formation at meiosis in yeast (Keeney et al., 1997). In Arabidopsis, there are two SPO11 homologs (AtSPO11-1 and AtSPO11-2), and each has three different polyadenylation sites. RT-PCR demonstrated at least 10 different splicing products from AtSPO11-1, whereas there is only one alternative splicing product from AtSPO11-2 (Hartung and Puchta, 2000). Because the alternately spliced isoforms were originally identified by RACE-PCR of cDNA from a single tissue or treatment, it will be interesting to determine whether the proportion of the different isoforms varies among the collection of cDNA populations available.
Variations in transcript structure by alternative splicing are also known to exist in other plant species. In pumpkin (Cucurbita pepo), two cDNAs are produced by alternative splicing from a single hydroxypyruvate reductase gene (Mano et al., 1999). The two hydroxypyruvate reductase proteins were localized in leaf peroxisomes and the cytosol, respectively, indicating that alternative splicing controls their subcellular localization. This alternative splicing is regulated by light, and the alternative splice site is 17 bp downstream of the predicted intron donor site. In spinach (Spinacia oleracea), there are two cDNA clones encoding stromal and thylakoid-bound ascorbate peroxidase isoenzymes (Ishikawa et al., 1996), which are produced by alternative splicing of two 3′-terminal exons (Ishikawa et al., 1997). In cauliflower (Brassica oleracea), a truncated SRK protein is specifically expressed in stigmata and translated from one of several transcripts, which are generated by a combination of alternative splicing and the use of alternative polyadenylation signals (Giranton et al., 1995). In maize (Zea mays), ZEMa gene encodes a MADS box-type transcription factor, for which transcripts are present in almost all maize tissues, but specific differentially spliced forms accumulate preferentially in maturing endosperm and leaf (Montag et al., 1995). Alternative splicing also occurs at the untranslated leading exons of the maize Zmhox1a homeobox gene in that one transcript gives a normal Zmhox1a open reading frame and the other gives an unrelated open reading frame. The alternative gene product, transposon-associated protein, has significant homology to the C terminus of the Mutator transposase (Comelli et al., 1999). The alternate unspliced, intron-containing transcript from the maize Bronze-2 locus was increased 50-fold by cadmium stress on maize seedlings and was proposed to have a role during response to heavy metals (Marrs and Walbot, 1997). However, the unspliced introns observed in cDNAs from At2g23050 and At2g44220 may be either alternative splicing products or may arise from immature nuclear transcripts present in the cDNA populations used for RACE.
For At2g24440, two cDNA isoforms were recovered, which differ at the splice donor site of intron 1 (Fig. 5). The 5′ splice site of intron 1 of Tmp5-2 cDNA is 11 bp downstream from that in Tmp5-1 and the prediction, but their 3′ splice sites are all the same as each other. Both isoforms of intron 1 use the conserved GT-AG splice sites. Most surprisingly, within the extra 11-bp in Tmp5-2, there is a single-base pair mismatch with the genomic sequence. The mismatch could be attributable to PCR error in our RACE reactions, although these do include a proofreading polymerase. It also could be attributable to a mutation in the sequenced BAC from which the genomic sequence was derived. Another possibility is that this alternatively spliced cDNA actually arose from the transcript of a mutant At2g24440 allele that existed in the pool of plants from which the RNA was isolated. Previous reports have demonstrated that mutations in genomic sequence can affect a gene's splicing patterns. The Arabidopsis floral homeotic mutant apetala3-1 allele is temperature sensitive and carries a mutation (from A to T) in exon 5 near the 5′ splice site, which causes a temperature-dependent splicing defect and the mutant phenotype (Sablowski and Meyerowitz, 1998; Yi and Jack, 1998). The Arabidopsis det3-1 mutation is attributable to a T to A mutation 32 bp upstream of a putative 3′ splice site, which causes a reduction of the transcript to approximately 50% of the wild-type level (Schumacher et al., 1999). The Arabidopsis cop1-1 allele carries a single-nucleotide change (from G to A) 4 bases upstream from the 3′ splice site of intron 5, which results in exon skipping (Simpson et al., 1998). At2g24440 is expressed in all six cDNA populations and was amplified from cold-treated cDNA. Therefore, if the Tmp5-2 cDNA of At2g24440 is really from a mutant allele and carries a mutation near the 5′ splice site of intron 1 (from G to A), the single-nucleotide change and the temperature treatment could cause the different splicing pattern of At2g24440. However, more experiments are needed for verification.
Multiple polyadenylation sites have been found previously in different transcripts in plants (Giranton et al., 1995; Golovkin and Reddy, 1996; Hartung and Puchta, 2000). From our results, six of 16 cDNA sequences of hypothetical genes display two or more polyadenylation sites (Fig. 4). All of the different poly(A) sites from each gene are near to each other, the spacing between any two ranging from 23 to 82 bp (Table III). A survey of experimentally validated poly(A) sites reveals that the conservation and use of the canonical AAUAAA element varies widely among yeast, rice (Oryza sativa), Arabidopsis, fruitfly (Drosophila melanogaster), mouse, and human, and is especially weak in plants and yeast (Graber et al., 1999). Only five (At2g15220, At2g23050, At2g23370, At2g23790, and At2g41660) of 16 genes in our study contained the poly(A) signal (AAUAAA) between the predicted stop codon and the farthest poly(A) site, suggesting that the polyadenylation mechanism in plants may be more subtle or variable than in animals.
Overall, our study of the hypothetical genes on chromosome 2 indicates that more than 80% of the genes predicted purely by computer algorithms are actually expressed in one or more of the six cDNA populations tested. In addition, there are frequently alternative splicing and multiple polyadenylation events for the same hypothetical gene. These observations are valuable not only for validating their predicted structures but also for understanding their expression and regulation.
MATERIALS AND METHODS
Plant Material
Arabidopsis ecotype Columbia-0 seeds plants were subjected to a variety of treatments as described below. After harvesting, the plant tissue was frozen immediately in liquid nitrogen before RNA isolation. For young plant tissue, seeds were sown on Redimix, transferred to 4°C for 4 d, and then grown at 25°C and 24-h photoperiod for 3 weeks. The aerial parts were harvested for RNA isolation. For heat and cold shock, plants were grown as above then either incubated at 4°C for 4 h (cold shock) or 37°C for 2 h (heat shock) then harvested. Before infection with Xanthomonas campestris pv campestris, seeds were cold-treated (4°C for 4 d) and then grown at 25°C and 8-h photoperiod for 21 d. The leaves were inoculated with a fresh culture of X. campestris, and the aerial plant parts were harvest 24 h later. For root tissue, sterile seeds were imbibed at 4°C for 24 h, then inoculated into Gamborg's B5 liquid medium, and grown at 25°C and 24-h photoperiod with shaking for 15 d, and then the entire tissue mass (predominantly roots) was harvested. For cultured tissue, sterile seeds were germinated on Murashige and Skoog medium (Murashige and Skoog+) containing 2,4-dichlorophenoxyacetic acid (0.1 mg L−1) and isopentenyl adenine (0.5 mg L−1). Callus tissue was subcultured into liquid Murashige and Skoog+ and harvested after 10 d for RNA isolation.
Construction of cDNA Populations
Total RNA was isolated from a number of tissues/treatments as described previously using TRIzol reagent (Invitrogen, Carlsbad, CA) and then treated with DNA-free (Ambion, Austin, TX) to remove residual genomic DNA. mRNA was isolated using the Oligotex mRNA isolation kit (Qiagen USA, Valencia, CA). cDNA was synthesized from mRNA using Marathon cDNA amplification kit (BD Biosciences Clontech, Palo Alto, CA).
Primer Design and PCR for Detection of Expression of Hypothetical Gene
For each hypothetical gene, two gene-specific primers (GSP1 and GSP2) for 3′- and 5′-RACE were designed based on the predicted coding region, using the Primer 3 program (http://www-genome.wi.mit.edu/cgi-bin/primer/primer3_www.cgi). The criteria for primer design are that they should be 23 to 28 nucleotides long (optimum 25 nucleotides) having 50% to 70% GC content with melting temperature ≥ 70°C, which enables touchdown PCR. The primers were designed to give a 200- to 500-bp overlap between the 5′- and 3′-RACE products, so that used together, they could produce a 200- to 500-bp product from any cDNA population in which the cognate gene is expressed. PCR conditions for detection of the gene expression from different cDNA populations were as follows: 94°C for 4 min, 35 cycles of 94°C for 30s, 55°C for 30 s, and 72°C for 2 min followed by 72°C for 10 min. The GSP1 and GSP2 primer sequences of the 16 cloned hypothetical genes are shown in the Supplemental Data Table II, which can be viewed at www.plantphysiol.org.
Cloning of 5′ and 3′ Ends of Full-Length cDNAs Using RACE-PCR and Sequence Analysis
Having identified a cDNA population in which the hypothetical gene is expressed, 5′- and 3′-RACE for each gene was performed with the Marathon cDNA amplification kit (BD Biosciences Clontech) using that cDNA population and touchdown PCR with the following parameters: 94°C for 30 s; five cycles of 94°C for 5 s, 72°C for 4 min; five cycles of 94°C for 5 s, 70 for 4 min; 25 cycles of 94°C for 5 s, 68°C for 4 min; and 68°C for 4 min. After examination by gel electrophoresis, RACE reaction products were cloned into pT-Adv (BD Biosciences Clontech) or pCR2.1-TOPO (Invitrogen). White colonies were inoculated into 96-well deep blocks and grown in a 37°C shaker (225 rpm) overnight. Verification of inserts was done by colony PCR using PCR Master Mix (Promega). PCR conditions were as follows: 94°C for 4 min, 25 cycles of 94°C for 30 s, 52°C or 68°C for 30 s, 72°C for 1 min, and 72°C for 10 min.
Four to five independent clones for each of the 5′- and 3′-RACE products were sequenced from both ends using generic sequencing primers and the sequences were assembled using TIGR assembler (Sutton et al., 1995). The assembled cDNA sequences and the predicted gene structure for each gene were aligned with the corresponding genomic sequences using dds/gap2 program (Huang et al., 1997).
Supplementary Material
ACKNOWLEDGMENTS
We thank all members of the Arabidopsis group at TIGR for their help and especially Nadeeza Ishmael for her technical input.
Footnotes
This work was supported by the National Science Foundation (grant no. DBI–9813586).
The online version of this article contains Web-only data. The supplemental material is available at www.plantphysiol.org.
Article, publication date, and citation information can be found at www.plantphysiol.org/cgi/doi/10.1104/pp.010207.
LITERATURE CITED
- Brendel V, Kleffe J. Prediction of locally optimal splice sites in plant pre-mRNA with applications to gene identification in Arabidopsis thaliana genomic DNA. Nucleic Acids Res. 1998;26:4748–4757. doi: 10.1093/nar/26.20.4748. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol. 1997;268:78–94. doi: 10.1006/jmbi.1997.0951. [DOI] [PubMed] [Google Scholar]
- Comelli P, Konig J, Werr W. Alternative splicing of two leading exons partitions promoter activity between the coding regions of the maize homeobox gene Zmhox1a and Trap (transposon-associated protein) Plant Mol Biol. 1999;41:615–625. doi: 10.1023/a:1006382725952. [DOI] [PubMed] [Google Scholar]
- Frohman MA, Dush MK, Martin GR. Rapid production of full-length cDNAs from rare transcripts: amplification using a single gene-specific oligonucleotide primer. Proc Natl Acad Sci USA. 1988;85:8998–9002. doi: 10.1073/pnas.85.23.8998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Giranton JL, Ariza MJ, Dumas C, Cock JM, Gaude T. The S locus receptor kinase gene encodes a soluble glycoprotein corresponding to the SKR extracellular domain in Brassica oleracea. Plant J. 1995;8:827–834. doi: 10.1046/j.1365-313x.1995.8060827.x. [DOI] [PubMed] [Google Scholar]
- Golovkin M, Reddy AS. Structure and expression of a plant U1 snRNP 70K gene: alternative splicing of U1 snRNP 70K pre-mRNAs produces two different transcripts. Plant Cell. 1996;8:1421–1435. doi: 10.1105/tpc.8.8.1421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Graber JH, Cantor CR, Mohr SC, Smith TF. In silico detection of control signals: mRNA 3′-end-processing sequences in diverse species. Proc Natl Acad Sci USA. 1999;96:14055–14060. doi: 10.1073/pnas.96.24.14055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haas BJ, Volfovsky N, Town CD, Troukhan M, Alexandrov N, Feldmann KA, Flavell RB, White O, Salzberg SL. Full-length messenger RNA sequences greatly improve genome annotation. Genome Biol. 2002;3:RESEARCH0029. doi: 10.1186/gb-2002-3-6-research0029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hartung F, Puchta H. Molecular characterisation of two paralogous SPO11 homologues in Arabidopsis thaliana. Nucleic Acids Res. 2000;28:1548–1554. doi: 10.1093/nar/28.7.1548. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hebsgaard SM, Korning PG, Tolstrup N, Engelbrecht J, Rouze P, Brunak S. Splice site prediction in Arabidopsis thaliana pre-mRNA by combining local and global sequence information. Nucleic Acids Res. 1996;24:3439–3452. doi: 10.1093/nar/24.17.3439. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang X, Adams MD, Zhou H, Kerlavage AR. A tool for analyzing and annotating genomic sequences. Genomics. 1997;46:37–45. doi: 10.1006/geno.1997.4984. [DOI] [PubMed] [Google Scholar]
- Ishikawa T, Sakai K, Yoshimura K, Takeda T, Shigeoka S. cDNAs encoding spinach stromal and thylakoid-bound ascorbate peroxidase, differing in the presence or absence of their 3′-coding regions. FEBS Lett. 1996;384:289–293. doi: 10.1016/0014-5793(96)00332-8. [DOI] [PubMed] [Google Scholar]
- Ishikawa T, Yoshimura K, Tamoi M, Takeda T, Shigeoka S. Alternative mRNA splicing of 3′-terminal exons generates ascorbate peroxidase isoenzymes in spinach (Spinacia oleracea) chloroplasts. Biochem J. 1997;328:795–800. doi: 10.1042/bj3280795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kato A, Suzuki M, Kuwahara A, Ooe H, Higano-Inaba K, Komeda Y. Isolation and analysis of cDNA within a 300 kb Arabidopsis thaliana genomic region located around the 100 map unit of chromosome 1. Gene. 1999;239:309–316. doi: 10.1016/s0378-1119(99)00403-5. [DOI] [PubMed] [Google Scholar]
- Keeney S, Giroux CN, Kleckner N. Meiosis-specific DNA double-strand breaks are catalyzed by Spo11, a member of a widely conserved protein family. Cell. 1997;88:375–384. doi: 10.1016/s0092-8674(00)81876-0. [DOI] [PubMed] [Google Scholar]
- Lazar G, Goodman HM. The Arabidopsis splicing factor SR1 is regulated by alternative splicing. Plant Mol Biol. 2000;42:571–581. doi: 10.1023/a:1006394207479. [DOI] [PubMed] [Google Scholar]
- Lin X, Kaul S, Rounsley S, Shea TP, Benito MI, Town CD, Fujii CY, Mason T, Bowman CL, Barnstead M et al. Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana. Nature. 1999;402:761–768. doi: 10.1038/45471. [DOI] [PubMed] [Google Scholar]
- Lukashin AV, Borodovsky M. GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res. 1998;26:1107–1115. doi: 10.1093/nar/26.4.1107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mano S, Hayashi M, Nishimura M. Light regulates alternative splicing of hydroxypyruvate reductase in pumpkin. Plant J. 1999;17:309–320. doi: 10.1046/j.1365-313x.1999.00378.x. [DOI] [PubMed] [Google Scholar]
- Marrs KA, Walbot V. Expression and RNA splicing of the maize glutathione S-transferase Bronze2 gene is regulated by cadmium and other stresses. Plant Physiol. 1997;113:93–102. doi: 10.1104/pp.113.1.93. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mayer K, Schuller C, Wambutt R, Murphy G, Volckaert G, Pohl T, Dusterhoft A, Stiekema W, Entian KD, Terryn N et al. Sequence and analysis of chromosome 4 of the plant Arabidopsis thaliana. Nature. 1999;402:769–777. doi: 10.1038/47134. [DOI] [PubMed] [Google Scholar]
- Montag K, Salamini F, Thompson RD. ZEMa, a member of a novel group of MADS box genes, is alternatively spliced in maize endosperm. Nucleic Acids Res. 1995;23:2168–2177. doi: 10.1093/nar/23.12.2168. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Park J, Teichmann SA. DIVCLUS: an automatic method in the GEANFAMMER package that finds homologous domains in single- and multi-domain proteins. Bioinformatics. 1998;14:144–150. doi: 10.1093/bioinformatics/14.2.144. [DOI] [PubMed] [Google Scholar]
- Sablowski RWM, Meyerowitz EM. Temperature-sensitive splicing in the floral homeotic mutant apetala3-1. Plant Cell. 1998;10:1453–1463. doi: 10.1105/tpc.10.9.1453. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Salanoubat M, Lemcke K, Rieger M, Ansorge W, Unseld M, Fartmann B, Valle G, Blocker H, Perez-Alonso M, Obermaier B et al. Sequence and analysis of chromosome 3 of the plant Arabidopsis thaliana. Nature. 2000;408:820–822. doi: 10.1038/35048706. [DOI] [PubMed] [Google Scholar]
- Schumacher K, Vafeados D, McCarthy M, Sze H, Wilkins T, Chory J. The Arabidopsis det3 mutant reveals a central role for the vacuolar H(+)-ATPase in plant growth and development. Genes Dev. 1999;13:3259–3270. doi: 10.1101/gad.13.24.3259. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Seki M, Narusaka M, Abe H, Kasuga M, Yamaguchi-Shinozaki K, Carninci P, Hayashizaki Y, Shinozaki K. Monitoring the expression pattern of 1300 Arabidopsis genes under drought and cold stresses by using a full-length cDNA microarray. Plant Cell. 2001a;13:61–72. doi: 10.1105/tpc.13.1.61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Seki M, Narusaka M, Yamaguchi-Shinozaki K, Carninci P, Kawai J, Hayashizaki Y, Shinozaki K. Arabidopsis encyclopedia using full-length cDNAs and its application. Plant Physiol Biochem. 2001b;39:211–220. [Google Scholar]
- Simpson CG, McQuade C, Lyon J, Brown JWS. Characterization of exon skipping mutants of the COP1 gene from Arabidopsis. Plant J. 1998;17:125–131. doi: 10.1046/j.1365-313x.1998.00184.x. [DOI] [PubMed] [Google Scholar]
- Sutton G, White O, Adams MD, Kerlavage AR. TIGR Assembler: a new tool for assembling large shotgun sequencing projects. Genome Sci Technol. 1995;1:9–19. [Google Scholar]
- Tabata S, Kaneko T, Nakamura Y, Kotani H, Kato T, Asamizu E, Miyajima N, Sasamoto S, Kimura T, Hosouchi T et al. Sequence and analysis of chromosome 5 of the plant Arabidopsis thaliana. Nature. 2000;408:823–826. doi: 10.1038/35048507. [DOI] [PubMed] [Google Scholar]
- The Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000;408:796–813. doi: 10.1038/35048692. [DOI] [PubMed] [Google Scholar]
- Theologis A, Ecker JR, Palm CJ, Federspiel NA, Kaul S, White O, Alonso J, Altafi H, Araugo, Bowman CL et al. Sequence and analysis of chromosome 1 of the plant Arabidopsis thaliana. Nature. 2000;408:816–820. doi: 10.1038/35048500. [DOI] [PubMed] [Google Scholar]
- Uberbacher EC, Mural RJ. Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc Natl Acad Sci USA. 1991;88:11261–11265. doi: 10.1073/pnas.88.24.11261. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yi Y, Jack T. An intragenic suppressor of the Arabidopsis floral organ identity mutant apetala3-1 functions by suppressing defects in splicing. Plant Cell. 1998;10:1465–1477. doi: 10.1105/tpc.10.9.1465. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou DX, Kim YJ, Li YF, Carol P, Mache R. COP1b, an isoform of COP1 generated by alternative splicing, has a negative effect on COP1 function in regulating light-dependent seedling development in Arabidopsis. Mol Gen Genet. 1998;257:387–391. doi: 10.1007/s004380050662. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.