Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2007 Jul 5;104(28):11844–11849. doi: 10.1073/pnas.0704258104

A GeneTrek analysis of the maize genome

Renyi Liu *, Clémentine Vitte *, Jianxin Ma *, A Assibi Mahama , Thanda Dhliwayo , Michael Lee , Jeffrey L Bennetzen *,
PMCID: PMC1913904  PMID: 17615239

Abstract

Analysis of the sequences of 74 randomly selected BACs demonstrated that the maize nuclear genome contains ≈37,000 candidate genes with homologues in other plant species. An additional ≈5,500 predicted genes are severely truncated and probably pseudogenes. The distribution of genes is uneven, with ≈30% of BACs containing no genes. BAC gene density varies from 0 to 7.9 per 100 kb, whereas most gene islands contain only one gene. The average number of genes per gene island is 1.7. Only 72% of these genes show collinearity with the rice genome. Particular LTR retrotransposon families (e.g., Gyma) are enriched on gene-free BACs, most of which do not come from pericentromeres or other large heterochromatic regions. Gene-containing BACs are relatively enriched in different families of LTR retrotransposons (e.g., Ji). Two major bursts of LTR retrotransposon activity in the last 2 million years are responsible for the large size of the maize genome, but only the more recent of these is well represented in gene-containing BACs, suggesting that LTR retrotransposons are more efficiently removed in these domains. The results demonstrate that sample sequencing and careful annotation of a few randomly selected BACs can provide a robust description of a complex plant genome.

Keywords: gene distribution, gene number, genome annotation, repetitive DNA, sample sequencing


Whole-genome sequence analysis has revolutionized the field of plant genetics. A sequenced plant genome provides the full list of genetic elements and also the context in which these elements function. The near-complete sequence of Arabidopsis thaliana (1) enabled the Arabidopsis 2010 project, which proposes to characterize the function of all of the genes in the genome (2). In addition, comparative analysis of the genomes of two or more species with known evolutionary relatedness is a powerful way to identify functional elements, to transfer knowledge from well studied model organisms to related plants, and to infer the mechanisms of genome evolution. For example, comparative analysis of orthologous regions from multiple cereals, including maize, rice, sorghum and wheat, has provided abundant information about the timing, nature and mechanisms of small rearrangements within those genomes (37).

Because of the high cost of whole-genome sequencing, it is best to choose the most cost-effective sequencing method, a choice influenced by genome size, gene content, and sequence organization. For small genomes or those with genes highly separated from most repetitive DNAs, BAC-by-BAC sequencing is the obvious approach. For the larger genomes characteristic of most flowering plants, where repeats are often intermixed with genes, gene-enrichment approaches have been suggested as a more efficient sequencing strategy (8, 9). Therefore, before deciding on an effective sequencing approach, it is appropriate to first understand the general composition and structure of the target genome.

The GeneTrek approach has been proposed as an efficient way to evaluate the general properties of any genome (10, 11). The principle is to sequence and annotate a small randomly selected portion of the genome, a strategy first used by Brenner and coworkers in the analysis of the Fugu genome (12). For maize, analysis of ≈475,000 BAC end sequences led to the estimation that the maize genome contains ≈59,000 genes and is ≈58% repetitive DNA (13). Further, sequence analysis of 100 randomly selected BAC clones from maize led to the prediction of 42,000–56,000 gene models and at least 66% repetitive DNA (14).

BAC sequences provide additional context information to a sample sequence analysis, thus allowing genome predictions that greatly enrich the GeneTrek approach. Here we describe a procedure for accurately predicting plant genome structure and composition with a relatively small data input and use this approach in comprehensive sequence annotation of randomly selected BACs that contain DNA from maize inbred B73. The results indicate that the maize genome contains many gene-free regions, many highly truncated gene fragments, and a nonrandom distribution of repetitive elements within different repeat-rich domains

Results

Repeat and Mobile DNA Annotation.

Repetitive elements, especially LTR retrotransposons, are a major component of the maize genome (1517). Exhaustive identification of repetitive elements is important not only for the accurate estimation of the amount of repetitive DNA but also for the estimation of gene content. Inadequate identification of repetitive elements may lead to consistent overestimation of gene number in plants because many repeats are scored as genes by gene prediction programs, and these repeats are well represented in EST databases (18, 19).

A total of 7.56 Mb of repeats were identified on the 74 randomly selected BACs by comparing BAC sequences to the TIGR maize repeat database (20) and an in-house maize retrotransposon database (P. SanMiguel, personal communication). Intact LTR retrotransposons were then identified by a structure-based search, leading to the characterization of an additional 710 kb of LTR retrotransposons. Hence, mobile and/or repetitive DNA accounted for at least 8.3 Mb or 68% of the BAC sequences (Table 1).

Table 1.

Summary of annotation results for 74 randomly selected maize BACs

Total number of BACs analyzed 74
Combined BAC lengths 12.2 Mb
Amount of identified repetitive DNA (percentage) 8.3 Mb (67%)
Number of genes with similarity or collinearity support 188
Number of severely truncated gene fragments 28
Overall gene density One gene per 65 kb
Genes that show collinearity with rice (percentage) 124 (72%)
Number of hypothetical genes 131
Number of estimated total maize genes 37,038–62,847

Gene Annotation.

When the FGENESH gene prediction program was applied to the 74 BACs, 2,137 gene models were predicted. Some models [1,790 (84%)] were parts of identified repeats (mostly LTR retrotransposons) and were removed from further analysis. Based on stringent criteria (i.e., either significant homology (e ≤10−10) to other species and/or colinearity with rice genes), 216 gene models were classified as verified gene candidates. After comparing to rice or Arabidopsis genes, 28 of these gene models were identified as gene fragments. This left 188 annotated genes over 74 BACs, with an average gene density of 1 gene per 65 kb (Table 1). The remaining 131 gene models were classified as “hypothetical proteins,” many of which will eventually be shown to be parts of heretofore undiscovered transposons (18). If these gene numbers are extrapolated to the whole maize genome, with an estimated size of 2,400 Mb, then maize contains 37,000 (verified) to 63,000 (verified plus hypothetical) genes (Table 1). Recalculating gene number using these 74 BACs, randomly sampled with replacement, indicated that the 37,000 gene number is 82% accurate with 95% confidence (data not shown).

Gene Fragments.

Because maize and rice are much more closely related than are maize and Arabidopsis, it is easier to find excellent maize homology to rice sequences. As a result, 27 of the 28 gene fragments were identified through comparisons with rice genes. Among 28 gene fragments, five are C-terminal fragments, 11 are N-terminal fragments, and 12 are fragments from the middle regions of the genes, the latter requiring at least two truncation events.

Colinearity with Rice.

Because it is not possible to infer colinearity if there is only one gene on a maize BAC, 15 genes were dropped from the colinearity analysis because they are on single-gene BACs. For the remaining 173 genes, 124 (72%) were colinear with genes in the sequenced rice genome (21).

Local Gene Density and Distribution.

As shown in Fig. 1, gene density, calculated as the number of genes on a BAC divided by the BAC length, varies from 0 to 7.9 per 100 kb on different BACs. This indicates that gene distribution in the maize genome is uneven, and that some regions have much higher gene density than others. Twenty one of the 74 annotated BACs contain no verified genes at all, indicating that ≈30% of the maize genome is comprised of long stretches of exclusively nongenic sequence. This uneven distribution is supported by a formal χ2 test (null hypothesis rejected at 99% significance level, χ2 = 22.7, df = 9, P = 0.007).

Fig. 1.

Fig. 1.

Gene density variation among BACs. Gene density was calculated as the gene number on a BAC divided by the BAC length. The BACs are sorted by overall gene density. The gene density is shown for both hypothetical and verified genes.

Gene Islands.

The gene distribution in the gene-rich regions can be evaluated as the number of genes per gene island. Here two genes are counted as on one gene island if there is <5 kb of identifiable repetitive sequences in the intergenic region between them. Genes at either end of a BAC are discarded from this analysis because one boundary of the gene island is not clear. As shown in supporting information (SI) Fig. 3, islands were found to contain one to seven (verified genes only) or eight (hypothetical genes included) genes. If hypothetical genes are included, 61% (89 of the 145) of the gene islands contain only one gene and the average number of genes per gene island is 1.8. If they are not included, the numbers are 64% (57 of 89) and 1.7, respectively.

Detailed Sequence Analysis of Six Gene-Free BACs.

From this analysis of 74 BACs, 21 did not contain any verified gene (that is, either no gene candidate or only hypothetical genes were predicted). To further investigate the content of such genomic regions, 6 of these 21 BACs with complete or almost complete sequence assembly were selected and manually annotated. The result of this annotation is presented in Fig. 2 and SI Figs. 4–8. As observed for other maize genomic regions (4, 6, 14, 2224), these regions primarily contain LTR retrotransposons that are organized in nested structures. As shown in SI Table 2, LTR retrotransposons are the largest components of all six BACs, representing 87.5% of the total sequence analyzed. The remaining sequence includes MITEs (0.1%) and other DNA transposons (1.4%). In BAC AC148161, >5% of the sequence corresponds to a tandem repeat (minisatellite) that shares >83% identity with the 180-bp knob repeat (25). Only 9.4% of the total sequence remained undefined. These sequences probably correspond to pieces of ancient repeats that are no longer recognizable because of the accumulation of deletions and substitutions (26). Among LTR retrotransposons, the most represented are Zeon (17 copies), Cinful (14 copies), Prem-1 (13 copies), Gyma (nine copies), Xilon (eight copies), Huck (seven copies), Opie (seven copies), Ji (seven copies), Shadowspawn (five copies), and Tekay (five copies). Two new gypsy-like elements, Weju and Reme, were also discovered on these BACs (SI Fig. 4 and SI Table 3).

Fig. 2.

Fig. 2.

Graphic representation of BAC AC147789. Total size of the BAC is shown in parentheses. For LTR retrotransposons, family names are shown on top of the element, and the number after the dash represents each copy. LTR divergence is represented on top of the name. Locations of the primers used for RJMs are represented as small arrows.

Both parametric and nonparametric tests comparing the number of LTR retrotransposons in the gene-free vs. gene-containing regions revealed that the elements Zeon and Gyma are significantly more abundant in the gene-free than gene-containing regions when taken individually (P values of 0.029 and 0.019 and 0.003 and 0.003, respectively, for parametric and nonparametric tests). After a Bonferroni correction, only Gyma remains significant (P values after Bonferroni correction, 0.030). In contrast, the Ji and Opie elements are relatively more abundant in the gene-containing regions [individual P value <0.0001 and 0.0001 and 0.006 and 0.014, respectively, for each test, with Ji still significant after Bonferroni correction (corrected P value of 0.001)].

As presented in Fig. 2, SI Figs. 4–8 and SI Table 4, most (66) of the 103 LTR retrotransposons analyzed contain two LTRs, whereas only 37 are truncated. For 13 of these truncated elements, the truncation is at the end of the BAC sequence and thus probably does not correspond to a biological feature. Only four solo LTRs (two from Gyma and two from Ji) could be described with confidence using the presence of the target site duplication (TSD), whereas an additional LTR was complete but did not exhibit a TSD. Among the 66 elements with two LTRs, insertion dates could be estimated for 63 (the other three did not have enough LTR sequence to be properly aligned). Results, presented in SI Table 5, show that most of these elements have inserted within the last 3 million years (My), with a first quartile, a median, a third quartile, and a maximum at 0.39, 1.35, 1.71, and 3.94 My, respectively. Histograms of the insertion dates were built, with a time scale of 1/2 My. These results, presented in SI Fig. 9, reveal two peaks of amplification, ≈1.5–2 Mya and within the last 500,000 years.

Of the 45 elements, only five (11.1%) are inserted within their own family, whereas 40 (88.9%) are within a member of another family (SI Table 6). The “top” elements are usually younger (based on comparison of the LTR sequences) than the elements in which they have inserted, although there are six exceptions (see Fig. 2 and SI Figs. 4–8).

Generation of Unique Markers from Repeated Sequences and Mapping of Gene-Free BACs.

Gene-free genomic fragments contain mainly repetitive sequences. Hence, they may be difficult to map genetically, because they do not harbor unique sequences that can be used as molecular markers. However, the insertion of a particular LTR retrotransposon within another can be used as a unique marker, because the combination of both the nature of the two elements and the site of insertion may be unique in the genome (11). Of 29 LTR repeat junction markers (RJMs) tested, 11 (37.9%) led to the amplification of a single band of expected size in B73 maize and seven (24.1%) to the amplification of a single band of expected size, plus one or two additional faint bands (SI Table 7; see SI Table 8 for details). Five of these 18 RJMs were polymorphic between B73 and Mo17 and were used to position these BACs on the maize genetic map. These five polymorphic markers are from BACs AC147789 (marker A6), AC148081 (markers B3 and B4), and AC147809 (markers F2 and F5).

Evidence for linkage was found for all markers, leading to the positioning of BAC AC147789 on chromosome 1, bin 1.06, between markers umc1972 (15.4 cM) and asg58 (43.1 cM); BAC AC148081 on chromosome 7, bin 7.02, between markers phi034/cyp6 (0 cM) and umc1983 (18.9 cM); and BAC AC147809 on chromosome 6, bins 6.01–6.02, between markers umc1006 (4.6) and bnlg1867 (4.9 cM) (SI Table 9). In cases where two markers were available for the same BAC (i.e., markers B3 and B4 on AC148081 and markers F2 and F5 on AC147809, respectively), these two markers were mapped within 0.5 cM, confirming the correlation between the bands observed and the markers designed (data not shown).

BAC fingerprint data (www.genome.arizona.edu/fpc/WebAGCoL/maize/WebFPC) were used to physically locate all six BACs. This step also confirmed the RJM mapping results and helped estimate the position of the remaining three BACs for which no RJM marker could be found: AC148172 in bin 4.05, AC148161 in bin 5.03, and ac148159 in bin 5.05 (SI Table 9).

Positions of the centromeres and knobs located on maize chromosomes 1, 4, 5, 6, and 7 were compared with the position of the six BACs (SI Table 9). Results of this comparison reveal that AC147789 is located >5 μm from the centromere and knob 1S described in traditional maize varieties (27), AC148172 is near (<1 μm) a centromere, AC148161 is located far (≈9 μm) from any centromere, AC148159 is located ≈2 μm from knob 5L, and AC147809 appears to be located near knob 6L1, described in traditional maize varieties and Mexican teosintes (27) (the mapping interval spans the knob region). For BAC AC148081, the Intermated B73 × Mo17 (IBM) mapping interval spans a centromere, but the interval is quite large and marker phi034/cyp6 (chromosome 7S) is tightly linked to the BAC. Therefore, it appears that BAC AC148081 is located on chromosome 7S, at least 3 μm from the centromere.

Hence, of the six gene-free BACs analyzed, one (AC148172) is located in a pericentromeric region, one (AC148159) is located close to a B73 knob, and one (AC147809) is located close to a knob that is present in traditional maize varieties and Mexican teosintes (27) but not visible in B73 by in situ hybridization (28), whereas the three other BACs are from chromosomal locations that are distant to known heterochromatic regions.

Discussion

The Structure of the Maize Genome.

To sequence large plant genomes in a cost-effective manner, it is appropriate to first evaluate genome composition and structure, then choose the best sequencing strategy. Our results demonstrate that the annotation of a small set of randomly selected BACs is an effective way to evaluate the key properties of a large and complex plant genome, such as total gene number, amount of repetitive DNA, and gene distribution.

Because all large plant genomes contain numerous transposable element (TEs) that may be difficult to identify and differentiate from genes, it is challenging to obtain an accurate estimation of gene number (18). Even in the relatively simple rice genome, the estimation of gene number from draft whole-genome sequence and finished individual chromosomes has varied from ≈32,000 to ≈70,000 (2933). The estimation of gene number in maize is particularly challenging, because the complete genome sequence is not available, and the majority of the genome consists of nested LTR retrotransposons (15). The lower boundary of our estimation of maize gene number (37,000) is similar to the gene number estimated in the nearly completed rice genome sequence (≈32,000) (34). This is consistent with the fact that, even though maize has a fairly recent tetraploid history (35), it is approaching a diploid status because 50–90% of the duplicated copies of genes have been deleted at least partially in one of the homoeologous regions (4, 6, 36). In a recent study, Haberer et al. (14) predicted 42,000–56,000 genes in the maize genome and concluded that 22% of their 100 randomly chosen maize BACs were missing genes. The differences in our analyses, for instance the ≈30% of maize BACs that we found to be lacking in verified genes, is an outcome of the more conservative gene annotation in the current study. We found that many of the sequences called genes by Haberer et al. (14) were actually truncated gene fragments and/or sequences within TEs.

One critical step to obtain an accurate gene number estimation in maize is to differentiate gene fragments from complete genes. In this set of BACs, 14% of the genes with homologues to other plant species are severely truncated when compared with rice or Arabidopsis genes. Gene fragments have been identified earlier in many maize regions. Some are the residual gene components retained from the loss of duplicated genes (from either segmental duplication or polyploidy) caused by the slow accumulation of small deletions associated with illegitimate recombination (4, 37). The majority, however, are associated with Helitrons, a new class of TEs that can acquire (and sometimes fuse/express) multiple gene fragments from different genes (38, 39). Although truncated genes can play a functional role because they might be expressed and translated, they are usually excluded from gene content estimation. Given our stringent criteria for prediction as a truncated gene (>30% sequence loss), it is likely that many more predicted genes in maize are actually truncated or inactivated by other types of mutations. For instance, Morgante et al. have predicted that ≈20% of annotated maize genes are actually gene fragments within Helitrons (39).

Another key step to evaluate gene content is to differentiate genes from TEs. Although a similarity search is effective in identifying known TEs, it cannot identify novel elements. It is thus important to identify TEs by structural features. In this study, 710 kb of novel LTR retrotransposon sequences were identified, and this decreased the predicted gene number by >20%.

Genomic colinearity with closely related species can be an effective way to identify real genes, because plant genomes are evolving very fast, and orthologous intergenic regions are rarely conserved (5). It has been previously demonstrated that the comparison of the homologous regions of maize and sorghum or of barley and rice was an effective way to identify genes and delineate gene structures (40, 41). However, this strategy has not been applied to large-scale genomic sequence annotation in plants. This is because the two species involved need to have appropriate evolutionary distance, which should be close enough to allow for extensive colinearity and far enough to allow for extensive divergence in the intergenic regions. For example, Arabidopsis is not an appropriate reference for rice genome analysis, because the colinearity is very limited between these two species (4245). In this study, colinearity with rice was used to help confirm that maize genes were real, and not artifacts of annotation.

Overall gene distribution in a genome has important implications in selecting cost-effective strategies for whole-genome sequencing or map-based cloning. The annotation results show that ≈30% of maize BACs contain no verified genes. Given the gene distribution observed, the results predict that ≈50% of verified genes could be sequenced on ≈18% of maize BACs, whereas ≈90% of verified genes are present on 49% of maize BACs and ≈95% of maize genes on 60% of maize BACs. Given the general opinion that 95% gene discovery is an unacceptably low final product for a genome sequence and given the difficulties in finding all maize BACs with genes on them, these results suggest that selecting only gene-rich BACs for maize genome sequencing would be a poor genome-sequencing approach. It would be better to use gene enrichment or sequence all available BACs and accept that ≈30% of the sequenced clones will provide no gene discovery value.

Nonrandom Genome-Component Distribution Across the Maize Genome.

The majority of the sequence of the six gene-free BACs corresponds to LTR retrotransposons, as expected in the LTR retrotransposon-rich maize genome. Neither LINEs nor Helitrons were found on these gene-free BACs. The absence of Helitrons in these regions was unexpected, because previous studies suggested that these elements insert mainly in repeated sequences (46).

Statistical analysis of LTR retrotransposon number in gene-free vs. -containing regions revealed that some elements are significantly more abundant in the gene-free regions (i.e., Gyma), whereas others are significantly more abundant in the gene-containing ones (i.e., Ji). Previous studies have indicated that some LTR retrotransposons are more likely to accumulate in pericentromeric regions than in euchromatic DNA in maize and other plants (for details, see SI Text). The data presented herein suggest this bias is not limited to large heterochromatic blocks like pericentromeres or knobs but is a general property of interstitial gene-poor regions intermixed with euchromatin.

Investigation of the insertion dynamics of LTR retrotransposons in the gene-free regions indicated two peaks of amplification, around 1.5–2 Mya and within the last 500,000 years. Calculating the insertion dates of LTR retrotransposon for six gene-containing regions revealed only one peak of amplification, within the last 1 My. This striking difference suggests that LTR retrotransposons have been more efficiently removed from the gene-containing regions (47). Removal of LTR retrotransposon sequences occurs primarily by illegitimate and unequal homologous recombination (26, 48). The recombinational suppression expected in gene-poor regions should lead to a lower frequency of solo LTR generation by unequal homologous recombination, as observed in rice pericentromeric heterochromatin (49).

Mapping of gene-free BACs revealed that, among the six gene-free BACs, only two (AC148159 and AC141872) are from known heterochromatic regions such as centromeres or knobs, and the four others are located either in regions distant from the heterochromatic regions or not highly compacted in B73. Thus blocks of gene-free repetitive DNA >100 kb in size appear to be common in the maize genome and are likely to be intermixed with genic blocks.

Although BAC AC148161 does not map in a knob region (it is located on bin 5.03, whereas knob 5L is located on bin 5.07, and physical estimates place it >16 μm from the knob), it harbors a tandem repeat that shares >83% identity with the 180-bp knob repeat (50). The repeat unit is repeated 71 times on this gene-free BAC. The repeats on this BAC are mostly complete, with only 12 truncated repeats that lack from 10 to 40 bp. The copy number of the 180-bp tandem repeat usually correlates with the size of the knob (25, 51), suggesting this BAC comes from a region with a “microknob” in B73. As suggested by Viotti et al. (52), the molecular characterization of this repeat confirms that some euchromatic regions have a low concentration of knob 180-bp repeats. The other maize knob repeat, the 350-bp repeat known as TR-1 (53), is not present in BAC AC148161, in agreement with the observation that the two types of repeats tend to be isolated from each other, either in different knobs or in separate domains of the same knob (54).

Conclusions

The general properties of the maize genome can be described by sequence analysis of a small number of randomly selected large-insert clones. The overall gene number in the genome, frequency of truncated genes, and gene distribution can be unambiguously assessed. The contributions and arrangements of repetitive DNAs are also clearly delineated. The results indicate that genes are found in small islands that are unevenly distributed around the genome, and that different families of TEs preferentially associate with gene-containing or -free regions. Mapping of these regions suggests most of these gene-free regions are not associated with known heterochromatic features of the genome.

Materials and Methods

Randomly Selected Maize BACs.

Ninety-one maize BAC sequences were downloaded from the National Center for Biotechnology Information (www.ncbi.nlm.nih.gov) on March 4, 2004. These BACs were selected randomly from EcoRI, MboI, and HindIII BAC libraries with inserts of DNA from maize inbred B73 (55). They were sequenced by the Whitehead/Broad Institute, the University of Arizona, and Rutgers University (www.broad.mit.edu/annotation/plants/maize/randomclones.html) (14). We selected only those 80 BACs with sizes >80 kb for annotation. Six of these BACs were excluded later, because their predicted size changed dramatically upon subsequent assemblies recorded in GenBank, leaving 74 BACs covering 12.2 Mb (0.5%) of the maize genome for the analysis.

Annotation of LTR Retrotransposons and Other Repeats.

Known repeats were identified by comparing BAC sequences with the TIGR maize repeat database (20) and an in-house database of maize repeats with cross-match (www.phrap.org/phredphrapconsed.html). LTR retrotransposons were manually identified by structural features such as pairs of LTRs, TSDs, a primer binding site and a polypurine tract.

Gene Annotation and Distribution Analysis.

BAC sequences were first subject to ab initio gene prediction with FGENESH (www.softberry.com) using the matrix for monocot plants. Gene models that were integral parts of known repeats were removed from further analysis. The remaining gene models were used as query sequences to search the National Center for Biotechnology Information nonredundant protein database, the Arabidopsis protein database (ftp://ftp.tigr.org/pub/data/a_thaliana, version 5.0), all BAC sequences from the international rice genome sequencing group (21), and the TIGR plant gene indices (56). All BLAST hits were manually evaluated to determine whether a gene model is likely to be a real gene based on e value, alignment with the query, hit annotation, and number and evolutionary range of species with sequences that exhibited homology.

A χ2 test was used to test the null hypothesis that genes are uniformly distributed in the maize genome. Because one requirement of a χ2 test is that cell (group) frequency must be at least 5 (57), each BAC was randomly assigned to one of the 10 groups (a randomized block design) and then the expected and observed gene number in each group was used to calculate the χ2 value.

Gene Fragment Identification.

Each gene model was used as a query to search Arabidopsis proteins (TIGR annotation version 5.0) and rice proteins derived from full length cDNAs. The best match from either Arabidopsis or rice was used as query to search maize BAC DNA sequences to ensure that homologs were not shortened by annotation errors. Each alignment was manually evaluated. Maize DNA sequences upstream and downstream of the alignment were also checked to make sure that any fragment identified was not caused by sequence gaps or BAC boundaries. A gene fragment was defined by the subjective criteria that the maize gene model could cover only part (≤70%) of the homologs in Arabidopsis or rice, and the alignment between the query and hit would have a high identity that was maintained at both ends of the alignment. Hence, many gene fragments and other pseudogenes would be missed by these criteria.

Colinearity with Rice.

Gene models from a single BAC were used as queries to search against rice BAC sequences. If two or more gene models had significant (BLAST e value ≤10−5) matches on the same rice BAC, they were judged as having colinearity with rice. Because the randomly selected maize BAC sequences were in draft stage and usually consisted of unordered pieces, it was not possible to compare the order and orientation of some genes.

Choice and Annotation of Gene-Free BACs.

From the 21 BACs for which no verified gene was annotated, six were chosen for further characterization. These BAC sequences are present in GenBank under accession nos. AC147789, AC148081, AC148159, AC148172, AC148161, and AC147809 (SI Table 2). Sequence analysis was based on the sequences available in GenBank in February 2005. For all BACs except AC148161 (which contains three contigs), the sequence is represented in one contig.

Annotation of the six BACs sequences was performed in several steps, each step allowing the characterization of a certain type of repeat, such as known TEs, unknown LTR retrotransposons, tandem repeats, and previously undescribed repeats (see SI Text for details of these steps). After each step, the annotated regions were removed from the sequence to allow the characterization of other repeats in which the first annotated elements may have inserted. For each characterized TE, special care was given to detect the presence of the TSD.

Primers Design and PCR.

Primers were chosen to flank the junction between two LTR retrotransposons, based on annotation of the BAC sequences from maize line B73. The TSD sequence of the inserted LTR retrotransposon was chosen to be in the middle of the PCR product, with a product size ranging from 500 to 600 bp. Primers were designed by using the Primer3 software (http://frodo.wi.mit.edu/cgi-bin/primer3/primer3_www.cgi), with a GC clamp of 2. Primer sequences are listed in SI Table 10.

DNAs were extracted from 2-week-old plants from maize lines B73 and Mo17 using the CTAB method described by M. Frohlich (http://fgp.bio.psu.edu/fgp/methods/method_CTAB.html) without conducting the CsCl purification steps. PCRs were performed in a 20-μl total volume, containing 100 ng of DNA, 2 μl of 10× Taq polymerase buffer (Roche, Indianapolis, IN), 0.16 μl of 25 mM dNTPs (Roche), 3 μl of 4 μM solutions of each primers (Invitrogen), 0.16 μl of 5 units/μl of Taq (Roche), completed with nuclease-free water (Fisher Scientific, Pittsburgh, PA). They were carried out with the following steps: denaturation at 94°C for 3 min, followed by 35 touchdown cycles with denaturation at 94°C for 30 sec, annealing with temperature decreasing from 62°C to 55°C in 30 sec, and extension of 1 min at 72°C, followed by a final extension step of 3 min at 72°C. In each cycle, the increase of the temperature between the annealing and the extension steps was carried out with a ramp of 1°C per sec.

Estimation of LTR Retrotransposon Insertion Dates.

The insertion date of each LTR retrotransposon harboring two LTRs was estimated following the method described in ref. 22. The divergence between LTRs was computed by MEGA version 3.0 (58), using the Kimura 2 parameter distance (59) that corrects for both homoplasy and differences in the rates of transition and transversion. The substitution rate used was 1.3 × 10−8 substitution/site per year, as suggested by Ma and Bennetzen (60).

Comparison of Gene-Free and Gene-Containing Regions.

Six gene-containing BACs, GenBank accession nos. AF123535 (15), AF448416 (24), AY555142 (7), AY664413, AY664414, and AY664415 (23), were chosen to compare to the six gene-free BACs in this study. LTR retrotransposons copy number and insertion dates were recalculated to avoid differences caused by different analytical procedures.

Variation in copy number between the gene-free and -containing BACs was determined for the 10 most-abundant LTR retrotransposons. For each element, two methods were used: (i) a P value was estimated by comparing the observed copy number in the gene-free regions to a binomial distribution with n trials (n being the total number of copies found in both gene-free and -rich regions for this element), and a probability of a copy to be in a gene-free region estimated by P = N1/N1+N2, with N1 and N2 the total number of copies found for all elements in the gene-free and gene-rich regions, respectively; (ii) a P value was estimated by using a nonparametric test. For each element, a theoretical distribution of the copy number in gene-free region was built by using 999 simulations in which NGF copies were randomly picked without replacement from the total number of copies observed in both gene-rich and gene-free regions (NGF + NGR), with NGF and NGR the observed number of copies in gene-free and -rich regions, respectively. The P value was estimated by comparing the observed number of copies in the gene-free regions to this distribution. A Bonferroni correction was used to correct for the 10 individual tests that were made (one for each of the 10 most-abundant elements) out of the same sample.

BAC Mapping and Comparison to Centromere and Knob Locations.

BACs with polymorphic markers were genetically mapped in a subset of 94 inbred lines from the IBM maize mapping population (61) using MapMaker 2.0. In addition, all BACs were mapped by using BAC fingerprint data retrieved from the Arizona Genomic Institute Web site (www.genome.arizona.edu/fpc/WebAGCoL/maize/WebFPC). These data were then used to locate the corresponding bin, using the location of these markers on the IBM2 2004 neighbors map.

To precisely position the six BACs on the genome, positions of the markers on the IBM2 2004 map were converted to physical locations, using the Morgan2McClintock translator (62) (available at www.lawrencelab.org/Morgan2McClintock). This allowed prediction of the physical position of the markers as a fraction of the distance on the arm from the centromere. Similarly, knob-containing bins were physically mapped on the chromosomes, allowing the estimation of the distance between the six BACs and these heterochromatic regions.

Acknowledgments

We thank P. SanMiguel for providing an unpublished maize retrotransposon data set, C. Lawrence for advice on locating maize knobs, L. Yang for Helitron analysis, and H. Zheng and J. Estill for help with statistical analyses. This research was supported by National Science Foundation Grant DBI-0501814 (to J.L.B.).

Abbreviations

TSD

target site duplication

My

million years

RJM

repeat junction marker

TE

transposable element

IBM

Intermated B73 × Mo17.

Footnotes

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/cgi/content/full/0704258104/DC1.

References

  • 1.The Arabidopsis Genome Initiative. Nature. 2000;408:796–815. doi: 10.1038/35048692. [DOI] [PubMed] [Google Scholar]
  • 2.Ausubel FM. Plant Physiol. 2002;129:394–437. [Google Scholar]
  • 3.Ramakrishna W, Dubcovsky J, Park YJ, Busso C, Emberton J, SanMiguel P, Bennetzen JL. Genetics. 2002;162:1389–1400. doi: 10.1093/genetics/162.3.1389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Ilic K, SanMiguel PJ, Bennetzen JL. Proc Natl Acad Sci USA. 2003;100:12265–12270. doi: 10.1073/pnas.1434476100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Bennetzen JL, Ma J. Curr Opin Plant Biol. 2003;6:128–133. doi: 10.1016/s1369-5266(03)00015-3. [DOI] [PubMed] [Google Scholar]
  • 6.Lai J, Ma J, Swigonova Z, Ramakrishna W, Linton E, Llaca V, Tanyolac B, Park YJ, Jeong Y, Bennetzen JL, et al. Genome Res. 2004;14:1924–1931. doi: 10.1101/gr.2701104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Ma J, SanMiguel P, Lai J, Messing J, Bennetzen JL. Genetics. 2005;170:1209–1220. doi: 10.1534/genetics.105.040915. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Whitelaw CA, Barbazuk WB, Pertea G, Chan AP, Cheung F, Lee Y, Zheng L, van Heeringen S, Karamycheva S, Bennetzen JL, et al. Science. 2003;302:2118–2120. doi: 10.1126/science.1090047. [DOI] [PubMed] [Google Scholar]
  • 9.Palmer LE, Rabinowicz PD, O'Shaughnessy AL, Balija VS, Nascimento LU, Dike S, de la Bastide M, Martienssen RA, McCombie WR. Science. 2003;302:2115–2117. doi: 10.1126/science.1091265. [DOI] [PubMed] [Google Scholar]
  • 10.Bennetzen JL. Proc Tenth Int Wheat Genet Symp. 2003;1:215–220. [Google Scholar]
  • 11.Devos KM, Ma J, Pontaroli AC, Pratt LH, Bennetzen JL. Proc Natl Acad Sci USA. 2005;102:19243–19248. doi: 10.1073/pnas.0509473102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Brenner S, Elgar G, Sandford R, Macrae A, Venkatesh B, Aparicio S. Nature. 1993;366:265–268. doi: 10.1038/366265a0. [DOI] [PubMed] [Google Scholar]
  • 13.Messing J, Bharti AK, Karlowski WM, Gundlach H, Kim HR, Yu Y, Wei F, Fuks G, Soderlund CA, Mayer KFX, et al. Proc Natl Acad Sci USA. 2004;101:14349–14354. doi: 10.1073/pnas.0406163101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Haberer G, Young S, Bharti AK, Gundlach H, Raymond C, Fuks G, Butler E, Wing RA, Rounsley S, Birren B, et al. Plant Physiol. 2005;139:1612–1624. doi: 10.1104/pp.105.068718. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.SanMiguel P, Tikhonov A, Jin YK, Motchoulskaia N, Zakharov D, MelakeBerhan A, Springer PS, Edwards KJ, Lee M, Avramova Z, et al. Science. 1996;274:765–768. doi: 10.1126/science.274.5288.765. [DOI] [PubMed] [Google Scholar]
  • 16.Meyers BC, Tingley SV, Morgante M. Genome Res. 2001;11:1660–1676. doi: 10.1101/gr.188201. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.SanMiguel P, Bennetzen JL. Ann Bot. 1998;82:37–44. [Google Scholar]
  • 18.Bennetzen JL, Coleman C, Liu R, Ma J, Ramakrishna W. Curr Opin Plant Biol. 2004;7:732–736. doi: 10.1016/j.pbi.2004.09.003. [DOI] [PubMed] [Google Scholar]
  • 19.Sabot F, Guyot R, Wicker T, Chantret N, Laubin B, Chalhoub B, Leroy P, Sourdille P, Bernard M. Mol Genet Genomics. 2005;274:119–130. doi: 10.1007/s00438-005-0012-9. [DOI] [PubMed] [Google Scholar]
  • 20.Ouyang S, Buell CR. Nucleic Acids Res. 2004;32:D360–D363. doi: 10.1093/nar/gkh099. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Matsumoto T, Wu JZ, Kanamori H, Katayose Y, Fujisawa M, Namiki N, Mizuno H, Yamamoto K, Antonio BA, Baba T, et al. Nature. 2005;436:793–800. [Google Scholar]
  • 22.SanMiguel P, Gaut BS, Tikhonov A, Nakajima Y, Bennetzen JL. Nat Genet. 1998;20:43–45. doi: 10.1038/1695. [DOI] [PubMed] [Google Scholar]
  • 23.Brunner S, Fengler K, Morgante M, Tingey S, Rafalski A. Plant Cell. 2005;17:343–360. doi: 10.1105/tpc.104.025627. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Fu H, Dooner HK. Proc Natl Acad Sci USA. 2002;99:9573–9578. doi: 10.1073/pnas.132259199. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Ananiev EV, Phillips RL, Rines HW. Genetics. 1998;149:2025–2037. doi: 10.1093/genetics/149.4.2025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Ma J, Devos KM, Bennetzen JL. Genome Res. 2004;14:860–869. doi: 10.1101/gr.1466204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.McClintock B, Kato Y, Blumenshein A. Chromosome Constitution of Races of Maize. Chapingo, Mexico: Colegio de Postgraduados; 1981. [Google Scholar]
  • 28.Kato A, Lamb JC, Birchler JA. Proc Natl Acad Sci USA. 2004;101:13554–13559. doi: 10.1073/pnas.0403659101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Goff SA, Ricke D, Lan TH, Presting G, Wang RL, Dunn M, Glazebrook J, Sessions A, Oeller P, Varma H, et al. Science. 2002;296:92–100. doi: 10.1126/science.1068275. [DOI] [PubMed] [Google Scholar]
  • 30.Yu J, Hu SN, Wang J, Wong GKS, Li SG, Liu B, Deng YJ, Dai L, Zhou Y, Zhang XQ, et al. Science. 2002;296:79–92. doi: 10.1126/science.1068037. [DOI] [PubMed] [Google Scholar]
  • 31.Feng Q, Zhang YJ, Hao P, Wang SY, Fu G, Huang YC, Li Y, Zhu JJ, Liu YL, Hu X, et al. Nature. 2002;420:316–320. doi: 10.1038/nature01183. [DOI] [PubMed] [Google Scholar]
  • 32.Sasaki T, Matsumoto T, Yamamoto K, Sakata K, Baba T, Katayose Y, Wu JZ, Niimura Y, Cheng ZK, Nagamura Y, et al. Nature. 2002;420:312–316. doi: 10.1038/nature01184. [DOI] [PubMed] [Google Scholar]
  • 33.Yu Y, Rambo T, Currie J, Saski C, Kim HR, Collura K, Thompson S, Simmons J, Yang TJ, Nah G, et al. Science. 2003;300:1566–1569. [Google Scholar]
  • 34.Itoh T, Tanaka T, Barrero RA, Yamasaki C, Fujii Y, Hilton PB, Antonio BA, Aono H, Apweiler R, Bruskiewich R, et al. Genome Res. 2007;17:175–183. doi: 10.1101/gr.5509507. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Swigonova Z, Lai J, Ma J, Ramakrishna W, Llaca V, Bennetzen JL, Messing J. Genome Res. 2004;14:1916–1923. doi: 10.1101/gr.2332504. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Langham RJ, Walsh J, Dunn M, Ko C, Goff SA, Freeling M. Genetics. 2004;166:935–945. doi: 10.1534/genetics.166.2.935. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Ramakrishna W, Emberton J, Ogden M, SanMiguel P, Bennetzen JL. Plant Cell. 2002;14:3213–3223. doi: 10.1105/tpc.006338. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Lai J, Li Y, Messing J, Dooner HK. Proc Natl Acad Sci USA. 2005;102:9068–9073. doi: 10.1073/pnas.0502923102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Morgante M, Brunner S, Pea G, Fengler K, Zuccolo A, Rafalski A. Nat Genet. 2005;37:997–1002. doi: 10.1038/ng1615. [DOI] [PubMed] [Google Scholar]
  • 40.Dubcovsky J, Ramakrishna W, SanMiguel PJ, Busso CS, Yan LL, Shiloff BA, Bennetzen JL. Plant Physiol. 2001;125:1342–1353. doi: 10.1104/pp.125.3.1342. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Avramova Z, Tikhonov A, SanMiguel P, Jin YK, Liu CN, Woo SS, Wing RA, Bennetzen JL. Plant J. 1996;10:1163–1168. doi: 10.1046/j.1365-313x.1996.10061163.x. [DOI] [PubMed] [Google Scholar]
  • 42.Devos KM, Beales J, Nagamura Y, Sasaki T. Genome Res. 1999;9:825–829. doi: 10.1101/gr.9.9.825. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Vandepoele K, Saeys Y, Simillion C, Raes J, Van de Peer Y. Genome Res. 2002;12:1792–1801. doi: 10.1101/gr.400202. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Liu H, Sachidanandam R, Stein L. Genome Res. 2001;11:2020–2026. doi: 10.1101/gr.194501. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Bennetzen JL, SanMiguel P, Chen MS, Tikhonov A, Francki M, Avramova Z. Proc Natl Acad Sci USA. 1998;95:1975–1978. doi: 10.1073/pnas.95.5.1975. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Brunner S, Pea G, Rafalski A. Plant J. 2005;43:799–810. doi: 10.1111/j.1365-313X.2005.02497.x. [DOI] [PubMed] [Google Scholar]
  • 47.Vitte C, Bennetzen JL. Proc Natl Acad Sci USA. 2006;103:17638–17643. doi: 10.1073/pnas.0605618103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Devos KM, Brown JKM, Bennetzen JL. Genome Res. 2002;12:1075–1079. doi: 10.1101/gr.132102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Ma J, Bennetzen JL. Proc Natl Acad Sci USA. 2006;103:383–388. doi: 10.1073/pnas.0509810102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Peacock WJ, Dennis ES, Rhoades MM, Pryor AJ. Proc Natl Acad Sci USA. 1981;78:4490–4494. doi: 10.1073/pnas.78.7.4490. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Dennis ES, Peacock WJ. J Mol Evol. 1984;20:341–350. doi: 10.1007/BF02104740. [DOI] [PubMed] [Google Scholar]
  • 52.Viotti A, Privitera E, Sala E, Pogna N. Theor Appl Genet. 1985;70:234–239. doi: 10.1007/BF00304904. [DOI] [PubMed] [Google Scholar]
  • 53.Ananiev EV, Phillips RL, Rines HW. Proc Natl Acad Sci USA. 1998;95:10785–10790. doi: 10.1073/pnas.95.18.10785. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Dawe RK, Hiatt EN. Chromosome Res. 2004;12:655–669. doi: 10.1023/B:CHRO.0000036607.74671.db. [DOI] [PubMed] [Google Scholar]
  • 55.Yim YS, Davis GL, Duru NA, Musket TA, Linton EW, Messing JW, McMullen MD, Soderlund CA, Polacco ML, Gardiner JM, et al. Plant Physiol. 2002;130:1686–1696. doi: 10.1104/pp.013474. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Quackenbush J, Cho J, Lee D, Liang F, Holt I, Karamycheva S, Parvizi B, Pertea G, Sultana R, White J. Nucleic Acids Res. 2001;29:159–164. doi: 10.1093/nar/29.1.159. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Bhattacharyya GK, Johnson RA. Statistical Concepts and Methods. New York: Wiley; 1977. [Google Scholar]
  • 58.Kumar S, Tamura K, Nei M. Brief Bioinfom. 2004;5:150–163. doi: 10.1093/bib/5.2.150. [DOI] [PubMed] [Google Scholar]
  • 59.Kimura M. J Mol Evol. 1980;16:111–120. doi: 10.1007/BF01731581. [DOI] [PubMed] [Google Scholar]
  • 60.Ma J, Bennetzen JL. Proc Natl Acad Sci USA. 2004;101:12404–12410. doi: 10.1073/pnas.0403715101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Lee M, Sharopova N, Beavis WD, Grant D, Katt M, Blair D, Hallauer A. Plant Mol Biol. 2002;48:453–461. doi: 10.1023/a:1014893521186. [DOI] [PubMed] [Google Scholar]
  • 62.Lawrence CJ, Seigfried TE, Bass HW, Anderson LK. Genetics. 2006;172:2007–2009. doi: 10.1534/genetics.105.054155. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES