Supporting information for Brouha et al. (2003) Proc. Natl. Acad. Sci. USA, 10.1073/pnas.0831042100
Supporting Text
Database Search Details.
Briefly, the Fasta alignment algorithm (1) was used to identify sequences >98% identical to L1.3. Candidate sequences were annotated with respect to typical L1 structural hallmarks. Full-length Ta, pre-Ta, and non-Ta elements were analyzed for similarity to the ORFs of L1.3 (L19092) by using the FRAMESEARCH program (GCG) at the Human Genome Mapping Project (www.hgmp.mrc.ac.uk), and intact elements were identified by manual inspection of the alignments generated. This was necessary because stop codons within the intergenic region and in-frame deletions may be tolerated, despite their disruption of the canonical L1 polypeptide sequence. Intact elements (full-length with two ORFs) were further annotated with respect to the GC content of the 10 kb of flanking genomic DNA 5' and 3' of the insertion site. Where target site duplications (TSDs) could be identified, they were used to reconstruct the empty site, thus excluding L1 transduced sequences in the composition analysis. When a TSD could not be identified, the element and its putative poly(A) tail were excised in silico to generate a putative empty site. Where necessary, flanking sequences were extended across accession boundaries, unless there was a known contig break. Empty sites were also assessed with respect to their recombinagenicity (in centimorgans per megabase) by identifying the physically closest marker typed in the two most recent human genetic map construction analyses (2, 3).Allele Frequency Details.
Genomic DNA was purchased from Coriell Cell Repositories (Camden, NJ). Briefly, the presence of an L1 was tested by PCR using a 5' or 3' flanking primer and an internal primer. About 50% of polymorphic elements were tested for presence by PCR across the 3' flank, and the rest were tested for presence across the 5' flank. The absence of an L1 was tested by using a 5' flanking forward primer and a 3' flanking reverse primer. Primers were designed with MacVector (Accelrys, San Diego). Expand Long template PCR reactions were performed in a PTC-0220 Peltier Thermocycler (MJ Research, Incline Village, NV), with 50200 ng of genomic DNA in a 20-µl reaction volume. The PCR was run under the following conditions: 10-min denaturation at 94ºC, 30 cycles of (30-s denaturation at 94ºC, 40-s annealing step at 55-70ºC, 1.5-min elongation at 68ºC), and 10-min final elongation at 68ºC. Primer sequences are available in Table 2. Our data on allele frequencies of 86 L1s partially overlap previous studies and are consistent with all but 1 of 22 overlapping allele frequencies in Myers et al. (4) and with all 10 overlapping allele frequencies reported in Boissinot et al. (5).Noncanonical L1 Description.
Preliminary sequence analysis of these 10 noncanonical L1s did not reveal an obvious mechanism for the generation of such L1s. Ac009269 and ac023480 appear to be mosaic. Ac009269 (element with ≈400-bp deletion in 5' UTR) contains nucleotides typical to both Ta-1 and Ta-0 interspersed throughout the length of the L1. Similarly, ac023480 has nucleotides typical of both Ta-1d and Ta-1nd throughout the L1. Such switching may be consistent with the "toggling" found in an L1 rescued from a cell culture assay (6). The other eight elements appear to belong to one subclass while having a single anomalous site near the 3' end. Using the numbering found in Boissinot et al. (5), ac093775 is a Ta-1d except for an older pre-Ta nucleotide at 6040. Ac021017 contains all Ta-1d diagnostic nucleotides except for the older Ta-0 GGAC at 555760. Ac015971 has pre-Ta nucleotides with one exception at 595456, where it is similar to an older L1Pa2. Finally, five Pre-Ta elements (ac008496, ac117496, ac026113, al357559, al360272) have a single younger nucleotide at 5557. In contrast to the old/young chimeric L1s seen in Gilbert et al. (7) and the L1/PAI1b cDNA chimeras described by Wei et al. (8), there never appears to be a clear, switching point between multiple diagnostic nucleotides of one subclass and those of another. The mutations appear independent. Therefore, the neighbor-joining tree is a valid method to evaluate the sequence divergence of these elements. Retrotransposition competence does not appear to be affected by the hybrid nature of these elements; four are weakly active and one is a hot L1.Mutation Analysis.
To analyze the relationship between L1 activity and L1 nucleotide sequence we constructed a consensus sequence with eight of the hot L1s (LRE3, L1RP, ac004200, ac002980, al356438, al512428, ac021017, al137845).We compared each intact L1 to the consensus of the hot elements. Active L1s have an average of 17 differences per element, whereas inactive L1s have an average of 21 differences per element. We did not find nucleotide positions in the L1 sequence where differences were associated solely with active or inactive L1s. Inactive L1s have more differences than active L1s in the 5' UTR, ORF1, the inter-ORF, ORF2, the endonuclease (EN), reverse transcriptase (RT), and cysteine-rich domains of ORF2, and the 3' UTR. However, both types of L1s show less variability in coding regions than in noncoding regions. The EN, RT, and the cysteine-rich regions in ORF2 are only slightly less variable than other ORF2 regions.
Nucleotide differences in the ORFs that result in amino acid changes (replacement changes, R) are more likely to be detrimental than differences that do not (synonymous or silent changes, S). Therefore, the ratio of replacement changes to synonymous changes (R/S) is reduced in functional coding regions (9). Pseudogenes have an R/S of 2.53.0 (10), whereas functional β-globin genes show an R/S ratio of ≈1.0 (11). In ORF1, the overall R/S ratio was 2.2 in inactive L1s and 1.6 in active L1s. In ORF2, the ratios were 1.7 and 1.4, respectively. The R/S ratios of the EN domain in ORF2 were 1.8 for inactive and 1.5 for active L1s. Finally, the RT domain had ratios of 1.6 and 1.2, respectively.
Statistical Analysis of L1 Genomic Distribution.
There was no evidence that active elements were overrepresented near (within 50 kb) or within known genes [Fishers exact test (one tail) P = 0.3647 and 0.1598, respectively]. When the GC content of empty sites was classified as "high" (greater than the genome average of 41%) or "low" (less than or equal to the genome average), there was no significant overrepresentation of active sequences in GC rich regions [Fishers exact test (one tail) P = 0.192]. Similarly, there was no association of active elements with elevated recombination rates under the assumption of a genomewide average of 1cM/mb [Fishers exact test (one tail), P = 0.1083]. All the empty sites analyzed also showed the expected weakly positive association between recombination rate and GC content (R2 = 0.1073, P < 0.01). Finally, the endonuclease cleavage sites of all analyzed elements, inferred from target site duplication sequences, showed strong similarity to the established consensus 5'-TTTTT/A-3' (T(52.4%), T(72.6%), T(63.1%), T(76.2%).1. Pearson, W. R. & Lipman, D. J. (1988) Proc. Natl. Acad. Sci. USA 85, 24442448.
2. Yu, A., Zhao, C., Fan, Y., Jang, W., Mungall, A. J., Deloukas, P., Olsen, A., Doggett, N. A., Ghebranious, N., Broman, K. W. & Weber, J. L. (2001) Nature 409, 951953.
3. Kong, A., Gudbjartsson, D. F., Sainz, J., Jonsdottir, G. M., Gudjonsson, S. A., Richardsson, B., Sigurdardottir, S., Barnard, J., Hallbeck, B., Masson, G., et al. (2002) Nat. Genet. 31, 241247.
4. Myers, J. S., Vincent, B. J., Udall, H., Watkins, W. S., Morrish, T. A., Kilroy, G. E., Swergold, G. D., Henke, J., Henke, L., Moran, J. V., et al. (2002) Am. J. Hum. Genet. 71, 312326.
5. Boissinot, S., Chevret, P. & Furano, A. V. (2000) Mol. Biol. Evol. 17, 915928.
6. Symer, D. E., Connelly, C., Szak, S. T., Caputo, E. M., Cost, G. J., Parmigiani, G. & Boeke, J. D. (2002) Cell 110, 327338.
7. Gilbert, N., Lutz-Prigge, S. & Moran, J. V. (2002) Cell 110, 315325.
8. Wei, W., Gilbert, N., Ooi, S. L., Lawler, J. F., Ostertag, E. M., Kazazian, H. H., Boeke, J. D. & Moran, J. V. (2001) Mol. Cell. Biol. 21, 14291439.
9. Martin, S. L., Voliva, C. F., Burton, F. H., Edgell, M. H. & Hutchison, C. A., III (1984) Proc. Natl. Acad. Sci. USA 81, 23082312.
10. Hardies, S. C., Martin, S. L., Voliva, C. F., Hutchison, C. A., III, & Edgell, M. H. (1986) Mol. Biol. Evol. 3, 109125.
11. Czelusniak, J., Goodman, M., Hewett-Emmett, D., Weiss, M. L., Venta, P. J. & Tashian, R. E. (1982) Nature 298, 297300.