Abstract
Numerous inbred mouse strains comprise models for human diseases and diversity, but the molecular differences between them are mostly unknown. Several mammalian genomes have been assembled, providing a framework for identifying structural variations. To identify variants between inbred mouse strains at a single nucleotide resolution, we aligned 26 million individual sequence traces from four laboratory mouse strains to the C57BL/6J reference genome. We discovered and analyzed over 10,000 intermediate-length genomic variants (from 100 nucleotides to 10 kilobases), distinguishing these strains from the C57BL/6J reference. Approximately 85% of such variants are due to recent mobilization of endogenous retrotransposons, predominantly L1 elements, greatly exceeding that reported in humans. Many genes’ structures and expression are altered directly by polymorphic L1 retrotransposons, including Drosha (also called Rnasen), Parp8, Scn1a, Arhgap15, and others, including novel genes. L1 polymorphisms are distributed nonrandomly across the genome, as they are excluded significantly from the X chromosome and from genes associated with the cell cycle, but are enriched in receptor genes. Thus, recent endogenous L1 retrotransposition has diversified genomic structures and transcripts extensively, distinguishing mouse lineages and driving a major portion of natural genetic variation.
Inbred mouse strains form a foundation for mammalian genetics research. Hundreds of distinct lineages including well-known laboratory strains were generated from limited founders by repetitive crosses of highly related animals within the past 100–300 yr. Individuals of a given strain are both virtually homozygous at all autosomal loci and isogenic (Beck et al. 2000). The power of mouse genetics research in part comes from naturally occurring genetic variation between different strains. Phenotypic differences between mouse lineages, such as disease susceptibility traits, behavioral differences, and many other characteristics, are widely used to model human developmental and metabolic disorders, cancers, and many other diseases and traits (Beck et al. 2000).
Genome sequence assemblies have been completed recently for the mouse and other mammalian species. Large-scale resequencing projects have focused upon identification of certain forms of sequence variation, especially short variants such as single nucleotide polymorphisms (SNPs) (International HapMap Consortium 2005), that might account for functional differences between mammalian individuals or lineages. Such work has helped map a small number of quantitative trait loci, tabulated common variants associated with cancers and other diseases, and facilitated analysis of mammalian evolution (Wade and Daly 2005; Conrad et al. 2006; Frazer et al. 2007). More recently, longer structural variants have been identified, distinguishing human individuals and mouse substrains (Mills et al. 2006; Egan et al. 2007; Korbel et al. 2007). Several recent studies on human structural variation revealed that nonhomologous end joining and endogenous transposition of retroelements have contributed mechanistically to most insertion or deletion (indel) changes between human genomes (Mills et al. 2006; Korbel et al. 2007).
Various classes of repetitive elements, mostly transposons, make up nearly half of the mammalian genomes assembled (Lander et al. 2001; Waterston et al. 2002). While some retrotransposon families are actively mobilized in mouse and human genomes (Kazazian 2004), occasionally resulting in disease-causing mutations (Chen et al. 2005) and various forms of genomic instability (Symer et al. 2002), their contributions to structural variation are largely unknown. Since transposons can introduce promoters, terminators, and alternative splice sites, and affect local chromatin structures (Whitelaw and Martin 2001; Roy-Engel et al. 2005; Wheelan et al. 2005; Belancio et al. 2006; Chen et al. 2006), their active mobilization in genomes is a likely determinant of transcriptional variation (Horie et al. 2007), and therefore at least some cases of phenotypic variation.
A comprehensive analysis of structural variation between classical inbred mouse strains has not been conducted to date, except for SNPs and certain copy number variants (CNVs). In this study, to identify intermediate-length structural variants between inbred mouse strains at extremely high resolution, i.e., single nucleotide resolution, we aligned individual sequence traces to the reference mouse genome using a fast and accurate new method. Virtually all sampled predictions were validated by specific polymerase chain reaction (PCR) assays. Surprisingly, most of the identified genomic variants between mouse strains were caused by recent mobilization of endogenous transposable elements, of which L1 retrotransposons were most active. Additionally, as described here, we found that a substantial number of these polymorphic transposons directly altered transcript structures and expression levels in corresponding mouse strains.
Results
Most intermediate-size mouse structural variants are due to transposition
High-resolution data from whole-genome shotgun (WGS) sequencing of four inbred mouse strains, A/J, DBA/2J, 129S1/SvImJ (henceforth, 129S1), and 129X1/SvJ (129X1) (Mural et al. 2002), whose genomes remain unassembled, were deposited recently at the National Center for Biotechnology Information (NCBI) trace archive (Mural et al. 2002; Wade and Daly 2005). To identify genomic variants distinguishing these strains, we downloaded ∼26 million WGS sequence traces (cumulative length ∼18 billion nucleotides [nt]) and aligned them individually to the reference C57BL6/J (C57) genome assembly using GMAP. This software application was developed to map exons and therefore is well-suited to align genomic fragments with intervening breaks. It appeared to speed alignments over other applications such as BLAT by 10- to 100-fold (R.M. Stephens and N. Volfovsky, unpubl.). We found that 73% of the individual sequence traces align unambiguously to the C57 reference genome with minimal or no variation (Fig. 1; Table 1; Supplemental Fig. 1). Many traces validate known SNPs and/or identify new ones, and show that significant portions of the compared strains’ genomes are nonpolymorphic in pairwise comparisons (Wade et al. 2002). In contrast, others align to multiple repetitive elements or to no unique locus, identify short tandem repeat (STR) polymorphisms (N. Volfovsky, J. Li, K. Akagi, R.M. Stephens, and D.E. Smyer, in prep.), and/or identify indel variants.
Table 1.
(A) WGS traces from four unassembled mouse strain genomes were aligned to the C57 reference using GMAP and a genome variation discovery pipeline. Resulting categories of alignment were tabulated for each alternative strain (see Supplemental material). (B) A summary of the coverage by clustered sequence traces mapped to the reference assembly, whose size in mm8 release is 2,644,077,689 nt.
Upon merging overlapping individual WGS traces (Fig. 1A; Supplemental Fig. 1), more than 10,000 intermediate-sized variants, ranging from 100 nt to 10 kb, are predicted by this analysis to be present in the C57 reference, but absent from at least one of the other strain(s) (Figs. 1, 2; Supplemental Table 1). We call such indel variants a “polymorphic insertion in C57” since they are present in the reference genome (Fig. 1) but absent from another strain. Even more variants were found present in at least one of the four unassembled strains, but absent from the reference (“polymorphic insertion in strain X”). These latter variants are difficult to characterize without full genome assemblies, precluding their detailed analysis here. We do not wish to imply by this nomenclature that the polymorphisms’ mechanism of formation is known in all cases; an indel variant that we call an insertion in a given strain could alternatively have been deleted from another strain. All polymorphisms identified here were determined from comparisons with the reference C57 mouse genome. Our alignment procedures, categorization of WGS traces, and resulting sequence coverage for each strain are described in Figure 1, Table 1, Supplemental Figure 1, Supplemental Tables 1 and 2, and Supplementary Methods. Comprehensive data about the genomic variants distinguishing mouse strains, as discovered in this study, are available using PolyBrowse, our new genomic polymorphism query and display website at http://polybrowse.abcc.ncifcrf.gov/ (R.M. Stephens, K. Akagi, J.R. Collins, B. Neelam, D. McCullough, N. Volfovsky, and D.E. Symer, in prep.).
Almost all such variants include at least 70% sequence content from various classes of repetitive elements (Fig. 2A), as identified by RepeatMasker (A.F.A. Smit, R. Hubley, and P. Green; http://www.repeatmasker.org). A large majority contains >90% transposon sequences per variant. Their length distribution is strikingly bimodal, matching transposons’ known structures in the mouse genome (Fig. 2). Of these transposon indels, L1 (LINE, long interspersed element) retrotransposons are the most numerous. L1 integrants are frequently truncated from the 5′ end, but many others are full length (Symer et al. 2002). L1 polymorphisms contributed the most variant nucleotides to the strains’ genomes overall; their mean ± standard deviation (SD) length is 1130 ± 590 nucleotides. Other classes of active transposable elements, including short interspersed elements (SINEs, mostly B2 elements) and long terminal repeat-containing retrotransposons (e.g., ERV-K and MaLR elements), are also very frequently polymorphic between strains (Fig. 2B).
L1 polymorphisms
We tabulated a total of 666,328 “reference L1s” (each >100 nt) in the haploid C57 reference genome using RepeatMasker (A.F.A. Smit, R. Hubley, and P. Green; http://www.repeatmasker.org), based on their evolutionary ages and structures (Fig. 1B; Supplemental Table 3). These counts are likely to be inexact because gaps remain in the reference genome assembly, currently 98.6% complete (Table 1). Remaining gaps frequently include highly repetitive sequences. Mouse Y chromosome sequences have not been assembled, and some transposons are “compound” elements, contiguous to one another, that cannot be counted unambiguously.
At least 127,803 L1 elements (19.2% of the total) are present in all four strains’ unassembled genomes and in C57, so we call them “nonpolymorphic” (Fig. 1B). Notably, some of these may be fixed in all mouse lineages, but their presence has been determined only for the five inbred strains here. In contrast, at least 6723 (1%) distinct elements are L1 polymorphisms in C57, i.e., present in the C57 reference and possibly other strains, but absent from at least one strain. We compared the absent or present status (A/P call) for all five inbred strains in 1861 fully predicted cases out of 6723 L1 polymorphisms. These pairwise comparisons confirmed that 129S1 and 129X1 strains are most similar, while A/J and DBA/2J are most divergent (Supplemental Table 4). These results corroborate both earlier phylogenetic analyses using SNPs and other genomic markers, and strains’ known breeding histories (Wade et al. 2002).
If a similar proportion of all reference L1s were polymorphic, then up to ∼33,000 L1s would be absent from at least one of the four unassembled strains. Additionally, many thousands of other currently unknown L1 integrants, absent from the reference genome, are likely to be present in one or more of the unassembled mouse strains. Thus, the analysis presented here substantially underestimates structural variation including transposition-mediated variation between the strains.
To validate predictions of L1s present or absent in the strains, we arbitrarily selected a set of 31 L1 integrants for validation by PCR (Table 2). This collection is an arbitrary sample of mouse L1s genome-wide, as we included 22 independent polymorphic L1s present in the C57 reference, but absent from at least one of the other strains. Of these, 11 were chosen from several regions of chromosome 10, and others were picked at a frequency of approximately one per chromosome. The remaining nine elements were chosen for validation based upon their activity in a screen for fusion transcripts (see below). PCR assays were run both across left and right junctions between L1s and flanking genomic sequences, and across empty and/or occupied genomic target sites. We required results from the three PCR tests to be self-consistent. Predictions from all but one of 78 individual WGS traces (99%) identifying empty target sites (where reference L1s are absent from a strain) were validated (Supplemental Table 5), suggesting very low error rates in trace sequencing and alignments, and minimal confounding by other forms of genomic variation such as copy number variants. A predicted integrant on chromosome 17 could not be assayed in any strain, probably because its target site lies within an ancient element repeated in many genomic locations (Table 2).
Table 2.
(Top) Candidate L1 polymorphisms from chromosomes 10 and others were arbitrarily selected as described in the text for validation by PCR. (Bottom) Nine putative L1 integrants from a screen of fusion transcripts were identified in unassembled strains by chromosome walking. PCR reactions across left and right genomic junctions and empty target sites validated presence (blue, P) or absence (yellow, A) of individual integrants, as predicted by WGS trace alignments for four unassembled strains. In a few cases, no PCR product was obtained (white), suggesting additional genetic variation in a strain, or suboptimal PCR design. Trace ID or cDNA clone names, chromosomal coordinates, spanning gene names, and L1 subtypes from RepeatMasker classification and Cross_Match reclassification are indicated (see Supplemental Methods).
We wanted to determine whether more extensive genomic variation distinguishes other lineages. Therefore, the same L1 integrants were assayed by PCR in 16 additional mouse strains and related species that have been studied in large-scale SNP discovery and analysis projects (Table 2) (Wade and Daly 2005; Frazer et al. 2007; Yang et al. 2007). Strikingly, none of the 31 L1s assayed (0%) is present in SPRET/EiJ, although Mus spretus diverged from ancestors of the classical inbred strains approximately one million years ago, and our collection emphasized integrants known to be polymorphic among those laboratory strains. If we had assayed mostly nonpolymorphic L1s, presumably some would be present at conserved loci in Mus spretus. Only 2/28 (7%) each are present in CAST/EiJ (Mus castaneus) and MOLF/EiJ (Mus molossinus), respectively, and 1/30 (3%) is in PWD/PhJ. For comparison, the overall contribution from the genomes of these ancestral strains to classical inbred mouse strains has been estimated to be 3% from CAST/EiJ, 10% from MOLF/EiJ, and 6% from PWD/PhJ, illustrating that our collection approximates the genome-wide contributions of these ancestors estimated by SNP analysis (Frazer et al. 2007). However, in WSB/EiJ, a strain most closely related to Mus musculus domesticus (the common ancestor for a majority of classical mouse strain genomes) (Wade et al. 2002), only a small minority (10 out of 29; 34%) of the assayed L1 integrants is present. This value deviates substantially from expected contribution (68%) from Mus musculus domesticus to the classical inbred mouse strains (Frazer et al. 2007), but might be explained by the small sample size and nonrandom distribution of L1s assayed here (Table 2).
Although most of the integrants chosen for validation are polymorphic, three of the 31 validated integrants are nonpolymorphic in the five strains. Of these, none are fixed in all 21 lineages (Table 2). Several integrants are present only in a few strains, suggesting that they integrated very recently in evolutionary time, quite possibly within the past few hundred years or less. This relatively rapid rate of genomic change is comparable to that reported for copy number variants, which have emerged within several hundred generations of inbreeding of C57BL6 substrains (Egan et al. 2007). While >19% of reference L1 elements are nonpolymorphic in the five strains, a substantially smaller fraction likely will be nonpolymorphic in all strains. These results are consistent with a recent analysis of SNPs in classical inbred mice, supporting their intrasubspecific origin (Yang et al. 2007). Additional WGS sequencing of divergent mouse species such as Mus spretus and Mus castaneus likely would identify fundamentally different patterns of transposon integrants and resulting differences in chromosome structures.
The chromosomal distributions of reference and polymorphic L1 retrotransposons were compared with genes and G/C-rich regions (Fig. 3). As expected, L1s are not uniformly distributed genome-wide, but tend to be located in gene-poor regions (Ostertag and Kazazian 2001). Strikingly, the mouse genome contains many more reference L1 elements than exons. Polymorphic L1s and exons contribute to similar extents (Fig. 3A). L1s are also enriched in A/T-rich genomic regions (Gasior et al. 2007). Variation in L1 polymorphism densities along chromosomes is not due simply to differences in WGS trace coverage (Supplemental Fig. 2; Supplemental Tables 2, 3). We cannot analyze the Y chromosome, since its coverage is minimal due to its composition of arrayed Huge Repeats.
Compared with autosomes, the X chromosome has a significantly higher density of reference L1s (Fig. 3A; Table 3) (P = 0), as expected (Ostertag and Kazazian 2001). Less purifying selection on the sex chromosomes would allow accumulation of deleterious L1s on chromosome X (Boissinot et al. 2001). Chromosome 11 contains a substantially lower density of reference L1s (Table 3A; P = 0).
Table 3.
(A) Chromosomal distribution. A total of 3,361,500 simulated “insertion events” were distributed randomly genome-wide, proportionally matching the relative lengths of chromosomes as expected. A total of 666,328 reference L1s were identified, of which 600,486 are on autosomes. A total of 6723 polymorphic L1s were found, of which 6484 are autosomal. Particularly significant enrichments or exclusions of reference or polymorphic L1 elements are highlighted (light gray). Numbers indicated for the Y chromosome are not reliable, due to its poor sequence coverage (dark gray). (B) Distribution within annotated genes. Simulation again had 3,361,500 “insertion events” distributed randomly genome-wide. The P-value for the comparison of reference L1s inside genes, vs. simulated events inside genes, is <1 × 10−100. The P-value for polymorphic L1s inside genes vs. simulated events inside genes is 2.13 × 10−83 (highlighted gray).
In contrast, there are many fewer L1 polymorphisms on the X chromosome and chromosome 10, and increased numbers of L1 polymorphisms on chromosomes 1 and 3. Out of 600,486 autosomal L1s, 6484 (1.08%) are polymorphic, while only 237 out of 65,038 L1s on the X chromosome (0.36%) are polymorphic (P = 1.47 × 10−22) (Fig. 3A; Table 3A). The high density of L1s on the X chromosome, together with its paradoxical lack of L1 polymorphisms, could be due to prevention of or strong selection against new insertions, or selection for older ones. This apparent contradiction suggests that nonpolymorphic L1s may play an important biological role there, perhaps in X inactivation (Lyon 1998).
We compared L1 variants and SNPs pairwise between the reference genome and A/J or DBA/2J, respectively. Such pairwise comparisons revealed that most polymorphic L1 integration sites coincide with SNP-dense regions (P < 1 × 10−10) (Fig. 3B; Supplementary Methods). A plausible explanation for this concordance between a large majority of L1 variants and SNP-dense regions is that most polymorphic transposon integration sites and flanking genomic sequences, coinherited from distant ancestors, then diverged with a subsequent accumulation of SNPs. Alternatively, these two forms of genomic variation might be expected to coincide in those chromosomal regions where such changes can be tolerated. While independent polymorphic L1s are substantially less numerous than SNPs (Frazer et al. 2007), they contain at least a 1000-fold more nucleotides per variant (Fig. 3B).
Importantly, occasional L1 variants integrated into genomic regions without apparent SNPs, so-called “identical by descent” (IBD) (insets, Fig. 3B). However, such transposon integrants clearly have caused substantial local variation, despite lack of SNPs. Screening for polymorphic transposons might provide a powerful new way to genotype mouse strains and other mammalian species, particularly in IBD regions with few or no SNPs available (The International HapMap Consortium 2005; Yang et al. 2007).
Several structural features of polymorphic L1s are consistent with their young evolutionary ages. In contrast with both reference and nonpolymorphic elements, polymorphic L1s have a bimodal length distribution with a significantly increased number of long, full-length elements (Fig. 4). They also more frequently have target-site duplications (TSDs) and poly(A) tails, and when present, their TSDs and poly(A) tails are significantly longer than those of reference or nonpolymorphic L1s (Supplemental Fig. 3). Polymorphic L1s also have a canonical target-site preference, lower nucleotide substitution rate, and more frequently are classified as young, active L1 subfamily members (Supplemental Table 6). These results strongly suggest that such genomic integrants are bona fide products of recent retrotransposition (Symer et al. 2002).
Three young L1 subfamilies are currently active in mouse; some members of these active subfamilies have caused murine diseases by insertional mutagenesis. Ranked by their occurrence in the reference genome, these are TF, A, and GF (Naas et al. 1998; Saxton and Martin 1998; Goodier et al. 2001; Ostertag and Kazazian 2001). Similarly, a majority (59%) of polymorphic L1s are products of retrotransposition by young, active donors, i.e., TF (28%), A (23%), and GF (8%) subfamily members (Supplemental Table 3).
These results collectively show that polymorphic L1s are substantially younger than other L1s in the mouse genome. However, L1 polymorphisms typically are localized in high-density SNP regions (Fig. 3B), suggesting their localization and coinheritance within divergent ancestral blocks (Wade et al. 2002). Clearly, determination of the ages and evolutionary relationships of individual transposon integrants and other genomic variants along chromosomes in different strains will require further investigation.
Transcriptional variation from L1 retrotransposition
Multiple forms of transcriptional variation have been linked previously with transposons, which may contribute cryptic or alternative promoters, terminators, and/or splice sites, affect RNA polymerase processivity, trigger altered chromatin conformations, mediate homologous recombination, and/or template small RNA expression (Ostertag and Kazazian 2001; Speek 2001; Wheelan et al. 2005; Belancio et al. 2006; Yang and Kazazian 2006). However, the extent of transcriptional variation due to endogenous transposition is not known.
Nearly half (53%) of both nonpolymorphic and polymorphic L1s are located within 100 kb of annotated RefSeq genes. Approximately 20% of both reference L1s and L1 variants occur inside transcription units, representing a significant bias against L1 integrants within genes, since 28%–30% of the mouse genome is comprised of annotated RefSeq genes including introns (An et al. 2006) (Table 3B). Presumably, this relative exclusion of L1 elements from genes reflects selection against them, or less likely, their nonrandom integration into intergenic regions.
Of the nonpolymorphic L1s within introns, ∼68% are oriented antisense to the ORF (Supplemental Table 7). A smaller majority (58%) of polymorphic L1s are antisense within genes. An antisense orientation bias also was observed for de novo L1 integrants within genes in cultured human cells (Symer et al. 2002). In contrast, both nonpolymorphic and polymorphic L1s within an interval of 100 kb upstream or downstream of genes occur in both orientations (Supplemental Table 7), suggesting a neutral orientation preference during retrotransposon integration per se, as expected (Gilbert et al. 2005). Presumably the observed orientation bias within genes is due to positive selection upon antisense elements or negative selection upon sense integrants (Boissinot et al. 2001). The smaller majority of antisense polymorphic L1s within genes may reflect selection over a shorter period of time upon these evolutionarily younger integrants.
To find L1s associated with transcriptional variation in mouse strains, we screened pooled testis cDNA libraries for fragments of L1 TF sequences. This approach allowed us to discover a new antisense promoter active within many full-length, young L1s (J. Li, M. Kannan, and D.E. Symer, in prep.). In an initial survey, a diverse collection of spliced, polyadenylated L1-gene fusion cDNAs, initiated by L1 elements in various gene introns or in intergenic regions, was identified (Supplemental Table 8). Their corresponding antisense L1 templates are polymorphic, but nonpolymorphic elements also can be expressed (J. Li, M. Kannan, K. Akagi, and D.E. Symer, in prep.). Approximately half are present in the C57 genome, while others are absent (Table 2; Supplemental Table 8). The latter putative L1 integrants were identified in other strains’ genomic DNA by chromosome walking from expressed exons into adjacent introns. Each unknown L1 integrant’s genomic flanks were sequenced, revealing canonical TSDs and a poly(A) tail. Once identified, the presence or absence of each L1 template was determined by PCR in all 21 lineages. In one case, a polymorphic L1 is present exclusively in the A/J lineage, but none of the others, suggesting that it integrated very recently (Table 2).
To verify that fusion transcripts are present exclusively in strains containing a putative genomic L1 template, we analyzed total RNAs isolated from adult male testes from the five strains. For example, fusion transcripts of L1-Drosha, L1-Parp8, and an L1-novel gene were identified only in strains with relevant antisense L1 polymorphisms present (Fig. 5). Similarly, other fusion transcripts were detected only in strains with corresponding L1 templates, including a chimeric transcript from the L1-Arhgap15 locus (Table 2; Supplemental Fig. 2; Supplemental Table 8).
Fusion L1 transcripts are exemplified by the L1-Drosha fusion transcript, which is expressed at ∼30% of the level of native Drosha (also called Rnasen) in testis (Fig. 5A). This transcript contains both translation start and splice donor sites from L1, and is spliced in-frame with downstream exons encoding catalytic domains of Drosha, an RNaseIII gene centrally involved in microRNA biosynthesis (Murchison and Hannon 2004). Similarly, an L1-Parp8 fusion transcript also is predicted to be in-frame, and its ORF contains most functional domains of Parp8 (Fig. 5B). As a control, an assay for read-through transcripts for the canonical genes, from which L1 polymorphisms are spliced out with usual introns, showed comparable expression levels. Remarkably, a novel, spliced transcript 1ASII-1 is promoted by a polymorphic L1 (Fig. 5C) in a genomic region where no cDNA or expressed sequence tag (EST) had been reported previously.
No appreciable fusion L1-Drosha transcript was identified by reverse transcriptase-mediated (RT–) PCR in nongonadal tissues (Fig. 5A). In contrast, the novel fusion transcript 1ASII-1 was detected both in testis and 11 d-embryo tissues (Fig. 5C). We speculate that mechanisms such as transcriptional or post-transcriptional gene silencing, position effects, and/or availability of tissue-specific transcription factors may contribute to variable expression and control of particular transposon integrants in different developmental states (Whitelaw and Martin 2001). These and other fusion transcripts may encode protein variants or noncoding RNAs with regulatory or other functions.
We asked what proportion of endogenous L1 variants might contribute to transcriptional variation in the strains. Therefore, we screened adult testis total RNA samples for more L1 fusion transcripts. Out of 205 full-length, antisense L1 polymorphisms predicted inside RefSeq genes in the C57 genome, an arbitrary sample of 68 was screened. Of these, 13 (19%) drive fusion L1-gene transcripts, including 40% of the TF polymorphisms tested (Supplemental Table 9) (J. Li, M. Kannan, K. Akagi, and D.E. Symer, in prep.). Additionally, fusion L1-Arhgap15 transcription was identified in another screen (Table 2; Supplemental Table 9; Supplemental Fig. 2b). Notably, two distinct intronic L1 polymorphisms occur in Grid2 in different strains, but only one drives expression of a fusion L1-Grid2 transcript, while the other does not (Supplemental Table 9). Thus, we speculate that both polymorphic and nonpolymorphic L1s may initiate additional transcripts in testes or other tissues, developmental stages, and/or disease states such as cancers.
Another way by which L1 variants can affect tissue-specific gene structure and expression (Fig. 6) is illustrated by the rd7 mouse model of retinal degeneration (Chen et al. 2006). A de novo insertion of a full-length antisense L1 into exon 5 of Nr2e3 disrupts that gene’s normal transcription and splicing. Its donor itself is polymorphic, present only in C57, NZB/BinJ, and AKR/J out of the 21 strains tested (Table 2), thereby providing the first example of a “hot” endogenous mouse L1 that actively retrotransposed from its chromosomal location (Brouha et al. 2003). Thus, other full-length, polymorphic L1s also may be highly active donors in vivo.
Ontology analysis (Mi et al. 2005) of annotated genes containing L1 polymorphisms showed a significant exclusion from certain categories of genes, including genes associated with cell cycle, nucleic acid metabolism, and oncogenesis (Table 4; Supplemental Table 10). In contrast, L1 polymorphisms are significantly enriched in the receptors category of molecular functions, suggesting that these genes generally may tolerate added structural or transcriptional variability mediated by transposon integration events. Nonpolymorphic L1s and reference L1s were enriched significantly in brain-associated genes along with other ontological categories (Supplemental Table 10). A recent high-resolution analysis of copy number variation between mouse strains revealed that these structural variants also are excluded from similar groups of mouse genes required in fundamental cellular processes, e.g., those involved in cell cycle and nucleic acid metabolism (Cutler et al. 2007).
Table 4.
Annotated genes containing 1327 distinct intronic L1 polymorphisms were identified. They were assigned to top-level ontological categories (including biological processes and molecular functions) using Gene Ontology (GO) Panther software. Because many genes are included in more than one ontological category, a total of 2184 assignments were made for these L1 polymorphisms. Only significant differences in ontological categories are listed. Additional information about nonpolymorphic and reference L1 elements is presented in Supplemental Table 10. (A,B) In silico simulations resulted in 2,045,793 “integrants” that are distributed randomly across the reference mouse genome, within annotated genes. As expected, they are distributed proportionally according to gene and chromosome lengths. Their annotated biological processes (A) and molecular functions (B) were determined. The frequency of integrants within each category was calculated as the ratio of the count of integrants divided by the total number of integrants (1327 polymorphic L1s or 2,045,793 simulated integrants, respectively). Because more than one ontological category can be assigned to a given gene, the sum of these frequencies for all top-level ontological categories exceeds 100%. P-values were calculated using the binomial statistic and are adjusted based upon the Bonferroni correction. Only statistically significant differences in ontological categories (corrected P-values < 0.01) are listed here. (C,D) Since reference L1s are nonrandomly distributed in the genome, and as they comprised the basis for identification of most polymorphic L1s described here, we compared the ontological categories of polymorphic L1 genes against reference L1 genes. Their annotated biological processes (C) and molecular functions (D) were determined.
Discussion
In this comprehensive study of intermediate length structural variants that distinguish different inbred mouse strains, we found that a large majority was caused by endogenous retrotransposition, predominantly by L1 retrotransposons. Other classes of active retrotransposons, including LTR elements and SINEs, also have caused substantial variation between the strains (Fig. 2). These variants, which could become a useful adjunct to SNPs and STRs in genotyping studies, can be accessed in detail by using the mouse PolyBrowse website (R.M. Stephens, K. Akagi, J.R. Collins, B. Neelam, D. McCullough, N. Volfovsky, and D.E. Symer, in prep.). While we identified over 10,000 independent variants (Fig. 2), their total numbers do not remotely approximate 8.3 million SNPs identified to date in 16 classical and wild strains (Frazer et al. 2007). Nevertheless, summation of their cumulative lengths (Fig. 2) strongly suggests that these variants have altered millions of nucleotides genome-wide, affecting the structures of perhaps hundreds of genes. Recently, a similar scope of structural variation has been attributed to copy-number variation between mouse strains (Cutler et al. 2007).
The extent of recent endogenous transposition in causing structural variation between mouse strains also appears to be substantially larger than that in humans, where nonhomologous end joining appears to have been a predominant mechanism for generating variation (Mills et al. 2006; Korbel et al. 2007; Levy et al. 2007). The reasons for this striking difference are unclear, since human L1 retrotransposons (which mobilize LINEs, SINEs, and SVA elements) paradoxically are more active than mouse L1s in tissue culture assays (Han and Boeke 2004). Moreover, their overall content in the human genome exceeds that in mouse (Lander et al. 2001; Waterston et al. 2002). Determination and comparison of the rates of structural variation by endogenous retrotransposition and by other mechanisms (Egan et al. 2007; Korbel et al. 2007) in mouse, man, and other species will require additional study.
In this study, we used GMAP (Wu and Watanabe 2005) in a new way to align individual sequence traces to the C57 reference genome assembly (Fig. 1; Supplemental Fig. 1). It is important to note that this alignment procedure, while fast and accurate, is also very stringent, as many additional polymorphisms are likely to remain uncounted. For example, variants in genomic regions with low sequence trace coverage were not counted here. If by chance single sequence traces did not span a variant substantially on both sides, that variant would not be counted. Moreover, polymorphisms that are present in an unassembled genome but absent from the C57 reference genome were not fully identified here. In an effort to describe the complete extent of variants existing between strains, we currently are comparing classes of variants that can be identified by different methods including mate pair alignments (Dew et al. 2005), and documenting many more novel variants present in strains with unassembled genomes.
The genomes of more distantly related mouse species such as Mus spretus are likely to be even more distinct from the classical strains analyzed here, due in large part to consequences of active endogenous transposition. As shown in Table 2, not a single one of the arbitrary, polymorphic L1 retrotransposons that we assayed is present in the Mus spretus genomic DNA, suggesting that a major component of its genomic architecture (likely corresponding to many thousands of elements, on average ∼1 kb long) is fundamentally different from that in its relatives. It is possible that such noncoding genomic compartments, outside of conserved exons, have been shaped differentially by endogenous transposition, but might contribute nevertheless to important biological differences between species, since their coding exons are expected to be extremely similar.
A substantial fraction of L1 variants directly affect neighboring gene expression and structures in a range of tissues, possibly contributing to functional differences between strains (Muotri et al. 2005). However, we presume that a majority of both polymorphic and nonpolymorphic L1s still do not significantly affect expression of overlapping or nearby genes in most tissues (Supplemental Table 9), as we do not anticipate large differences between strains in the structure or expression of most genes. We cannot exclude the possibility that polymorphic transposons, in many cases, may cause subtle differences in the expression and structures of many genes (Han et al. 2004). It will be of great interest to compare transcriptomes in various mouse species with very distinctive genome structures, for example, using gene expression microarrays or ultra-high-throughput sequencing to elucidate the relationship between structural variation and transcriptional variation more fully (Stranger et al. 2007).
Many of the novel fusion L1 transcripts that we identified reflect altered gene structures. For example, the L1-Drosha and L1-Parp8 fusion transcripts (Fig. 5A,B; Supplemental Table 8) are predicted to encode many of the catalytic domains of the native gene products together with short domains from the antisense L1 elements. Others, such as the novel spliced transcript 1ASII-1 (Fig. 5C), also demonstrate that transcription levels can be altered dramatically at a genomic locus previously thought to be devoid of exons. As the biological significance of such fusion transcripts remains unclear, we currently are evaluating whether such transcripts, initiated by certain polymorphic transposons, could rescue upstream promoter traps or affect tissue-specific gene expression levels. At least some of the variant fusion transcripts resulting directly from L1 retrotransposon polymorphisms may be noncoding RNAs with possible regulatory roles.
It is entirely possible that other structural variants, including those caused by other classes of retrotransposon polymorphisms (Fig. 2), may exert even larger effects upon transcriptional variation. For example, LTR retrotransposons may contain stronger promoters active in additional tissues and in other genomic contexts (Horie et al. 2007). Thus, the functional consequences of transposon-mediated genomic variation upon transcripts may be variable themselves (Han et al. 2004). Variable transcription or added regulation mediated by polymorphic transposon promoters could provide a selective advantage that helps explain how mammalian hosts tolerate huge numbers of transposons in their genomes, despite the negative burden that their dispersal and maintenance engenders (Yoder et al. 1997; Boissinot et al. 2001; Bestor 2003; Han et al. 2004).
The generation of diversity between and within very recently separated mouse lineages by active mobilization of L1 retrotransposons emphasizes in detail that these elements are a built-in, active, dynamic engine for evolutionary changes—driving genetic variation and providing a substrate for natural selection—that operates even now (Kazazian 2004). As we documented here, the resulting changes caused by endogenous transposons are not merely structural, genomic variants: They can bring about direct changes in expressed transcripts, and quite likely, phenotypic variation as well.
Methods
Identification of mouse genomic sequence variants
Approximately 26 million sequence traces (∼18 billion nucleotides) from four inbred mouse strains (A/J, DBA2/J, 129S1/SvImJ, and 129X1/SvJ) were downloaded from the tracedb archive, National Center for Biotechnology Information (NCBI, NIH). Only high-quality (>300 nt with phred score >Q20) sequence traces were included, thereby excluding a very small percentage of traces. GMAP was used to align each individual trace to the C57 genome assembly (Wu and Watanabe 2005; R.M. Stephens, K. Akagi, J.R. Collins, B. Neelam, D. McCullough, N. Volfovsky, and D.E. Symer, in prep.). Possible alignment categories included no best alignment, polymorphism in C57, polymorphism in strain X, almost perfect alignment, and others (Fig. 1; Supplemental Fig. 1). Candidate indels’ boundaries were determined by merging traces.
Databases/public graphical display browser
PolyBrowse, a query tool and graphical browser at http://polybrowse.abcc.ncifcrf.gov/ based on GBrowse (Stein et al. 2002), was developed to display all indels described here together with other available genomic variants and annotated features (R.M. Stephens, K. Akagi, J.R. Collins, B. Neelam, D. McCullough, N. Volfovsky, and D.E. Symer, in prep.). C57 reference genomic data were downloaded from UCSC website, http://genome.ucsc.edu/, Feb. 2006 release. Protein domains were predicted using the SMART database, http://smart.embl-heidelberg.de/ (Letunic et al. 2006).
Bioinformatic identification of polymorphic transposons
Procedures are described in the Supplementary Methods.
Mouse tissues’ total RNA isolation
Total RNA was isolated from grossly dissected adult testes (fasted, 72–75-d-old males, harvested at the same time of day), frozen in RNALater (Ambion), and homogenized in TRIzol (Invitrogen) following standard protocols.
Validation of genomic polymorphisms
Genomic DNA from C57, 129S1, 129X1, A/J, and DBA/2J mice was purchased from The Jackson Laboratory. A locus-specific PCR amplicon was designed across the empty target site of each polymorphic repetitive element (Table 2; Supplemental Table 5). Occasionally, the same PCR reaction detected smaller integrants (<500 nt), while both left and/or right junctions of larger integrants were assayed using unique locus-specific primers in flanking genomic sequences paired with primers within the repetitive element (sequences available upon request). PCR products were assessed by agarose gel electrophoresis using standard methods.
Identification of L1 fusion transcripts
Screens of commercial phage libraries and online EST libraries were performed as described in the Supplementary Methods.
cDNA sequencing
Synthesis of cDNAs was performed using SuperScript II (Invitrogen) with oligo-dT and gene-specific primers. Sequencing was performed as described in the Supplementary Methods.
Correlation with SNPs
SNP reference genome coordinates were downloaded from NIEHS Perlegen and Celera databases (stored at tracedb, NCBI website) and compared with polymorphic transposon coordinates as described (R.M. Stephens, K. Akagi, J.R. Collins, B. Neelam, D. McCullough, N. Volfovsky, and D.E. Symer, in prep.).
Simulations and ontology analysis
To test various hypotheses about the genome-wide distribution of the 6723 independent polymorphic L1s identified here, we generated lists of simulated integrants using a random number generator to assign chromosomal coordinates. To approximate genomic or intragenic distributions, 6723 integrant locations were simulated 500 times, resulting in 3,361,500 simulated L1 insertions. Intronic integrants were identified by comparison with a database of RefSeq genes (NCBI). P-values were calculated using the binomial statistic and were adjusted by applying the Bonferroni correction (SPSS software) (Slonim 2002).
To sample gene categories randomly for ontology analysis, based on their relative lengths, the simulation was performed 1000 times, resulting in 6,723,000 simulated integrants.
To investigate whether genes are involved in a biological process affected by polymorphisms, we used the GeneID associated with each accession to query the PANTHER database (Mi et al. 2005) at http://www.pantherdb.org. Simulated integrants or reference L1s were used alternatively as reference groups, as indicated. Biological process or molecular function categories were deemed significant if, upon applying the Bonferroni correction, their P-values are <0.01 as determined by the binominal statistic (Mi et al. 2005).
Acknowledgments
We thank Drs. Maxine Singer, Michael Kuehn, Beverly Mock, Maura Gillison, and Berton Zbar for helpful comments on drafts of this manuscript, and members of the Symer lab for constructive discussions. This research was supported by the Intramural Research Program of the Center for Cancer Research, National Cancer Institute, NIH, and in part was funded by NCI contract N01-CO-12400. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government. NCI-Frederick is accredited by Association for Assessment and Accreditation of Laboratory Animal Care International and follows the U.S. Public Health Service Policy for the Care and Use of Laboratory Animals. Animal care was provided in accordance with the procedures outlined in the “Guide for Care and Use of Laboratory Animals” (National Research Council, 1996, National Academy Press, Washington, D.C.). Mouse studies were performed following a protocol approved by the Animal Care and Use Committee, NCI-Frederick.
Footnotes
[Supplemental material is available online at www.genome.org. Novel L1 fusion transcript and genomic integrant sequences have been submitted to GenBank under accession nos. EF591871–EF591883.]
Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.075770.107.
References
- An W., Han J.S., Wheelan S.J., Davis E.S., Coombes C.E., Ye P., Triplett C., Boeke J.D., Han J.S., Wheelan S.J., Davis E.S., Coombes C.E., Ye P., Triplett C., Boeke J.D., Wheelan S.J., Davis E.S., Coombes C.E., Ye P., Triplett C., Boeke J.D., Davis E.S., Coombes C.E., Ye P., Triplett C., Boeke J.D., Coombes C.E., Ye P., Triplett C., Boeke J.D., Ye P., Triplett C., Boeke J.D., Triplett C., Boeke J.D., Boeke J.D. Active retrotransposition by a synthetic L1 element in mice. Proc. Natl. Acad. Sci. 2006;103:18662–18667. doi: 10.1073/pnas.0605300103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beck J.A., Lloyd S., Hafezparast M., Lennon-Pierce M., Eppig J.T., Festing M.F., Fisher E.M., Lloyd S., Hafezparast M., Lennon-Pierce M., Eppig J.T., Festing M.F., Fisher E.M., Hafezparast M., Lennon-Pierce M., Eppig J.T., Festing M.F., Fisher E.M., Lennon-Pierce M., Eppig J.T., Festing M.F., Fisher E.M., Eppig J.T., Festing M.F., Fisher E.M., Festing M.F., Fisher E.M., Fisher E.M. Genealogies of mouse inbred strains. Nat. Genet. 2000;24:23–25. doi: 10.1038/71641. [DOI] [PubMed] [Google Scholar]
- Belancio V.P., Hedges D.J., Deininger P., Hedges D.J., Deininger P., Deininger P. LINE-1 RNA splicing and influences on mammalian gene expression. Nucleic Acids Res. 2006;34:1512–1521. doi: 10.1093/nar/gkl027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bestor T.H. Cytosine methylation mediates sexual conflict. Trends Genet. 2003;19:185–190. doi: 10.1016/S0168-9525(03)00049-0. [DOI] [PubMed] [Google Scholar]
- Boissinot S., Entezam A., Furano A.V., Entezam A., Furano A.V., Furano A.V. Selection against deleterious LINE-1-containing loci in the human lineage. Mol. Biol. Evol. 2001;18:926–935. doi: 10.1093/oxfordjournals.molbev.a003893. [DOI] [PubMed] [Google Scholar]
- Brouha B., Schustak J., Badge R.M., Lutz-Prigge S., Farley A.H., Moran J.V., Kazazian H.H., Schustak J., Badge R.M., Lutz-Prigge S., Farley A.H., Moran J.V., Kazazian H.H., Badge R.M., Lutz-Prigge S., Farley A.H., Moran J.V., Kazazian H.H., Lutz-Prigge S., Farley A.H., Moran J.V., Kazazian H.H., Farley A.H., Moran J.V., Kazazian H.H., Moran J.V., Kazazian H.H., Kazazian H.H. Hot L1s account for the bulk of retrotransposition in the human population. Proc. Natl. Acad. Sci. 2003;100:5280–5285. doi: 10.1073/pnas.0831042100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen J.M., Stenson P.D., Cooper D.N., Ferec C., Stenson P.D., Cooper D.N., Ferec C., Cooper D.N., Ferec C., Ferec C. A systematic analysis of LINE-1 endonuclease-dependent retrotranspositional events causing human genetic disease. Hum. Genet. 2005;117:411–427. doi: 10.1007/s00439-005-1321-0. [DOI] [PubMed] [Google Scholar]
- Chen J., Rattner A., Nathans J., Rattner A., Nathans J., Nathans J. Effects of L1 retrotransposon insertion on transcript processing, localization and accumulation: Lessons from the retinal degeneration 7 mouse and implications for the genomic ecology of L1 elements. Hum. Mol. Genet. 2006;15:2146–2156. doi: 10.1093/hmg/ddl138. [DOI] [PubMed] [Google Scholar]
- Conrad D.F., Jakobsson M., Coop G., Wen X., Wall J.D., Rosenberg N.A., Pritchard J.K., Jakobsson M., Coop G., Wen X., Wall J.D., Rosenberg N.A., Pritchard J.K., Coop G., Wen X., Wall J.D., Rosenberg N.A., Pritchard J.K., Wen X., Wall J.D., Rosenberg N.A., Pritchard J.K., Wall J.D., Rosenberg N.A., Pritchard J.K., Rosenberg N.A., Pritchard J.K., Pritchard J.K. A worldwide survey of haplotype variation and linkage disequilibrium in the human genome. Nat. Genet. 2006;38:1251–1260. doi: 10.1038/ng1911. [DOI] [PubMed] [Google Scholar]
- Cutler G., Marshall L.A., Chin N., Baribault H., Kassner P.D., Marshall L.A., Chin N., Baribault H., Kassner P.D., Chin N., Baribault H., Kassner P.D., Baribault H., Kassner P.D., Kassner P.D. Significant gene content variation characterizes the genomes of inbred mouse strains. Genome Res. 2007;17:1743–1754. doi: 10.1101/gr.6754607. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dew I.M., Walenz B., Sutton G., Walenz B., Sutton G., Sutton G. A tool for analyzing mate pairs in assemblies (TAMPA) J. Comput. Biol. 2005;12:497–513. doi: 10.1089/cmb.2005.12.497. [DOI] [PubMed] [Google Scholar]
- Egan C.M., Sridhar S., Wigler M., Hall I.M., Sridhar S., Wigler M., Hall I.M., Wigler M., Hall I.M., Hall I.M. Recurrent DNA copy number variation in the laboratory mouse. Nat. Genet. 2007;39:1384–1389. doi: 10.1038/ng.2007.19. [DOI] [PubMed] [Google Scholar]
- Frazer K.A., Eskin E., Kang H.M., Bogue M.A., Hinds D.A., Beilharz E.J., Gupta R.V., Montgomery J., Morenzoni M.M., Nilsen G.B., Eskin E., Kang H.M., Bogue M.A., Hinds D.A., Beilharz E.J., Gupta R.V., Montgomery J., Morenzoni M.M., Nilsen G.B., Kang H.M., Bogue M.A., Hinds D.A., Beilharz E.J., Gupta R.V., Montgomery J., Morenzoni M.M., Nilsen G.B., Bogue M.A., Hinds D.A., Beilharz E.J., Gupta R.V., Montgomery J., Morenzoni M.M., Nilsen G.B., Hinds D.A., Beilharz E.J., Gupta R.V., Montgomery J., Morenzoni M.M., Nilsen G.B., Beilharz E.J., Gupta R.V., Montgomery J., Morenzoni M.M., Nilsen G.B., Gupta R.V., Montgomery J., Morenzoni M.M., Nilsen G.B., Montgomery J., Morenzoni M.M., Nilsen G.B., Morenzoni M.M., Nilsen G.B., Nilsen G.B., et al. A sequence-based variation map of 8.27 million SNPs in inbred mouse strains. Nature. 2007;448:1050–1053. doi: 10.1038/nature06067. [DOI] [PubMed] [Google Scholar]
- Gasior S.L., Preston G., Hedges D.J., Gilbert N., Moran J.V., Deininger P.L., Preston G., Hedges D.J., Gilbert N., Moran J.V., Deininger P.L., Hedges D.J., Gilbert N., Moran J.V., Deininger P.L., Gilbert N., Moran J.V., Deininger P.L., Moran J.V., Deininger P.L., Deininger P.L. Characterization of pre-insertion loci of de novo L1 insertions. Gene. 2007;390:190–198. doi: 10.1016/j.gene.2006.08.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gilbert N., Lutz S., Morrish T.A., Moran J.V., Lutz S., Morrish T.A., Moran J.V., Morrish T.A., Moran J.V., Moran J.V. Multiple fates of L1 retrotransposition intermediates in cultured human cells. Mol. Cell. Biol. 2005;25:7780–7795. doi: 10.1128/MCB.25.17.7780-7795.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goodier J.L., Ostertag E.M., Du K., Kazazian H.H., Ostertag E.M., Du K., Kazazian H.H., Du K., Kazazian H.H., Kazazian H.H. A novel active L1 retrotransposon subfamily in the mouse. Genome Res. 2001;11:1677–1685. doi: 10.1101/gr.198301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Han J.S., Boeke J.D., Boeke J.D. A highly active synthetic mammalian retrotransposon. Nature. 2004;429:314–318. doi: 10.1038/nature02535. [DOI] [PubMed] [Google Scholar]
- Han J.S., Szak S.T., Boeke J.D., Szak S.T., Boeke J.D., Boeke J.D. Transcriptional disruption by the L1 retrotransposon and implications for mammalian transcriptomes. Nature. 2004;429:268–274. doi: 10.1038/nature02536. [DOI] [PubMed] [Google Scholar]
- Horie K., Saito E.S., Keng V.W., Ikeda R., Ishihara H., Takeda J., Saito E.S., Keng V.W., Ikeda R., Ishihara H., Takeda J., Keng V.W., Ikeda R., Ishihara H., Takeda J., Ikeda R., Ishihara H., Takeda J., Ishihara H., Takeda J., Takeda J. Retrotransposons influence the mouse transcriptome: Implication for the divergence of genetic traits. Genetics. 2007;176:815–827. doi: 10.1534/genetics.107.071647. [DOI] [PMC free article] [PubMed] [Google Scholar]
- International HapMap Consortium A haplotype map of the human genome. Nature. 2005;437:1299–1320. doi: 10.1038/nature04226. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kazazian H.H. Mobile elements: Drivers of genome evolution. Science. 2004;303:1626–1632. doi: 10.1126/science.1089670. [DOI] [PubMed] [Google Scholar]
- Korbel J.O., Urban A.E., Affourtit J.P., Godwin B., Grubert F., Simons J.F., Kim P.M., Palejev D., Carriero N.J., Du L., Urban A.E., Affourtit J.P., Godwin B., Grubert F., Simons J.F., Kim P.M., Palejev D., Carriero N.J., Du L., Affourtit J.P., Godwin B., Grubert F., Simons J.F., Kim P.M., Palejev D., Carriero N.J., Du L., Godwin B., Grubert F., Simons J.F., Kim P.M., Palejev D., Carriero N.J., Du L., Grubert F., Simons J.F., Kim P.M., Palejev D., Carriero N.J., Du L., Simons J.F., Kim P.M., Palejev D., Carriero N.J., Du L., Kim P.M., Palejev D., Carriero N.J., Du L., Palejev D., Carriero N.J., Du L., Carriero N.J., Du L., Du L., et al. Paired-end mapping reveals extensive structural variation in the human genome. Science. 2007;318:420–426. doi: 10.1126/science.1149504. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lander E.S., Linton L.M., Birren B., Nusbaum C., Zody M.C., Baldwin J., Devon K., Dewar K., Doyle M., FitzHugh W., Linton L.M., Birren B., Nusbaum C., Zody M.C., Baldwin J., Devon K., Dewar K., Doyle M., FitzHugh W., Birren B., Nusbaum C., Zody M.C., Baldwin J., Devon K., Dewar K., Doyle M., FitzHugh W., Nusbaum C., Zody M.C., Baldwin J., Devon K., Dewar K., Doyle M., FitzHugh W., Zody M.C., Baldwin J., Devon K., Dewar K., Doyle M., FitzHugh W., Baldwin J., Devon K., Dewar K., Doyle M., FitzHugh W., Devon K., Dewar K., Doyle M., FitzHugh W., Dewar K., Doyle M., FitzHugh W., Doyle M., FitzHugh W., FitzHugh W., et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. [DOI] [PubMed] [Google Scholar]
- Letunic I., Copley R.R., Pils B., Pinkert S., Schultz J., Bork P., Copley R.R., Pils B., Pinkert S., Schultz J., Bork P., Pils B., Pinkert S., Schultz J., Bork P., Pinkert S., Schultz J., Bork P., Schultz J., Bork P., Bork P. SMART 5: Domains in the context of genomes and networks. Nucleic Acids Res. 2006;34:D257–D260. doi: 10.1093/nar/gkj079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Levy S., Sutton G., Ng P.C., Feuk L., Halpern A.L., Walenz B.P., Axelrod N., Huang J., Kirkness E.F., Denisov G., Sutton G., Ng P.C., Feuk L., Halpern A.L., Walenz B.P., Axelrod N., Huang J., Kirkness E.F., Denisov G., Ng P.C., Feuk L., Halpern A.L., Walenz B.P., Axelrod N., Huang J., Kirkness E.F., Denisov G., Feuk L., Halpern A.L., Walenz B.P., Axelrod N., Huang J., Kirkness E.F., Denisov G., Halpern A.L., Walenz B.P., Axelrod N., Huang J., Kirkness E.F., Denisov G., Walenz B.P., Axelrod N., Huang J., Kirkness E.F., Denisov G., Axelrod N., Huang J., Kirkness E.F., Denisov G., Huang J., Kirkness E.F., Denisov G., Kirkness E.F., Denisov G., Denisov G., et al. The diploid genome sequence of an individual human. PLoS Biol. 2007;5:e254. doi: 10.1371/journal.pbio.0050254. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lyon M.F. X-chromosome inactivation: A repeat hypothesis. Cytogenet. Cell Genet. 1998;80:133–137. doi: 10.1159/000014969. [DOI] [PubMed] [Google Scholar]
- Mi H., Lazareva-Ulitsky B., Loo R., Kejariwal A., Vandergriff J., Rabkin S., Guo N., Muruganujan A., Doremieux O., Campbell M.J., Lazareva-Ulitsky B., Loo R., Kejariwal A., Vandergriff J., Rabkin S., Guo N., Muruganujan A., Doremieux O., Campbell M.J., Loo R., Kejariwal A., Vandergriff J., Rabkin S., Guo N., Muruganujan A., Doremieux O., Campbell M.J., Kejariwal A., Vandergriff J., Rabkin S., Guo N., Muruganujan A., Doremieux O., Campbell M.J., Vandergriff J., Rabkin S., Guo N., Muruganujan A., Doremieux O., Campbell M.J., Rabkin S., Guo N., Muruganujan A., Doremieux O., Campbell M.J., Guo N., Muruganujan A., Doremieux O., Campbell M.J., Muruganujan A., Doremieux O., Campbell M.J., Doremieux O., Campbell M.J., Campbell M.J., et al. The PANTHER database of protein families, subfamilies, functions and pathways. Nucleic Acids Res. 2005;33:D284–D288. doi: 10.1093/nar/gki078. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mills R.E., Luttig C.T., Larkins C.E., Beauchamp A., Tsui C., Pittard W.S., Devine S.E., Luttig C.T., Larkins C.E., Beauchamp A., Tsui C., Pittard W.S., Devine S.E., Larkins C.E., Beauchamp A., Tsui C., Pittard W.S., Devine S.E., Beauchamp A., Tsui C., Pittard W.S., Devine S.E., Tsui C., Pittard W.S., Devine S.E., Pittard W.S., Devine S.E., Devine S.E. An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res. 2006;16:1182–1190. doi: 10.1101/gr.4565806. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Muotri A.R., Chu V.T., Marchetto M.C., Deng W., Moran J.V., Gage F.H., Chu V.T., Marchetto M.C., Deng W., Moran J.V., Gage F.H., Marchetto M.C., Deng W., Moran J.V., Gage F.H., Deng W., Moran J.V., Gage F.H., Moran J.V., Gage F.H., Gage F.H. Somatic mosaicism in neuronal precursor cells mediated by L1 retrotransposition. Nature. 2005;435:903–910. doi: 10.1038/nature03663. [DOI] [PubMed] [Google Scholar]
- Mural R.J., Adams M.D., Myers E.W., Smith H.O., Miklos G.L., Wides R., Halpern A., Li P.W., Sutton G.G., Nadeau J., Adams M.D., Myers E.W., Smith H.O., Miklos G.L., Wides R., Halpern A., Li P.W., Sutton G.G., Nadeau J., Myers E.W., Smith H.O., Miklos G.L., Wides R., Halpern A., Li P.W., Sutton G.G., Nadeau J., Smith H.O., Miklos G.L., Wides R., Halpern A., Li P.W., Sutton G.G., Nadeau J., Miklos G.L., Wides R., Halpern A., Li P.W., Sutton G.G., Nadeau J., Wides R., Halpern A., Li P.W., Sutton G.G., Nadeau J., Halpern A., Li P.W., Sutton G.G., Nadeau J., Li P.W., Sutton G.G., Nadeau J., Sutton G.G., Nadeau J., Nadeau J., et al. A comparison of whole-genome shotgun-derived mouse chromosome 16 and the human genome. Science. 2002;296:1661–1671. doi: 10.1126/science.1069193. [DOI] [PubMed] [Google Scholar]
- Murchison E.P., Hannon G.J., Hannon G.J. miRNAs on the move: miRNA biogenesis and the RNAi machinery. Curr. Opin. Cell Biol. 2004;16:223–229. doi: 10.1016/j.ceb.2004.04.003. [DOI] [PubMed] [Google Scholar]
- Naas T.P., DeBerardinis R.J., Moran J.V., Ostertag E.M., Kingsmore S.F., Seldin M.F., Hayashizaki Y., Martin S.L., Kazazian H.H., DeBerardinis R.J., Moran J.V., Ostertag E.M., Kingsmore S.F., Seldin M.F., Hayashizaki Y., Martin S.L., Kazazian H.H., Moran J.V., Ostertag E.M., Kingsmore S.F., Seldin M.F., Hayashizaki Y., Martin S.L., Kazazian H.H., Ostertag E.M., Kingsmore S.F., Seldin M.F., Hayashizaki Y., Martin S.L., Kazazian H.H., Kingsmore S.F., Seldin M.F., Hayashizaki Y., Martin S.L., Kazazian H.H., Seldin M.F., Hayashizaki Y., Martin S.L., Kazazian H.H., Hayashizaki Y., Martin S.L., Kazazian H.H., Martin S.L., Kazazian H.H., Kazazian H.H. An actively retrotransposing, novel subfamily of mouse L1 elements. EMBO J. 1998;17:590–597. doi: 10.1093/emboj/17.2.590. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ostertag E.M., Kazazian H.H., Kazazian H.H. Biology of mammalian L1 retrotransposons. Annu. Rev. Genet. 2001;35:501–538. doi: 10.1146/annurev.genet.35.102401.091032. [DOI] [PubMed] [Google Scholar]
- Roy-Engel A.M., El-Sawy M., Farooq L., Odom G.L., Perepelitsa-Belancio V., Bruch H., Oyeniran O.O., Deininger P.L., El-Sawy M., Farooq L., Odom G.L., Perepelitsa-Belancio V., Bruch H., Oyeniran O.O., Deininger P.L., Farooq L., Odom G.L., Perepelitsa-Belancio V., Bruch H., Oyeniran O.O., Deininger P.L., Odom G.L., Perepelitsa-Belancio V., Bruch H., Oyeniran O.O., Deininger P.L., Perepelitsa-Belancio V., Bruch H., Oyeniran O.O., Deininger P.L., Bruch H., Oyeniran O.O., Deininger P.L., Oyeniran O.O., Deininger P.L., Deininger P.L. Human retroelements may introduce intragenic polyadenylation signals. Cytogenet. Genome Res. 2005;110:365–371. doi: 10.1159/000084968. [DOI] [PubMed] [Google Scholar]
- Saxton J.A., Martin S.L., Martin S.L. Recombination between subtypes creates a mosaic lineage of LINE-1 that is expressed and actively retrotransposing in the mouse genome. J. Mol. Biol. 1998;280:611–622. doi: 10.1006/jmbi.1998.1899. [DOI] [PubMed] [Google Scholar]
- Slonim D.K. From patterns to pathways: Gene expression data analysis comes of age. Nat. Genet. 2002;32:502–508. doi: 10.1038/ng1033. (Suppl) [DOI] [PubMed] [Google Scholar]
- Speek M. Antisense promoter of human L1 retrotransposon drives transcription of adjacent cellular genes. Mol. Cell. Biol. 2001;21:1973–1985. doi: 10.1128/MCB.21.6.1973-1985.2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stein L.D., Mungall C., Shu S., Caudy M., Mangone M., Day A., Nickerson E., Stajich J.E., Harris T.W., Arva A., Mungall C., Shu S., Caudy M., Mangone M., Day A., Nickerson E., Stajich J.E., Harris T.W., Arva A., Shu S., Caudy M., Mangone M., Day A., Nickerson E., Stajich J.E., Harris T.W., Arva A., Caudy M., Mangone M., Day A., Nickerson E., Stajich J.E., Harris T.W., Arva A., Mangone M., Day A., Nickerson E., Stajich J.E., Harris T.W., Arva A., Day A., Nickerson E., Stajich J.E., Harris T.W., Arva A., Nickerson E., Stajich J.E., Harris T.W., Arva A., Stajich J.E., Harris T.W., Arva A., Harris T.W., Arva A., Arva A., et al. The generic genome browser: A building block for a model organism system database. Genome Res. 2002;12:1599–1610. doi: 10.1101/gr.403602. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stranger B.E., Forrest M.S., Dunning M., Ingle C.E., Beazley C., Thorne N., Redon R., Bird C.P., de Grassi A., Lee C., Forrest M.S., Dunning M., Ingle C.E., Beazley C., Thorne N., Redon R., Bird C.P., de Grassi A., Lee C., Dunning M., Ingle C.E., Beazley C., Thorne N., Redon R., Bird C.P., de Grassi A., Lee C., Ingle C.E., Beazley C., Thorne N., Redon R., Bird C.P., de Grassi A., Lee C., Beazley C., Thorne N., Redon R., Bird C.P., de Grassi A., Lee C., Thorne N., Redon R., Bird C.P., de Grassi A., Lee C., Redon R., Bird C.P., de Grassi A., Lee C., Bird C.P., de Grassi A., Lee C., de Grassi A., Lee C., Lee C., et al. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science. 2007;315:848–853. doi: 10.1126/science.1136678. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Symer D.E., Connelly C., Szak S.T., Caputo E.M., Cost G.J., Parmigiani G., Boeke J.D., Connelly C., Szak S.T., Caputo E.M., Cost G.J., Parmigiani G., Boeke J.D., Szak S.T., Caputo E.M., Cost G.J., Parmigiani G., Boeke J.D., Caputo E.M., Cost G.J., Parmigiani G., Boeke J.D., Cost G.J., Parmigiani G., Boeke J.D., Parmigiani G., Boeke J.D., Boeke J.D. Human L1 retrotransposition is associated with genetic instability in vivo. Cell. 2002;110:327–338. doi: 10.1016/s0092-8674(02)00839-5. [DOI] [PubMed] [Google Scholar]
- Wade C.M., Daly M.J., Daly M.J. Genetic variation in laboratory mice. Nat. Genet. 2005;37:1175–1180. doi: 10.1038/ng1666. [DOI] [PubMed] [Google Scholar]
- Wade C.M., Kulbokas E.J., Kirby A.W., Zody M.C., Mullikin J.C., Lander E.S., Lindblad-Toh K., Daly M.J., Kulbokas E.J., Kirby A.W., Zody M.C., Mullikin J.C., Lander E.S., Lindblad-Toh K., Daly M.J., Kirby A.W., Zody M.C., Mullikin J.C., Lander E.S., Lindblad-Toh K., Daly M.J., Zody M.C., Mullikin J.C., Lander E.S., Lindblad-Toh K., Daly M.J., Mullikin J.C., Lander E.S., Lindblad-Toh K., Daly M.J., Lander E.S., Lindblad-Toh K., Daly M.J., Lindblad-Toh K., Daly M.J., Daly M.J. The mosaic structure of variation in the laboratory mouse genome. Nature. 2002;420:574–578. doi: 10.1038/nature01252. [DOI] [PubMed] [Google Scholar]
- Waterston R.H., Lindblad-Toh K., Birney E., Rogers J., Abril J.F., Agarwal P., Agarwala R., Ainscough R., Alexandersson M., An P., Lindblad-Toh K., Birney E., Rogers J., Abril J.F., Agarwal P., Agarwala R., Ainscough R., Alexandersson M., An P., Birney E., Rogers J., Abril J.F., Agarwal P., Agarwala R., Ainscough R., Alexandersson M., An P., Rogers J., Abril J.F., Agarwal P., Agarwala R., Ainscough R., Alexandersson M., An P., Abril J.F., Agarwal P., Agarwala R., Ainscough R., Alexandersson M., An P., Agarwal P., Agarwala R., Ainscough R., Alexandersson M., An P., Agarwala R., Ainscough R., Alexandersson M., An P., Ainscough R., Alexandersson M., An P., Alexandersson M., An P., An P., et al. Initial sequencing and comparative analysis of the mouse genome. Nature. 2002;420:520–562. doi: 10.1038/nature01262. [DOI] [PubMed] [Google Scholar]
- Wheelan S.J., Aizawa Y., Han J.S., Boeke J.D., Aizawa Y., Han J.S., Boeke J.D., Han J.S., Boeke J.D., Boeke J.D. Gene-breaking: A new paradigm for human retrotransposon-mediated gene evolution. Genome Res. 2005;15:1073–1078. doi: 10.1101/gr.3688905. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Whitelaw E., Martin D.I., Martin D.I. Retrotransposons as epigenetic mediators of phenotypic variation in mammals. Nat. Genet. 2001;27:361–365. doi: 10.1038/86850. [DOI] [PubMed] [Google Scholar]
- Wu T.D., Watanabe C.K., Watanabe C.K. GMAP: A genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics. 2005;21:1859–1875. doi: 10.1093/bioinformatics/bti310. [DOI] [PubMed] [Google Scholar]
- Yang N., Kazazian H.H., Kazazian H.H. L1 retrotransposition is suppressed by endogenously encoded small interfering RNAs in human cultured cells. Nat. Struct. Mol. Biol. 2006;13:763–771. doi: 10.1038/nsmb1141. [DOI] [PubMed] [Google Scholar]
- Yang H., Bell T.A., Churchill G.A., de Pardo-Manuel Villena F., Bell T.A., Churchill G.A., de Pardo-Manuel Villena F., Churchill G.A., de Pardo-Manuel Villena F., de Pardo-Manuel Villena F. On the subspecific origin of the laboratory mouse. Nat. Genet. 2007;39:1100–1107. doi: 10.1038/ng2087. [DOI] [PubMed] [Google Scholar]
- Yoder J.A., Walsh C.P., Bestor T.H., Walsh C.P., Bestor T.H., Bestor T.H. Cytosine methylation and the ecology of intragenomic parasites. Trends Genet. 1997;13:335–340. doi: 10.1016/s0168-9525(97)01181-5. [DOI] [PubMed] [Google Scholar]