Abstract
We discovered two new complex elements while studying large genomic rearrangements and segmental duplications in the human genome. Both resemble bacterial composite DNA transposon Tn9, consisting of a core flanked by mobile elements, except that the flanking element is not a DNA transposon but instead is long terminal repeat retrotransposon-like with human endogenous retrovirus and satellite sequences. Based on the core size, we named them Xiao (~30 kb) and DA (~280 kb), meaning small and big, respectively, in Chinese. Xiao originated from a 19p region encoding olfactory receptor 7E members after the human/ape divergence from Old World monkeys, while DA likely evolved from a Xiao by inserting ~200 kb of chimeric sequence from 16p and 21q into the Xiao core, resulting in a target site duplication of 3.4 kb. DA/Xiao was identified in 30 loci on 12 chromosomes, and only DAs mediated intrachromosomal rearrangements, based on our reconstructed human–mouse–rat ancestral genome and the rhesus macaque genome.
Keywords: Composite LTR–retrotransposon, Retrotransposition, Target site duplication, Rearrangement, Segmental duplication, Recombination, DNA transposition, Genome evolution, Comparative genomics
Transposable elements (TEs) [1–6] are grouped into retrotransposons and DNA transposons, depending upon whether RNA intermediates are required for transposition. Simple DNA transposons such as insertion sequences (IS) encode a transposase and have an inverted repeat of usually 9–41 bp at each end [1,2]. Composite DNA transposons such as Tn10, Tn5, and Tn9, found in bacteria, contain a middle region (often encoding a drug resistance gene) flanked by two IS elements. Owing to the flanking IS elements, the middle region becomes mobile as well. DNA transposons jump around the genome through either “cut and paste” (nonreplicative) or “copy and paste” (replicative) mechanisms [1,2], causing breakages at both donor and acceptor sites. Retrotransposons require RNA intermediates for transposition and do not induce DNA breakage at the donor sites. Retrotransposons can be divided into two groups: one with a long terminal repeat (LTR) of a few hundred bases to over 1 kb in length at each end (e.g., retroviruses) and the other without LTRs (e.g., long interspersed nuclear elements or LINEs) [2]. Retroviruses and LINEs encode reverse transcriptases that are essential for their amplification in the genome. Insertion of TEs in most cases results in target site duplication (TSD) of usually 2–20 bp [1,2].
The human genome contains significantly more recognizable retrotransposons (nearly 50%) than DNA transposons (only 3%) [6–8]. The already identified retrotransposons include LTR elements such as human endogenous retroviruses (HERVs), as well as non-LTR elements such as LINEs (e.g., L1s) and short interspersed nuclear elements (SINEs) (e.g., Alus). Although hundreds of thousands of copies of individual TEs have been identified [7,8], only a small number of composite elements are reported for the human genome, including SVA, which contains SINE, VNTR (variable numbers of tandem repeats), and Alu elements [9]; the composite DNA transposon Ricksha; and Harlequin and HERV39, which have mostly simple repeats inserted into the relevant HERV elements.
Segmental duplications (SDs), another type of repeating DNA different from traditional TEs described above, make up ~5% of the human genome [10]. SDs are low-copy-number repeats (typically below 100 copies in a haploid genome), with a size of at least 1 kb and ≥90% sequence identity among the copies. SDs have been extensively studied in recent years; however, neither the origins of SDs nor their duplication mechanisms are known, except for models that propose replicative DNA transposition and Alu-mediated recombination [11,12]. No linkages between SDs and retrotransposons have ever been reported.
TEs and SDs are associated with genomic rearrangements (i.e., translocations, inversions, insertions, duplications, or deletions), facilitating the evolution of the genome but sometimes also contributing to disease development [5,13,14]. To understand TEs, SDs, and genomic rearrangements better, we reconstructed an ancestral genome for human, mouse, and rat by using dog as the outgroup, which allowed the identification of rearrangements occurring in each lineage since its divergence from the ancestor. We found that inversions are the dominant event during human genome evolution, and SDs were identified in 90% of the inversion breakpoints. We characterized a genome-wide SD and discovered two new composite LTR retrotransposon-like elements in the human genome, named DA and Xiao, which have an architecture resembling the bacterial composite DNA transposon Tn9, except that the flanking element is a modified HERV. We explored the role that Xiao/DA has played in reshaping the genome. We also discussed observations that argue for and against the possibilities of Xiao and DA being composite LTR–retrotransposons and of retrotransposition being the mechanism by which a genome-wide SD might have arisen and dispersed throughout the genome.
Results
Human–mouse–rat ancestral genome
The human–mouse–rat ancestral genome was reconstructed by comparing the human, mouse, and rat genomes with the dog genome as the outgroup (see Materials and methods). The architecture of the ancestral genome is shown in Fig. 1, using the current human genomic sequence coordinates, and consists of a total of 154 human sequence blocks of which 132 have their ancestral genomic location unambiguously assigned. This reconstruction largely agrees with a previously reconstructed ancestral genome [15] except for a few discrepancies (Supplementary Table s1) that are likely due to the use of different outgroups (i.e., dog vs chicken). Many of the p-arm inversions and interchromosomal changes shown in Fig. 1 have been confirmed by other studies ([16,17] for instance), indicating the accuracy of our ancestral genome.
Fig. 1.
The The human–mouse–rat ancestral genome (excluding the Y chromosome) presented using the current human genomic sequence blocks. The ancestral genome contains 26 chromosomes, grouped based on the rearrangements occurring while the ancestor evolved to the current human. Group A has undergone intrachromosomal changes including p-arm inversion (A1) and other inversions (A2). Group B has undergone interchromosomal changes, including fissions (B1), fusions (B2), and likely a translocation (B3). Group C has no large (≥100 kb) rearrangements found in it. Each box represents a maximum unrearranged human sequence block as labeled on top (e.g., 9q:1 stands for the first block of the current 9q), with the current sequence orientation indicated by an interior diagonal line (see Supplementary Fig. s2 for actual sequence coordinates). Empty and shaded boxes distinguish different current chromosomes. Solid lines between boxes indicate that the joining of the two fragments is confirmed by using the dog outgroup, whereas dashed lines indicate that additional outgroups are needed to confirm the joining.
Intrachromosomal rearrangements predominate in the human lineage
Using the most parsimonious rule, approximately 8 interchromosomal events (1 translocation, 2 fissions, and 5 fusions) and 82 intrachromosomal events (i.e., inversions) are required to transform the ancestral genome into the current human genome (see Supplementary Materials I for detailed transformation), indicating that intrachromosomal changes are the major driving force for the evolution of the human genome. The mouse and rat genomes are different, as interchromosomal changes are the dominant events, demonstrated by significantly more interchromosomal breakpoints (Table 1) and rearrangements.
Table 1.
Species-specifica breakpoints in human, mouse, and rat since each diverged from their common ancestor
Species | Interchromosomal | Intrachromosomal | Total |
---|---|---|---|
Human | 10 | 89 | 99 |
Mouse | 129 | 83 | 212 |
Rat | 113 | 83 | 196 |
A total of 35 breakpoints are shared among human, mouse, and rat, compared to dog.
Segmental duplications are enriched in human-specific inversion breakpoint regions
Over 90% of human-specific inversion breakpoint regions were found to harbor SDs, consistent with a previous study reporting that SDs populate many primate-specific breakpoints [18]. Chromosome-specific SDs were found in extensively rearranged chromosomes, such as 9, 17, 10, 15, 22, 7, and 16 (Fig. 1), some of which have been studied ([19,20] for example). Here we focus on a large genome-wide SD and report the discovery of two new complex genetic elements.
DA and Xiao have a giant and composite LTR–retrotransposon-like architecture
We discovered that a genome-wide SD has generated complete and partial duplicon copies scattered over 30 loci on half of the human chromosomes, with size ranging from 3 kb to over 400 kb (Fig. 2). Complete duplicons were found to have an architecture resembling bacterial composite transposon Tn9, albeit with a much larger size, consisting of a middle region (the core) flanked by two elements that are direct repeats of each other (thus LTR-like). Based on the size of the core, we grouped the duplicons into two types: one with a core of ~10 kb, named Xiao (meaning small in Chinese), and the other with a core of >200 kb, named DA (meaning big in Chinese).
Fig. 2.
The structure and distribution of Xiao and DA and DA-mediated inversions. (A) DA and Xiao resemble Tn9 (top; IS1 is a DNA transposon). Xiao (the sound of the Chinese character shown, meaning small) has a core with repeats and OR7E pseudogenes (enclosed area) flanked by a modified HERV-E. DA (the sound of the Chinese character meaning big) appears to have evolved from a Xiao through insertion of directly fused sequences (overlapping bases indicated, the vertical line representing TEs inserted after the fusion, see text) from 16p (red bars, the inside arrow indicating the original sequence orientation) and 21q (black bar) into the target site (green-shaded boxes). Black vertical bars represent HERVs subsequently inserted. (B) Xiao loci (green vertical lines) do not colocalize with human breakpoints (black vertical lines inside chromosomes). However, except for the 11q 67-Mb DA, all DA loci (red vertical lines) colocalize with human-specific inversion breakpoints. Pink lines represent the donor loci for the DA and Xiao cores. See Supplementary Table s2 for the exact sequence coordinates. (C) DA-mediated inversions occur via: (1) fission of individual DA (inducing 6.9/97-Mb fission of Chromosome 7 and 131.4/15/131.1-Mb fissions of Chromosome 3, blue regions in B) and (2) homologous recombination between two DAs (purple regions in B).
A complete Xiao (Fig. 2A) is ~30 kb, with a 10-kb core flanked by a second 10-kb sequence on each side. The core harbors two olfactory receptor subfamily 7E (OR7E) pseudogenes [21] and terminates at both ends with satellite DNA SATR2, which is up to a few kilobases long and in the same orientation. Thus, the core itself has an LTR-element-like architecture. The 10-kb sequences flanking the core contain a partial HERV-E of 4–4.5 kb (having no or up to 200 bp of left LTR and partial gag–pol sequences) surrounded by satellites SATR1 (and sometimes SATR2 as well) of a few kilobases and in the same orientation, thus resembling a HERV-E that has satellites replacing the regular LTR (i.e., with a “SATR–partial (LTR?) HERV-E core–SATR” architecture).
We do not know whether this SATR–partial (LTR?) HERV-E core–SATR is a mobile element. In addition to DA/Xiao-containing genomic loci, we also found nine other loci (see Supplementary Table s3 for their exact genomic locations) where internal HERV-E sequences are closely associated with SATR1. For the rest of the HERVs, however, only two loci were identified where HERV-L and HERVIP10FH, respectively, are associated with SATR1. Thus, it seems unlikely that this close relationship between SATR1 and HERV-E is preserved in the genome randomly; however, we do not understand the significance of such a relationship.
A total of 5 complete and 40 partial copies of Xiao, existing as individuals or clusters ranging from 3 to 50 kb that total to 0.9 Mb, were identified on 12 chromosomes (Fig. 2). Partial Xiaos display 5′ truncations, 3′ truncations, or internal deletions. The internal deletions are likely due to internal rearrangements or recombination between adjacent copies.
The Xiao core is derived from a 19p region that encodes the OR7E gene and pseudogenes. To determine the origin of the Xiao core, we searched its sequence (Fig. 2A) against the entire human genome. With a cutoff of at least 80% identity and 500-bp match length, we identified only a region on 19p (9162–9231 kb) that matched the Xiao core among all the sequences that belong to neither DAs nor Xiaos. This indicates that the Xiao core originated from this 19p site, which is further supported by the distribution of the OR7E subfamily in the genome, as its 86 members are all located in either DA and Xiao duplicons (which harbor 81 members) or this 19p region and 30 kb nearby (which harbor 5 members) (Fig. 4 and Supplemental Fig. s3). Although this 19p region is currently 69 kb, its ancestral site was likely smaller at the time when the Xiao core first emerged, because at least 25 kb of sequence are missing in the homologous loci in the rhesus macaque and dog genomes.
Fig. 4.
Proposed retrotransposition model for DA/Xiao formation and amplification. Xiao formation (top): The Xiao core evolved from a 19p region (top line, showing only those with high homology to the Xiao core) and then was inserted into satellites of the flanking element. Tandem duplications (three tandem repeats were found at 3q-127Mb, shown in Supplemental Fig. s8, where the duplicons were ancestral) led to the complete Xiao (enclosed area). DA core formation (middle): The donor loci (bars on the right) contain retrogenes (blue vertical lines across the bars) that encode genes (purple lines below the bars with vertical lines indicating exons, see Supplemental Fig. s7 for the gene names) from which partial spliced transcripts of ~39, 89, and 66 kb (black, red, and blue lines) were generated and then fused during reverse transcription (via Alu elements as indicated) through the overlapping bases. Bottom: DAs were amplified following the arrows (solid arrows represent the resolved order, while dashed arrows represent the likely order) (through HERV-like mechanism? see text) and induced inversions (blue- and purple-shaded areas, see Fig. 2).
DA likely evolved from Xiao by inserting ~200 kb of chimeric sequences from 16p and 21q into the Xiao core, resulting in a TSD of 3.4 kb
We found that except for a much larger core, DA is the same as Xiao. Compared to the Xiao core, the DA core harbors an additional ~200 kb of chimeric sequences derived from 16p (5067–5156 and 5196–5286-kb regions) and 21q (32,722–32,806-kb region) that were later inserted by 56 kb of TEs (e.g., HERVs, Alu's, L1s, etc.) and retrogenes (Fig. 2A). The ~200-kb chimeric sequence was found to be a direct fusion product of three sequence fragments. While the 16p and 21q fragments were fused directly with a 3-bp overlap (AAA) (see Supplemental Materials II for the actual sequence coordinates of each fusion site), the fusion site between the two 16p fragments is more complex because of TE insertion. However, for a majority of DAs, we found that the two 16p fragments were also fused directly with a 4-bp overlap (AGGC), followed by insertion of ~1 kb of sequence derived from L1, LTR2, and sometimes Alu into the overlapping “AGGC” bases between G and C (Fig. 2A).
We identified a 3.4-kb sequence duplication flanking the 16p/21q chimeric sequence (Fig. 2) inside the DA core, compared to the Xiao core. This could be explained by two scenarios: (1) Xiao is ancestral and DA evolved from a Xiao through the insertion of the 16p/21q sequences into the Xiao core, resulting in a TSD of 3.4 kb, or (2) DA is ancestral and Xiao evolved from a DA by internal deletions of the 16p/21q chimeric sequence mediated by this 3.4-kb duplicated sequence, similar to how solo LTRs formed from HERVs. Two observations indicated that Xiao is most likely the ancestor of DA and thus the first scenario is likely what happened in nature. First, the majority of Xiao copies were found to be more ancient than the DA copies (see below). Second, compared to the Xiao copies that are not associated with DA, the corresponding Xiao core portion of each DA copy contains an extra LTR5B element (Fig. 2A). Because the donor sequences at 19p have neither LTR5B nor HERV-K (LTR5B is the sole LTR of HERV-K), it is likely that the ancestral Xiao lacks LTR5B and DA evolved from a Xiao copy that is a descendant of the ancestral Xiao through a HERV-K insertion (which later degenerated to LTR5B) (Fig. 3), which is also supported by a likely duplication order (shown in Supplemental Fig. s4) among the Xiao copis derived by sequence alignment. Thus, the 3.4-kb duplication is likely a TSD, even though it is substantially larger than those reported so far for TEs.
Fig. 3.
Age determination of DA and Xiao and TSD identification. Left: Proposed evolution sequence of Xiao, type I DA (located in 3q-127Mb and 3q-75Mb), type II DA (located in 7–6.6/97Mb, 3q-131Mb, 4p-4Mb, 11p-3Mb, 11q-67Mb, and 8p-6.9Mb), and type III DA (located in 12p-8Mb, 8p-7Mb, 8p-11.8Mb, 4–9Mb, and 11q-71Mb). Right top: Age of DA and Xiao. Double arrows indicate their amplification peak. Right bottom: TSDs of DA/Xiao loci were determined by using the rhesus and/or orangutan homologous regions as the empty sites (see Table 2 for the results).
A total of 20 DA copies (only 4 are complete) were found on six chromosomes, ranging from 26.5 to 400 kb (clusters or additional sequence duplication, see Figs. 2 and 3), totaling 3.9 Mb.
Three types of DA core exhibit a hierarchical order. By examining the sequence composition (e.g., additional sequence duplication, the insertion patterns of HERVs and retrogenes) of each DA core as well as the sequence alignment among the DA copies, we found three types of DA core existing in the human genome (Fig. 3). With its sequences closest to those at the 21q and 16p donor loci, type I is the most ancestral and was found in both the orangutan and the chimp genomes (Supplemental Tables s4 and s5 indicate which DA/Xiao copies have the chimp and orangutan homologues, respectively). Type II harbors an extra HERV-H-HERV-E-HERV-H element (a HERV-E element was inserted into an existing HERV-H), from which type III evolved by having an extra 6-kb doublet repeat and additional sequence duplication of 100 kb originating from 8p (based on the rhesus macaque genome). Both types II and III are found in the chimp genome but not in the orangutan genome. Each type has several DA copies (Fig. 3), and a likely hierarchical order among the copies can be found in Supplemental Fig. s5. The analyses identified the DA located at 3q-127Mb as the most ancestral and revealed that duplicons of the same chromosome are more closely related than duplicons from different chromosomes, indicating interchromosomal duplication followed by intrachromosomal duplication.
Only DAs (and not Xiaos) mediated human intrachromosomal rearrangements
Xiaos do not coincide with breaks of synteny among the human, mouse, and rat genomes and are not associated with large-scale interspecies genomic rearrangements (Fig. 2). On the other hand, 19 of 20 total DAs overlap with human-specific breakpoints (Fig. 2B) and are likely to have mediated intrachromosomal rearrangements of the human genome via either fission of an individual DA or homologous recombination between DAs (Fig. 2C), based on our reconstructed ancestral genome (Fig. 1). All these inversions (ranging from 4 to 70 Mb) are absent from the rhesus macaque genome but present in the chimp genome (the rhesus and chimp homologous regions can be viewed in Supplemental Materials II), consistent with later analyses indicating that DA and Xiao emerged after humans/apes diverged from Old World monkeys.
SDs were found to be enriched at the breakpoint region of large genomic rearrangements previously [18], but the precise mechanisms through which SDs facilitate rearrangements are not understood. Our study indicates that in addition to homologous recombination between two copies of an SD in opposite orientations and on the same chromosome, internal breakages within SDs could result in large inversions (Fig. 2). In addition, the analysis implies that only large SDs (e.g., only DA, not Xiao) can mediate large inversions. The study also answers the question why some OR7E-associated SDs coincide with breaks of synteny with the mouse, rat, and/or chicken genomes, whereas others do not, as previously reported [22].
Xiao/DA emerged after humans/apes diverged from Old World monkeys, and most Xiao copies appeared before the human–orangutan divergence, whereas most DA copies arose after the human–orangutan divergence
To determine the ages of DA and Xiao, we searched their sequences against the published chimp and rhesus macaque genomes [23,24] and identified the corresponding chimp homologues (shown in Supplemental Table s4) for a majority of DA and Xiao copies, but did not find any rhesus macaque homologues. This indicates that Xiao and DA emerged after humans/apes diverged from Old World monkeys about 25 Myr ago [25], which is further supported by HERV insertion analyses. Compared to the sequences at the 16p and 21q donor loci, the DA core harbors additional HERVs such as HERV-K, S71, H, and E (Fig. 2A). HERV-L, on the other hand, was found at the same positions as at the donor locus (Supplemental Table s6 indicates the order of all LTR elements found in the donor loci and each DA core). Therefore, HERV-L was inserted into the donor locus prior to the duplication, whereas the other HERVs were inserted into the duplicated copy after the DA core formed. Thus, DA first emerged before 10–15 Myr ago, when HERV-H was still amplified in the genome [26], and after 40 Myr ago, when HERV-L ceased mobility [27] and HERV-K and HERV-E first entered the genome [28,29].
To determine whether DA emerged first or Xiao emerged first, we searched the two insertion junction sequences and internal sequences of each DA and Xiao copy against the recently released orangutan contig sequences and found that the orangutan genome has nearly all the Xiao copies but only the type I DAs (Fig. 3). This indicates that most Xiao copies appeared in the genome before the human–orangutan divergence, whereas the majority of DA copies (types II and III) emerged after the human-orangutan divergence ~14 Myr ago [25], consistent with a study finding HERV-H/HERV-E hybrids in the human and chimp genomes but not in the orangutan genome [30]. We searched the published chimeric sequences against the human genome and found that they all reside in the type II DA core, where a HERV-E inserted into a HERV-H (Figs. 2 and 3), but nowhere else.
These bioinformatics analysis results were confirmed by our polymerase chain reaction experiments (unpublished data) that examined the insertion junctions of each DA/Xiao copy as well as the internal chimeric fusion junctions of the DA core (Fig. 2) in the genomes of various primates, which will be reported in a separate manuscript in detail.
Are DA and Xiao TEs? Target site duplication identification Satellites as target insertion sites
Because many DA and Xiao copies start and end with either SATR1 or SATR2 (often in the range of a few kilobases) (Fig. 2A), these satellite sequences seem to be the preferred insertion sites for DA and Xiao. Insertion into a satellite sequence has advantages, including a lower chance of disrupting gene-coding regions compared to other sites such as Alus. In addition, satellites can better tolerate length variation and thus better withstand insertion outcomes such as TSDs. Unlike other satellites, SATR1 and SATR2 are not concentrated near or within heterochromatic regions but rather are scattered over euchromatic areas. We identified a total of 182 SATR1, 40 SATR2, and 123 SATR1–SATR2 sites in the genome, of which 60, 12, and 83 are associated with DA/Xiao, respectively. For the remaining sites, we found that 93% of SATR1, 88% of SATR1–SATR2, and 82% of SATR2 are closely associated with TEs (within a distance of < 100 bp) (Supplemental Table s3 provides more detailed information regarding the distribution of SATR1 and SATR2 sequences in the genome); however, we do not understand the significance of such a relationship and do not know if transposition of these TEs has brought SATR1 and SATR2 sequences to these many genomic sites.
TSD identification for DA/Xiao
TSD is an almost universal hallmark of TEs. To determine whether DA and Xiao are TEs, we searched for TSDs at each DA/Xiao insertion site. TSDs could be identified by comparing the sequences at the two insertion junctions, which, however, requires the junctions to be precisely determined. As described before, many DA and Xiao copies terminate with SATR1 or SATR2 sequences that are much more rapidly evolving compared to other sequences in the genome [31]; the junctions determined based on sequence alignment with other copies may not be the actual insertion sites due to sequence mutations.
To resolve these issues and to obtain an unambiguous TSD identification, we would need to compare the sequence of the insertion site to that of the preinsertion site (or empty site). Toward this goal, we used the homologous regions of the released rhesus macaque and orangutan genomes as the empty sites (Fig. 3) and, indeed, identified TSDs of 4–8 bp for at least 8 of 30 total DA/Xiao loci (Table 2). It is possible that more DA/ Xiao insertions have created TSDs but we failed to detect them for the following reasons. First, the rhesus genome is in a draft state and may contain gaps, misassemblies, and other inaccuracies; the orangutan genome is even less complete and only the contig sequences (ranging from 288 bp to 10 Mb with an average of 41 kb) are released. This may have made the TSD identification impossible for some DA/Xiao insertion sites. We expect more TSDs to be found when these genomic sequences become more complete. Second, sequence mutations and genomic rearrangements complicate TSD identification. As described before, many DA/Xiao copies inserted into SATR1 and SATR2 sequences, which are evolutionarily more dynamic and contain more mutations. In addition, nearly all DA copies are involved in homologous recombination and rearrangements (Fig. 2), leading to gains/losses of sequences. Because of these issues (see Supplemental Table s8 for specific issues found with specific DA/Xiao loci), the TSD results (Table 2) indicate that DA and Xiao are likely TEs; however, clearly more analyses are needed before we can confidently conclude this.
Table 2.
DA/Xiao TSD identificationa
Dupliconb | Empty site | Empty site length (bp)c | TSD |
---|---|---|---|
14–51 Xiao | Orangutan | 0 | GGCC(A)CCCCd |
2–71 Xiao | Rhesus | 0 | CCATCA |
2–159 Xiao | Rhesus | 9 | AATATC |
9–90 Xiao | Rhesus | 0 | GAGATTGG |
13–40 Xiao | Rhesus | 299 | AAAACT |
21–32 Xiao | Rhesus | 182 | GGGAGT |
11–67 DA | Orangutan | 933 | ATGG |
4–4/9 DAe | Orangutan | 837 | CCTGCTA |
The detailed analyses can be found in Supplemental Table s7.
Duplicons are represented by their location in the human genome, e.g., “14–51 Xiao” stands for the Xiao copy located in 51 Mb of chromosome 14.
The empty site length is the length of the orangutan/rhesus site that corresponds to the left and right insertion junctions of DA/Xiao in the human genome. Ideally, it should be 0 bp; however, sequence insertions (e.g., TEs) could have happened as the orangutan/rhesus genome evolved.
A base change (C/A) occurred in the TSD.
This is a DA-mediated inversion in the human genome compared to the rhesus and orangutan genomes; consequently the insertion junctions of both DA copies were compared to identify the TSD.
Discussion
We discovered two new composite LTR-retrotransposon-like elements, DA and Xiao, while characterizing a genome-wide SD and large inversions in the human genome. Although this OR7E-associated SD was investigated previously [32], the duplicon structure was not determined and many questions remained to be answered. For instance, why does this SD involve 16p and 21q regions that have no OR7E members at all [32]? Why do some duplicons coincide with breaks of synteny with the mouse, rat, and/or chicken genomes, whereas others do not [22]? The determination of the DA/Xiao structure (Fig. 2) and the reconstructed human–mouse–rat ancestral genome (Fig. 1) answer many such questions and shed light on the role that SDs have played in facilitating rearrangements and reshaping the genome (Fig. 2). The discovery of DA/Xiao also provides an opportunity to explore the mechanisms by which some SDs might have arisen and spread throughout the genome. Below we discuss observations that argue for and against the possibilities that DA and Xiao may be composite LTR–retrotransposons and that retrotransposition may be the more likely mechanism for the origin and propagation of this genome-wide SD, compared to the current working models of DNA transposition [11,32] and recombination [12].
What is the mechanism of 16p/21q chimeric sequence insertion to form the DA core (Figs. 2 and 3): replicative DNA transposition, recombination, or retrotransposition?
The DA core appears to have originated from a Xiao core by inserting sequences from the 16p and 21q donor loci (Figs. 2 and 3). While replicative DNA transposition has been proposed to explain chimeric sequence formation within SDs [11,32], the exact mechanism by which DNA has transposed is unclear, as these sequence fragments do not encode DNA transposases like DNA transposons. In addition, the model proposes that sequences from different donor loci are being independently transposed to the acceptor locus [11,32], predicting a higher possibility of acceptor locus sequences intervening between inserted sequences from two different donor loci, which, however, is not the case here as the three sequence fragments from the 16p and 21q donor sites were found to be directly fused (Fig. 2).
A previous study found that AluS and AluY were enriched at SD junctions and proposed that Alu-mediated nonallelic homologous recombination (NAHR) played a role in the origin and spread of SDs [12]. Although AluS was identified at 16p/ 21q chimeric sequence insertion junctions, several observations argue against the NAHR model. First, NAHR would not result in the 3.4-kb TSD at the insertion site shown in Fig. 2.In addition, NAHR would likely induce genomic rearrangements [12], which was not observed, as the 16p and 21q donor loci were found to be ancestral and no rearrangements have been identified since human diverged from mouse/rat ~75 Myr ago (Fig. 1).
Is retrotransposition possibly the mechanism? Although retrotransposition has not been proposed to explain SD origin and propagation before, several findings indicate that it is a possible mechanism for the 16p/21q sequence insertion. First, a TSD was found at the insertion site (Fig. 2), albeit with a size (3.4 kb) substantially larger than TSDs previously reported. Second, the 16p/21q donor loci encode genes (Fig. 4) and are likely to be transcribed. We found that not all sequences from the donor loci were retained in the DA core, and the exon- or retrogene-containing regions are better preserved than intron sequences (Fig. 4), indicating that the sequences might have undergone transcription and partial processing. Based on these findings, we propose a retrotransposition model as shown in Fig. 4 that uses AluS elements as the priming sites. The model also hypothesizes that 16p/21q sequence fusions took place during reverse transcription, similar to template switching during trans-mobilization of non-L1 mRNAs by L1-retrotransposition machinery to form chimeric U6-L1 pseudogenes [33]. Although this process (Fig. 4) resembles L1 insertion [2,34], we realize that because of their large size (~200 kb, Fig. 2), even if these sequences were inserted indeed through retrotransposition, the mechanism must be different from that of L1s in some aspects.
Are DA and Xiao composite LTR–retrotransposons that have proliferated in the genome through retrotransposition?
Alu-mediated NAHR has been proposed as a possible mechanism through which SDs spread throughout the genome [12]. Because many DA and Xiao copies start and end with SATR1 or SATR2 sequences (Fig. 2A), one could argue that DA and Xiao were amplified via NAHR mediated by these satellites. However, several findings disagree with the NAHR model. First, for those complete DA and Xiao copies that have the two flanking “satellite–HERV-E–satellite” elements (Fig. 2), a significantly higher sequence homology was found among the corresponding satellites between the left and the right elements (76 and 91%) than among those within the same element (64 and 65%, the same as background) (Table 3), which is also clearly demonstrated by the phylogenetic trees in which these satellite sequences are separated into two big groups at the very beginning (Supplemental Fig. s6). This is inconsistent with the NAHR model, which would predict the same sequence divergence among all four satellite sequences. Second, NAHR would likely induce genomic rearrangements [12], resulting in deletions, translocations, and/or inversions of sequences that belong to neither DAs nor Xiaos. However, we did not find such rearrangements that are associated with any of the 45 Xiaos or the 20 DAs except for the DA-mediated inversions shown in Fig. 2 (which likely occurred after the DAs were duplicated), based on our reconstructed ancestral genome (Fig. 1) and the rhesus macaque genome.
Table 3.
Average sequence identities between the left and the right flanking SATR–HERV-E–SATR elements of complete DA and Xiao copiesa
Comparison | Average identity (%) |
---|---|
L-l-SATRb vs R-l-SATR | 76 |
L-r-SATR-e vs R-r-SATR | 91 |
L-HERV-E vs R-HERV-E | 91 |
L-l-SATR vs L-r-SATR | 66c |
R-l-SATR vs R-r-SATR | 65c |
SATR represents SATR2-SATR1 (see Fig. 2). Separate analyses were performed for SATR2 and SATR1, and the results were nearly identical.
“R” and “L” stand for the right and left flanking elements, respectively, whereas lowercase “l” and “r” stand for the left and right SATR within the same flanking element, respectively. Thus, L-l-SATR stands for the left SATR of the left flanking element, R-r-SATR stands for the right SATR of the right flanking element, and so on.
These identities are similar to those of background SATR, for which a total of 23,186 SATR sequences were analyzed.
Because DA and Xiao are structured like composite LTR–retrotransposons (Fig. 2), would it be possible that they might have propagated through retrotransposition? Several observations argue for this model. First, TSDs were found for a number of DA/Xiao copies (Table 2), indicating a likelihood that they might have proliferated through transposition, especially considering that recombination is an unlikely mechanism, as discussed above. Second, DA and Xiao copies were found to be spanned by genes, mRNA, and/or ESTs (Supplemental Materials), and we found that many intron regions were transcribed, including the 16p/21q junction (an 800-bp EST CX164970 was found to match the junction), the OR7E pseudogenes (60% were found to be transcribed, which is significantly higher than in other OR subfamilies [35]), and the HERV-E–HERV-H junctions [30]. This indicates the possibility that the entire DA/Xiao copy might have been transcribed. Third, as described above, when comparing the two flanking satellite–HERV-E–satellite elements, the corresponding satellites between the two elements are significantly more similar than those within the same elements (see Table 3). This is consistent with the LTR–retrotransposon amplification mechanism in which the LTR was duplicated from one end to another during reverse transcription [2,3]. Last, we found that the primer binding site (PBS) was retained only in the left HERV-E, whereas the central polypurine tract (cPPT) [36] was preserved much better in the right HERV-E (Fig. 4, Supplemental Table s9), implying the possibility that reverse transcription might have been via the two flanking HERV-Es (Fig. 2), with the left one providing the PBS and the right one providing the cPPT.
A number of observations, however, also argue against the HERV-like retrotransposition model. First, DA and Xiao are too big. HERV reverse transcription takes place inside a virus-like particle that can handle only molecules of below 20 kb, which would be too small for Xiao and DA copies (ranging from 3 to 50 and from 26.5 to 400 kb, respectively, Fig. 2). Second, each HERV integrase has strict substrate specificity. For instance, the integrase of HERV-E is specific for double-stranded DNA that begins with “TA” and ends with “AC” [37]. For most DA/Xiao copies, we did not find such bases at the ends although the junctions might have not been precisely determined. Because of these issues, we realize that even if DA/Xiao indeed proceed through retrotransposition, the mechanism must be new in many aspects and cannot be the same as HERV retrotransposition [2].
Clearly, the question whether DA and Xiao originated and propagated through retrotransposition can be answered only by more analyses. We hope that our findings reported here will lead to further studies that will eventually allow the mechanisms by which some large low-copy-number repeats have arisen and dispersed throughout the genome to be understood.
Retrotransposition may be more successful than DNA transposition for sequence amplification in the human genome
Nearly 50% of the human genome consists of retroelements, whereas only 3% is DNA transposons. In addition, while some retrotransposons are still active (e.g., L1, Alus, HERV-K), no evidence indicates DNA transposon activity in the human genome for the past 37 Myr [38]. This implies that retrotransposition is a more efficient mechanism for DNA amplification within the human genome. One likely reason is that retrotransposition keeps the donor copy intact and simultaneously introduces changes into the duplicated copy by manipulating the transcript, whereas DNA transposition may damage the donor copy because DNA breakage occurs at the donor as well as at the acceptor sites.
The discovery of Xiao and DA indicates a possible linkage between a genome-wide SD and retrotransposition, different from the current working models of DNA transposition and recombination [11,12,32]. However, we do not know if this finding represents the general situation of SDs. A recent study [20] that focused on intrachromosomal duplications within chromosome 16 (different SDs from the SD studied here) proposed a core duplicon-flanking transposition model, i.e., duplication of a specific duplicon (the core duplicon) resulting in transposition of its adjacent sequences (which is different from Xiao/DA, which amplified additional sequences by inserting the sequences into the core and keeping the flanking elements unchanged, Fig. 3). We studied the core duplicon, but found that its architecture did not resemble DAs or Xiaos. We do not know if interchromosomal SDs and intrachromosomal SDs have different duplication mechanisms.
Materials and methods
The analysis used the human NCBI build 35, the dog genome canFam1 version, the chicken genome version 2, the mouse NCBI build 30, the rat genome version 3.1, the chimp genome panTro2 version, and the rhesus macaque draft assembly version 1.0, all downloaded from the University of California Santa Cruz (UCSC) genome site at www.genome.ucsc.edu and the Ensembl site at www.ensembl.org. The orangutan genomic contig sequence data were obtained from the Washington University Saint Louis Genome Sequencing Center at genome.wustl.edu.
About 1,200,000 end-sequence mate pairs from large clones (bacterial artificial chromosome clones and 50-kb shotgun clones), downloaded from GenBank and the NCBI Trace Archive database, were searched against the human, mouse, rat, and dog genomes as previously described [39]. As a result, a total of 340,000, 270,000, 170,000, and >200,000 large clones were mapped to the human, mouse, rat, and dog genomes, respectively, with a genome coverage of >10-fold. The synteny and rearrangement breakpoint maps among these four genomes were constructed by using these mapped clones as anchors as previously described [39]. The human–mouse–rat ancestral genome was reconstructed by using dog as the outgroup, and the mouse–rat ancestral genome was reconstructed using human as the outgroup, by using the most parsimonious rule (genomic fragments with the fewest rearrangements compared to the outgroup are considered to be ancestral, see Supplemental Fig. s1).
SDs were identified by using databases obtained from the UCSC genome site and the human SD database at projects.tcag.ca, as well as by structural matches. Genes, retrogenes, and other annotation data were obtained from the UCSC and Ensembl sites. Repeats were identified by RepeatMasker (www.repeatmasker.org). Annotation of OR7E members was obtained from the HORDE database at http://bioportal.weizmann.ac.il/HORDE/. Detailed supporting analyses can be found at http://csbl.bmb.uga.edu/~jix/science/duplicon/supp/.
Supplementary Material
Acknowledgments
We thank Dr. Guojun Li for helping us develop algorithms to derive duplication order, Drs. J. David Puett and Claiborne V. C. Glover III, and Kevin Jamison for editing the manuscript. This work is supported by funds from the University of Georgia, the Georgia Cancer Coalition, and the American Cancer Society.
Footnotes
Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.ygeno.2007.10.014.
References
- [1].Craig NL. Mobile DNA: an introduction. In: Craig NL, Craigie R, Gellert M, Lambowitz AL, editors. Mobile DNA II. ASM Press; Washington, DC: 2002. pp. 3–11. [Google Scholar]
- [2].Kazazian HH., Jr. Mobile elements: drivers of genome evolution. Science. 2004;303:1626–1632. doi: 10.1126/science.1089670. [DOI] [PubMed] [Google Scholar]
- [3].Voytas DF, Boeke JD. Ty1 and Ty5 of Saccharomyces cerevisiae. In: Craig NL, Craigie R, Gellert M, Lambowitz AL, editors. Mobile DNA II. ASM Press; Washington, DC: 2002. pp. 631–662. [Google Scholar]
- [4].Martin SL, Garfinkel DJ. Survival strategies for transposons and genomes. Genome Biol. 2003;4:313. doi: 10.1186/gb-2003-4-4-313. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Bennetzen JL. Transposable elements, gene creation and genome rearrangement in flowering plants. Curr. Opin. Genet. Dev. 2005;15:621–627. doi: 10.1016/j.gde.2005.09.010. [DOI] [PubMed] [Google Scholar]
- [6].Deininger PL, Batzer MA. Mammalian retroelements. Genome Res. 2002;12:1455–1465. doi: 10.1101/gr.282402. [DOI] [PubMed] [Google Scholar]
- [7].International Human Genome Sequencing Consortium, Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. [DOI] [PubMed] [Google Scholar]
- [8].Venter JC, et al. The sequence of the human genome. Science. 2001;291:1304–1351. doi: 10.1126/science.1058040. [DOI] [PubMed] [Google Scholar]
- [9].Wang H, et al. SVA elements: a hominid-specific retroposon family. J. Mol. Biol. 2005;354:994–1007. doi: 10.1016/j.jmb.2005.09.085. [DOI] [PubMed] [Google Scholar]
- [10].Bailey JA, et al. Recent segmental duplications in the human genome. Science. 2002;297:1003–1007. doi: 10.1126/science.1072047. [DOI] [PubMed] [Google Scholar]
- [11].Samonte RV, Eichler EE. Segmental duplications and the evolution of the primate genome. Nat. Rev., Genet. 2002;3:65–72. doi: 10.1038/nrg705. [DOI] [PubMed] [Google Scholar]
- [12].Bailey JA, Liu G, Eichler EE. An Alu transposition model for the origin and expansion of human segmental duplications. Am. J. Hum. Genet. 2003;73:823–834. doi: 10.1086/378594. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Stankiewicz P, Lupski JR. Genome architecture, rearrangements and genomic disorders. Trends Genet. 2002;18:74–82. doi: 10.1016/s0168-9525(02)02592-1. [DOI] [PubMed] [Google Scholar]
- [14].Menendez L, Benigno BB, McDonald JF. L1 and HERV-W retrotransposons are hypomethylated in human ovarian carcinomas. Mol. Cancer. 2004;3:12. doi: 10.1186/1476-4598-3-12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Bourque G, Zdobnov EM, Bork P, Pevzner PA, Tesler G. Comparative architectures of mammalian and chicken genomes reveal highly variable rates of genomic rearrangements across different lineages. Genome Res. 2005;15:98–110. doi: 10.1101/gr.3002305. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Yunis JJ, Prakash O. The origin of man: a chromosomal pictorial legacy. Science. 1982;215:1525–1530. doi: 10.1126/science.7063861. [DOI] [PubMed] [Google Scholar]
- [17].Murphy WJ, Stanyon R, O'Brien SJ. Evolution of mammalian genome organization inferred from comparative gene mapping. Genome Biol. 2001;2 doi: 10.1186/gb-2001-2-6-reviews0005. REVIEWS0005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Murphy WJ, et al. Dynamics of mammalian chromosome evolution inferred from multispecies comparative maps. Science. 2005;309:613–617. doi: 10.1126/science.1111387. [DOI] [PubMed] [Google Scholar]
- [19].Zody MC, et al. Analysis of the DNA sequence and duplication history of human chromosome 15. Nature. 2006;440:671–675. doi: 10.1038/nature04601. [DOI] [PubMed] [Google Scholar]
- [20].Johnson ME, et al. Recurrent duplication-driven transposition of DNA during hominoid evolution. Proc. Natl. Acad. Sci. U. S. A. 2006;103:17626–17631. doi: 10.1073/pnas.0605426103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Glusman G, Yanai I, Rubin I, Lancet D. The complete human olfactory subgenome. Genome Res. 2001;11:685–702. doi: 10.1101/gr.171001. [DOI] [PubMed] [Google Scholar]
- [22].Yue Y, Haaf T. 7E olfactory receptor gene clusters and evolutionary chromosome rearrangements. Cytogenet. Genome Res. 2006;112:6–10. doi: 10.1159/000087507. [DOI] [PubMed] [Google Scholar]
- [23].Rhesus Macaque Genome Sequencing and Analysis Consortium, Evolutionary and biomedical insights from the rhesus macaque genome. Science. 2007;316:222–234. doi: 10.1126/science.1139247. [DOI] [PubMed] [Google Scholar]
- [24].Chimpanzee Sequencing and Analysis Consortium, Initial sequence of the chimpanzee genome and comparison with the human genome. Nature. 2005;437:69–87. doi: 10.1038/nature04072. [DOI] [PubMed] [Google Scholar]
- [25].Stewart CB, Disotell TR. Primate evolution—in and out of Africa. Curr. Biol. 1998;8:R582–R588. doi: 10.1016/s0960-9822(07)00367-3. [DOI] [PubMed] [Google Scholar]
- [26].Goodchild NL, Wilkinson DA, Mager DL. Recent evolutionary expansion of a subfamily of RTVL-H human endogenous retrovirus-like elements. Virology. 1993;196:778–788. doi: 10.1006/viro.1993.1535. [DOI] [PubMed] [Google Scholar]
- [27].Smit AF. Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr. Opin. Genet. Dev. 1999;9:657–663. doi: 10.1016/s0959-437x(99)00031-3. [DOI] [PubMed] [Google Scholar]
- [28].Medstrand P, Mager DL. Human-specific integrations of the HERV-K endogenous retrovirus family. J. Virol. 1998;72:9782–9787. doi: 10.1128/jvi.72.12.9782-9787.1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].Christensen T. Association of human endogenous retroviruses with multiple sclerosis and possible interactions with herpes viruses. Rev. Med. Virol. 2005;15:179–211. doi: 10.1002/rmv.465. [DOI] [PubMed] [Google Scholar]
- [30].Lindeskog M, Medstrand P, Cunningham AA, Blomberg J. Coamplification and dispersion of adjacent human endogenous retroviral HERV-H and HERV-E elements; presence of spliced hybrid transcripts in normal leukocytes. Virology. 1998;244:219–229. doi: 10.1006/viro.1998.9106. [DOI] [PubMed] [Google Scholar]
- [31].Ugarkovic D, Plohl M. Variation in satellite DNA profiles—causes and effects. EMBO J. 2002;21:5955–5959. doi: 10.1093/emboj/cdf612. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [32].Newman T, Trask BJ. Complex evolution of 7E olfactory receptor genes in segmental duplications. Genome Res. 2003;13:781–793. doi: 10.1101/gr.769003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [33].Garcia-Perez JL, Doucet AJ, Bucheton A, Moran JV, Gilbert N. Distinct mechanisms for trans-mediated mobilization of cellular RNAs by the LINE-1 reverse transcriptase. Genome Res. 2006;17:602–611. doi: 10.1101/gr.5870107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [34].Luan DD, Korman MH, Jakubczak JL, Eickbush TH. Reverse transcription of R2Bm RNA is primed by a nick at the chromosomal target site: a mechanism for non-LTR retrotransposition. Cell. 1993;72:595–605. doi: 10.1016/0092-8674(93)90078-5. [DOI] [PubMed] [Google Scholar]
- [35].Feldmesser E, et al. Widespread ectopic expression of olfactory receptor genes. BMC Genomics. 2006;7:121. doi: 10.1186/1471-2164-7-121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [36].Rausch JW, Le Grice SF. `Binding, bending and bonding': polypurine tract-primed initiation of plus-strand DNA synthesis in human immuno-deficiency virus. Int. J. Biochem. Cell Biol. 2004;36:1752–1766. doi: 10.1016/j.biocel.2004.02.016. [DOI] [PubMed] [Google Scholar]
- [37].Leib-Mösch C, et al. Influence of human endogenous retrovirus on cellular gene expression. In: Sverdlov ED, editor. Retrovirus and Primate Genome Evolution, Chap. 7. Landes Bioscience; Austin, TX: 2004. pp. 1–22. [Google Scholar]
- [38].Pace JK, II, Feschotte C. The evolutionary history of human DNA transposons: evidence for intense activity in the primate lineage. Genome Res. 2007;17:422–432. doi: 10.1101/gr.5826307. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [39].Zhao S, et al. Human, mouse and rat genome large scale rearrangements: stability versus speciation. Genome Res. 2004;14:1851–1860. doi: 10.1101/gr.2663304. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.