Abstract
We report here the presence of numerous processed pseudogenes derived from the W family of endogenous retroviruses in the human genome. These pseudogenes are structurally colinear with the retroviral mRNA followed by a poly(A) tail. Our analysis of insertion sites of HERV-W processed pseudogenes shows a strong preference for the insertion motif of long interspersed nuclear element (LINE) retrotransposons. The genomic distribution, stability during evolution, and frequent truncations at the 5′ end resemble those of the pseudogenes generated by LINEs. We therefore suggest that HERV-W processed pseudogenes arose by multiple and independent LINE-mediated retrotransposition of retroviral mRNA. These data document that the majority of HERV-W copies are actually nontranscribed promoterless pseudogenes. The current search for HERV-Ws associated with several human diseases should concentrate on a small subset of transcriptionally competent elements.
[Online supplementary material available at http://www.genome.org]
The human genome contains two major classes of autonomous, retrotransposition-competent elements encoding the reverse transcriptase and mobilized via an RNA intermediate, long interspersed nuclear elements (LINEs or L1) and human endogenous retroviruses (HERVs). Long terminal repeat (LTR)-containing HERVs comprise ∼4%–5% of the genome (International Human Genome Sequencing Consortium 2001). LINEs, retrotransposons lacking LTRs and containing poly(A) tails, are the most active autonomous transposable elements in the human genome (Kazazian and Moran 1998; Prak and Kazazian 2000). They are estimated to be present in >500,000 copies, comprising 17% of the genome (Smit 1996a, 1999; International Human Genome Sequencing Consortium 2001). However, due to 5′ truncations and deleterious mutations, only 30–60 LINEs per haploid genome are active and transpose along the genome (Sassaman et al 1997; Kazazian 1999; The database of Retrotransposon Insertion into the Human Genome http://www.med.upenn.edu/genetics/labs/retrotrans_table.html). LINE elements have a broad impact on the genome structure; beside occupying a large portion of the genome by themselves, the LINE copies are most probably involved in the expansion of Alu elements, which comprise 10% of the genome (Smit 1996a, 1999; International Human Genome Sequencing Consortium 2001). LINE transduction of the 3′-flanking sequences is estimated to account for a further 0.5%–1% of the genome (Goodier et al. 2000; Pickeral at al. 2000; International Human Genome Sequencing Consortium 2001).
Another type of genomic element generated by the LINE machinery is called processed pseudogenes. They were first described as pseudogenes structurally colinear with gene mRNA, lacking promoters, introns, and, in general, without protein-coding capacity due to mutations and frequent stop codons. Their mRNA-derived structure, poly(A) tails at the 3′ end and the presence of direct repeats of variable (5–15 bp) length led to the hypothesis that their formation requires reverse transcriptase, and these pseudogenes were termed processed pseudogenes (Vanin 1985; Weiner et al. 1986). In search of this reverse transcriptase activity, Tchènio et al. (1993) showed the generation of processed pseudogenes from intron-containing proviral structures in murine cells. Elimination of sequences essential in cis for the retroviral life cycle indicated that endogenous retroviruses are not involved in this process. Later, this endogenous reverse transcriptase activity was shown to generate processed pseudogenes also from nonviral (non-LTR) constructs in somatic HeLa cells (Maestre et al. 1995). Along the same line, evidence available from other experiments disclosed that retroviral infection, as well as forced expression of retrovirus-like structures, resulted in all cases in cDNA genes without the typical structure of processed pseudogenes (Dornburg and Temin 1990; Levine et al. 1990; Derr et al. 1991). In direct in vitro tests, LINEs, and particularly LINE ORF2, were necessary for production of the typical cDNA structures similar to processed pseudogenes, whereas Moloney murine leukemia virus as well as human immunodeficiency virus type 1 (HIV-1) were unable to form the expected cDNAs, probably due to the lack of essential structures such as the primer-binding site for tRNA in the vector (Dhellin et al. 1997). An independent support in favor of LINEs in the generation of processed pseudogenes comes from the computational comparison of insertion sites of LINEs, Alus, and processed pseudogenes. They all share similar features including a common TT|AAAA insertion motif and a variable-length (5–15 bp) target site duplication (TSD) (Jurka 1997), whereas retroviruses are characterized by short 4–6-bp direct repeats and no preference for a motif similar to TT|AAAA. Moreover, poly(A) tails and frequent truncation at the 5′ end of processed pseudogenes are typical for LINEs (Voliva et al. 1983; Smit 1999), indicating again that LINEs are the master mobile elements in the human genome. Finally, an in vitro assay clearly showed creation of reporter gene copies by the LINE machinery with all hallmarks of the processed pseudogene (Esnault et al. 2000).
The estimated portion of processed pseudogenes in the human genome is ∼0.5% (Dunham et al. 1999), with copy number 23,000–33,000 (Goncalves et al. 2000). The LINE machinery influences not only the pseudogene structure, but also the pseudogene distribution. We reported recently that not only young LINEs and Alus, but also processed pseudogenes, preferentially reside in GC-poor parts of the genome (Pavlíček et al. 2001).
During our work on the Human Endogenous Retrovirus database (http://herv.img.cas.cz; Pačes et al. 2002), we found unusual retroviral structures lacking the common proviral organization. They are structurally similar to retroviral mRNA, followed by a poly(A) tail. These elements are frequently observed in the HERV-W family (Blond et al. 1999), described previously as LM7 or multiple sclerosis-associated virus (Perron et al. 1989, 1997), and as endogenous HERV17 family in the Repbase Update database (http://www.girinst.org/; Jurka 1998, 2000; Smit 1999). The HERV-W family is of special interest because of suggestions of a role in several human diseases including multiple sclerosis (Perron et al. 1997; Komurian-Pradel et al. 1999), rheumatoid arthritis (Gaudin et al 1997, 2000), and schizophrenia (Karlsson et al. 2001). Moreover, the env gene of a prototype HERV-W element on chromosome 7 was suggested as the human gene coding for the syncythin protein responsible for cell fusion during the differentiation of the syncythiotrofoblast in human placenta (Smit 1999; Blond et al. 2000; Mi et al. 2000). This is one of the best-documented examples of recruitment of retroviral genes by the host genome.
This work describes, for the first time, the presence of processed pseudogenes from mRNA of HERV-Ws in the human genome. These pseudogenes are remarkably numerous in the HERV-W family and are not found at comparable frequencies in related retroviral families of the class 1 HERVs. This finding represents not only an interesting connection between two main reverse transcriptase-encoding classes of retroelements, LINEs and HERVs, but, as we show, also a unique model for comparing LINE and retroviral transposition, mRNA preference, integration specificity, recombination, and genomic stability.
RESULTS
Determination and Genomic Structure of HERV-W Retroviral and Pseudogene Copies
Using the genome-wide RepeatMasker scan, 654 HERV-W family members were found. From their multiple alignment schematically depicted in Figure 1 (see also Supplementary Figure 1S available online at http://www.genome.org), it is clear that HERV-W sequences are divided into two major groups. The first group is defined by the complete or partial U3 region in 5′ LTR and/or U5 in 3′ LTR and will be referred to as the retro group (or retro copies) in the following text, because the mentioned parts of LTRs are completed during retroviral reverse transcription. This retroviral group contains 77 proviral copies with complete or partial internal (non-LTR) sequences, and 343 soloLTRs without any internal sequences (Table 1).
Table 1.
Group | Number | Complete | 5′ truncationa | 3′ truncationa | 5′3′ truncationa |
---|---|---|---|---|---|
poly(A) | 176 | 46 | 107 | 23 | 0 |
retro (internal sequence–containing) | 77 | 29 | 15 | 24 | 9 |
retro (soloLTR) | 343 | 306 | 20 | 13 | 4 |
undef | 58 | 0 | 0 | 0 | 58 |
Truncations were defined according to alignments with a particular consensus sequence for each group. The terminal gap was counted if the start or the end of the element was more distant than 20 bp from the consensus termini.
The members of the second group are defined by incomplete LTRs that start close to the beginning of the R region of 5′ LTR (from nucleotide 256 of the consensus sequence, see Methods) and end at the 3′ part of the 3′ LTR R region (up to nucleotide 9732 of the consensus). The second characteristic feature is the frequent presence of the poly(A) tail of variable length and, therefore, we term these HERV-W sequences as the poly(A) group [or poly(A) copies] in the following text. The lengths of the tails vary from a few nucleotides to >50 bp. The poly(A) tail starts at position 9732, just 12 nucleotides downstream of the poly(A) signal ATTAAA, in the 3′ LTR (positions 9714–9719). In the poly(A) tails, we often detected A-rich microsatellites of up to 10 repeat units. The poly(A) tails end, in general, with the 3′ direct repeat (Fig. 2). We have found 176 members of the poly(A) group (Table 1).
Aside from these two groups, there are 58 truncated elements, which, due to the absence of diagnostic parts, cannot be unambiguously defined and are referred to here as the undef group (Table 1).
Analysis of terminal deletions shows further intergroup differences. Whereas the poly(A) group is, in the great majority of cases, truncated at the 5′ end, retro copies and soloLTRs are nearly equally truncated at both ends (Table 1). In addition, soloLTRs are mostly complete, but the poly(A) and the retro copies of HERV-Ws are generally shortened (Table 1).
Interestingly, in addition to the poly(A) tract, HERV-W genomic copies display further oligonucleotide expansion located in the leader region (positions 1484–1603 at the consensus sequence). Sometimes, arrays of several hundreds of basepairs are found, composed from mainly AG-rich di- or oligonucleotides.
Characterization of Insertion Sites of HERV-W Poly(A) and Retro Copies
To provide better insight into the transposition processes of HERV-W copies, we have analyzed their insertion sites. From the retro copies, we first extracted a subset of 172 complete elements and complete soloLTRs with untruncated 5′ and 3′ ends. For 82 of them, we found 4-bp direct repeats in 5′ and 3′-flanking sequences, just adjacent to the border of proviruses. On the other hand, in 116 of 176 elements in the poly(A) group, we detected long direct repeats of variable length up to 21 bp, with a mean of 9.51 bp.
Nucleotide frequencies in insertion sites of both groups are shown in Table 2. Three nucleotides upstream and five downstream of the start of each direct repeat were analyzed. For retro copies, there is a weak bias in the nucleotide frequencies toward a TT|ATA(A/T) sequence, in which | denotes the start of the direct repeat (Table 2a). The fifth downstream nucleotide is frequently T, already corresponding to the first nucleotide of HERV-W LTR, which is consensually T, because the general length of HERV-W insertion TSD is 4 bp.
Table 2.
(a) HERV-W retro copies | (b) HERV-W pseudogene poly(A) copies | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
All | 82 | 82 | 82 | 82 | 82 | 82 | 82 | 82 | 116 | 116 | 116 | 116 | 116 | 116 | 116 | 116 |
A | 40 | 10 | 23 | 34 | 14 | 41 | 6 | 2 | 54 | 49 | 30 | 72 | 73 | 77 | 71 | 67 |
C | 16 | 27 | 31 | 20 | 13 | 1 | 17 | 9 | 7 | 11 | 21 | 9 | 7 | 6 | 9 | 10 |
G | 15 | 6 | 26 | 20 | 8 | 20 | 30 | 6 | 11 | 14 | 16 | 16 | 15 | 14 | 19 | 12 |
T | 11 | 39 | 2 | 8 | 47 | 20 | 29 | 65 | 43 | 41 | 49 | 19 | 21 | 19 | 17 | 27 |
N | 1 | 1 | ||||||||||||||
— | ||||||||||||||||
Others |
The highest nucleotide frequencies defining the consensus insertion site motif are in italics.
In comparison with the retro group, poly(A) copies have a strong bias in the insertion site motif, namely toward the (A/T)(A/T)T|AAAAA octanucleotide (Table 2b).
Phylogenetic Analysis
Two parts of the retroviral genome were selected for the phylogenetic analysis: the gag region (consensus sequence nucleotides 2718–3923) and a 5′ part of the pol region (consensus sequence nucleotides 4450–6789; see Methods). Both phylogenetic trees (Fig. 3A and Supplementary Figure 2Sa, available on line at http://genome.org) show a complex pattern of HERV-W copies, in which poly(A) elements are dispersed within the retro copies. Due to a high similarity at non-CpG sites, there are few polymorphic sites, and thus the reliability of the obtained topology is weak (see low bootstrap values in Fig. 3A, Supplementary Figure 2Sa). We have selected only a small subset of elements, and the obtained tree (Fig. 3B) shows a better supported branching pattern, in which, again, poly(A) elements are dispersed within retroviral copies.
Genomic Distribution
The genomic distribution of HERV-W elements was calculated for 100-kb nonoverlapping segments of the genome (see Methods). Figure 4A shows the HERV-W frequency in genomic segments classified according to four ranges of the GC content. Both retro and poly(A) copies show distribution biased toward the GC-poor part of the genome. Especially, soloLTRs have a strong tendency to localize in GC-poor regions.
The intrachromosomal distribution of HERV-W (Fig. 4B) shows several interesting features. First, the intrachromosomal distribution varies among all poly(A), retro, and soloLTR groups. Poly(A) elements are over-represented on chromosomes 3, 6, 19, and X and less abundant on chromosomes 16, 17, 21, 22, and Y. Retro copies are more frequent on chromosomes 4, 5, 7, 13, X, and especially chromosome Y, where both internal sequence-containing copies and soloLTRs are clearly over-represented. On the other hand, retroviral copies are under-represented on smaller chromosomes 16, 17, 19, 20, and 22. Generally, HERV-W elements show biased distribution toward chromosomes 3, 4, X, and particularly Y, and are less frequent on chromosomes 16, 17, 20, and 22, in agreement with experimental results (Voisset et al. 2000).
Comparison of the Length and the GC Level of HERV-W Elements
We selected 46 complete poly(A) copies and 29 complete retro copies for comparison of the length and the GC level. From all copies, only positions 256–9732 of the consensus, corresponding to the full-length poly(A) sequence without the poly(A) tail, were compared. To eliminate the influence of various insertions and satellite expansions, only sequences homologous to the consensus were used; sequences were aligned to the consensus and nonhomologous parts were neglected as in Figure 1. No striking differences were found in the GC content, poly(A) sequences being slightly GC poorer than the retro copies (45.8% compared with 47.7%). The length was calculated as a percentage of gap positions in all available positions of the consensus (100 × gaps length/total length). These numbers were used to compare splicing variation. Because the majority of internal deletions correspond to several splice variants (visible from the Supplementary Figure 1S, available on line at http://www.genome.org), these numbers estimate the percentage of sequences missing due to the splicing. For the retro group, we obtained, on average, 41.4% of gaps in comparison with the consensus, for the poly(A) group, 37.6% of gaps.
DISCUSSION
Processed Pseudogenes of HERV-Ws Display the Features of LINE-Mediated Transposition
Our analysis of HERV-W copies in the human genome revealed unusual endogenous retroviral structures. We found elements colinear with retroviral mRNAs followed by poly(A) tails, the poly(A) copies, resembling processed pseudogenes of cellular genes (Vanin 1985; Weiner et al. 1986). On the basis of the lack of complete LTRs, which regenerate during normal retroviral reverse transcription, we suggest that these poly(A) copies arose by virtue of heterologous reverse transcriptase, most probably through the LINE-mediated reverse transcription.
The majority of poly(A) copies is surrounded by direct repeats of variable length (6–21 bp, mean 9.51 bp), characteristic for insertions of LINEs, Alus, and processed pseudogenes, which are believed to have arisen through double enzymatic nicking of host DNA by the LINE endonuclease (Feng et al. 1996; Jurka 1997; Cost and Boeke 1998). In contrast, standard members of the HERV-W family, the retro copies, have canonically short 4-bp direct repeats as already described (Repbase Update; Blond et al. 1999). Analysis of insertion sites of poly(A) (Table 2) elements showed a strong preference for (A/T)(A/T)T|AAAAA sequences, strongly resembling the preferential TT|AAAA cutting motif of the LINE endonuclease found in LINEs, Alus, and processed pseudogenes (Feng et al. 1996; Jurka 1997; Cost and Boeke 1998; Wei et al. 2001). Poly(A) tails of poly(A) elements often contain microsatellites. Similar expansions were described for Alus, in which oligo-dA-rich tails have been shown to serve as nuclei for the genesis of simple repeats (Arcot et al. 1995).
Frequent 5′ truncations in the poly(A) group (Fig.1; Table 1) also described in LINEs and processed pseudogenes (Voliva et al. 1983; Vanin 1985; Weiner et al. 1986; Smit 1999) the presence of spliced variants, the mRNA-derived structure, the poly(A) tail, and the insertion-site characteristics mentioned above, which altogether indicate that the LINE enzymatic machinery generated the pseudogene copies of the HERV-W family [poly(A) group]. The recent experimental demonstration of processed pseudogenes generated by LINE-encoded enzymes (Esnault et al. 2000; Wei et al. 2001) strengthens our conclusion that the poly(A) group is generated by nonretroviral, LINE-mediated mobilization. The complex pattern on phylogenetic trees, in which poly(A) copies are dispersed within retroviral copies (Fig. 3A,B), suggests multiple and independent origins of poly(A) copies.
Distribution and Stability of HERV-W Copies Depends on the Mechanism of Transposition
The genome-wide distribution of HERV-W poly(A) copies shows a bias toward the GC-poorer part of the genome than that of retro copies (Fig. 4A). This bias is typical for LINEs, young Alus, and processed pseudogenes, and is probably linked to the AT-rich insertion motif of these elements (International Human Genome Sequencing Consortium 2001; Pavlíček et al. 2001). Similarly, for the retro group, we also found a weak bias toward AT-rich insertion sites, and the genomic distribution is also biased toward the GC-poor isochores.
In contrast to LINEs, there is no clear general insertion motif for retroviruses. Both retrovirus and LINE integration are influenced by the DNA bending (Pryciak and Varmus 1992; Muller and Varmus 1994; Jurka et al. 1998), the nucleosome structure at the site of integration (Pryciak and Varmus 1992; Cost et al. 2001), and the state of chromatin condensation (Leib-Mösch et al. 1993, Rynditch et al. 1998). The only sequence-specific insertion motif reported so far is the palindromic pentanucleotide GT(A/T)AC found in nonautonomous LTR retrotransposons MaLR (mammalian apparent LTR retrotransposon) (Smit 1996b) and HIV-1 integration (Stevens and Griffith 1996; Carteau et al. 1998). From the data in Table 2, we derived a weak preferential motif ATC|ATA(G/T) and a strong negative motif notTnotGnotCnotA. The described preferential motif is similar, but not the same as the CA(A/T)TG pentanucleotide, the complementary version of the MaLR and HIV-1 insertion motif. The studied MaLR and HIV-1 insertion sequences were different in the length of the TSD – 5 bp instead of 4 bp for HERV-W integration, suggesting that on the contrary to the high evolutionary conservation of the integrase region in retroviruses, insertions could differ in both TSD length and in the insertion motif. Despite the fact that the insertion motif obviously is not the only factor determining retrotransposon integration, the AT-rich LINE insertion motif seems to be responsible for the GC-poor bias in the processed pseudogene distribution.
The length of homology at 5′ and 3′ ends of poly(A) and retroviral elements is the determining factor for the element stability. The majority of retroviral sequences in the human genome are soloLTRs (Jurka 1998), which arose by homologous recombination between 5′ and 3′ LTR with excision of the central part and one LTR, leaving only a single LTR (Mager and Goodchild 1989). In our data set, most retro copies are soloLTRs (343 of 420 elements, Table 1). On the other hand, in the poly(A) group, only two elements begin close to the 5′ start of the 3′ LTR and there is no abundance of elements starting near the 5′ end of the 3′ LTR (Supplementary Figure 1S and Supplementary Table 1S, available on line at http://www.genome.org). Thus, apparently, there are no, or very few soloLTRs originated by homologous recombination in the poly(A) group. This observation is consistent with the requirement for the minimum length of homology needed for efficient intra- and extrachromosomal recombination between closely linked homologous sequences, which ranges from 163 to 295 bp (Rubnitz and Subramani 1984; Liskay et al. 1987; Waldman and Liskay 1988). Retro copies have two complete LTRs, providing them with 780 bp of perfect homology at both ends, which is sufficient for effective homologous recombination. Poly(A) copies, due to the absence of retroviral reverse transcription, lack complete LTRs, and 5′ and 3′ ends share just the R region, yielding 71 bp of the homology at both ends, clearly below the limit for efficient homologous recombination. In accordance with their different propensity to recombination, both groups differ in their pattern of chromosomal distribution (Fig. 4B). HERV-W retro copies are over-represented on chromosomes 3, 4, X, and particularly Y in comparison with the rest of the genome. On the contrary, poly(A) copies show no bias toward Y. Because of the limited recombination, chromosome Y is known to posses a high concentration of HERVs (Kjellman el al. 1995). The increased density of LTR-retrotransposons on chromosome X, and even higher on Y, is mostly due to the larger ratio of complete elements over solitary LTRs (Smit 1999). Retro copies with internal sequences represent 28.6% (4/14) of all HERV-W retroviral copies on chromosome Y, whereas solely 18% (73/406) for the rest of the genome. The limited meiotic recombination seems to be the causative factor for both concentration of retro copies and increased proportion of retro copies containing the internal sequences on chromosome Y. Noteworthy, however, is the fact that the frequent presence of soloLTRs on chromosome Y implicates a possible role of some intrachromosomal (mitotic) recombination mechanisms such as intrachromatid single-strand annealing, reciprocal exchange, replication slippage, or abortive reciprocal recombination (for review, see Klein 1995; Lambert et al. 1999).
In conclusion, we could see that the retroviral and the LINE transposition differ in the integration, distribution and the stability of inserted copies (Fig. 5). The retroviral reverse transcription produces potentially expression- and replication-competent copies, which are unstable due to the homologous LTR–LTR recombination, and are frequently excluded during evolution. On the other hand, LINEs often produce 5′ truncated copies, unable to transcribe and further mobilize due to the loss of promoter sequences in the U3 region of the 5′ LTR. These copies have, however, a better chance to keep their original structure during evolution. Retro copies represent 64.2% (420/654) of all HERV-W elements, whereas poly(A) copies are just 26.9% (176/654). On the other hand, poly(A) copies account for 52.5% (149/284) of elements containing internal sequences, whereas the proportion of retro copies is only 27.11% (77/284). This implies that the majority of HERV-W gag, pol, and env-coding regions experimentally detected on human chromosomes (Voisset et al. 2000) are actually pseudogenes, probably nontranscribed promoterless copies, and the current search for possible causative agents of several human diseases (Perron et al. 1989, 1997; Gaudin et al. 1997, 2000; Komurian-Pradel et al. 1999; Kim and Crow 1999) could concentrate on a small subset of transcriptionally competent HERV-W elements.
Remarks on the Processed Pseudogene Formation
The LINE machinery is highly effective in the transposition of its own copies (Dombroski et al. 1991; Moran et al. 1996). The estimated frequency of cellular mRNA mobilization is only 0.01%–0.05% per LINE transposition (Wei et al. 2001). We found numerous processed pseudogenes within the HERV-W family, but not in the closely related HERV9 family. In the HERV database (Pačes et al. 2002), we have made a preliminary analysis on the basis of the presence of poly(A) tails and TSDs and we estimate that the HERV9 family has more than seven times less pseudogenes than the HERV-W family. In the class 1 HERVs, we have not found any retroviral family with >5% of processed pseudogenes (data not shown). In a direct in vitro test, the frequency of Moloney murine leukemia virus-based vector transposition was just 10−8–10−6 per cell per generation (Tchènio et al. 1993). Similarly, genomic library screening for the HERV-H family indicated very few (<1%) copies with the structure of processed pseudogenes (Goodchild et al. 1995). Hence, LINE-mediated transposition of the retroviral mRNA seems to be rather rare, with the exception of HERV-W mRNA.
What drives the selectivity of capturing the non-LINE mRNA by LINE-encoded enzymes? Only transpositions in the germ line, primordial germ line, or early embryonal cells have a chance to proceed into the next generation and eventually to be fixed. Both LINEs and Alus are known to be expressed in testicular tissues and they could be potentially transposed into germ-line cells (Branciforte and Martin 1994; Schmid 1998). It is noteworthy that expression of transcripts homologous to the syncythin gene have been detected not only in placenta, but also in testis (Mi et al. 2000); it is therefore conceivable that coexpression of LINEs and HERV-W elements facilitates the generation of processed pseudogenes in the HERV-W family. On the other hand, transcription up-regulation in testis is not limited to LINEs, Alus, and HERV-Ws and could serve as an engine for generation of processed pseudogenes from various mRNAs (Schmidt 1996).
Virtually all types of mRNA are capable of retrotransposition (Brosius 1999), but the most efficient are genes connected with the translation machinery and ribosomes (Venter et al. 2001). A genome-wide analysis of processed pseudogenes suggested that the processed pseudogenes are preferentially generated from short, GC-poor RNAs (Goncalves et al. 2000). The complex splice pattern of the HERV-W family provides necessary GC and length variation and a good opportunity to test these predictions and compare the affinity of LINE and retroviral enzymes for various mRNAs. In our analysis, we have not found any difference between the retroviral and LINE replication. Also, in the analysis of the human genome sequence, no bias has been found in the GC content of genes generating processed pseudogenes (Venter et al. 2001).
METHODS
Identification and Localization of HERV-W Sequences in the Human Genome
The RepeatMasker program (Smit and Green RepeatMasker at http://repeatmasker.genome.washington.edu) was used to identify HERVs in the GoldenPath assembly of 87% of the human genome (Haussler et al. Human Genome Working Draft at http://genome.ucsc.edu/). The fragments of LTRs, as well as internal retroviral sequences of the HERV-W (HERV17) family and the related HERV9 family, were included in the following analysis. All selected retroviral fragments were compared against family profiles of HERV-W and HERV9 families using the HMMER package (HMMER v. 2.1.1; Eddy 1998; http://hmmer.wustl.edu/) to eliminate errors due to misidentifications of closely related families. Only elements with better matches to the HERV-W profile than to that of the HERV9 profile were used further as real HERV-W elements. Precise positions of starts and ends of elements were determined by recursive alignment with the Repbase Update consensus of the HERV family using the Dialign2 program (Dialign v. 2.0, Morgenstern et al. 1996).
Analysis of the Genomic Structure
All elements were aligned using the Dialign2 program and the multiple alignment was manually checked for errors in the Seaview editor (Galtier et al. 1996). Computing of a large alignment was approximated by aligning each element with a consensus sequence obtained from the Repbase Update, followed by assembly of pairwise alignments into multiple alignments. As a consensus, we used sequences of LTR17 and HERV17 consensus, joined in the order LTR17, HERV17, LTR17 into one sequence. In 3′ LTR, we have introduced a 50-bp insertion of a poly(A) stretch in the place where the poly(A) tail of retroviral RNAs is located (from position 327 of the LTR17 consensus).
Characterization of Insertion Sites of HERV-W Copies
We have identified insertion sites of all copies, including the TSD and the insertion motif. First, the sequences were scanned for an exact match of 6 bp or longer at both ends of HERV-W elements in a region from 30 bp outside to 5 bp inside. Complete retroviral copies as well as complete soloLTRs were scanned for direct repeats of 4 bp or longer at both ends of the element.
Phylogenetic Analysis
For the phylogenetic analysis, we have selected two regions in the alignment, the gag region from 2718 to 3923 bp and a proximal part of the pol region, until the major env splicing variant (nucleotides 4450–6789). Then, we excluded all incomplete and many gap-containing elements. We obtained 36 elements in the gag region and 46 sequences in the pol alignment. All CpG and gap-containing positions were excluded. The topology was obtained by neighbor joining (Tajima-Nei distance), maximum parsimony, and maximum likelihood methods implemented in the phylo_win program (Galtier et al. 1996).
Genomic Distribution of HERV-W Copies
Retrotransposon densities were calculated in 100-kb long, nonoverlapping fragments from the GoldenPath assembly (http://genome.ucsc.edu). The densities were calculated for the isochore families as given by Zoubak et al. (1996). Isochore family intervals were as follows: <37%, 37%–41%, 41%–46%, and >46% content of GC for L1, L2, H1, and H2+H3 families, respectively.
WEB SITE REFERENCES
http://genome.ucsc.edu/; Haussler et al. Human Genome Working Draft.
http://herv.img.cas.cz; the authors' work on the Human Endogenous Retrovirus database.
http://hmmer.wustl.edu/; HMMER v. 2.1.1.
http://repeatmasker.genome.washington.edu; Smit and Green RepeatMasker.
http://www.girinst.org/; Repbase Update database.
http://www.med.upenn.edu/genetics/labs/retrotrans_table.html; The database of Retrotransposon Insertion into the Human Genome.
Acknowledgments
We thank Oliver Clay, Jan Svoboda, and Giorgio Bernardi for critical reading of the manuscript. This work was supported by grant No. 204/01/0632 of the Grant Agency of the Czech Republic to J.H. A.P. is supported in part by a PhD fellowship of the French Government program Doctorat en cotutelle.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.
Footnotes
E-mail hejnar@img.cas.cz; Fax 420-2-24-310-955.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.216902. Article published online before print in February 2002.
REFERENCES
- Arcot SS, Wang Z, Weber JL, Deininger PL, Batzer MA. Alu repeats: A source for the genesis of primate microsatellites. Genomics. 1995;29:136–144. doi: 10.1006/geno.1995.1224. [DOI] [PubMed] [Google Scholar]
- Blond JL, Beseme F, Duret L, Bouton O, Bedin F, Perron H, Mandrand B, Mallet F. Molecular characterization and placental expression of HERV-W, a new human endogenous retrovirus family. J Virol. 1999;73:1175–1185. doi: 10.1128/jvi.73.2.1175-1185.1999. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blond JL, Lavillette D, Cheynet V, Bouton O, Oriol G, Chapel-Fernandes S, Mandrand B, Mallet F, Cosset FL. An envelope glycoprotein of the human endogenous retrovirus HERV-W is expressed in the human placenta and fuses cells expressing the type D mammalian retrovirus receptor. J Virol. 2000;74:3321–3329. doi: 10.1128/jvi.74.7.3321-3329.2000. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Branciforte D, Martin SL. Developmental and cell type specificity of LINE-1 expression in mouse testis: Implications for transposition. Mol Cell Biol. 1994;14:2584–2592. doi: 10.1128/mcb.14.4.2584. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brosius J. RNAs from all categories generate retrosequences that may be exapted as novel genes or regulatory elements. Gene. 1999;238:115–134. doi: 10.1016/s0378-1119(99)00227-9. [DOI] [PubMed] [Google Scholar]
- Carteau S, Hoffmann C, Bushman F. Chromosome structure and human immunodeficiency virus type 1 cDNA integration: Centromeric alphoid repeats are a disfavored target. J Virol. 1998;72:4005–4014. doi: 10.1128/jvi.72.5.4005-4014.1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cost GJ, Boeke JD. Targeting of human retrotransposon integration is directed by the specificity of the L1 endonuclease for regions of unusual DNA structure. Biochemistry. 1998;37:18081–18093. doi: 10.1021/bi981858s. [DOI] [PubMed] [Google Scholar]
- Cost GJ, Golding A, Schlissel MS, Boeke JD. Target DNA chromatinization modulates nicking by L1 endonuclease. Nucleic Acids Res. 2001;29:573–577. doi: 10.1093/nar/29.2.573. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Derr LK, Strathern JN, Garfinkel DJ. RNA-mediated recombination in S. cerevisiae. Cell. 1991;67:355–364. doi: 10.1016/0092-8674(91)90187-4. [DOI] [PubMed] [Google Scholar]
- Dhellin O, Maestre J, Heidmann T. Functional differences between the human LINE retrotransposon and retroviral reverse transcriptases for in vivo mRNA reverse transcription. EMBO J. 1997;16:6590–6602. doi: 10.1093/emboj/16.21.6590. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dombroski BA, Mathias SL, Nanthakumar E, Scott AF, Kazazian HH., Jr Isolation of an active human transposable element. Science. 1991;254:1805–1808. doi: 10.1126/science.1662412. [DOI] [PubMed] [Google Scholar]
- Dornburg R, Temin HM. Presence of a retroviral encapsidation sequence in nonretroviral RNA increases the efficiency of formation of cDNA genes. J Virol. 1990;64:886–889. doi: 10.1128/jvi.64.2.886-889.1990. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dunham I, Shimizu N, Roe BA, Chissoe S, Hunt AR, Collins JE, Bruskiewich R, Beare DM, Clamp M, Smink JL, et al. The DNA sequence of human chromosome 22. Nature. 1999;402:489–495. doi: 10.1038/990031. [DOI] [PubMed] [Google Scholar]
- Eddy SR. Profile hidden Markov models. Bioinformatics. 1998;14:755–763. doi: 10.1093/bioinformatics/14.9.755. [DOI] [PubMed] [Google Scholar]
- Esnault C, Maestre J, Heidmann T. Human LINE retrotransposons generate processed pseudogenes. Nature Genet. 2000;24:363–367. doi: 10.1038/74184. [DOI] [PubMed] [Google Scholar]
- Feng Q, Moran JV, Kazazian HH, Jr, Boeke JD. Human L1 retrotransposon encodes a conserved endonuclease required for retrotransposition. Cell. 1996;87:905–916. doi: 10.1016/s0092-8674(00)81997-2. [DOI] [PubMed] [Google Scholar]
- Galtier N, Gouy M, Gautier C. SeaView and Phylo_win, two graphic tools for sequence alignment and molecular phylogeny. Comput Applic Biosci. 1996;12:543–548. doi: 10.1093/bioinformatics/12.6.543. [DOI] [PubMed] [Google Scholar]
- Gaudin P, Perron H, Favre G, Mandrand B, Juvin R, Marcel F, Beseme F, Bedin F, Mallet F, Mougin B, et al. Detection of retrovirus RNA in plasma from rheumatiod arthritis. Arthritis Rheum. 1997;40:S245. [Google Scholar]
- Gaudin P, Ijaz S, Tuke PW, Marcel F, Paraz A, Seigneurin JM, Mandrand B, Perron H, Garson JA. Infrequency of detection of particle-associated MSRV/HERV-W RNA in the synovial fluid of patients with rheumatoid arthritis. Rheumatology. 2000;39:950–954. doi: 10.1093/rheumatology/39.9.950. [DOI] [PubMed] [Google Scholar]
- Goncalves I, Duret L, Mouchiroud D. Nature and structure of human genes that generate retropseudogenes. Genome Res. 2000;10:672–678. doi: 10.1101/gr.10.5.672. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goodchild NL, Freeman JD, Mager DL. Spliced HERV-H endogenous retroviral sequences in human genomic DNA: Evidence for amplification via retrotransposition. Virology. 1995;206:164–173. doi: 10.1016/s0042-6822(95)80031-x. [DOI] [PubMed] [Google Scholar]
- Goodier JL, Ostertag EM, Kazazian HH., Jr Transduction of 3′-flanking sequences is common in L1 retrotransposition. Hum Mol Genet. 2000;9:653–357. doi: 10.1093/hmg/9.4.653. [DOI] [PubMed] [Google Scholar]
- International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. doi: 10.1038/35057062. [DOI] [PubMed] [Google Scholar]
- Jurka J. Sequence patterns indicate an enzymatic involvement in integration of mammalian retroposons. Proc Natl Acad Sci. 1997;94:1872–1877. doi: 10.1073/pnas.94.5.1872. [DOI] [PMC free article] [PubMed] [Google Scholar]
- ————— Repeats in genomic DNA: Mining and meaning. Curr Opin Struct Biol. 1998;8:333–337. doi: 10.1016/s0959-440x(98)80067-5. [DOI] [PubMed] [Google Scholar]
- ————— Repbase update: A database and an electronic journal of repetitive elements. Trends Genet. 2000;16:418–420. doi: 10.1016/s0168-9525(00)02093-x. [DOI] [PubMed] [Google Scholar]
- Jurka J, Klonowski P, Trifonov EN. Mammalian retroposons integrate at kinkable DNA sites. Biomol Struct Dyn. 1998;15:717–721. doi: 10.1080/07391102.1998.10508987. [DOI] [PubMed] [Google Scholar]
- Karlsson H, Bachmann S, Schroder J, McArthur J, Torrey EF, Yolken RH. Retroviral RNA identified in the cerebrospinal fluids and brains of individuals with schizophrenia. Proc Natl Acad Sci. 2001;98:4634–4639. doi: 10.1073/pnas.061021998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kazazian HH., Jr An estimated frequency of endogenous insertional mutations in humans. Nature Genet. 1999;22:130. doi: 10.1038/9638. [DOI] [PubMed] [Google Scholar]
- Kazazian HH, Jr, Moran JV. The impact of L1 retrotransposons on the human genome. Nature Genet. 1998;19:19–24. doi: 10.1038/ng0598-19. [DOI] [PubMed] [Google Scholar]
- Kim H, Crow TJ. Identification and phylogeny of novel human endogenous retroviral sequences belonging to the HERV-W family on the human X chromosome. Arch Virol. 1999;144:2403–2413. doi: 10.1007/s007050050653. [DOI] [PubMed] [Google Scholar]
- Kjellman C, Sjogren HO, Widegren B. The Y chromosome: A graveyard for endogenous retroviruses. Gene. 1995;161:163–170. doi: 10.1016/0378-1119(95)00248-5. [DOI] [PubMed] [Google Scholar]
- Klein HL. Genetic control of intrachromosomal recombination. BioEssays. 1995;17:147–159. doi: 10.1002/bies.950170210. [DOI] [PubMed] [Google Scholar]
- Komurian-Pradel F, Paranhos-Baccala G, Bedin F, Ounanian-Paraz A, Sodoyer M, Ott C, Rajoharison A, Garcia E, Mallet F, Mandrand B, et al. Molecular cloning and characterization of MSRV-related sequences associated with retrovirus-like particles. Virology. 1999;260:1–9. doi: 10.1006/viro.1999.9792. [DOI] [PubMed] [Google Scholar]
- Lambert S, Saintigny Y, Delacote F, Amiot F, Chaput B, Lecomte M, Huck S, Bertrand P, Lopez BS. Analysis of intrachromosomal homologous recombination in mammalian cell, using tandem repeat sequences. Mutat Res. 1999;433:159–168. doi: 10.1016/s0921-8777(99)00004-x. [DOI] [PubMed] [Google Scholar]
- Leib-Mösch C, Haltmeier M, Werner T, Geigl EM, Brack-Werner R, Francke U, Erfle V, Hehlmann R. Genomic distribution and transcription of solitary HERV-K LTRs. Genomics. 1993;18:261–269. doi: 10.1006/geno.1993.1464. [DOI] [PubMed] [Google Scholar]
- Levine KL, Steiner B, Johnson K, Aronoff R, Quinton TJ, Linial ML. Unusual features of integrated cDNAs generated by infection with genome-free retroviruses. Mol Cell Biol. 1990;10:1891–1900. doi: 10.1128/mcb.10.5.1891. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liskay RM, Letsou A, Stachelek JL. Homology requirement for efficient gene conversion between duplicated chromosomal sequences in mammalian cells. Genetics. 1987;115:161–167. doi: 10.1093/genetics/115.1.161. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maestre J, Tchènio T, Dhellin O, Heidmann T. mRNA retroposition in human cells: Processed pseudogene formation. EMBO J. 1995;14:6333–6338. doi: 10.1002/j.1460-2075.1995.tb00324.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mager DL, Goodchild NL. Homologous recombination between the LTRs of a human retrovirus-like element causes a 5-kb deletion in two siblings. Am J Hum Genet. 1989;45:848–854. [PMC free article] [PubMed] [Google Scholar]
- Mi S, Lee X, Li X, Veldman GM, Finnerty H, Racie L, LaVallie E, Tang XY, Edouard P, Howes S, et al. Syncytin is a captive retroviral envelope protein involved in human placental morphogenesis. Nature. 2000;403:785–789. doi: 10.1038/35001608. [DOI] [PubMed] [Google Scholar]
- Moran JV, Holmes SE, Naas TP, DeBerardinis RJ, Boeke JD, Kazazian HH., Jr High frequency retrotransposition in cultured mammalian cells. Cell. 1996;87:917–927. doi: 10.1016/s0092-8674(00)81998-4. [DOI] [PubMed] [Google Scholar]
- Morgenstern B, Werner T, Dress AWM. Multiple DNA and protein sequence alignment based on segment-to-segment comparison. Proc Natl Acad Sci. 1996;93:12098–12103. doi: 10.1073/pnas.93.22.12098. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Muller HP, Varmus HE. DNA bending creates favored sites for retroviral integration: An explanation for preferred insertion sites in nucleosomes. EMBO J. 1994;13:4704–4714. doi: 10.1002/j.1460-2075.1994.tb06794.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pačes J, Pavlíček A, Pačes V. HERVd: Database of human endogenous retroviruses. Nucleic Acids Res. 2002;30:205–206. doi: 10.1093/nar/30.1.205. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pavlíček A, Jabbari K, Pačes J, Pačes V, Hejnar J, Bernardi G. Similar integration but different stability of Alus and LINEs in the human genome. Gene. 2001;276:39–45. doi: 10.1016/s0378-1119(01)00645-x. [DOI] [PubMed] [Google Scholar]
- Perron H, Geny C, Laurent A, Mouriquand C, Pellat J, Perret J, Seigneurin JM. Leptomeningeal cell line from multiple sclerosis with reverse transcriptase activity and viral particles. Res Virol. 1989;140:551–561. doi: 10.1016/s0923-2516(89)80141-4. [DOI] [PubMed] [Google Scholar]
- Perron H, Garson JA, Bedin F, Beseme F, Paranhos-Baccala G, Komurian-Pradel F, Mallet F, Tuke PW, Voisset C, Blond JL, et al. Molecular identification of a novel retrovirus repeatedly isolated from patients with multiple sclerosis. Proc Natl Acad Sci. 1997;94:7583–7588. doi: 10.1073/pnas.94.14.7583. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pickeral OK, Makalowski W, Boguski MS, Boeke JD. Frequent human genomic DNA transduction driven by LINE-1 retrotransposition. Genome Res. 2000;10:411–415. doi: 10.1101/gr.10.4.411. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Prak ET, Kazazian HH., Jr Mobile elements and the human genome. Nature Rev Genet. 2000;1:134–144. doi: 10.1038/35038572. [DOI] [PubMed] [Google Scholar]
- Pryciak PM, Varmus HE. Nucleosomes, DNA-binding proteins, and DNA sequence modulate retroviral integration target site selection. Cell. 1992;69:769–780. doi: 10.1016/0092-8674(92)90289-o. [DOI] [PubMed] [Google Scholar]
- Rubnitz J, Subramani S. The minimum amount of homology required for homologous recombination in mammalian cells. Mol Cell Biol. 1984;4:2253–2258. doi: 10.1128/mcb.4.11.2253. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rynditch AV, Zoubak S, Tsyba L, Tryapitsina-Guley N, Bernardi G. The regional integration of retroviral sequences into the mosaic genomes of mammals. Gene. 1998;222:1–16. doi: 10.1016/s0378-1119(98)00451-x. [DOI] [PubMed] [Google Scholar]
- Sassaman DM, Dombroski BA, Moran JV, Kimberland ML, Naas TP, DeBerardinis RJ, Gabriel A, Swergold GD, Kazazian HH., Jr Many human L1 elements are capable of retrotransposition. Nature Genet. 1997;16:37–43. doi: 10.1038/ng0597-37. [DOI] [PubMed] [Google Scholar]
- Schmid CW. Does SINE evolution preclude Alu function? Nucleic Acids Res. 1998;26:4541–4550. doi: 10.1093/nar/26.20.4541. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schmidt EE. Transcriptional promiscuity in testes. Curr Biol. 1996;6:768–769. doi: 10.1016/s0960-9822(02)00589-4. [DOI] [PubMed] [Google Scholar]
- Smit AF. The origin of interspersed repeats in the human genome. Curr Opin Genet Dev. 1996a;6:743–748. doi: 10.1016/s0959-437x(96)80030-x. [DOI] [PubMed] [Google Scholar]
- ————— . “Structure and evolution of mammalian interspersed repeats.” PhD thesis. Los Angeles, CA: University of Southern California; 1996b. [Google Scholar]
- ————— Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr Opin Genet Dev. 1999;9:657–663. doi: 10.1016/s0959-437x(99)00031-3. [DOI] [PubMed] [Google Scholar]
- Stevens SW, Griffith JD. Sequence analysis of the human DNA flanking sites of human immunodeficiency virus type 1 integration. J Virol. 1996;70:6459–6462. doi: 10.1128/jvi.70.9.6459-6462.1996. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tchènio T, Segal-Bendirdjian E, Heidmann T. Generation of processed pseudogenes in murine cells. EMBO J. 1993;12:1487–1497. doi: 10.1002/j.1460-2075.1993.tb05792.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vanin EF. Processed pseudogenes: Characteristics and evolution. Annu Rev Genet. 1985;19:253–272. doi: 10.1146/annurev.ge.19.120185.001345. [DOI] [PubMed] [Google Scholar]
- Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, et al. The sequence of the human genome. Science. 2001;291:1304–1351. doi: 10.1126/science.1058040. [DOI] [PubMed] [Google Scholar]
- Voisset C, Bouton O, Bedin F, Duret L, Mandrand B, Mallet F, Paranhos-Baccala G. Chromosomal distribution and coding capacity of the human endogenous retrovirus HERV-W family. AIDS Res Hum Retroviruses. 2000;16:731–740. doi: 10.1089/088922200308738. [DOI] [PubMed] [Google Scholar]
- Voliva CF, Jahn CL, Comer MB, Hutchison CA, Edgell MH. The L1Md long interspersed repeat family in the mouse: Almost all examples are truncated at one end. Nucleic Acids Res. 1983;11:8847–8850. doi: 10.1093/nar/11.24.8847. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Waldman AS, Liskay RMM. Dependence of intrachromosomal recombination in mammalian cells on uninterrupted homology. Mol Cell Biol. 1988;8:5350–5357. doi: 10.1128/mcb.8.12.5350. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wei W, Gilbert N, Ooi SL, Lawler JF, Ostertag EM, Kazazian HH, Jr, Boeke JD, Moran JV. Human L1 retrotransposition: cis preference versus trans complementation. Mol Cell Biol. 2001;21:1429–1439. doi: 10.1128/MCB.21.4.1429-1439.2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weiner AM, Deininger PL, Efstradiatis A. Nonviral retroposons: Genes, pseudogenes, and transposable elements generated by the reverse flow of genetic information. Annu Rev Biochem. 1986;55:631–661. doi: 10.1146/annurev.bi.55.070186.003215. [DOI] [PubMed] [Google Scholar]
- Zoubak S, Clay O, Bernardi G. The gene distribution of the human genome. Gene. 1996;174:95–102. doi: 10.1016/0378-1119(96)00393-9. [DOI] [PubMed] [Google Scholar]