Coding sequences of functioning human genes derived entirely from mobile element sequences

Roy J Britten

doi:10.1073/pnas.0406985101

. 2004 Nov 16;101(48):16825–16830. doi: 10.1073/pnas.0406985101

Coding sequences of functioning human genes derived entirely from mobile element sequences

Roy J Britten ^1,^*

PMCID: PMC534736 PMID: 15546984

Abstract

Among all of the many examples of mobile elements or “parasitic sequences” that affect the function of the human genome, this paper describes several examples of functioning genes whose sequences have been almost completely derived from mobile elements. There are many examples where the synthetic coding sequences of observed mRNA sequences are made up of mobile element sequences, to an extent of 80% or more of the length of the coding sequences. In the examples described here, the genes have named functions, and some of these functions have been studied. It appears that each of the functioning genes was originally formed from mobile elements and that in some process of molecular evolution a coding sequence was derived that could be translated into a protein that is of some importance to human biology. In one case (AD7C), the coding sequence is 99% made up of a cluster of Alu sequences. In another example, the gene BNIP3 coding sequence is 97% made up of sequences from an apparent human endogenous retrovirus. The Syncytin gene coding sequence appears to be made from an endogenous retrovirus envelope gene.

Mobile elements form the majority of the human genome, but that is unimportant compared to all of the functional effects these “parasites” have had on our evolution. Insertions have influenced the regulation of transcription of some genes and the termination of transcription. Hundreds of examples have been recognized where individual exons have sequences that are similar or identical to fragments of mobile element (ME) sequences (1, 2). In many of these cases a single exon is involved, and its transcription yields a variant mRNA (3). The suggestion is that MEs are a source of variation as a result of the insertion of fragments of sequence into functioning genes. Here, I am using MEs (sensu lento) to represent any repeated sequence present in many copies in the genome. Smit (4) has made a list of 19 examples of human genes “probably derived from transposable elements.”

Reported here are cases where almost the entire coding sequences (>89%) of functioning human genes are apparently derived from ME sequences. There are several examples of genes with named functions in which all or nearly all of coding sequences are quite similar to ME sequences as recognized by repeatmasker (www.repeatmasker.org). There are many other examples of observed mRNAs for which the coding sequences are defined by computer programs, and these sequences are identified by repeatmasker as MEs. However, in this subset of cases it is claimed that a functioning gene was derived entirely from ME sequences. There may be additional cases among a list of 49 unstudied examples derived by screening mRNA libraries to be described below.

These observations contribute an additional bit to the growing mass of evidence that indicates that mobile elements/repeats are not always junk and have made important contributions to the “host” (5–9). The MEs and DNA sequences derived from them have been a part of the eukaryotic “genomic environment” for a very long time. Thus, it is expected that they will have had important effects on gene function because it can be considered that living systems will sooner or later make use of whatever is available if it is at all possible, particularly in the genome. There have been theoretical proposals (10) of the evolutionary role of variety and change in these relationships, particularly in the control of gene expression. There is direct evidence for the evolutionarily significant role of mobile elements/repeats (11–13) and evidence for strong associations and functions including the regulation of transcription. The cases described in this paper add to this earlier evidence in that, in these cases, nearly the entire coding sequences of genes have apparently been derived from ME sequences.

A survey is in process to determine the fraction of the coding sequences recognized at present in available genomes that are derived from ME sequences. The early results turned up the AD7C or neural thread protein gene, which sparked interest because it is apparently derived entirely from a cluster of Alu repeated sequences. The investigators pointed out that the coding sequence contained regions of sequence similarity to four Alu sequences (14). Table 1 describes this and several other cases.

Table 1. Selected genes derived from ME sequences.

Chr.	Name	cds, nt	% ME	% match	ME identifier	Accession no.
1?	AD7c	1,128	99.6	83-92	5 Alu segments	NM_014486
7	SYNCYTIN	1,615	100	97	HERV-W	AF072506
(7	GTF2IRD2^*	1,607	97.7	80-88	Charlie8, DNA/MER1	NM_001003795)
8	HHCM	1,404	89.9	68-71	L1MD2, LINE/L1	NM_006543
10	BNIP3	585	97.1	84	HERV70, LTR/ERV1	NM_004052
13	LG30	216	100	74-76	MLT1E, MLT1G, LTR	AY138548

Open in a new tab

The first column is the chromosome (Chr.) number, which is not certain for AD7C.

Exon 16 only, therefore in parentheses.

Methods

A collection of coding sequences was made from the NCBI file seq_gene.md. These were examined by repeatmasker, and those that were reported to be almost completely similar in sequence to mobile elements were set aside for further study. The examples examined in the first part of this paper were selected from this list on the basis of their known function. Some of the remainder of them are shown in Table 4.

Table 4. Observed transcripts that match ME for >80% of their length.

Chr.	% length^*	Length of cds	ID	ID	% match^†	Listing^‡	Record^§
1	98.25	171	NM_016646	LOC51336	82.7	L1M3
1	96.00	150	XM_352936	PRO2012	78.1	L1ME	Record removed
2	100.00	384	NM_175853	LOC150759	78.7	L1P
2	100.00	375	XM_208704	LOC283517	97.9	SVA	Record removed
2	100.00	723	XM_351431	LOC375197	95.8	L1PA4	Record removed
2	95.83	288	XM_173068	LOC253584	85.4	THE1C MLT1B
2	92.33	300	XM_351509	LOC375299	90.0	L1P Tigger2	Record removed
2	86.38	279	XM_291017	LOC339793	89.8	L1P MLT1E2
3	93.44	183	NM_018629	PRO2533	76.3	MLT1G3	Record removed
3	91.54	402	XM_353342	LOC375388	79.9	MLT2B4	Record removed
3	83.66	153	NM_014135	PRO0641	78.1	MLT1H	Record removed
4	99.95	1974	XM_209656	LOC285550	89.8	Charlie9
4	97.40	1692	NM_024534	FLJ12684	72.6	MER34-int
5	87.04	486	NM_173668	FLJ34836	89.1	LTR12C BaEV-int	MER50
5	86.45	369	XM_353366	LOC375433	81.2	L1MC/D LTR5B	Record removed
6	99.74	387	XM_291181	LOC340211	97.4	L1P
6	99.60	249	NM_018572	PRO1051	87.9	L1P	Record removed
6	84.97	366	NM_178534	FLJ37940	85.7	HERVL18	Record removed
7	97.72	1491	NM_032203	GTF2IRD2	82.4	Charlie8
7	94.02	1722	NM_145111	DKFZp727G1	86.4	Charlie9
8	97.99	348	XM_351783	LOC375668	86.8	Tigger3 (Golem)	FLAM_C removed
8	85.35	273	XM_353456	LOC375664	70.3	L2	Record removed
9	100.00	375	XM_209180	LOC284397	97.1	SVA
9	100.00	342	XM_351803	LOC375700	87.4	L1P4	Record removed
9	100.00	363	XM_353472	LOC375692	82.8	LTR1B	Record removed
9	92.10	291	XM_353479	LOC375732	73.1	SST1	Record removed
9	92.10	291	XM_353476	LOC375716	72.8	SST1	Record removed
9	91.75	291	XM_353477	LOC375726	72.6	SST1	Record removed
9	90.14	426	NM_030898	FLJ21673	84.1	AluSg/x L2	FLAM_A record removed
9	90.11	354	XM_353493	LOC375772	93.1	MLT2A1	Record removed
9	88.52	270	XM_353481	LOC375740	77.4	REP522	Record removed
9	88.52	270	XM_353480	LOC375738	77.8	REP522	Record removed
9	88.52	270	XM_353478	LOC375727	77.4	REP522	Record removed
10	99.67	600	NM_178512	FLJ37201	73.2	Tigger4 (Zombi)
10	99.57	231	XM_352893	LOC374280	90.4	MER11B	Record removed
10	98.10	105	NM_173577	MGC45541	86.4	AluJo/FRAM
1	85.21	2082	NM_021211	LOC58486	68.3	Charlie1
12	85.48	303	XM_350891	LOC374483	93.8	HERVK22	Record removed
13	99.31	288	NM_138474	LOC144845	80.1	L1PA13	Record removed
13	97.80	501	NM_173604	FLJ25694	87.4	HERVE
13	89.24	381	NM_153251	FLJ25952	85.5	AluSg/x AluSx
13	80.13	297	XM_353050	LOC374511	77.0	MSTC MIR	Record removed
16	100.00	75	NM_030970	MGC3771	90.7	AluSp
19	100.00	144	NM_178523	MGC45556	91.0	L1PA10	Record removed
19	88.17	372	XM_294914	LOC339358	81.2	MER41-int	MER41B MER77
19	84.00	225	NM_138781	LOC113386	77.8	HERVK3
20	100.00	375	XM_209370	LOC284806	89.1	SVA	Record removed
21	88.89	378	XM_211658	LOC284837	75.3	L1MB8	Record removed
X	100.00	372	NM_153016	FLJ30672	87.8	THE1-int

Open in a new tab

Chr., chromosome.

Percent of length of cds that is ME.

^†

Percent sequence match of cds to ME.

^‡

repeatmasker listing.

^§

Added on March 2, 2004, when removals were found.

Results and Discussion

AD7C. AD7C is a neuronal thread protein gene. It encodes a 41-kDa membrane spanning phosphoprotein that is useful in the diagnosis of early Alzheimer's disease (14, 15). The coding sequence is 1,128 nt long and repeatmasker shows that it consists of fragments of five (or four, see below) Alu sequences. All of the matches are with the reverse complements of the Alu repeats. The alignment is summarized in Table 2. Listed are the percent similarity and length of each of the regions from the best matching Alu sequences, which differ inconsequentially from those published in ref. 14.

Table 2. Alignment summary of AD7C.

				Position in ME^§
%^*	Start	End^†	ME^‡	End	Start
92	1	281	AluSp#SINE/Alu	280	1
87	284	411	AluJo#SINE/Alu	143	2
83	413	580	AluJo#SINE/Alu	301	134
92	581	884	AluSc#SINE/Alu	302	1
88	887	1128	AluSx#SINE/Alu	300	61

Open in a new tab

Match between ME sequence and region of cds.

^†

Start and end positions in cds.

^‡

REPEATMASKER description of ME.

^§

End and start positions in reverse-oriented ME.

First, an AluSp matches at 92% accuracy the first 281 nt of the coding sequence. After a gap of 3 nt, 141 nt of AluJo matches at 87% precision. Then, after 2 nt, an additional part of the AluJo sequence matches to 93% for 167 nt including a sizeable part of the poly(A) tail, modified by two substitutions that affect the translation. These two short fragments seem to represent one Alu sequence homolog in the coding sequence, but rearrangement has apparently occurred because there are overlapping regions of the AluJo. Next is a 92% match for 302 nt to an AluSc, including a sizeable part of the poly(A) tail that is modified. Finally, there is an 88% match for 239 nt to an AluSx, also including a sizeable region of the poly(A) tail that is modified. In the genome, this match continues after the end of the coding sequence region and there is another match to an Alu sequence (data not shown).

It appears that the whole gene coding region has been made from a cluster of Alu sequences. The gaps of a few nucleotides between the individual Alu sequence matches are probably just details of the repeatmasker alignment process and can be ignored. A matter of interest is how much change has occurred in the sequences to form a useful gene from the ME sequences. The Alu sequences summarized in Table 2 are simply the best matches from the repeatmasker collection and are not necessarily the Alu sequences that were present in the original Alu cluster, so that it is not possible in general to identify the sequence changes that have occurred. A sample can be estimated by examining the three poly(A) chains that are included. They total to 60 Ts in the complementary Alu sequences. In these poly(T) regions, eight changes have occurred, all leading to translatable codons for amino acids other than phenylalanine. They consist of six A substitutions and two insertions of two As each. This ≈17% change in this small sample suggests positive selection. Of course, there is only one possible silent substitution in a row of Ts, the transition from T to C in the third base. In addition, there are four cases of internal T-rich sequences in the five Alu sequences involved, and in one of those, such a silent substitution has occurred. In two of these cases, length differences have occurred resulting from a six-base deletion and a four-base insertion, leading, of course, to translatable codons. This is a clear case in which a cluster of Alu repeats has been converted into an active human gene. We do not yet know how the 5′ control region is organized. With that information we will someday be able to say more about the evolutionary process that created the gene. It was pointed out that an identifiable full-length representation in the human genome (build 34) is only 97% similar to the AD7C mRNA sequence (A. F. Smit, personal communication) (14). The differences are such that the genomic sequence is not translatable for a significant length. No better genomic copy of the mRNA has been found, but the gene could contain introns and might be hard to identify because of the Alu sequences.

BNIP3. BNIP3 is the gene for a protein involved in controlling apoptosis through the interaction with other proteins (16–18). The heading for the entry in OMIM (Online Mendelian Inheritance in Man) is BCL2/ADENOVIRUS E1B 19KD PROTEIN-INTERACTING PROTEIN 3: BNIP3. Table 1 shows that 97% of the coding sequence is related closely to that of HERV70RM. HERV70RM is the name I am using for the version of HERV70 that is included in the repeatmasker library and it is named a human endogenous retrovirus, although it does not contain recognizable retroviral gene residues. It is more than 7 kb long, and the relationships to the BNIP3 coding sequence occurs after nucleotide 4641 of HERV70RM. The coding sequence of the BNIP3 mRNA aligns fully with the HERV70RM sequence even though the gene consists of 6 exons spread over almost 15 kb of DNA. To help resolve this relationship, repeatmasker was run against the whole gene, and the results are shown in Table 3. Most of these data are from repeatmasker output, and two columns are added to show the location of the exons in the gene. In most cases, the identification of an HERV70RM segment in the gene aligns closely with the exons. This agreement is so good that the history seems obvious. Likely, a part of the HERV70RM from about 4–7 kb was converted to a gene without introns, which must have evolved and become useful, and later the introns were inserted into it to lead to the modern BNIP3 gene. In fact, there is a BNIP3P sequence on chromosome 14 that is identified as a pseudogene because it lacks introns and gives a very good match in a search made with the BNIP3 mRNA by using blast the human genome. It is possibly a fossil of the early stage in this event or it may be an actual pseudogene made from the mRNA at a later stage.

Table 3. MEs in the BNIP3 gene.

Divergence			Distance from start of gene							Location in ME
%	Del	Ins	Exon		Start	End		ME identification		Start	End
17.6	8.0	3.2	824	869	1	875	+	HERV70	LTR/ERV1	4641	5557
26.1	0.0	4.2			1241	1288	C	L2	LINE/L2	(86)	3227
28.3	16.3	0.0			1648	1739	C	MER5A	DNA/MER1_type	(48)	141
9.0	4.1	0.0			2208	2473	+	AluSq	SINE/Alu	1	277
23.7	12.6	2.1			2753	2847	+	L1ME3A	LINE/L1	6021	6125
18.0	0.0	0.0	2938	3087	2937	3086	+	HERV70	LTR/ERV1	6776	6925
16.2	0.0	3.7	3164	3270	3169	3277	+	HERV70	LTR/ERV1	6933	7037
15.8	11.4	0.0			4574	4687	+	FLAM_C	SINE/Alu	1	127
13.8	0.0	0.0	5334	5418	5335	5421	+	HERV70	LTR/ERV1	7032	7118
13.6	1.3	0.0	6093	6243	6094	6247	+	HERV70	LTR/ERV1	5901	6056
19.3	2.8	0.0			6691	6980	C	AluJo	SINE/Alu	(14)	298
32.0	8.0	0.0			6997	7146	C	L1ME	LINE/L1	(734)	5436
7.0	1.1	1.1	None		7147	7233	+	HERV70	LTR/ERV1	7172	7258
27.5	5.6	1.4			7241	7384	C	L1ME	LINE/L1	(873)	5273
23.0	18.6	2.3			8613	8870	C	MER21C	LTR/ERV1	(88)	847
17.9	0.0	11.8			8909	8984	+	(CCCCAA) n	Simple_repeat	2	68
16.7	0.0	1.6			9224	9284	+	MER41B	LTR/ERV1	481	540
8.3	1.4	0.7			9297	9586	+	AluSq	SINE/Alu	6	297
24.1	3.7	3.7			9594	9675	C	MER21C	LTR/ERV1	(853)	82
23.7	17.2	0.0			9747	10036	C	MLT1A0	LTR/MaLR	(17)	348
34.9	1.8	3.6			11487	11596	+	MIR3	SINE/MIR	101	208
21.7	4.7	3.5			11902	11987	+	FRAM	SINE/Alu	75	161
4.7	3.0	3.0			12762	12892	+	AluJo/FLAM	SINE/Alu	1	131
			14061	14106			+	HERV70	LTR/ERV1	6053	6100

Open in a new tab

Del, deletion; Ins, insertion.

To further explore this interpretation, the coding sequence was aligned with the HERV70RM sequence by using blast2 sequences. The result showed two copies of the almost complete cds region at locations 5507–6073 and 6732–7289 in the HERV70RM sequence, matching ≈80%. Thus, the locations shown in Table 3 in HERV70RM are simply the best fits of repeatmasker and do not necessarily show the actual sequence origins of the BNIP3 coding sequence. It seems likely that it originated as a copy of one of the regions in HERV70RM. Table 3 shows one example of a sequence similarity between HERV70RM and a region of the gene that is not an exon in BNIP3. The history of this region is unclear. In any case, it is clear that most of the exons of the BNIP3 gene derived from a continuous stretch of HERV70RM. This seems to be a good case of “introns late” because there is no other explanation that comes to mind for the presence of a series of connected pieces of HERV70RM spread widely in the BNIP3 gene.

An important issue is the nature of HERV70RM. The copy used in these studies is listed in the library of human repeated sequences listed in repeatmasker. It is incomplete and not a classical endogenous retrovirus. The hervd database (http://herv.img.cas.cz) lists many regions in the human genome that are similar in sequence to what I call HERV70RM here, although none of them match a length of more than ≈1 kb. In fact, there is a set of 63 sequences in this database that match the BNIP3 cds, although most of them show only a short matching region. The situation needs clarification because there are many entries in the hervd database called HERV70 that show no sequence similarity to HERV70RM. There is no full-length copy of HERV70RM in the present version of the human genome, so its status as a human endogenous retrovirus sequence is doubtful. blast of the human genome (filter off) searching with HERV70RM finds many hits and graphs some examples as if they were full-length matches. They do not exist, and the program has assembled them from groups of nearby fragmentary matches.

When repeatmasker is run against HERV70RM, two small fragments of Alu sequences are found, as well as other MEs within it. There are regions that repeatmasker identifies as HERV70 (HERV70RM), and these include the region of the copies of the BNIP3 coding sequences. A warning is required here because blast of the human genome (filter off, default) finds only 3 matching sequences for the BNIP3 coding sequence of the 63 that exist in the hervd database. I confirm the fact that there are many matching fragments to the coding sequence (cds), finding 120 in the human genome by using blast. This is an important point because these data, regardless of the interpretation of HERV70RM, show that the BNIP3 gene cds sequence is closely related in toto to sequences of a ME. We may not know exactly what this ME is, but there are many copies of this region of it in the human genome ranging from precise to quite divergent.

The BNIP3 gene occurs in the mouse genome [NM_009760], and the coding sequence matches the human with 89% accuracy. The protein sequences match to 90% accuracy except for a 5-aa gap and a 1-aa gap in the mouse protein. The gene arrangement is similar, with 6 exons extending over ≈15 kb. The exons are identical in length to the human exons except for the gaps of 15 and 3 nt corresponding to the protein differences. Because the cds match so closely in sequence, the mouse BNIP3 exons show the same relationship to the human HERV70RM as do the human BNIP3 exons. Interestingly, there is no sequence in the mouse genome, seen by blast of the mouse genome, that matches the human HERV70RM except for the BNIP3 exons. There is apparently no equivalent ERV in mouse genome, although, of course, many other HERVs and MERVs share sequence. repeatmasker may be used with either the human repeats or mouse repeats to examine the mouse BNIP3 gene region. With the human repeats, the mouse BNIP3 exons are recognized as HERV70RM sequences, but with the mouse repeats, no sequences match. The exons in the two genes are nearly identical. The nucleotide sequences of the mouse and human BNIP3 cds match closely (90%). K_s between the coding sequences of mouse and human are 0.41 and K_a = 0.047 (K_s is the divergence due to synonymous substitutions, and K_a is the divergence due to changes that cause amino acid replacement) (19). This similarity suggests that whatever the events were, they occurred far in the past.

The BNIP3 gene has also been sequenced from rat, and the cds is 95% similar to that of mouse BNIP3, so the same arguments apply. The K_s between the coding sequences of the rat and human is 0.37 and K_a = 0.048 (20). blast of the rat genome finds a BNIP3 exon and two other rat sequences similar to parts of human HERV70RM, whereas blast of the mouse genome finds only a BNIP3 exon with similarity to human HERV70RM. Based on a blast search of GenBank, chicken (Gallus gallus) has a similar mRNA sequence to the human BNIP3. There is a match of 367 of 453 nt, or 81%, in one large region and evidence of other smaller regions of similarity. It seems that a full examination of the evolution and relationships of BNIP3 and HERV70RM would be worthwhile in a number of species.

Syncytin. This example is listed by Smit (4) and is included here because recent evidence shows that Syncytin is a functioning gene in human placenta (21, 22). The mRNA is derived in toto from the endogenous retrovirus HERV-W, which is present in many copies in the human genome. The authors (21) identify ERVWE1 as the gene region that is the source of the transcript, although this may not be certain. ERVWE1 is 10.2 kb long and consists of the usual LTR–gag–pol–env–LTR arrangement. The Syncytin mRNA is 2.8 kb long and consists of the 5′ LTR, some additional sequence, the env gene, and the 3′ LTR. The cds of 1,617 nt includes just the env gene of the endogenous retrovirus. Within it, regions can be identified that are functionally significant to Syncytin. It is not clear how much evolutionary change occurred in the env gene to assume its present function. Entrez Gene lists what are termed GeneRIFs (www.ncbi.nlm.nih.gov/projects/GeneRIF/GeneRIFhelp.html):

Env HERV-W glycoprotein mediates cell–cell fusion upon interaction with the type D mammalian retrovirus receptor. Env protein was detected in the placental syncytiotrophoblast, suggesting a physiological role during pregnancy and placenta formation.
Contributor to normal placental architecture, especially in the fusion processes of cytotrophoblasts to syncytiotrophoblasts. The gene expression of Syncytin may be altered in cases with placental dysfunction such as preeclampsia or HELLP syndrome.
mRNA abundance for Syncytin showed stimulation by forskolin in BeWo cells.
Syncytin-mediated trophoblastic fusion in human cells is regulated by GCMa.
Syncytin gene activation is highest in term placenta.
HERV-W Env glycoprotein is directly involved in the differentiation of primary cultures of human villous cytotrophoblasts.
Hypoxia alters expression and function of Syncytin and its receptor during trophoblast cell fusion of human placental BeWo cells: Implications for impaired trophoblast syncytialization in preeclampsia.
Syncytin gene expression is down-regulated by hypoxia, which strengthens the hypothesis that Syncytin is reduced in disturbed pregnancies in the course of placental hypoxia.

HHCM. HHCM is identified as a human hepatocellular carcinoma 3.0-kb DNA sequence that encodes (in a 1,404-nt cds) a 52-kDa protein. It transforms both rat liver cells and NIH 3T3 fibroblasts.† Table 1 shows that it is almost 90% made up of L1 MEs. The sequence match is only ≈70%, so much sequence change has occurred because its origin from a part of the L1 sequence. It matches the regions 18–331 nt and 437-1470 nt of L1MD2. This is not apparently a beneficial contribution that L1 has made to our genome, although MEs act in strange ways. The record NM_006543 was “temporarily removed by RefSeq staff for additional review” and Smit (personal communication) did not find a closely matching genomic sequence. Thus, this example must be considered a candidate for future study.

LG30. LG30 is a gene of unknown function in the region G72/G30 of chromosome 13. Mutations in the region are connected to bipolar disorder (23, 24), but it appears that the G72 is more likely to be responsible (25). The LG30 coding region is only 216 nt long, and 100% of its length is related to LTR class ME (MLT1E, MLT1G).

GTF2IRD2. GTF2IRD2 was initially described as a transcription factor gene (26, 27), and the NCBI entry consisted of the fragment listed in Table 1. That is why it is included here. It has recently been studied in detail (28, 29), and it turns out that this fragment is actually exon 16, the 3′ exon and the only long exon, more than half the length of the whole coding sequence. This exon consists entirely of ME sequence Charlie8. What follows is a quotation from ref. 29. “GTF2IRD2 is the third member of the novel TFII-I family of genes clustered on 7q11.23. The GTF2IRD2 protein contains two putative helix–loop–helix regions (I-repeats) and an unusual C-terminal CHARLIE8 transposon-like domain, thought to have arisen as a consequence of the random insertion of a transposable element generating a functional fusion gene. The retention of a number of conserved transposase-associated motifs within the protein suggests that the CHARLIE8-like region may still have some degree of transposase functionality that could influence the stability of the region in a mechanism similar to that proposed for Charcot–Marie–Tooth neuropathy type 1A. GTF2IRD2 is highly conserved in mammals and the mouse orthologue (Gtf2ird2) has also been isolated.”

Other Transcript Coding Sequences Apparently Derived from ME. Table 4 is a list of 49 examples of observed transcripts for which the coding sequences have been determined by computer programs, and these cds are made up from MEs at least to the extent of 80%. This collection was made by running repeatmasker against the NCBI collection of gene transcripts in February of 2004, but when checks were made in early March, all of the transcripts so marked had been removed from the collection. It seems likely that someone decided they were junk, which in a sense may be true, but from the point of view of this article they may be considered potentially useful and should be further examined. Some of them are likely to be examples of the transcription of fragments of ME, a process which occurs frequently. Regions of ME line 1 are expressed in mouse and rat and human RNA collections (unpublished data). Smit's table (4) has been extended (27) to include 47 potential genes derived at least in part from ME. However, the central issue for these two tables is whether these candidates are actually functioning genes. In fact, there is no evidence in the majority of cases that these mRNAs are produced by functioning genes. There are two examples in these tables where nearly the whole mRNA derives from an ME, and one of them is described above as Syncytin (21, 22). The other appears to be the transcription of a fragment of a sequence related fairly closely to HERV3, including the env gene and LTR, and the transcript is described as an env gene mRNA. The evidence of its function is transcription in placental trophoblast cells (28), reminiscent of intracysternal A-particles in mouse that are similar to ERVs and may be claimed to have an important role in placenta (29).

The cases described and possibly the example just mentioned (4, 27) show that parts of ME have been converted to form essentially complete gene coding sequences. There are probably more cases as indicated by Table 4. These observations add to the many known ways in which MEs have contributed to our evolution. This subject has been reviewed recently by Kazazian (30) who characterizes them as being in the driver's seat, rather than simply being useful to have around. Because of this review there is not reason for extensive discussion here.

Acknowledgments

I thank John Williams for assistance, Arian Smit and Mark Springer for criticism, and Eric H. Davidson's laboratory for support.

Abbreviations: cds, coding sequence; ME, mobile element.

Footnotes

^†

Yang, S. S., Zhang, K., Vieira, W., Taub, J. V., Zeilstra-Ryalls, J. H. & Somerville, R. L., 14th International Symposium for Comparative Leukemia and Related Diseases, October 8–12, 1989, Vail, CO.

References

1.Nekrutenko, A. & Li, W. H. (2001) Trends Genet. 17, 619–621. [DOI] [PubMed] [Google Scholar]
2.Lorenc, A. & Makalowski, W. (2003) Genetics 118, 183–191. [PubMed] [Google Scholar]
3.Sorek, R. R., Ast, G. & Graur, D. (2002) Genome Res. 12, 1060–1067. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Smit, A. F. (1999) Genet. Dev. 9, 657–663. [DOI] [PubMed] [Google Scholar]
5.Brosius, J. (1999) Gene 228, 115–134. [DOI] [PubMed] [Google Scholar]
6.Makalowski, W. (2000) Gene 259, 61–67. [DOI] [PubMed] [Google Scholar]
7.Lagemaat, L., Landry, J. R., Mager, D. L. & Medstrand, P. (2003) Trends Genet. 19, 530–536. [DOI] [PubMed] [Google Scholar]
8.Jordan, K., Rogozin, I. B., Glazko, G. V. & Koonin, E. V. (2003) Trends Genet. 19, 68–72. [DOI] [PubMed] [Google Scholar]
9.Liu, G., Zhao, S., Bailey, J. A., Sahinalp, S. C., Alkan, C., Tuzun, E., Green, E, D. & Eichler, E. E. (2003) Genome Res. 13, 358–368. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Britten, R. J. & Davidson, E. H. (1971) Q. Rev. Biol. 46, 111–138. [DOI] [PubMed] [Google Scholar]
11.Britten, R. J. (1996) Mol. Phylogenet. Evol. 5, 13–17. [DOI] [PubMed] [Google Scholar]
12.Britten, R. J. (1996) Proc. Natl. Acad. Sci. USA 93, 9374–9377. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Britten, R. J. (1997) Gene 205, 177–182. [DOI] [PubMed] [Google Scholar]
14.De la Monte, S. M. & Wands, J. R. (2002) Front. Biosci. 7, 989–996. [DOI] [PubMed] [Google Scholar]
15.De la Monte, S. M. & Wands, J. R. (2004) J. Alzheimer's Dis. 6, 231–242. [DOI] [PubMed] [Google Scholar]
16.Boyd, J. M., Malstrom, S., Subramanian, T., Venkatesh, L. K., Schaeper, U., Elangovan, B., D'Sa-Eipper, C. & Chinnadurai, G. (1994) Cell 79, 341–351. [DOI] [PubMed] [Google Scholar]
17.Kothari, S., Cizeau, J., Mcmillan-Ward, E., Israels, S. J., Bailes, M., Ens, K., Kirshenbaum, L. A. & Gibson, S. B. (2003) Oncogene 30, 4734–4744. [DOI] [PubMed] [Google Scholar]
18.Giatromanolaki, A., Koukourakis, M. I., Sowter, H. M., Sivridis, E., Gibson, S., Gatter, K. C. & Harris, A. L. (2004) Clin. Cancer Res. 10, 5566–5571. [DOI] [PubMed] [Google Scholar]
19.Comeron, J. M. (1995) J. Mol. Evol. 41, 1152–1159. [DOI] [PubMed] [Google Scholar]
20.Graur, D. & Li, W.-H. (2000) Fundamentals of Molecular Evolution (Sinauer, Sunderland, MA), pp. 362–363.
21.Mallet, F., Bouton, O., Prudhomme, S., Cheynet, V., Oriol, G., Bonnaud, B., Lucotte, G., Duret, L. & Mandrand, B. (2004) Proc. Natl. Acad. Sci. USA 101, 1731–1736. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Potgens, A. J., Drewlo, S., Kokozidou, M. & Kaufmann, P. (2004) Hum. Reprod. Update, in press. [DOI] [PubMed]
23.Chen, Y.-S., Akula, N., Detera-Wadleigh, S. D., Schulze, T. G., Thomas, J., Potash, J. B., DePaulo, J. R., McInnis, M. G., Cox, N. J. & McMahon, F. J. (2004) Mol. Psychiatry 9, 87–92. [DOI] [PubMed] [Google Scholar]
24.Hattori, E., Liu, C., Badner, J. A., Bonner, T. I., Christian, S. L., Maheshwari, M., Detera-Wadleigh, S. D., Gibbs, R. A. & Gershon, E. S. (2003) Am. J. Hum. Genet. 72, 1131–1140. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Chumakov, I., Blumenfeld, M., Guerassimenko, O., Cavarec, L., Palicio, M., Abderrahim, H., Bougueleret, L., Barry, C., Tanaka, H., La Rosa, P., et al. (2002) Proc. Natl. Acad. Sci. USA 99, 13365–13367.12374853 [Google Scholar]
26.Strausberg, R. L., Feingold, E. A., Grouse, L. H., Derge, J. G., Klausner, R. D., Collins, F. S., Wagner, L., Shenmen, C. M., Schuler, G. D., Altschul, S. F., et al. (2002) Proc. Natl. Acad. Sci. USA 99, 16899–16903. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. (2001) Nature 409, 860–921. [DOI] [PubMed] [Google Scholar]
28.Boyd, M. T., Bax, C. M., Bax, B. E., Bloxam, D. L. & Weiss, R. A. (1993) Virology 196, 905–909. [DOI] [PubMed] [Google Scholar]
29.Ball, M., McLellan, A., Collins, B., Coadwell, J., Stewart, F. & Moore, T. (2004) Gene 325, 103–113 [DOI] [PubMed] [Google Scholar]
30.Kazazian, H. H., Jr. (2004) Science 303, 1626–1632. [DOI] [PubMed] [Google Scholar]

[ref1] 1.Nekrutenko, A. & Li, W. H. (2001) Trends Genet. 17, 619–621. [DOI] [PubMed] [Google Scholar]

[ref2] 2.Lorenc, A. & Makalowski, W. (2003) Genetics 118, 183–191. [PubMed] [Google Scholar]

[ref3] 3.Sorek, R. R., Ast, G. & Graur, D. (2002) Genome Res. 12, 1060–1067. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref4] 4.Smit, A. F. (1999) Genet. Dev. 9, 657–663. [DOI] [PubMed] [Google Scholar]

[ref5] 5.Brosius, J. (1999) Gene 228, 115–134. [DOI] [PubMed] [Google Scholar]

[N0x9ba9640.0x9c3f078] 6.Makalowski, W. (2000) Gene 259, 61–67. [DOI] [PubMed] [Google Scholar]

[N0x9ba9640.0x9c3f198] 7.Lagemaat, L., Landry, J. R., Mager, D. L. & Medstrand, P. (2003) Trends Genet. 19, 530–536. [DOI] [PubMed] [Google Scholar]

[N0x9ba9640.0x9c3f2b8] 8.Jordan, K., Rogozin, I. B., Glazko, G. V. & Koonin, E. V. (2003) Trends Genet. 19, 68–72. [DOI] [PubMed] [Google Scholar]

[ref9] 9.Liu, G., Zhao, S., Bailey, J. A., Sahinalp, S. C., Alkan, C., Tuzun, E., Green, E, D. & Eichler, E. E. (2003) Genome Res. 13, 358–368. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref10] 10.Britten, R. J. & Davidson, E. H. (1971) Q. Rev. Biol. 46, 111–138. [DOI] [PubMed] [Google Scholar]

[ref11] 11.Britten, R. J. (1996) Mol. Phylogenet. Evol. 5, 13–17. [DOI] [PubMed] [Google Scholar]

[N0x9ba9640.0x9c3f6d8] 12.Britten, R. J. (1996) Proc. Natl. Acad. Sci. USA 93, 9374–9377. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref13] 13.Britten, R. J. (1997) Gene 205, 177–182. [DOI] [PubMed] [Google Scholar]

[ref14] 14.De la Monte, S. M. & Wands, J. R. (2002) Front. Biosci. 7, 989–996. [DOI] [PubMed] [Google Scholar]

[ref15] 15.De la Monte, S. M. & Wands, J. R. (2004) J. Alzheimer's Dis. 6, 231–242. [DOI] [PubMed] [Google Scholar]

[ref16] 16.Boyd, J. M., Malstrom, S., Subramanian, T., Venkatesh, L. K., Schaeper, U., Elangovan, B., D'Sa-Eipper, C. & Chinnadurai, G. (1994) Cell 79, 341–351. [DOI] [PubMed] [Google Scholar]

[N0x9ba9640.0xa0b1c68] 17.Kothari, S., Cizeau, J., Mcmillan-Ward, E., Israels, S. J., Bailes, M., Ens, K., Kirshenbaum, L. A. & Gibson, S. B. (2003) Oncogene 30, 4734–4744. [DOI] [PubMed] [Google Scholar]

[ref18] 18.Giatromanolaki, A., Koukourakis, M. I., Sowter, H. M., Sivridis, E., Gibson, S., Gatter, K. C. & Harris, A. L. (2004) Clin. Cancer Res. 10, 5566–5571. [DOI] [PubMed] [Google Scholar]

[ref19] 19.Comeron, J. M. (1995) J. Mol. Evol. 41, 1152–1159. [DOI] [PubMed] [Google Scholar]

[ref20] 20.Graur, D. & Li, W.-H. (2000) Fundamentals of Molecular Evolution (Sinauer, Sunderland, MA), pp. 362–363.

[ref21] 21.Mallet, F., Bouton, O., Prudhomme, S., Cheynet, V., Oriol, G., Bonnaud, B., Lucotte, G., Duret, L. & Mandrand, B. (2004) Proc. Natl. Acad. Sci. USA 101, 1731–1736. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref22] 22.Potgens, A. J., Drewlo, S., Kokozidou, M. & Kaufmann, P. (2004) Hum. Reprod. Update, in press. [DOI] [PubMed]

[ref23] 23.Chen, Y.-S., Akula, N., Detera-Wadleigh, S. D., Schulze, T. G., Thomas, J., Potash, J. B., DePaulo, J. R., McInnis, M. G., Cox, N. J. & McMahon, F. J. (2004) Mol. Psychiatry 9, 87–92. [DOI] [PubMed] [Google Scholar]

[ref24] 24.Hattori, E., Liu, C., Badner, J. A., Bonner, T. I., Christian, S. L., Maheshwari, M., Detera-Wadleigh, S. D., Gibbs, R. A. & Gershon, E. S. (2003) Am. J. Hum. Genet. 72, 1131–1140. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref25] 25.Chumakov, I., Blumenfeld, M., Guerassimenko, O., Cavarec, L., Palicio, M., Abderrahim, H., Bougueleret, L., Barry, C., Tanaka, H., La Rosa, P., et al. (2002) Proc. Natl. Acad. Sci. USA 99, 13365–13367.12374853 [Google Scholar]

[ref26] 26.Strausberg, R. L., Feingold, E. A., Grouse, L. H., Derge, J. G., Klausner, R. D., Collins, F. S., Wagner, L., Shenmen, C. M., Schuler, G. D., Altschul, S. F., et al. (2002) Proc. Natl. Acad. Sci. USA 99, 16899–16903. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref27] 27.Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. (2001) Nature 409, 860–921. [DOI] [PubMed] [Google Scholar]

[ref28] 28.Boyd, M. T., Bax, C. M., Bax, B. E., Bloxam, D. L. & Weiss, R. A. (1993) Virology 196, 905–909. [DOI] [PubMed] [Google Scholar]

[ref29] 29.Ball, M., McLellan, A., Collins, B., Coadwell, J., Stewart, F. & Moore, T. (2004) Gene 325, 103–113 [DOI] [PubMed] [Google Scholar]

[ref30] 30.Kazazian, H. H., Jr. (2004) Science 303, 1626–1632. [DOI] [PubMed] [Google Scholar]

PERMALINK

Coding sequences of functioning human genes derived entirely from mobile element sequences

Roy J Britten

Abstract

Table 1. Selected genes derived from ME sequences.

Methods

Table 4. Observed transcripts that match ME for >80% of their length.

Results and Discussion

Table 2. Alignment summary of AD7C.

Table 3. MEs in the BNIP3 gene.

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Coding sequences of functioning human genes derived entirely from mobile element sequences

Roy J Britten

Abstract

Table 1. Selected genes derived from ME sequences.

Methods

Table 4. Observed transcripts that match ME for >80% of their length.

Results and Discussion

Table 2. Alignment summary of AD7C.

Table 3. MEs in the BNIP3 gene.

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases