Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2024 Oct 28;121(45):e2313581121. doi: 10.1073/pnas.2313581121

The reconstruction of evolutionary dynamics of processed pseudogenes indicates deep silencing of “retrobiome” in naked mole rat

Valeria Kogan a,1, Ivan Molodtsov b,1, Daria I Fleyshman b, Olga V Leontieva b, Igor E Koman a, Andrei V Gudkov b,2
PMCID: PMC11551321  PMID: 39467133

Significance

Retrotransposons, epigenetically silenced, highly repetitive virus-like elements, constitute nearly half of mammalian genomes. This “retrobiome” emerged through multiple explosive retrotransposon amplifications during mammalian evolution, and its derepression is linked to various pathologies, such as cancer and aging. To assess current and past retrobiome activity, we devised a computational genomic methodology using processed pseudogenes, intronless cDNA copies of mRNAs formed as a “side product” of retrobiome activity. Our analysis revealed that the retrobiome in naked mole rats, renowned for their exceptional longevity, has remained dormant for millions of years, contrasting with active retrobiomes in mice, rats, and other species studied. This finding underscores a potential connection between retrobiome activity and longevity.

Keywords: retrotransposons, evolution, longevity, mutations, reverse transcription

Abstract

Approximately half of mammalian genomes are occupied by retrotransposons, highly repetitive interspersed genetic elements expanded through the mechanism of reverse transcription. The evolution of this “retrobiome” involved a series of explosive amplifications, presumably associated with high mutation rates, interspersed with periods of silencing. A by-product of retrotransposon activity is the formation of processed pseudogenes (PPGs)—intron-less, promoter-less DNA copies of messenger RNA (mRNA). We examined the proportion of PPGs with varying degrees of deviation from their ancestor mRNAs as an indicator of the intensity of retrotranspositions at different times in the past. Our analysis revealed a high proportion of “young’’ (recently acquired) PPGs in the DNA of mice and rats, indicating significant retrobiome activity during the recent evolution of these species. The ongoing process of new PPG entries in mouse germ line DNA was confirmed by identifying diversity in PPG content within the single strain of mice, C57BL/6. In contrast, the highly abundant PPGs of the naked mole rat (NMR) exhibited substantial deviation from their mRNAs, with a near-complete lack of PPGs without mutations, indicative of the silencing of the retrobiome in the most recent evolutionary past, preceded by a period of high activity. This distinctive feature of the NMR genome was confirmed through the analysis of a broad range of mammalian species. The peculiar evolutionary dynamics of PPGs in the NMR, an organism with exceptional longevity and resistance to cancer, may reflect the role played by the retrobiome in aging and cancer.


From hundreds of thousands to millions of copies of long- and short interspersed nuclear elements (LINEs and SINEs) are present in the genome of every mammalian species collectively occupying from a quarter to half of its entire length (14). Along with endogenous retroviruses (5), they belong to retrotransposons and altogether form “retrobiome”—the entirety of genetic material originating from reverse transcription (6). LINE-1 (L1) is the most ancient family of autonomous retrotransposons and is the major source of reverse transcriptase (RT) encoded by the second of the two open reading frames of L1 RNA transcript (7). Out of approximately 500,000 copies of L1, only about 150 copies in human genome and about 3,000 copies in mouse genome remain intact and technically capable of retrotransposition; the rest are deficient due to truncations or mutations accumulated during millions of years of their residence in the genome in the absence of correcting pressure of stabilizing selection (8).

RT of L1 (RTL1) is the likely driver of amplification of SINEs, the most abundant retrobiome family (9). Every order of mammals has its own type of SINEs: Alu in primates, B1 to B4 in rodents, C1 in carnivores (10), etc., suggesting their role of “morphogenes” that divided mammals into their major archetypes (11). Unlike L1, SINEs do not encode proteins. They originated from the genes transcribed with RNA polymerase III and encoded different classes of small RNAs, i.e., tRNAs and 7SLRNA, and require RTL1 for their replication and integration (1214).

On the evolutionary timescale, accumulation of L1 and SINEs occurred via multiple “explosions” when, during rather short evolutionary times, numerous insertions of retroelements entered genomes of our ancestors creating structurally distinct subfamilies of L1s and SINEs. These explosions were presumably associated with periods of high levels of genomic instability (15, 16) that, on one the hand, was associated with a high risk of inherited diseases (17), and, on the other hand, created broad phenotypic variations that allowed natural selection to promote adaptation (10, 18).

In somatic tissues, the activity of retrobiome is frequently observed in tumor cells: multiple human tumor types are positive for L1 encoded proteins (19) and acquire multiple new integration copies of L1 and SINEs (20). Activation of L1 expression and reverse transcription was linked to the processes of aging (21). Moreover, activation of L1 in senescent cells was shown to result in the cGAS–STING-mediated induction of interferon type I response (22) and, if occurred at the organismal level (e.g., in Sirt6-knockout mice), could determine systemic inflammation and premature death (23).

Retroelements’ activation affects genomic instability both in germline and somatic cells despite multiple preventive mechanisms (24). The first line of defense is epigenetic: almost the third of CpG sites in human genome is distributed within Alu sequences (25, 26). De-novo methylation of L1 is conducted through PIWI–piRNA pathway (27). Besides heavy methylation, transcription of L1 is actively prohibited by histone modifications regulated by activity of Rb (24, 2830) and sirtuins 6 (31) and 7 (32). A key role in epigenetic silencing of retroelements is played by p53 (33, 34). Multiple mechanisms act at posttranscriptional level by causing selective degradation of L1 RNA (3537). Activation of interferon results either in a direct or indirect (via attraction of immune system) eradication of cells with desilenced retrobiome (33, 38).

Provided the significance of retrobiome’s impact in cancer, aging, and evolution, it would be important to develop tools enabling objective quantitative estimation of its activity. Extremely high copy number of major retrobiome components complicates comparison of their contents among genomes as well as the detection of new integrated copies of L1 and SINEs and assigning evolutionary age to such events. Therefore, we focused on less abundant events which, nevertheless, can also indicate the rate of retrobiome activity in terms of acquisition of new entries of products of reverse transcription in DNA. Specifically, we chose low abundant members of retrobiome incapable of replication, namely processed pseudogenes (PPGs). PPGs are cDNA copies of mRNAs presumably synthesized by RTL1 and integrated into the genome (6, 39). They can be clearly distinguished from their parental genes as lacking promoters and introns and having poly(A) sequence at the corresponding 3′-end of mRNA. They are generally incapable of transcription, and, as any other nonfunctional elements, undergo genetic drift passively acquiring mutations with the speed of neutral evolution (40). Thus, relative mutational load of PPGs in the genome can be used as an indicator of their evolutionary age and relative abundance of PPGs with certain proportions of mutations could be indicative of the activity of retrotranspositions at given evolutionary time.

We applied this analysis to the genomes of rodents with short (mouse, rat) and exceptionally long lifespan (naked mole rat, NMR) (41) and found dramatic differences in evolutionary dynamics of PPGs, presumably representing the rest of the retrobiome, among them. These differences seem to be consistent with the claimed role of retrobiome activity in aging.

Results

“PPG Finder”: A Tool for Mining PPG-Derived Genomic Regions.

PPGs originate from reverse transcription of mRNAs and miss introns of their ancestral genes. Hence, genomic DNA fragments spanning through exon–exon junctions are expected to belong to PPGs We developed a software tool for systematic mining for such putative PPG-derived sequences named PPG Finder (Fig. 1), based on identifying regions in nuclear DNA homologous to exon–exon junctions of spliced mRNAs. Then, 260 bp-long sequences representing exon–exon junctions in each known mRNA transcript were archived as a FASTA file to be further transformed to the BLAST database (see SI Appendix for detail). Once the said database is completed for mRNAs of each animal species of interest, the reference genome of the studied species is aligned to it. Identified BLAST hits are filtered by their length: 75% of junction sequence length and higher; and percentage of identical matches: 85% and higher. All filtered hits are considered to present elements of PPGs. The sequences of the PPGs can be restored from the junctions based on information of the junction localization in the reference genome.

Fig. 1.

Fig. 1.

Schematic description of the main principles of the PPG Finder approach to the identification of PPGs. Explanations for the schemes are provided in the text. The Venn diagram shows the intersections among the PPG contents revealed by the indicated methodologies in the mouse genome and the degrees of intersection among them. The numbers and proportions of PPG-derived exon–exon junctions identified by PPG Finder that are either unique or common to the PPGs identified by PseudoPipe and GENCODE are indicated. More information is provided in SI Appendix, Fig. S1.

We tested the ability of PPG Finder to identify PPG-derived sequences in the mouse genome by comparing the set of sequences yielded by this tool with those defined by previously developed approaches (42), such as PseudoPipe (4345), RetroFinder (46), models created by the Human and Vertebrate Analysis and Annotation (HAVANA) team and manually curated pseudogenes list in GENCODE (47). Different coding genes and corresponding transcripts (including alternatively spliced ones) were used to build BLAST database of exon–exon junctions. After alignment to the reference genome, PPG Finder identified 15,053 junctions corresponding to 2,191 parental genes for potential PPGs.

To compare the outcomes yielded by PPG Finder, PseudoPipe, and GENCODE annotation, the chromosomal coordinates determined for all entries in each of the three methods were intersected, resulting in three-way intersection data (Fig. 1). Since PPG Finder is focused on only the region surrounding the junction while both PseudoPipe and GENCODE annotation are not limited to it, we do not require the full intersection but at least partial. PseudoPipe output was filtered to include “PSSD” and “FRAG” biotypes, indicative of PPGs and pseudogene loci where authors couldn’t assign with certitude a biotype (processed or duplicated). GENCODE output was filtered to include “processed_pseudogene” and “transcribed_processed_pseudogene” biotypes.

The observed results (Fig. 1 and SI Appendix, Table S1) demonstrate that most of the junctions identified by PPG Finder (58%) intersect PPGs found by both PseudoPipe and GENCODE. Additionally, there are smaller subsets of PPG Finder hits which intersect with only PseudoPipe or only GENCODE entries (4% and 21%, respectively). A comparatively large group of PseudoPipe and GENCODE entries did not intersect with any of PPG Finder junctions, which was expected since not all pseudogenes in PseudoPipe and GENCODE necessarily include exon–exon junction sequences.

Then, 16% of junctions did not intersect with either PseudoPipe or GENCODE annotations. To check whether they represent bonefide PPG-derived sequences, we analyzed sequences belonging to this category and derived from genes represented by two to five junctions. From the resulting list, three random genes, Armcx3, Rian, and Ppcs, were picked, and all their junctions were manually analyzed.

For Armcx3 gene, five junctions were found on chromosome 7 within the region spanning from chr7:128042409 to chr7:128042873 of total length of 464 base pairs. It should be noted that this region lies exactly downstream from PPG Gm9299 (chr7:128041251–128042390) which is part of GENCODE annotation. For these five junctions, one was generated uniquely for transcript variant with NCBI Reference Sequence NM_001358520.1, two were present in transcripts NM_001358520.1, NM_001358521.1, NM_027870.4, XM_017318626.2, and two were unique for NM_027870.4. Alignment of NM_027870.4 sequence to mouse genome using BLAST resulted in nearly perfect hit (99.8% identity, chr7:128040959–128042873) including both all predicted junctions and Gm9299.

Similarly, for Rian gene encoding lncRNA, three junctions were found on chromosome X within the region spanning from chrX:64497336 to chrX:64498052 of a total length of 716 base pairs. All three junctions were generated from transcript NR_028261.1. Using BLAST to compare NR_028261.1 to chrX resulted in a hit (chrX:64496585–64498895, 95.9% identity) covering all the junctions.

Finally, for gene Ppcs, two junctions were found on chromosome 17 within the region spanning from chr17:8261031 to chr17:8261597 of a total length of 566 base pairs generated from transcript XR_003954862.1. Using BLAST to compare XR_003954862.1 to chr17 resulted in a hit (chr17:8260656–8262231, 96.1% identity) covering both junctions.

These results lead us to believe that PPG Finder is able to identify a set of exon–exon junctions which are not part of any annotated PPGs.

We also confirmed that the overlap between PPG Finder and both GENCODE and PseudoPipe was highly statistically significant, rejecting the null hypothesis of random co-occurrence (Materials and Methods and SI Appendix, Table S2).

History of PPG Accumulation as a Tracker of Retrobiome Activity.

As they are generally nonfunctional, PPGs are not subject to stabilized selection and accumulate mutations, akin to introns and other noncoding DNA sequences (48). This implies that the older a PPG is, the greater the degree of deviation it exhibits from its ancestral mRNA (Fig. 2A). Given that the formation of new PPGs occurs as a by-product of L1 RT activity, we hypothesized that the relative abundance of PPGs with specific mutation loads would serve as an indicator of the rate of retrobiome activity during related evolutionary periods. Then, the histogram showing the numbers of PPG junctions with specific proportion of mutations—from small to large one—would provide a quantitative measure of the history of retrobiome activity in the genomes of ancestors of current species (Fig. 2B).

Fig. 2.

Fig. 2.

Comparison of PPG-derived exon–exon junctions in mouse and rat genomes. (A) Schematic illustration of the progressive accumulation of mutations in PPGs as a function of time during neutral evolution. (B) A histogram indicating the proportions of PPGs with different degrees of deviation from their ancestral mRNAs, presumably including PPGs of different ages, is expected to reflect the relative intensity of PPG formation at different evolutionary times. This feature can be used to determine their “evolutionary age” by aligning the PPG sequences to the parental mRNAs and identifying the percentage of sequence identity. (C) Schematic description of PPG integration in the genomes during the recent evolution of mouse and rat. PPGs integrated into the genome of their common ancestor (“old” PPGs) retain common chromosomal positions in both species and are expected to accumulate more mutations. PPG integrations that occurred after the separation of mouse and rat (“young” PPGs) have different chromosomal locations. (D) The analysis of the distribution of PPGs with different degrees of sequence identity to the parental mRNAs showed that old PPG junctions have less sequence identity to their parental mRNAs compared to the young PPGs.

One of the conditions for using PPGs for the reconstruction of the evolutionary past of retrobiome activity is their stable presence at the same spot within the genome since their original integration for millions of years. This can be done by comparison of genomes of recently separated species, such as domestic mouse and rat [separated about 25 to 33 mln years ago (49)], which are expected to have substantial numbers of PPGs in the same locations in their genomes. To compare PPG genomic localizations, we used UCSC LiftOver tool (https://genome.ucsc.edu/cgi-bin/hgLiftOver) to align reference genomes of mouse and rat and Bedtools Intersect (50) to identify junctions located in corresponding evolutionary conserved regions. Among 15,053 and 12,090 total junctions detected in mouse and rat, respectively, 5,562 appeared to be in the same genomic regions presumably representing PPGs in the germline of common ancestor of these species.

Another condition of feasibility of our approach is the correlation between the evolutionary age of PPGs and their deviation from the ancestral mRNA due to the acquisition of mutations. PPG junctions that are common between mouse and rat represent events that happened before the ancestors of these two rodent species separated, i.e., they are at least 25 to 33 mln years of age (Fig. 2C). At the same time, junctions specific for either species would likely represent less ancient evolutionary events that happened already after rat and mouse separation. To test these predictions, we compared the mutation load of PPG junctions common and different between rat and mouse genomes by performing BLAST alignment of each exon–exon junction sequence with its ancestral mRNA. The results of this analysis are shown in Fig. 2D.

The proportions of PPGs with different degrees of deviation from their ancestral mRNAs has strikingly different patterns for the PPGs from the same genetic loci in both species than those that are specific for either mouse or rat genome. While significant proportions of mouse- or rat-specific PPGs showed either complete or near-complete homology to their parental mRNAs, such a category was dramatically underrepresented among PPGs that are common between the two species (Fig. 2D) indicating their evolutionary deviation.

It is noteworthy that the accuracy of PPG detection drops down with an increase in the degree of deviation of sequences from the ancestral mRNAs (see estimation of this dependence done by computer simulation in SI Appendix). Therefore, our analysis is limited to PPGs that deviate from their parental transcripts by not more than 15%.

Multiple Species Analysis and NMR Phenomena.

To assess the potential relevance of retrobiome activity to aging and longevity, we performed the analysis of evolutionary dynamics of PPGs in the reference genome of an extremely long-living rodent, NMR. Being the size of a mouse, NMR lives up to 25 to 30 y with no signs of physiological decline and aging (51, 52). Unlike fibroblasts of rats or mice, fibroblasts of NMR do not undergo senescence following DNA damage (53) and are resistant to oncogenic transformation (54).

Comparison of distributions of deviations from parental genes for PPGs showed dramatic difference in the proportion of PPGs with 100% identity to parental transcript between mouse and NMR (Fig. 3A): while a significant part of mouse PPGs are close to 100% identity to their parental transcripts indicating recent integrations of PPGs in their germ line, the genome of NMR has extremely low proportion of exon–exon junctions identical to their ancestral mRNAs.

Fig. 3.

Fig. 3.

The comparison of histograms reflects the evolutionary dynamic of acquisition of PPGs in the genomes of mammalian species. (A) Histogram representing the distribution of mismatches in identified PPGs for house mouse (Mus musculus) and NMR (Heterocephalus glaber). (B) Skewness vs. kurtosis for each indicated species indicates a unique position of NMR. (C) Fraction of exon–exon junctions with no mismatches compared to parental genes in the genomes of different mammalian species indicated in (D).

To further establish the position of the NMR across mammal species, we performed the same analysis in 16 different mammal species (Fig. 3B; see also Materials and Methods). For all of these species, distribution of mismatches was calculated, and their skewness and kurtosis were estimated to compare these distributions (Fig. 3C). With NMR characterized by the lowest skewness and highest kurtosis, it confirms the phenomena of exceptionally low number of evolutionary young (recently acquired) PPGs in NMR compared to other mammals. This observation may help connect low retrobiome activity and extreme longevity of NMR in the further studies.

Diversity of the PPG Contents Among Inbred Mice.

The dynamics of PPG content in evolutionary history of C57BL/6 mice demonstrates a relatively high rate of new PPG acquisitions during the most recent times suggesting high activity of retrobiome in this laboratory strain (Figs. 2D and 3A). To check this possibility, we conducted WGS for 5 individual C57BL/6 males originating from Charles River colony and analyzed the presence of PPGs using PPG Finder technique. For this purpose, the junction library generated as part of PPG Finder pipeline was used for the alignment of WGS sequencing results and junctions which were present differentially across sequenced samples were further explored (see Materials and Methods and SI Appendix for more details).

This analysis revealed, in addition to the previously identified set of C57BL/6 PPGs, four additional PPGs that were not previously detected in mice, namely Sub1, Fbxo22, Rybp, and Zfp40 (see SI Appendix for details). Their presence in mouse genomes was confirmed by PCR using primers designed to distinguish spliced and intron-containing sequences in DNA samples (Fig. 4). Thus, the process of acquisition of new PPG copies indeed continues in current populations of C57BL/6 mice presumably indicative of the ongoing retrobiome expansion activity.

Fig. 4.

Fig. 4.

PCR detection of new PPGs identified by the analysis of new exon–exon junctions among whole genome sequencing data generated for individual C57BL/6 mice. PCR, using primers and conditions described in Materials and Methods and SI Appendix, Table S2, was performed in a panel of randomly picked C57BL/6 mice. The results indicate the diversity in PPG contents among the individual mice analyzed.

Discussion

Several approaches have been developed and employed for the reconstruction of the evolutionary dynamics of the retrobiome. The earliest methods involved projecting specific subfamilies of retroelements onto the phylogenetic tree, with major branching points determined by paleontological evidence (2, 40, 55). These approaches enabled the estimation of the moment of the primary massive expansion of SINEs back to 65 million years, aligning with times of significant species extinction on Earth. This expansion was followed by a series of explosive amplifications, each leaving behind a new subfamily with characteristic mutations indicative of their origin from a single variant of the same predecessor element (56). The high similarity of protein-coding sequences across all mammalian species starkly contrasts with the complete distinction of SINE families in every mammalian order (10, 13). This observation, coupled with the coincidence of SINE explosions with the appearance of all major mammalian archetypes (11, 12, 57), underscores the significance of the evolutionary dynamics of the retrobiome as a major driver of diversity and a contributor to the formation of the current landscape of mammals.

Regardless of the above-described advancements in understanding basic principles of retrobiome evolution, its accurate reconstruction remains a challenging task even today when massive whole-genome sequencing has given us access to numerous genomes. Extremely high copy numbers of retroelements and close similarity of even those elements that belong to different subfamilies of SINEs and LINEs create significant challenges in accurate genome assembly, definitive assigning of every retroelement to its integration site and accurate estimation in copy number differences. Many of these obstacles become irrelevant when we switch from most abundant classes of retroelements to the analysis of a minor component of retrobiome, namely PPGs. As many as 8,908 and 7,811 PPGs were detected in human and mouse genomes, respectively (58). Distribution of these PPGs along the time axis would likely reflect the relative activity of RTL1 in the course of evolution of a given species if appearance of new PPGs occurs with higher frequency during the periods of massive retrotranspositions. This is a reasonable presumption provided that every act of expansion of the retrobiome was driven by generally the same RT of L1. Moreover, new PPGs were found in DNA of tumors (59, 60) that frequently derepress L1 enabling retrotranspositions (61, 62). Combining in our analysis two age estimation criteria—similarity of chromosomal localization between species and the degree of deviation of the ancestor mRNA—increases the accuracy of our conclusions.

We were able to confirm our theoretical prediction regarding the high activity of the retrobiome in the modern mouse genome. This was demonstrated by the recent germline acquisition of new pseudogenes, as evidenced by the variability in the PPG contents within the population of inbred C57BL/6 mice. The fact that laboratory rats and mice have shown an apparent increase in PPG generation rates in their most recent history is not entirely understood. It is possible that this increase is a result of the artificial inbreeding that these laboratory animals have undergone during recent decades, which may have weakened epigenetic control over retrobiome silencing and allowed for its expansion. Despite this uncertainty, it is crucial to note that retrobiome-driven insertional mutagenesis is actively ongoing in populations of laboratory mice that are considered inbred and genetically identical.

The time profile of PPG accumulation in the genome of the NMR differs significantly from that of the mouse or rat. The near complete absence of PPGs that are fully homologous to their ancestral mRNAs suggests a much stricter suppression of retrobiome activity in this species. Interestingly, this apparent retrobiome silencing was preceded by a period of much higher frequency of new PPG formation than that observed in the ancestors of the mouse or rat during the same evolutionary time, indicating a substantial expansion of retrotransposons that occurred before the nearly complete shutdown (Fig. 3). It is reasonable to assume (Fig. 5) that this activity was accompanied by a heavy load of mutations due to genome-destabilizing insertional mutagenesis caused by retrotranspositions and DNA damage caused by L1 endonuclease, potentially resulting in high mortality and malformations. Over time, natural selection favored the survival of organisms that developed strict mechanisms for controlling retrobiome silencing, as observed in the current NMR. It is noteworthy that retrotransposons are less abundant in DNA of NMR than in other mammalian species occupying only 25% of their genome vs. 40% in human, 37% in mouse, and 35% in rat genomes (63). This observation may reflect the lack of retrobiome expansion in NMR during last several million years.

Fig. 5.

Fig. 5.

Hypothetical reconstruction of evolutionary events that preceded and drove NMR origin. The comparison of histograms showing the distribution of PPGs with different degrees of deviation from their ancestral mRNAs among various mammalian species indicates a distinct distribution pattern observed for PPGs of NMR, suggesting a unique dynamic of retrobiome activity in the evolutionary history of this species. It involves an exceptionally high frequency of new PPGs with only 93 to 96% identity to their parental mRNAs (old PPGs) and an exceptionally low frequency of PPGs that are identical to their mRNA sequences, which are nearly completely lacking in the NMR genome. This pattern suggests an exceptionally high frequency of retrotranspositions that were ongoing in the ancestors of NMR several million years ago, followed by a complete shutdown of retrobiome activity at a later evolutionary time. One can hypothesize that the period of high activity of retrotransposons could be associated with catastrophic mutagenesis that created conditions for selecting a mechanism that ensured effective negative control of retrobiome activity. Animals that acquired this mechanism became ancestors of the current NMR.

While the current knowledge does not allow us to determine whether the exceptional longevity of the NMR is directly linked to the suppression of retrobiome activity, it presents an intriguing hypothesis that gains support from accumulating evidence about the involvement of activated retrotransposons in age-related inflammation and cellular senescence (32, 33, 64). This opens up the possibility of using pharmacological control of retrotransposons through RT inhibitors for geroprotective applications.

The potential utility of PPG Finder extends beyond tracing the evolutionary dynamics of PPG appearance and finding PPGs missed by other PPG identification tools. Although PseudoPipe and its subsequent modifications (4345) allow the identification of entire sequences of both duplicated (unprocessed) and PPGs, they can be hardly applicable to identification of new PPGs in raw data obtained via Next Generation Sequencing. In contrast, PPG Finder, as shown in the present work, can detect new PPG formation and can be projected for detection of retrotransposition activity of derepressed L1, a phenomenon frequently observed in tumor cells (65) and associated with tumor progression and treatment resistance (66). Identifying exon–exon junctions absent from the germline genome (i.e., the genome of normal cells) could serve as an indicator of ongoing retrotransposition activities. This application is likely feasible for analyzing DNA from liquid biopsies, complementing the array of assays developed for the detection of L1 derepression (67, 68) and holding the potential for advancing cancer diagnostics and detection (6971).

Materials and Methods

Exon–Exon Junctions Library formation.

To create an exon–exon junction library applicable to various genomes, we utilized information about known transcripts and their exon/intron structure as provided by the University of California Santa Cruz (UCSC) annotation database. This method can be used for any genome for which UCSC provides a compatible GTF file. The process involves the following steps, implemented as Python3 script:

  • 1)

    Junction Identification: For each transcript of each gene included in the GTF file, we identified locations of all exon–exon junctions.

  • 2)

    Sequence Extraction: Around each identified junction, we extracted transcript subregions of a given number N of base pairs (bp) on each side. If a junction was closer than N bp to the start or the end of a transcript, the extraction extended up to the transcript's start or end.

  • 3)

    Removing Redundant Sequences: In cases where identical junction subregions occurred in different transcripts, we included only one instance of each sequence in the final library.

  • 4)

    Unique Naming Convention: Each junction was assigned a unique identifier in the format: The format for these identifiers is structured as follows:

  • Gene_Symbol|Transcript_ID_1|Junction_ID_1|Intron_Length_ 1|...|Transcript_ID_k|Junction_ID_k|Intron_Length_k,

  • where:
    • a)
      Gene_Symbol: gene identifier
    • b)
      Transcript_ID_1: identifier of the ith transcript that includes this particular junction.
    • c)
      Junction_ID_1: serial number of the junction within the ith transcript.
    • d)
      Intron_Length_1: the length of the intron associated with this junction presplicing
    • e)
      k: the total number of transcripts that share this exact junction sequence.
  • 5)

    Library Compilation and Archiving: The resulting sequences, representing exon–exon junctions in each known mRNA transcript, are compiled into a library formatted as a FASTA file, making it suitable for various genomic analyses and for transformation into a BLAST database.

To ensure that only junctions from expressed active genes were using, genome annotation file used for junction library construction was additionally filtered to include only reviewed genes with known gene structure in contrast with inferred or modeled genes, which are also part of UCSC-provided NCBI RefSeq GTF file and were excluded from analysis.

BLAST Search and Output Analysis.

FASTA file with junction library is used to create a BLAST database necessary for running BLAST searches against the sequences. We then execute a BLAST search. The BLASTn command is used to perform a nucleotide sequence search against a specified database, outputting the results in a tabular format. It is configured to display only the top alignment and its description for each query sequence. The output is saved to a designated text file.

The resulting list of BLAST hits is then used for further downstream analysis with the use of Python3 scripts. The analysis might include filtering by a number of BLAST hit output parameters (e.g., percentage of identical positions, alignment length, number of mismatches, etc.) or other parameters (e.g., gene name, transcript name, intron length for the junction, etc.) and grouping and/or averaging the results by any set of this parameters.

In all cases, we specifically filtered out all hits corresponding to the junctions with intron length less than 10 bps as such intron length is not biologically just and likely do not reflect real splicing behavior in the species.

For any filtered list of hits in a given species, we apply gene-wise averaging of the percentage of identical positions to get the estimate of the average per-gene exon–exon junction quality. This procedure allows us to remove the variation related to the number of exon–exon junctions in transcripts from different genes.

The list of BLAST hits includes hits of varying length. In all cases, only hits with length of at least 85% of junction library sequence length are included in the analysis. This threshold is selected arbitrarily to guarantee that exon sequences on both sides of the junction are covered by the hit.

Selection of the Junction Region Length.

The accuracy of our analysis depends on the length of segments that include exon–exon junctions. On the one hand, they should be long enough to enable detection of mutations deviating their sequences from those of parental mRNAs. On the other hand, their length is limited by the size of DNA segments read by the NGS technique. To select the number N of base pairs on each side of the junction used for the generation of the junction library, the range of the libraries with N in range from 50 to 250 was generated for the murine genome mm10. The resulting list of hits was then filtered by length and averaged over genes as described above. The averaged hit quality was used to build distribution histograms (SI Appendix, Fig. S1). This model experiment demonstrated frame size dependence of the distribution of exon–exon junctions among categories with different levels of deviation from the parental mRNAs: the use of shorter frames led to an expected tendency to miss detection of the junctions with higher degree of deviation and to exaggerate the proportion of junctions identical to the parental mRNAs. This bias decreased with the increase in the frame size and had only a minor influence on the distribution of sequences among the categories with different degrees of deviation above 100 bp. For our analysis, we selected the threshold of 130 bps at each side of the junction, the frame enabling the detection of exon–exon junctions within 150 bp DNA reads coming out of the NGS, as the standard value for the junction library generation.

Selection of the Threshold Error Using Computer Simulation.

We performed analysis of the quality of novel PPG identification by PPG Finder using computer simulation of PPG integration. To achieve this goal, we simulated integration of 2,000 junctions generated for 180 random mouse genes with various fractions of errors (from 0% to 40% single nucleotide random substitutions) into random locations of the whole murine genome. PPG Finder was further applied to the genomes with such modifications and a fraction of correctly identified PPGs among simulated integrations was calculated. We observed dramatic decrease in detection quality after 15% error rate with no integrations detected for ≥30% error rate (SI Appendix, Fig. S2). Thus, we selected 15% error rate as a threshold for the further analysis that established a limit in our ability to look back in time to about 30 mln years (the mutation rate in nonfunctional DNA sequences in mammals is estimated as ~0.5% per 1 mln years) (68).

Testing Statistical Significance of Cross-Methods Overlap.

To test the statistical significance of the co-occurrence of PPGs locations obtained by different methods, we performed the following test. The whole genome was split into successive 520 bps segments. For each segment, we estimated whether processed psedogenes exon–exon junctions predicted by PPG Finder or PPGs predicted by PseudoPipe and GENCODE overlap this segment. This analysis resulted in two contingency tables for PPG Finder predictions vs. GENCODE predictions and PPG Finder predictions vs. PseudoPipe predictions, respectively. The resulting table was analyzed using Fisher exact test for a null hypothesis of independence of the hits. The results of this analysis are provided in SI Appendix, Table S2.

Analysis PPGs in Different Species.

The set of NCBI RefSeq annotated genomes of multiple mammalian species was obtained from the UCSC Genome Browser: https://genome.ucsc.edu/index.html. For each of these species, a library of exon–exon junctions with a total length of 260, 130 bps at each side, was generated. The power of a library was calculated as the number of junctions. Only junctions originating from exons separated with more than 10 bp-long introns were taken into consideration. The resulting powers varied between 513 and 438,389 junctions with a median of 217,314. We then removed from the analysis all species with a power of library less than half of the median value, thus reducing the list of species to 24. For these species, histograms demonstrating distributions for the per-gene averaged percent of identity were generated (SI Appendix, Fig. S3). These distributions were then made to calculate skewness and curtosis, statistical measures with skewness quantifying the asymmetry of a distribution (skewness) and the “tailedness”, or the extent of outliers present in the distribution (curtosis).

Identification of Variable PPGs from WGS.

We performed Whole Genome Sequencing of DNA from the lungs of five individual C57BL/6 males originating from Charles River colony. DNA libraries are prepared using the TruSeq DNA PCR-free preparation kit (Illumina, Inc.) as per the manufacturer’s instructions. The, 2 μg of each DNA sample is fragmented using Covaris shearing to a size range of 500 to 550 bp. Fragmented DNA is then processed with end repair, 3′ adenylation, and bead purification for fragment size selection. Indexing adapters are ligated to the fragment ends in preparation for flow cell hybridization. Following purification, each DNA library is then quantified using qPCR (KAPA Biosystems) prior to normalization. Libraries were sequenced on Illumina HiSeq2000 sequencer using Rapid flow cell and 2 × 250 cycle sequencing was performed according to the manufacturer's recommended protocol (Illumina Inc.). First, reads were aligned to the library of all possible murine 250 bp-long exon–exon junctions sequences representing exon–exon junctions in each known mRNA transcript, which was generated as part of PPG Finder. Most junctions were present either in all samples or in none of them. However, a minority of junctions were expressed variably across studied samples, demonstrating moderate to high coverage in some samples and low to no coverage in other samples. This could be interpreted as the appearance of a unique germline PPGs.

Since our goal was not to identify all of the variable pseudogenes but to demonstrate the fact of recent germline PPGs integration, we selected 4 genes, namely Fbxo22, Rybp, Sub1, and Zfp40. SI Appendix, Fig. S2 shows read coverage for junction sequences normalized by the sample average genome coverage. As additional illustration, we aligned sequencing reads for specific transcripts sequences of selected genes and calculated per-bp coverage using bedtools genomecov tool (https://pubmed.ncbi.nlm.nih.gov/20110278/). Such coverage for Fbxo22 is shown in SI Appendix, Fig. S3. The presence of above-listed PPGs in mouse DNA was confirmed using PCR with the primers corresponding to the sequences localized in different exons making the resulting product indicative of the PPG. The primers used and their genomic locations are shown in SI Appendix, Table S3.

Supplementary Material

Appendix 01 (PDF)

pnas.2313581121.sapp.pdf (908.1KB, pdf)

Acknowledgments

We thank Marina Antoch for providing experimental materials, Yu Han and Alan Hutson for advice, and staff of Comparative Oncology and Genomics Shared Resources of Roswell Park Comprehensive Cancer Center, which are supported by the National Cancer Institute Cancer Center Support Grant (NCI P30CA16056). This work was funded by a grant from Roswell Park Alliance Foundation to A.V.G.

Author contributions

A.V.G. designed research; V.K., I.M., D.I.F., and O.V.L. performed research; V.K. and I.M. contributed new reagents/analytic tools; V.K., I.M., O.V.L., I.E.K., and A.V.G. analyzed data; D.I.F. coordinated research; and V.K., I.M., D.I.F., I.E.K., and A.V.G. wrote the paper.

Competing interests

The authors declare no competing interest.

Footnotes

This article is a PNAS Direct Submission.

Data, Materials, and Software Availability

The PPG Finder is implemented as a suite of bash and Python scripts Finder (72). The results obtained from PPG Finder (SI Appendix, Table S4) have been deposited to Figshare (73). The sequencing data have been deposited in the NCBI Sequence Read Archive (SRA) (74).

Supporting Information

References

  • 1.Lander E. S., et al. , Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001). [DOI] [PubMed] [Google Scholar]
  • 2.Smit A. F., Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr. Opin. Genet. Dev. 9, 657–663 (1999). [DOI] [PubMed] [Google Scholar]
  • 3.Burton F. H., et al. , Conservation throughout mammalia and extensive protein-encoding capacity of the highly repeated DNA long interspersed sequence one. J. Mol. Biol. 187, 291–304 (1986). [DOI] [PubMed] [Google Scholar]
  • 4.Han J. S., Boeke J. D., LINE-1 retrotransposons: Modulators of quantity and quality of mammalian gene expression? Bioessays 27, 775–784 (2005). [DOI] [PubMed] [Google Scholar]
  • 5.Johnson W. E., Origins and evolutionary consequences of ancient endogenous retroviruses. Nat. Rev. Microbiol. 17, 355–370 (2019). [DOI] [PubMed] [Google Scholar]
  • 6.Boeke J. D., Stoye J. P., “Retrotransposons, endogenous retroviruses, and the evolution of retroelements” in Retroviruses, Coffin J. M., Hughes S. H., Varmus H. E., Eds. (Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, 1997). [PubMed] [Google Scholar]
  • 7.Richardson S. R., et al. , The influence of LINE-1 and SINE retrotransposons on mammalian genomes. Microbiol. Spectr. 3, MDNA3-0061-2014 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Brouha B., et al. , Hot L1s account for the bulk of retrotransposition in the human population. Proc. Natl. Acad. Sci. U.S.A. 100, 5280–5285 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Wallace N., Wagstaff B. J., Deininger P. L., Roy-Engel A. M., LINE-1 ORF1 protein enhances Alu SINE retrotransposition. Gene 419, 1–6 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Walters-Conte K. B., Johnson D. L., Allard M. W., Pecon-Slattery J., Carnivore-specific SINEs (Can-SINEs): Distribution, evolution, and genomic impact. J. Hered 102, S2–10 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Warren I. A., et al. , Evolutionary impact of transposable elements on genomic diversity and lineage-specific innovation in vertebrates. Chromosome Res. 23, 505–531 (2015). [DOI] [PubMed] [Google Scholar]
  • 12.Kramerov D. A., Vassetzky N. S., SINEs. Wiley Interdiscip. Rev. RNA 2, 772–786 (2011). [DOI] [PubMed] [Google Scholar]
  • 13.Kramerov D. A., Vassetzky N. S., Origin and evolution of SINEs in eukaryotic genomes. Heredity 107, 487–495 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Dewannieux M., Esnault C., Heidmann T., LINE-mediated retrotransposition of marked Alu sequences. Nat. Genet. 35, 41–48 (2003). [DOI] [PubMed] [Google Scholar]
  • 15.Deininger P. L., Moran J. V., Batzer M. A., Kazazian H. H. Jr., Mobile elements and mammalian genome evolution. Curr. Opin. Genet. Dev. 13, 651–658 (2003). [DOI] [PubMed] [Google Scholar]
  • 16.Jung Y. D., et al. , Retroelements: Molecular features and implications for disease. Genes Genet. Syst. 88, 31–43 (2013). [DOI] [PubMed] [Google Scholar]
  • 17.Hancks D. C., Kazazian H. H. Jr., Active human retrotransposons: Variation and disease. Curr. Opin. Genet. Dev. 22, 191–203 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Konkel M. K., Walker J. A., Batzer M. A., LINEs and SINEs of primate evolution. Evol. Anthropol. 19, 236–249 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Rodic N., et al. , Long interspersed element-1 protein expression is a hallmark of many human cancers. Am. J. Pathol. 184, 1280–1286 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Ardeljan D., Taylor M. S., Ting D. T., Burns K. H., The human long interspersed element-1 retrotransposon: An emerging biomarker of neoplasia. Clin. Chem. 63, 816–822 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Cardelli M., The epigenetic alterations of endogenous retroelements in aging. Mech. Ageing Dev. 174, 30–46 (2018). [DOI] [PubMed] [Google Scholar]
  • 22.De Cecco M., et al. , L1 drives IFN in senescent cells and promotes age-associated inflammation. Nature 566, 73–78 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Simon M., et al. , LINE1 derepression in aged wild-type and SIRT6-deficient mice drives inflammation. Cell Metab. 29, 871–885.e875 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Ferreira R., Naguibneva I., Pritchard L. L., Ait-Si-Ali S., Harel-Bellan A., The Rb/chromatin connection and epigenetic control: Opinion. Oncogene 20, 3128–3133 (2001). [DOI] [PubMed] [Google Scholar]
  • 25.Xie H., et al. , Epigenomic analysis of Alu repeats in human ependymomas. Proc. Natl. Acad. Sci. U.S.A. 107, 6952–6957 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Ariumi Y., Guardian of the human genome: Host defense mechanisms against LINE-1 retrotransposition. Front. Chem. 4, 28 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Kuramochi-Miyagawa S., et al. , DNA methylation of retrotransposon genes is regulated by Piwi family members MILI and MIWI2 in murine fetal testes. Genes Dev. 22, 908–917 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Montoya-Durango D. E., et al. , Epigenetic control of mammalian LINE-1 retrotransposon by retinoblastoma proteins. Mutat. Res. 665, 20–28 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Teneng I., Montoya-Durango D. E., Quertermous J. L., Lacy M. E., Ramos K. S., Reactivation of L1 retrotransposon by benzo(a)pyrene involves complex genetic and epigenetic regulation. Epigenetics 6, 355–367 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Montoya-Durango D. E., et al. , LINE-1 silencing by retinoblastoma proteins is effected through the nucleosomal and remodeling deacetylase multiprotein complex. BMC Cancer 16, 38 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Van Meter M., et al. , SIRT6 represses LINE1 retrotransposons by ribosylating KAP1 but this repression fails with stress and age. Nat. Commun. 5, 5011 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Li L., et al. , SIRT7 is a histone desuccinylase that functionally links to chromatin compaction and genome stability. Nat. Commun. 7, 12235 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Leonova K. I., et al. , p53 cooperates with DNA methylation and a suicidal interferon response to maintain epigenetic silencing of repeats and noncoding RNAs. Proc. Natl. Acad. Sci. U.S.A. 110, E89–98 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Tiwari B., et al. , p53 directly represses human LINE1 transposons. Genes Dev. 34, 1439–1451 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Pizarro J. G., Cristofari G., Post-transcriptional control of LINE-1 retrotransposition by cellular host factors in somatic cells. Front. Cell Dev. Biol. 4, 14 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Feng Y., Goubran M. H., Follack T. B., Chelico L., Deamination-independent restriction of LINE-1 retrotransposition by APOBEC3H. Sci. Rep. 7, 10881 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Pezic D., Manakov S. A., Sachidanandam R., Aravin A. A., piRNA pathway targets active LINE1 elements to establish the repressive H3K9me3 mark in germ cells. Genes Dev. 28, 1410–1428 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Goodier J. L., Restricting retrotransposons: A review. Mob. DNA 7, 16 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Dewannieux M., Heidmann T., LINEs, SINEs and processed pseudogenes: Parasitic strategies for genome modeling. Cytogenet. Genome Res. 110, 35–48 (2005). [DOI] [PubMed] [Google Scholar]
  • 40.Zhang Z., Harrison P. M., Liu Y., Gerstein M., Millions of years of evolution preserved: A comprehensive catalog of the processed pseudogenes in the human genome. Genome Res. 13, 2541–2558 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Lewis K. N., et al. , Unraveling the message: Insights into comparative genomics of the naked mole-rat. Mamm. Genome 27, 259–278 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Harrison P. M., Computational methods for pseudogene annotation based on sequence homology. Methods Mol. Biol. 2324, 35–48 (2021). [DOI] [PubMed] [Google Scholar]
  • 43.Zhang Z., et al. , PseudoPipe: An automated pseudogene identification pipeline. Bioinformatics 22, 1437–1439 (2006). [DOI] [PubMed] [Google Scholar]
  • 44.Ortutay C., Vihinen M., PseudoGeneQuest—Service for identification of different pseudogene types in the human genome. BMC Bioinf. 9, 299 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Zheng D., Gerstein M. B., A computational approach for identifying pseudogenes in the ENCODE regions. Genome Biol. 7, 10–11 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Baertsch R., et al. , Retrocopy contributions to the evolution of the human genome. BMC Genomics 9, 466 (2008), 10.1186/1471-2164-9-466. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Pei B., et al. , The GENCODE pseudogene resource. Genome Biol. 13, 1–26 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Zheng D., et al. , Pseudogenes in the ENCODE regions: Consensus annotation, analysis of transcription, and evolution. Genome Res. 17, 839–851 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Nei M., Xu P., Glazko G., Estimation of divergence times from multiprotein sequences for a few mammalian species and several distantly related organisms. Proc. Natl. Acad. Sci. U.S.A. 98, 2497–2502 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Quinlan A. R., BEDTools: The Swiss-army tool for genome feature analysis. Curr. Protoc. Bioinf. 47, 11.12.1-34 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Buffenstein R., Negligible senescence in the longest living rodent, the naked mole-rat: Insights from a successfully aging species. J. Comp. Physiol. B 178, 439–445 (2008). [DOI] [PubMed] [Google Scholar]
  • 52.Lagunas-Rangel F. A., Chavez-Valencia V., Learning of nature: The curious case of the naked mole rat. Mech. Ageing Dev. 164, 76–81 (2017). [DOI] [PubMed] [Google Scholar]
  • 53.Zhao Y., et al. , Naked mole rats can undergo developmental, oncogene-induced and DNA damage-induced cellular senescence. Proc. Natl. Acad. Sci. U.S.A. 115, 1801–1806 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Liang S., et al. , Resistance to experimental tumorigenesis in cells of a long-lived mammal, the naked mole-rat (Heterocephalus glaber). Aging Cell 9, 626–635 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Eickbush T. M., "Origin and Evolution of retrotransposons" in Mobile DNA, Nancy R. C., Craig L., Gellert M., Lambowitz A. M., Eds. (ASM Press, Washington, DC, 2007), vol. II, chap. 49, 10.1128/9781555817954.ch49. [DOI] [Google Scholar]
  • 56.Ullu E., Weiner A. M., Human genes and pseudogenes for the 7SL RNA component of signal recognition particle. EMBO J. 3, 3303–3310 (1984). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Kim T. M., Hong S. J., Rhyu M. G., Periodic explosive expansion of human retroelements associated with the evolution of the hominoid primate. J. Korean Med. Sci. 19, 177–185 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Sisu C., et al. , Comparative analysis of pseudogenes across three phyla. Proc. Natl. Acad. Sci. U.S.A. 111, 13361–13366 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Cooke S. L., et al. , Processed pseudogenes acquired somatically during cancer development. Nat. Commun. 5, 3644 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Kazazian H. H. Jr., Processed pseudogene insertions in somatic cells. Mob. DNA 5, 20 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Solyom S., et al. , Extensive somatic L1 retrotransposition in colorectal tumors. Genome Res. 22, 2328–2338 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Tang Z., et al. , Human transposon insertion profiling: Analysis, visualization and identification of somatic LINE-1 insertions in ovarian cancer. Proc. Natl. Acad. Sci. U.S.A. 114, E733–E740 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Kim E. B., et al. , Genome sequencing reveals insights into physiology and longevity of the naked mole rat. Nature 479, 223–227 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Gorbunova V., et al. , The role of retrotransposable elements in ageing and age-associated diseases. Nature 596, 43–53 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Burns K. H., Transposable elements in cancer. Nat. Rev. Cancer 17, 415–424 (2017). [DOI] [PubMed] [Google Scholar]
  • 66.McKerrow W., et al. , LINE-1 expression in cancer correlates with p53 mutation, copy number alteration, and S phase checkpoint. Proc. Natl. Acad. Sci. U.S.A. 119, e2115999119 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Taylor M. S., et al. , Ultrasensitive detection of circulating LINE-1 ORF1p as a specific multicancer biomarker. Cancer Discov. 13, 2532–2547 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Vylegzhanina A. V., et al. , Cancer relevance of circulating antibodies against LINE-1 antigens in humans. Cancer Res. Commun. 3, 2256–2267 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Kelsey M. M. G., Reconsidering LINE-1’s role in cancer: Does LINE-1 function as a reporter detecting early cancer-associated epigenetic signatures? Evol. Med. Public Health 9, 78–82 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Ponomaryova A. A., et al. , Aberrant methylation of LINE-1 transposable elements: A search for cancer biomarkers. Cells 9, 2017 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Gezer U., Bronkhorst A. J., Holdenrieder S., The utility of repetitive cell-free DNA in cancer liquid biopsies. Diagnostics (Basel) 12, 1363 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Kogan V., et al. , PPG Finder. GitHub. https://github.com/gudkovlab/ppgfinder. Deposited 27 December 2023.
  • 73.Gudkov A., Sequences and chromosomal localization coordinates of exon-exon junctions found in mouse genome ‘mm10’ by PPG Finder. Figshare. 10.6084/m9.figshare.27116971.v1. Deposited 27 September 2024. [DOI]
  • 74.Gudkov A., Diversity of the PPG contents among inbred mice. NCBI Sequence Read Archive (SRA). https://www.ncbi.nlm.nih.gov/sra/PRJNA1165451. Deposited 19 September 2024. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix 01 (PDF)

pnas.2313581121.sapp.pdf (908.1KB, pdf)

Data Availability Statement

The PPG Finder is implemented as a suite of bash and Python scripts Finder (72). The results obtained from PPG Finder (SI Appendix, Table S4) have been deposited to Figshare (73). The sequencing data have been deposited in the NCBI Sequence Read Archive (SRA) (74).


Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES