Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2014 Dec 15;43(4):e22. doi: 10.1093/nar/gku1250

T-lex2: genotyping, frequency estimation and re-annotation of transposable elements using single or pooled next-generation sequencing data

Anna-Sophie Fiston-Lavier 1,2,*, Maite G Barrón 3,4, Dmitri A Petrov 1, Josefa González 4
PMCID: PMC4344482  PMID: 25510498

Abstract

Transposable elements (TEs) constitute the most active, diverse and ancient component in a broad range of genomes. Complete understanding of genome function and evolution cannot be achieved without a thorough understanding of TE impact and biology. However, in-depth analysis of TEs still represents a challenge due to the repetitive nature of these genomic entities. In this work, we present a broadly applicable and flexible tool: T-lex2. T-lex2 is the only available software that allows routine, automatic and accurate genotyping of individual TE insertions and estimation of their population frequencies both using individual strain and pooled next-generation sequencing data. Furthermore, T-lex2 also assesses the quality of the calls allowing the identification of miss-annotated TEs and providing the necessary information to re-annotate them. The flexible and customizable design of T-lex2 allows running it in any genome and for any type of TE insertion. Here, we tested the fidelity of T-lex2 using the fly and human genomes. Overall, T-lex2 represents a significant improvement in our ability to analyze the contribution of TEs to genome function and evolution as well as learning about the biology of TEs. T-lex2 is freely available online at http://sourceforge.net/projects/tlex.

INTRODUCTION

The key question in genomics is how genomes vary and evolve at both large and fine scales. In order to answer such question, we need to be able to study genome variation, i.e. identifying and analyzing functionally relevant single nucleotide polymorphisms and structural variants, both within and between populations. Next-generation sequencing (NGS) technology has revolutionized this field by allowing one to study variation of a large number of individuals and even cells within individuals (13). Unfortunately, NGS technology generally provides fairly short, even if extremely abundant, sequencing reads and thus is not perfectly designed for the study of structural variants (4).

The study of one class of structural variants, called transposable elements (TEs), is particularly difficult to carry out using NGS data. TEs are repetitive, ubiquitous and dynamic components of genomes that often vary in genomic location among members of the same species. TEs are classified in three main orders: DNA transposons, long-terminal repeat (LTR) elements and non-LTR elements, and each TE order is represented by several TE families (5,6). This classification highlights the biological diversity in terms of sequence, dynamics and evolution of these repetitive elements that makes their analysis challenging.

TEs represent a large part of most of the eukaryotes genomes (7). A recent study suggests that more than two-thirds of the human genome is composed of TEs (8), and in plants, TEs may represent up to 90% of the genome (9). TEs are responsible for a large number of mutations both in populations and somatically within individuals (10,11). Although most of the TE-induced mutations are deleterious, evidence for an adaptive role of TE-induced mutations is starting to accumulate (1216). TEs can provide active promoter, splice site or terminator features that can affect the expression, structure and function of nearby genes (17). TEs are also involved in the creation of transcriptional regulatory networks, and in the generation of chromosomal rearrangements (12,18). Thus, knowing the role of TEs on genome dynamics and evolution, it is crucial to identify and quantify their impact (19,20). Given the increased number of sequenced genomes, a tool that allows routine and automatic genotyping of individual TE insertions using a large number of NGS samples is needed.

Recently, several accurate computational approaches have been developed to analyze TE insertions using NGS data (10,2131). Designed for specific projects, all these tools were built with a limited number of features, e.g. some tools were designed to specifically call particular families of TEs (10), while others were designed exclusively for pooled NGS data (26). Unfortunately, none of the currently available tools allows a complete and in-depth analysis of individual annotated TE insertions.

With the idea of providing a tool to automatically call TE insertions in multiple genomes and to estimate their population frequency, we recently designed an integrated, flexible pipeline called T-lex (31). While most of the available approaches only detect the presence of TE insertions, T-lex launches two complementary and distinct TE detection modules: one to detect the ‘presence’ and another to detect the ‘absence’ of a TE insertion. Both detection approaches are based on the analysis of the junction sequences of a TE insertion and its flanking sequences as defined by genome annotation. Thus, the accuracy of the TE calls mainly depends on the quality of the annotation and more specifically on the delimitation of the individual TE insertions. Unfortunately, the detection of the precise TE insertion sites is still one of the main challenges of the TE annotation process (32). The veracity of the TE calls also depends on the genomic environment of the TE insertion. For instance, a TE insertion located inside a duplication with one of the copies lacking the TE insertion can be miscalled as heterozygote (i.e., both present and absent). In addition, a TE insertion flanked by other repeats, such as low-complexity regions or other TE sequences, can also be miscalled.

Here, we present T-lex2, the version 2 of the original T-lex pipeline (31). Besides improving the accuracy of the TE frequency estimate, T-lex2 now assesses the quality of the TE calls. To achieve this, T-lex2 uses information from (i) a new module specifically designed to identify target site duplications (TSDs) and (ii) the genomic context of each insertion, to identify putatively miss-annotated TE insertions highly likely to produce wrong calls. TSDs are short duplications, from 2 to 20 base pairs (bp), flanking the TE sequences as a result of the transposition mechanism of most TE families. While current TSD detection approaches for reference TE insertions require knowing the biology of the TEs and are limited to LTR or non-LTR elements, T-lex2 TSD detection module works without a priori knowledge, and for all type of TEs. This new module allows precisely delimiting the location of each TE insertion. In this new version, frequency of TE insertions can be estimated using individual and pooled NGS data.

We provide evidence that the new features of T-lex2 help improving the genotyping and frequency estimate of TE insertions, and allow re-annotating TE locations using both individual and pooled Drosophila melanogaster genomic data. We also demonstrate that T-lex2 provides accurate TE calls in genomes other than that of D. melanogaster by analyzing TE insertions.

MATERIALS AND METHODS

T-lex2 pipeline overview

T-lex2 is composed of five modules that can be run with individual strain or pooled NGS data (Figure 1). (i) The TE-analysis module investigates the flanking regions of the known TE insertions (i.e., annotated in a reference genome) to identify those likely to return wrong calls. (ii) The TE-presence detection and (iii) the TE-absence detection modules combine mapping and read depth coverage information to identify reads providing evidence for the presence and for the absence, respectively, of the known TE insertions. (iv) Then the TE-combine module combines the results of the presence and absence detection modules to genotype TEs and/or to estimate their frequencies in populations. Finally, (v) the TE-TSD detection module annotates TSDs in an unbiased and accurate way. Each of these modules can be launched independently and provides a large number of options allowing the user to carry out flexible and customizable analyses. A detailed manual describing step-by-step how to run T-lex2 and listing all T-lex2 options is available online at http://petrov.stanford.edu/cgi-bin/Tlex2_manual.html.

Figure 1.

Figure 1.

T-lex2 pipeline. Schematic representation of the five different T-lex modules (dark gray boxes) and of the input files (white boxes) required to run T-lex2. SD, segmental duplication.

TE-analysis module

This module analyses the flanking sequences of each known TE insertion to identify features that might interfere with the presence and/or absence detection. The TE-analysis module (i) provides information about the presence of repetitive elements using RepeatMasker (http://www.repeatmasker.org), (ii) identifies miss-annotated poly(A) tails and (iii) flags the TE insertions that are part of segmental duplications.

To identify the presence of repetitive elements, the TE-analysis module extracts the flanking sequences of each TE insertion (125 bp by default; see option ‘-f’) and launches RepeatMasker to annotate the TEs, simple repeats and low-complexity regions (http://www.repeatmasker.org). The option ‘-s’ allows specification of the name of the model organism and thus specifies the repeat library that needs to be used by RepeatMasker (http://www.repeatmasker.org). If at least one flanking region shows a repeat density greater than the pre-specified value (50% by default; option ‘-d’), the TE insertion is flagged as flanked by repetitive elements. By default, these TE insertions are filtered out. This filtering step can be bypassed using the option ‘-noFilterTE’.

A new feature of the TE-analysis module identifies putatively miss-annotated poly(A) tails by searching for stretches of A's or T's (by default more than 5 bp) at the TE junctions, without a priori knowledge of the TE type. Non-LTR elements harbor poly(A) tails at their 3′-end that are known to be highly variable in length. Such sequences are therefore very difficult to annotate automatically. As RepeatMasker only annotates repeats longer than 20 bp, it cannot be used to identify short stretches of A's or T's that may correspond to miss-annotated poly(A) tail sequences (http://www.repeatmasker.org).

When a TE insertion is part of duplication, other copies of the same duplication may not contain the insertion. The analysis of the flanking sequences of such TE insertion may lead to call this TE as present and absent while it should only be called as present. We added one new feature in the analysis of the TE flanking regions to identify TE insertions fully or partially part of segmental duplications. For each TE insertion, the two extracted flanking sequences are BLATed against a reference genome (33) (BLAT v. 34 default parameters; http://www.soe.ucsc.edu/∼kent). When more than 50% of a flanking sequence matches at more than one location on the reference genome with more than 90% of sequence identity, the TE insertion is flagged as part of a duplication. TE insertions partially part of duplication (notified as ‘sd_left’ or ‘sd_right’) are distinguished from the TE fully part of a duplication (notified as ‘sd’). When a TE is fully part of a duplication while another copies do not contain the TE insertion, this TE insertion is notified as ‘sd_noTE’.

The analysis of the TE flanking sequences including the RepeatMasker output, whether the TE is flanked by a longer than annotated poly(A) tail, and the segmental duplication analysis results, are stored in a subdirectory called ‘Tanalyses’.

TE-presence detection module

The presence detection module detects sequencing reads overlapping the flanking regions of the TE insertion (Figure 2A). The junction sequences of each TE insertion are extracted and used as reference sequences. By default, each TE junction sequence encompasses the terminal TE sequence (60 bp from each end) and the flanking sequence (1 kb on each side). The lengths of the TE sequence and the flanking sequence can be changed using the options ‘-b’ and ‘-j’, respectively. The reads are mapped on these new extracted reference sequences (Figure 2A) and are used to build a contig using Phrap (version 1.090518; http://phrap.org). T-lex2 uses only the reads mapping with a minimum Phrep quality score (34 by default; this can be changed using the option ‘-minQ’). The pipeline also reports the number of reads mapped on the TE junction at the minimum Phrep quality score (Figure 2A). Additionally, T-lex2 also requires a minimum length of the match inside and outside the TE sequence, and a minimum read sequence identity (Figure 2A). These two parameters are set at 15 bp and 95% by default and can be modified using the options ‘-limp’ and ‘-id’, respectively. By default, multiple alignments of the contigs for each TE side are also returned. The analysis and TE calls from this module are stored in a subdirectory called ‘Tpresence’.

Figure 2.

Figure 2.

T-lex2 TE-presence detection and TE-absence detection modules. (A) TE-presence detection module is based on the mapping of the NGS reads on the TE insertion junctions. (B) TE-absence detection module is based on the mapping of NGS reads on the putative ancestral genomic sequence prior the TE insertion. Input files required to run each module (white boxes), the different steps of the pipeline (light gray boxes) and examples of reads providing evidence for the presence, absence and no call reads are depicted.

TE-absence detection module

The absence module detects reads overlapping both sides of the TE insertion, i.e. reads spanning across the TE insertion site (Figure 2B). It starts by extracting the two flanking sequences for each TE insertion (125 bp each by default; see option ‘–f’). The two sequences are then concatenated and the new constructed sequence is used as the reference sequence for the absence detection (Figure 2B). This new constructed sequence approximates the ancestral sequence prior to the TE insertion. However, note that traces of the TE insertion mechanism, called TSDs, may also be encompassed in the new constructed sequence and located at the TE insertion breakpoint. The length of the extracted sequences should be longer than the reads themselves and total length of the new constructed sequence should be similar to the insert size of the paired-end (PE) data in order to get reads and/or pairs spanning the TE insertion breakpoints. T-lex2 uses now by default a length of 125 bp matches for libraries of 100 bp read length and an insert size of 250 bp. The reads are then mapped on the new constructed sequence using SHRiMP2 (34) (version 2.2.1, October 2011), the new version of SHRiMP that can now handle PE read sequencing data. SHRiMP2, as SHRiMP, was specifically designed to handle long gaps and polymorphisms (34,35). Such features allow the mapping of reads despite the presence of the TSDs or despite miss-annotation. To handle long gaps, SHRiMP2 is launched with a Smith–Waterman gap open score of −40 and a gap extension score of −1 for both query and reference sequences. We also allow higher divergence with a Smith–Waterman mismatch score of −20. SHRiMP2 maps the reads in every pair together (34) (see option ‘-pairends’). If a pair does not map, SHRiMP2 attempts to map each read individually (34). If the two reads of a pair map to the new constructed reference sequence, only the pairs with reads mapping on opposite strands are selected (i.e., concordant pairs). When the two reads from the pair overlap in the mapping, the absence module merges them and considers the pair as a single read. The absence module specifically looks for ‘reads’ spanning the TE junction with at least 15 bp (parameters by default; see option ‘–v’) of overlap on each side. T-lex2 then runs RepeatMasker on the selected reads (see option ‘-noint’) in order to test whether the mapping is due primarily to simple repeats or low-complexity regions (http://www.repeatmasker.org). After this step, T-lex2 selects the reads that have at least five non-repetitive (low-complexity or satellites) base pairs at each end that both map to the flanking regions. Because of the stringency of our approach, even a single read mapping to both flanking sides and thereby spanning the insertion site of the TE is sufficient for the absence detection module to classify the TE insertion as ‘absent’ (Figure 2B). If no reads overlap the TE junction, two possibilities are considered: (i) the TE is present in the strain (or fixed in the pool sample) or (ii) the coverage of the data used is insufficient to detect reads providing evidence for the absence. In order to distinguish between these two possibilities, the absence detection module checks whether reads map on both flanking sides but fail to map over the junction. If this is the case, given that the flanking regions are similar in length to the reads themselves, the module concludes that the TE is present.

If reads do not map to both flanking sides and do not map over the junction site, the module concludes that the coverage is insufficient and returns a ‘no call’ as a result. By default, the multiple alignments of the contigs for each TE breakpoint are also returned. The analysis and TE calls from this module are stored in a subdirectory called ‘Tabsence’.

TE-combine module

The results from the presence and absence detection approaches can then be combined to return definitive TE detection calls. We defined five TE call categories based on the evidence for presence and absence. Briefly, a TE is called ‘present’ or ‘absent’ when the calls from both detection modules are congruent. A TE is called as ‘polymorphic’ when this TE is clearly detected as present and absent (i.e., the presence detection module detects it as present and the absence detection module detects it as absent). A TE is called as ‘present/polymorphic’ when the presence detection module detects it as present while the absence detection module fails to return a call. A TE is called as ‘absent/polymorphic’ if the TE is detected as absent by the absence detection approach, while the presence detection fails (see Table 1 in (31)).

T-lex2 can also combine the TE calls from several NGS data from the same population and can estimate the population frequency of each annotated TE insertion (see options ‘–combRes’ and ‘-combine’). To estimate the frequency, T-lex2 transforms the calls in frequencies as follows: a ‘present’ call is associated with a 100% frequency, and absent call is associated with a zero percent frequency, a ‘polymorphic’ call is associated with a 50% frequency (100% (present) + 0% (polymorphic)/2), a ‘present/polymorphic’ call is associated with a 75% frequency (100% + 50%/2) and an ‘absent/polymorphic’ call is associated with a 25% frequency (0% + 50%/2).

Finally, T-lex2 pipeline can also estimate the TE frequency using pooled NGS data by taking into account the number of reads supporting the presence and the number of reads supporting the absence. In this case, the user just needs to add in the command line the option ‘-pooled’. This approach is based on two assumptions: (i) individuals in the pooled sample are equally represented and (ii) the read-depth coverage is correlated with the occurrence of a TE sequence in a population. Based on such assumptions, we expect to observe a positive correlation between the frequency and the number of reads providing evidence for the presence. We also expect to observe a negative correlation between the frequency and the number of reads providing evidence for the absence. The number of copies of TEs from the same family may also have an impact on the number of reads mapping at the TE breakpoints suggesting that the number of reads supporting the presence can be more biased than the number of reads supporting the absence. Taking all these expectations into account, we designed and verified a metric based on the local read depth coverage in the vicinity of the TE insertion to estimate TE frequencies from pooled data: total number of reads supporting the presence (NP) divided by the total number of reads supporting the presence (NP) and supporting the absence (NA), i.e. TE frequency = NP/(NP + NA).

The file combining the TE calls from both TE detection modules is called ‘Tresults’. The TE frequency estimates are stored in the ‘Tfreq’ or Tfreq_pooled’ if the option ‘-pooled’ is specified.

TE-TSD detection module

To detect the TSD of each annotated TE insertion, this module analyzes the read alignments generated by the absence detection module (Figure 3). As all T-lex2 modules, this new module can be launched independently although it does require the read alignments from the absence detection module (see option ‘-tsd’). The TSD detection module looks for tandem duplications located on the absence reference sequence spanning the TE breakpoint. It starts by assembling all the selected reads supporting the absence for each TE using the Phrap program (http://phrap.org; Figure 3). Because Phrap requires a minimum of three sequences to build a contig, if fewer than three reads are selected to support the absence call, the reads are considered independently (Figure 3). Each contig (or read) is then re-aligned on the absence reference sequence using BLAT v.34 program (33) (default parameters; Figure 3). Only the absence calls with a gap larger than two base pairs and located close to the TE breakpoint are selected and analyzed. If no gaps are observed, the TE-TSD detection module returns ‘no Gap’. If a gap is observed, the motif present in front of the gap is called the putative target site or PTS (Figure 3). Using the FastaGrep program (available from bioinfo.ut.ee/download/dl.php?file=6 and executed with the default parameters), a tool that looks for short and conserved motifs, the TSD detection module looks for the copy of the PTS on the other side of the breakpoint. Based on the sensitivity of this program and the short length of the TSDs, we only trust the matches with more than 80% sequence identity (Figure 3). T-lex2 then sorts the FastaGrep hits to identify the sequence showing the closer and highly conserved motif. If FastaGrep fails to detect a match, the TSD detection returns ‘noTSD’ as a result, otherwise, the TSD detection module returns ‘detected’. In the latter case, the PTS and its closest copy sequences are also returned. Because several contigs can be generated for the same TE insertion, several TSDs can also be returned. The results of the TSD detection process are stored in a file called ‘TSDannot’.

Figure 3.

Figure 3.

T-lex2 TE-TSD detection module. Schematic representation of the procedure to identify TSDs. Input files required to run this module and the different steps followed by the pipeline are also depicted.

TE-TSD detection module flags the TE insertions for which the TSDs cannot be clearly defined. TSD detection may fail for old TE insertions for which the TSD is too divergent, for truncated TEs, and when the boundaries of the TE insertions are not well annotated (Figure 4). The latter TEs can be re-annotated by the user after manual inspection.

Figure 4.

Figure 4.

Detection of putatively miss-annotated TE insertions. (A) Schematic representation of a correctly annotated TE (I), a TE longer than annotated (II) and a TE shorter than annotated (III). (B) Effect of miss-annotation on the TE calls when the TE insertion is present and absent. When the TE insertion is absent, T-lex2 will give a correct call if the miss-annotation is short and an erroneous call in the miss-annotation is long (B.h, B.i, B.k and B.l). (1) If the miss-annotated region is longer than the maximum read length, selected reads will not overlap the real TE junction(s). These reads will be mapped inside the TE sequence. Consequently, the number of selected reads will depend on the number of TE copies from the same family. (2) If the miss-annotated region is longer than the maximum read length, selected reads will fully map on the TE flanking region. The number of selected reads will be a proxy of the local read depth coverage.

Re-annotation of TE insertions

Because both TE detection approaches are based on the mapping of NGS reads to the flanking TE sequences, the accuracy of the TE calls depends on having a correct TE annotation. TE-analysis and TE-TSD detection modules help identifying putatively miss-annotated TE insertions and thus putatively wrong T-lex2 calls.

A particular TE insertion can be miss-annotated because it is longer than annotated (Figure 4A.II) or shorter than annotated (Figure 4A.III), in both cases the TE-TSD detection will fail. If the TE insertion is present in a given genome, both the presence detection and the absence detection modules will provide the correct call even though the TE is longer or shorter than initially annotated (Figure 4B). However, if the TE insertion is absent, and the TE is miss-annotated the presence and the absence modules could return a wrong call (Figure 4B). For example, if the TE sequence is longer than the official annotation and the TE is absent, the gap in the reads providing the absence could be longer than the TSD size and the module could return present as a call (Figure 4B.k). On the other hand, when the TE is shorter than annotated and the TE is absent, the gap will be present in the reference sequence and not in the reads. In such situation, the module could return present as a call (Figure 4B.l).

The Tfreq_combine file of T-lex2 provides information from both the TE-analyses and TE-TSD detection modules, i.e. poly(A) tail longer than the official annotation and TSD detection failures, that allows classifying these TE insertions as putatively miss-annotated TEs. This information can be used to perform manual curation of these insertions and re-annotate them.

Validation data set

T-lex2 was validated in two data sets, one from D. melanogaster and one from human. All the NGS data used in this work was generated using the Illumina technology.

Veracity of T-lex2 calls was assessed by calling a total of 755 well-studied TE insertions in D. melanogaster North America strains from the Drosophila Genome Reference Panel (DGRP) project (36,37). Estimates of TE frequency based on pooled-polymerase chain reaction (PCR) using other North America strains are available for these 755 TEs in a different data set (37,38). TE frequency estimates based on pooled-PCR approach were categorized in four classes: ‘very rare’ (i.e., TE frequency <1.5%), ‘rare’ (i.e., TE frequency ∼2–15% but not absent), ‘common’ (i.e., TE frequency ∼10–98%) and ‘fixed’ (i.e., TE frequency >98%) (37). The D. melanogaster reference genome (release 5) was downloaded from Flybase website (http://flybase.org). The annotations of the TE insertions were extracted from the release 5.43 (http://flybase.org). The resequencing data from 86 D. melanogaster DGRP single strains from the freeze 2 were downloaded from the DGRP website (∼20×; http://dgrp.gnets.ncsu.edu/, Supplementary Table S1). We also called the same TE insertions in the DGRP population using a high-depth sequencing data from a pool of 92 D. melanogaster DGRP strains (∼60×) (39). Eighty-four strains are common in the two data sets (Supplementary Table S1). To investigate the effect of different read-depth coverage and different number of flies in pooled samples, we also used T-lex2 to analyzed TE frequencies in four other pools of lower depth coverage described in Zhu et al. (39).

Experimental validation of T-lex2 results was also done for 11 TE insertions known to be polymorphic (38) in 24 single D. melanogaster North American DGRP strains using a single PCR approach (Supplementary Table S1). The PCR conditions and primers used were as described in Gonzalez et al. (38).

In humans, the performance of T-lex2 was evaluated using high-depth NGS data for a CEU individual (NA12878) and the TE annotations from the Mobile Element Insertions (MEI) catalog from the 1000 Genomes Project (40). The human reference genome used was Mar 2006 NCBI36/hg18. Out of the 2010 reference MEI reported in Steward et al. (40), we analyzed 1549 insertions that (i) have genomic coordinates with confidence intervals smaller or equal to 1 according to Steward et al., (ii) could be unambiguously mapped to RepeatMasker annotations within a range of ±100 bp and (iii) have been analyzed in the CEU trio data set. RepeatMasker chromosome, strand and coordinates of these 1549 TEs were used to build the TE annotation file and TE list that was used as input for T-lex2. All the TE insertions analyzed have been previously validated by PCR and/or algorithms. However, all the TEs were called either as present or absent while T-lex2 classifies the insertions as present, present/polymorphic, polymorphic, absent/polymorphic and absent.

Statistical analysis

All the statistical analyses were performed using R (http://cran.r-project.org/). Sensitivity and specificity were estimated using individual PCR frequencies as we have the T-lex2 frequencies estimates using the same exact strains. Sensitivity and specificity were computed such as: (i) Sensitivity = number of correct presence calls/total number of presence calls in the validation data set and (ii) Specificity = number of correct absence calls/total number of absence calls in the validation data set. For the human data set, the number of correct absence calls is not known because both polymorphic and absence calls are detected as absent in the validation data set. Thus, we used the positive predictive value (i.e., number of correct presence calls based on T-lex2/total number of presence calls) as a proxy for specificity and the false discovery rate (i.e., number of false presence calls based on T-lex2/total number of presence calls) as proxy for the T-lex2 call precision.

RESULTS

T-lex2 is an improved and expanded version of T-lex

T-lex was originally designed to genotype and estimate frequencies of the known TE insertions in individual strain NGS data (31). The new version of T-lex, called T-lex2, incorporates numerous new features in order to improve genotyping and frequency estimation of known TE insertions (Table 1). T-lex2 assesses the quality of the calls by analyzing the genomic context of each TE insertion and detecting their TSDs that allows identifying miss-annotated TE insertions. T-lex2 takes into account the sequencing technology advances and can now handle PE sequencing and longer reads. Unlike the previous version, T-lex2 can be run for individual strains and for pooled NGS for the TE frequency estimates (Table 1).

Table 1. T-lex2 is an improved and expanded version of T-lex.

graphic file with name gku1250tbl1.jpg

Accurate TE frequency estimates using NGS data from D. melanogaster individual strains

To test the quality of the T-lex2 calls in D. melanogaster, we compared T-lex2 TE frequency estimates with (i) frequency estimates previously obtained using a pooled-PCR approach (3738,41) and (ii) frequency estimates obtained in this work by individual strain PCRs.

We first launched the pipeline to estimate the frequency of 755 TE insertions for which we have an estimate of their population frequencies based on a pooled-PCR approach (3738,41). We run T-lex2 on 86 single DGRP strains (Supplementary Table S1; see Materials and Methods). The TE call distribution, i.e. proportion of TEs called as present, absent, polymorphic, present/polymorphic, absent/polymorphic and no calls, is consistent among the strains, except for 14 strains that differ drastically (Supplementary Figure S1). For 11 out of the 14 strains, we observed more than 40% of no calls, and for the other three strains (RAL-379, RAL-362 and RAL-313), more than 70% of TE insertions were detected as present. As we suspect these 14 strains to be problematic (most likely a combination of poor sequence quality and extensive residual heterozygosity), we decided to remove them from our analysis and analyze the calls in the remaining 72 strains (Supplementary Figure S1).

Out of the 755 TE insertions, TE analysis module detected six that were flanked by other TE insertions, and three that were part of segmental duplications (Supplementary Table S2). We removed these nine insertions because they are likely to produce wrong calls (see Materials and Methods). We then filtered out seven TEs for which T-lex2 failed to return any calls, and 30 TE insertions for which T-lex2 fails to return a call for more than 75% of the strains (Supplementary Table S2). All of these removed TE insertions, 46 in total, are likely to be miss-annotated as we failed to call them in all or most of the strains. We thus estimated the population frequencies of 709 TE insertions using 72 strains. Note also that 65 out of the 709 TEs were identified in a first T-lex2 run as miss-annotated based on results of the TE-TSD detection module (see below). We manually re-annotated these 65 TE insertions, and we have re-run T-lex2 to obtain accurate frequency estimates of these TEs.

T-lex2 frequency estimates of the 709 TEs in 72 strains are significantly and positively correlated with previous estimates based on a pooled-PCR approach (Figure 5A: Spearman's ρ = 0.87, P-value << 0.001). The vast majority of the very rare and fixed TE insertions are correctly detected by T-lex2 with 92.5% (357/386) of the very rare TE insertions absent and 85% (91/107) of the fixed TE insertions fixed in the population (Figure 5A). As the pooled-PCR and T-lex2 frequencies categories were not build the same way, we expected to observe intermediate profiles for the rare and common TE insertions. As expected, most of the rare TE insertions (114/146) are detected with a zero or low frequency, while most of the common TE insertions (67/70) appear frequently or fixed (Figure 5A).

Figure 5.

Figure 5.

TE frequency estimates using D. melanogaster single strains. (A) Comparison of the frequency estimates of 709 TE insertions using T-lex2 and pooled PCR approach. (B) Frequency estimates of 11 TE insertions using T-lex2 and single PCR approach.

Overall, we identified only 1.5% (11/709) of clear discrepant estimates between pooled-PCR and T-lex2 frequency estimates. However, after manual curation, we suspected all these TE insertions to be miss-classified by the experimental approach or miss-annotated. For example, insertion FBti0020042 is classified as present at very low frequency by T-lex2 but is detected at high frequency using the pooled-PCR approach. However, a single-strain PCR approach confirms that this TE is present at a low frequency as estimated using T-lex2 (38). The reclassification of this TE insertion, allows to conclude that T-lex2 returns accurate TE frequencies with only ∼1% of wrong estimates.

To further test the accuracy of T-lex2 calls, we experimentally determined the frequency of 11 well-studied TE insertions known to be polymorphic and thus providing all types of calls (38). We then randomly selected 24 strains part of the bigger set of 86 DGRP strains (Supplementary Table S1) and performed single-strain PCR for each of the 11 analyzed TEs. We experimentally called the 11 TE insertions in the same 24 DGRP strains (Supplementary Table S3). We first compared the frequency estimates for the 11 TE insertions obtained by T-lex2 using all the DGRP strains (86) and the subset of 24 strains. This analysis confirms the robustness of the frequency estimates using the subset of 24 strains (Figure 5B: Pearson's ρ = 0.96, P-value << 0.001). Out of the 264 calls (11 detections in 24 strains), we removed 53 for which we failed to get a call experimentally or using T-lex2 and ended up analyzing 211 TE calls. We observed only 11 calls that were different between the experimental data and the T-lex2 calls (Supplementary Table S3). This represents 5.2% of putatively wrong calls that is close to the error rate previously estimated for the TE detection using PCR or the previous version of T-lex, around 5% in both cases (31,37). We finally conservatively estimated a high TE call accuracy with 99.14% of sensitivity and 89.58% of specificity (see Materials and Methods). Thus, we overall observed accurate TE frequency estimates comparing the T-lex2 estimates with the pooled-PCR and the individual PCR experimental estimates.

Accurate TE population frequency estimates using D. melanogaster pooled sequencing data

We called the presence/absence of the 709 selected TE insertions using data from a pool of 92 DGRP strains, containing the 86 single DGRP individual strains previously used (39) (Supplementary Table S1). For 29 out of the 709 selected TE insertions, T-lex2 failed to return a call using the pooled sample, most probably due to low read coverage in these particular regions of the genome. We thus ended up estimating the population frequency of 680 TE insertions (Supplementary Table S2). As expected, we found that the number of reads providing evidence for the absence is negatively correlated with the experimental TE frequency estimates (Figure 6A: Spearman's ρ = −0.60, P-value << 0.001), while the number of reads providing evidence for the presence is positively correlated (Figure 6B: Spearman's ρ = 0.87, P-value << 0.001). Interestingly, the number of reads supporting the presence does not seem to be strongly biased by the number of copies of TEs from the same family indicating the accuracy of our approach (see Materials and Methods).

Figure 6.

Figure 6.

TE frequency estimates using D. melanogaster pooled samples. (A) Number of reads supporting the absence and (B) number of reads supporting the presence for each of the TE frequency classes. (C) T-lex2 frequency estimates using pooled sample versus TE frequency classes experimentally determined using a pooled-PCR approach. (D) Comparison of T-lex2 frequency estimates using single and pooled NGS data.

We observed significant and strong positive correlations between T-lex2 TE frequency estimates using the pooled sequencing data and the pooled-PCR determined frequencies (Figure 6C: Spearman's ρ = 0.87, P-value << 0.001) and between the T-lex2 TE frequency estimates using the pooled sequencing data and the T-lex2 estimates using single strains (Figure 6D: Pearson's ρ = 0.98, P-value << 0.001). This finding supports that the TE population frequency can be estimated with accuracy using pooled NGS sample only based on the read depth coverage.

For 25% of the TEs, all of them polymorphic, we noted an overestimation of the TE frequency using the pooled sample compared to the estimates using single strains (Figure 6D). However, for the majority of TEs the difference between the two estimates is <10%, which can be explained by the sampling effect when the pool and the individual strain libraries are constructed. To get accurate frequency estimates, each selected individual should be equally represented in the pooled library. Unfortunately, those conditions are hard to set up or control during library construction. For 7% of the TEs, the discrepancy between the two estimates is >10% and the pool estimate is consistently higher than the individual strain estimate. However, we run T-lex2 in a different set of individual and pooled strains and although 5% of the TEs showed discrepant frequencies estimates between pooled and individual samples, these discrepancies were both over- and underestimations (Supplementary Figure S2). These results suggest that T-lex2 frequency estimation is not biased and that the observed overestimation for the DGRP strains is likely due to some particular feature of this data set, such as sequencing coverage or number of strains used in pooled versus individual strains.

To assess the effect of the sampling and coverage on the TE frequency estimates using pooled data, we compared the TE frequency estimates using five different pooled data sets (Table 2). The read depth coverage varies from 20× to ∼60× (39), the number of strains from 42 to 92 and the number of flies sampled varies from 42 to 134. With only 42 flies pooled and a minimum coverage of 20×, the correlation is strongly positive (Table 2). The increase in the number of flies only does not seem to significantly improve the TE frequency estimates (Table 2). However, the increase of the coverage but not the number of strains improves significantly the TE frequency estimates (Table 2). Thus, the coverage matters more than the number of flies or strains pooled in the TE population frequency estimation using pooled sequencing samples. Read-depth coverage of 20× appears sufficient to obtain good estimates although the increase of the coverage allows reducing the number of errors in the calling.

Table 2. Sampling effect on the population frequency estimates of 680 TE insertions in D. melanogaster.

Sample identifier Read depth coverage Nb. of strains pooled Nb. of flies pooled % (Nb.) of no calls Pearson's ρ [IC95]
B1 20× 42 42 3.82 (26) 0.968 [0.963–0.973]
B4 20× 50 50 4.70 (32) 0.977 [0.973–0.980]
B3 40× 42 84 0.88 (6) 0.980 [0.977–0.983]
B5 40× 92 92 0.88 (6) 0.981 [0.978–0.983]
B3–B4 60× 92 134 0 (0) 0.985 [0.982–0.987]

T-lex2 detection of TSDs is unbiased and accurate

The new TE-TSD detection module of T-lex2 allows the automatic annotation of TSDs without a priori knowledge of the biology of the TEs. While only one copy of the TSD is present in the genome without the TE insertion (i.e., the real TE absence), two TSD copies should be observed in the new constructed reference sequence used by T-lex2 for the absence detection (i.e., the TE absence reference sequence constructed by computational removal of the TE sequence from the genome). Based on this expectation, the new T-lex2 module post-processes the results from the absence module to analyze the alignments of the reads spanning the TE insertion site to identify the TSDs. This new module looks for the expected missing sequence in the alignment flanking the TE insertion site (Figure 3). If the missing sequence corresponds strictly to the TSD, another copy of this TSD should be detected in the vicinity of the missing sequence. Contrary to most of the TSD detection approaches (4244), no expected TSD length is required for T-lex2 TSD detection.

Out of the 587 TEs with at least one read providing evidence for the absence, T-lex2 detects TSDs for 390 of them. Comparing with previous TSD annotation (45,46), we confirmed the veracity of our approach (Table 3). TSD length ranges from two to 19 bp with an average and a median of five base pairs (Supplementary Figure S3). TSDs of the LTR retroelements are consistently short: four base pairs long on average. TSD lengths of DNA elements are also consistent but longer: seven base pairs on average. On the contrary, TSDs of the non-LTR retroelements (LINE) are not conserved within families and vary in length from four to 19 bp (Table 3). Overall, we identified for the first time TSDs for ten TE families: six DNA transposon families and four non-LTR families (Table 3).

Table 3. TSD detection for the 53 TE families analyzed in D. melanogaster.

T-lex2
TE order TE superfamily TE family Nb. of TEs with TSD detected TSD consensus motif(1) Min. length (bp) Max. length (bp) Known TSD length (bp)
DNA hAT H 11 ACCCACAG 8 8 8(2)
DNA Mariner/Tc1 FB 7 AAAACCGTC 9 9 ND
DNA Mariner/Tc1 S2 1 AT 2 2 ND
DNA P-element 1360 36 GTYYYG 6 7 7(2)
DNA Transib hopper 7 CAATG 5 5 5(2)
DNA Transib transib2 5 CAVTG 5 5 5(2)
LINE Jockey BS 12 x 4 12 ND
LINE Jockey Doc 14 x 4 12 ND
LINE Jockey F-element 6 x 9 13 ∼10(3)
LINE Jockey G2 4 x 5 12 ∼10(3)
LINE I-element I-element 2 x 15 19 ND
LINE Jockey Ivk 1 CACATCATCTTAT 13 13 ∼10(4)
LINE Jockey jockey 7 x 7 15 ND
LINE Jockey Juan 1 ACGAAACAACATTA 14 14 ND
LINE R1 Rt1a 2 x 6 14 ND
LINE R1 Rt1b 1 GCTATC 6 6 ND
LINE Jockey X-element 1 TTTTGAAA 8 8 ND
LTR copia 1731 1 TAAAT 5 5 4-6(5)
LTR copia copia 21 AAAAT 5 5 4-6(5)
LTR Gypsy 17.6 3 ATAT 4 4 4-6(5)
LTR Gypsy 297 20 ATAT 4 4 4(2)
LTR Gypsy 412 16 AAAC 4 4 4(2)
LTR Gypsy blood 19 AACC 4 4 4-6(5)
LTR Gypsy Burdock 9 TRYA 4 4 4(2)
LTR Gypsy diver 6 AAGGG 5 5 4-6(5)
LTR Gypsy flea 9 TXHA 4 4 4-6(5)
LTR Gypsy gtwin 1 TACA 4 4 4(2)
LTR Gypsy gypsy 1 TACA 4 4 4(2)
LTR Gypsy gypsy2 1 ATA 3 3 4(2)
LTR Gypsy gypsy5 2 CGCG 4 4 4(2)
LTR Gypsy HMS-Beagle 3 TGYA 4 4 4(2)
LTR Gypsy HMS-Beagle2 2 x 4 7 4-6(5)
LTR Gypsy invader1 2 ATG 3 4 4-6(5)
LTR Gypsy invader3 1 AAAT 4 4 4(2)
LTR Gypsy invader4 1 TATA 4 4 4-6(5)
LTR Gypsy invader6 1 TATA 4 4 4-6(5)
LTR Gypsy McClintock 2 YATAT 5 5 4-6(5)
LTR Gypsy mdg1 10 AAGG 4 4 4(2)
LTR Gypsy mdg3 9 AGTA 4 4 4(2)
LTR Gypsy opus 3 TAA 3 4 4(2)
LTR Gypsy Quasimodo 5 ATAT 4 4 4(2)
LTR Gypsy rover 1 ATAT 4 4 4-6(5)
LTR Gypsy springer 1 GTA 3 3 4-6(5)
LTR Gypsy Stalker 4 CAAT 4 4 4-6(5)
LTR Gypsy Stalker2 4 AAAT 4 4 4(2)
LTR Gypsy Stalker4 2 ATAC 4 4 4-6(5)
LTR Gypsy Tabor 2 CCTG 4 4 4(2)
LTR Gypsy Tirant 15 ACG 2 4 2(2)
LTR Gypsy Transpac 4 ATAT 4 4 4(2)
LTR Pao 3S18 4 GYRDT 5 5 4-6(5)
LTR Pao Max 2 CCTCA 5 5 5(2)
LTR Pao roo 84 AACAG 4 7 5(2)
LTR Pao rooA 1 ACAAT 5 5 4-6(5)

ND: Notdetermined. (1)TSD motifs represented in the standard IUB/IUPAC nucleic acid codes. The sign “x” indicates that no consensus can be reported due to a lack of motifs or too variable motifs. (2)(21), (3)(45), (4)(46), (5)(5).

To assess the quality of the TSD annotation using T-lex2, we compared 148 TSD results obtained by T-lex2 with the annotations obtained with LTRharvest software (42) (Supplementary Table S4). For 92 out of the 148 TE insertions, the TSD annotations are identical in length and motif. For 7 TEs, TSDs have the same length but differ in motif (Supplementary Table S4) and for the other 49 TEs TSD motifs are identical but differ in length: 7 TEs have shorter motifs and 42 TEs have longer motifs. All the longer TSDs correspond to shifts of only 1 bp except for one TSD that is 3 bp longer than previously annotated. For 13 of these TEs, corresponding to Copia elements, T-lex2 results are supported by experimental studies indicating that Copia TSD are 5 bp long (47,48) and not 4 bp as detected with LTRHarvest (Supplementary Table S4) (42).

TE-TSD detection module did not identify TSDs for 197 TEs out of the 587 with at least one read providing evidence for the absence. TSD detection may fail for old TE insertions for which the TSD is too divergent, for truncated TEs, and when the boundaries of the TE insertions are not well annotated.

T-lex2 allows re-annotation of TEs in D. melanogaster

Combining outputs from the TE-analysis module and the TE-TSD annotation, T-lex2 allows manual curation and re-annotation of TE insertions. To facilitate the identification of the putatively miss-annotated TEs, information on TEs for which the TSD detection failed and TEs with poly(A) tail longer than annotated, are also given in ‘Tfreq_combine’ file. TSD detection fails for TEs that are longer (Figure 4A.II) and shorter (Figure 4A.III) than annotated. We thus manually curated the 197 TEs for which TSD could not be detected. For 65 out of the 197 TE insertions, manual curation confirms the miss-annotation. All these 65 TEs are detected experimentally and by T-lex2 at very low frequency (Supplementary Table S5). Sixty-four of the 65 miss-annotated TE insertions are longer than annotated and they all correspond to non-LTR TE insertions: 33 have a longer poly(A) tail than annotated, and the other 32 TEs have one or the two junctions miss-annotated. The increase in length of these TEs ranges from 5 to 40 bp (12 bp on average; Supplementary Table S5). The only TE insertion detected as shorter than annotated is an LTR (FBti0020118; Supplementary Table S5). Thus, T-lex2 allows identifying miss-annotated TEs and manual curation allows re-annotating ∼10% (65/709) of the analyzed TEs.

T-lex2 provides accurate TE presence/absence detection in human data

We also tested the accuracy of T-lex2 in the human genome, a genome with a very different TE composition and TE dynamics compared to that of the Drosophila genome. We used a curated subset of the MEI catalog (see Materials and Methods) (40). Out of the 1549 insertions analyzed, 755 TEs were excluded based on the results of the TE-analysis module because the repeat density in their flanking regions was higher than 50%. Twelve additional TEs were removed from the final data set because T-lex2 returned ‘no data’ as a call. Thus, T-lex2 yielded highly accurate TE calls for a total of 782 TEs (Supplementary Table S6). We compared T-lex2 results with available validation data for the same subset of TEs obtained by PCR and a combination of different software (Table 4). T-lex2 results show an overall of 96,9% matches, which is higher than that obtained by the read pair and split read methodology used by Steward et al. (40) (93%). T-lex2 has a high TE call accuracy in human data with 97.65% sensitivity and 93.26% positive predictive value, a proxy for specificity (see Materials and Methods). Finally, false discovery rate was low: 6.74% relative to the PCR and/or assembly validated calls from Steward et al. (40).

Table 4. Comparison between T-lex2 results and MEI catalog in humans.

graphic file with name gku1250tbl4.jpg

DISCUSSION

We present here the new version of the T-lex pipeline: T-lex2 (Figure 1). This integrated Perl computational pipeline designed to analyze TE insertions in NGS data is freely available and user-friendly since only one command line is necessary to run it (see http://petrov.stanford.edu/cgi-bin/Tlex.html or http://sourceforge.net/projects/tlex). T-lex2 is also flexible. The pipeline is composed of five distinct modules that can be launched independently and each one of them includes a large number of options that allows performing flexible and customizable analyses (see T-lex2 manual at http://petrov.stanford.edu/cgi-bin/Tlex2_manual.html). T-lex2 runs with single-end and/or PE reads and can be customized for NGS data sets with different read lengths, read quality and read depth. Thus, T-lex is not only able to work with the different NGS data sets already available, but also with upcoming data sets resulting from improvements in the NGS technology.

Briefly, T-lex2 has improved and/or expanded the four modules present in T-lex and has incorporated a new module that allows detecting TSDs in an unbiased and accurate way (Table 1). Besides individual strain NGS data, T-lex2 can now estimate TE frequencies from pooled NGS data. This is an important new feature of T-lex2 since sequencing pooled samples is an inexpensive and efficient means of obtaining population genome data (26,39). Another important new feature of T-lex2 is the ability to identify TEs likely to return wrong calls, i.e. miss-annotated TEs, TEs part of segmental duplications and TEs flanked by other TEs. This has been accomplished by the expansion of the TE-filter module of T-lex and by the incorporation of the TSD detection module (Figure 1). Being able to identify TEs likely to give wrong calls is especially important for the analyses of TE population frequencies in genomes that do not have a high-quality TE annotation. Note that even in an extremely high quality genome as the D. melanogaster one, we re-annotated ∼10% of the analyzed TEs. This result also supports the opportunities offered by the NGS data to improve TE annotations (49).

We tested the reliability of T-lex2 calls and frequency estimates using the fly and the human genome. Indeed, the pipeline can be used for any species for which NGS data and TE annotation is available. T-lex2 can be run for all type of TEs in any species because (i) it is a flexible pipeline that allows to customize runs for NGS data sets of different qualities, (ii) it is able to identify putatively wrong calls and re-annotate miss-annotate TEs and (iii) no information about the biology of the TE is needed to get accurate frequency estimates.

There are other available tools that can be used to estimate TE frequencies (10,2130). However, T-lex (version 1 and 2) is to date the only tool that (i) combines presence and absence detection analysis that allows the identification of heterozygotes and/or polymorphic TE insertions (31); (ii) can be used for individual and pooled samples; (iii) provides the user with an output file with the frequency of the analyzed TEs; (iv) can be easily customized to be run with different NGS data sets in any organism; and (v) assess the quality of the calls allowing to re-annotate miss-annotated TEs. Thus overall, T-lex2 is the most broadly applicable and flexible tool available to date.

The flexibility of T-lex2, exemplified by the incorporation of the TE-TSD detection module in this new version, will allow us to easily add new modules in near future. We are currently working on a new module designed to identify TEs not annotated in reference genomes. However, T-lex2 can be used to genotype and estimate frequencies of de novo TEs identified with already available software, such as RetroSeq (23). The ability of T-lex2 to re-annotate TEs would be specially useful to study this de novo TEs as their junctions might not be as well annotated as the ones in the reference genomes.

Overall, we are providing a versatile tool that allows exploring the impact of TEs on genome evolution as well as learning about TE biology. By analyzing the frequency of TEs in different populations, we will be able to determine which proportion of the TE-induced mutations are subject to strong purifying selection or are likely to be evolving under positive selection. Additionally, accurate annotation of TE insertions will allow us to study, for example, the target site preferences of different TE families. The availability of a tool, such as T-lex2, improves our ability to explore TE dynamics and TE biology in an increasing number of species. This is important, as we cannot hope to understand genome structure and function without a thorough understanding of its most active, diverse and ancient component.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

SUPPLEMENTARY DATA

Acknowledgments

We would like to thank Lidia Mateo for technical support and all the members of the Petrov lab for their comments. We also would like to thank the ScaleGenomics (http://scalegen.com/) and Proclus (Stanford University; http://www.stanford.edu/) computational platforms that we used to investigate this study.

FUNDING

National Institutes of Health [R01GM089926 to D.A.P.]; European Commission [PCIG-GA-2011-293860 to J.G.]; Spanish Government [BFU-2011-24397 to J.G.]; France-Stanford center 2013–2014 [to D.A.P. and A.S.F.L.]. PIF fellowship from the Universitat Autònoma de Barcelona [ to M.G.B.]. Ramón y Cajal fellow [RYC-2010-07306 to J.G.]. Funding for open access charge: National Institutes of Health [R01GM089926 to D.A.P.]; European Commission [PCIG-GA-2011-293860 to J.G.]; Spanish Government [BFU-2011-24397 to J.G.]; France-Stanford center 2013–2014 [to D.A.P. and A.S.F.L.]. PIF fellowship from the Universitat Autònoma de Barcelona [ to M.G.B.]. Ramón y Cajal fellow [RYC-2010-07306 to J.G.].

Conflict of interest statement. None declared.

REFERENCES

  • 1.Gilad Y., Pritchard J.K., Thornton K. Characterizing natural variation using next-generation sequencing technologies. Trends Genet. 2009;25:463–471. doi: 10.1016/j.tig.2009.09.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Koboldt D.C., Steinberg K.M., Larson D.E., Wilson R.K., Mardis E.R. The next-generation sequencing revolution and its impact on genomics. Cell. 2013;155:27–38. doi: 10.1016/j.cell.2013.09.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Shapiro E., Biezuner T., Linnarsson S. Single-cell sequencing-based technologies will revolutionize whole-organism science. Nat. Rev. Genet. 2013;14:618–630. doi: 10.1038/nrg3542. [DOI] [PubMed] [Google Scholar]
  • 4.Treangen T.J., Salzberg S.L. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 2012;13:36–46. doi: 10.1038/nrg3117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Wicker T., Sabot F., Hua-Van A., Bennetzen J.L., Capy P., Chalhoub B., Flavell A., Leroy P., Morgante M., Panaud O., et al. A unified classification system for eukaryotic transposable elements. Nat. Rev. Genet. 2007;8:973–982. doi: 10.1038/nrg2165. [DOI] [PubMed] [Google Scholar]
  • 6.Kapitonov V.V., Jurka J. A universal classification of eukaryotic transposable elements implemented in Repbase. Nat. Rev. Genet. 2008;9:411–412. doi: 10.1038/nrg2165-c1. author reply 414. [DOI] [PubMed] [Google Scholar]
  • 7.Biemont C. A brief history of the status of transposable elements: from junk DNA to major players in evolution. Genetics. 2010;186:1085–1093. doi: 10.1534/genetics.110.124180. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.de Koning A.P., Gu W., Castoe T.A., Batzer M.A., Pollock D.D. Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet. 2011;7:e1002384. doi: 10.1371/journal.pgen.1002384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Devos K.M., Beales J., Ogihara Y., Doust A.N. Comparative sequence analysis of the phytochrome C gene and its upstream region in allohexaploid wheat reveals new data on the evolution of its three constituent genomes. Plant Mol. Biol. 2005;58:625–641. doi: 10.1007/s11103-005-6801-z. [DOI] [PubMed] [Google Scholar]
  • 10.Ewing A.D., Kazazian H.H., Jr Whole-genome resequencing allows detection of many rare LINE-1 insertion alleles in humans. Genome Res. 2011;21:985–990. doi: 10.1101/gr.114777.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Iskow R.C., McCabe M.T., Mills R.E., Torene S., Pittard W.S., Neuwald A.F., Van Meir E.G., Vertino P.M., Devine S.E. Natural mutagenesis of human genomes by endogenous retrotransposons. Cell. 2010;141:1253–1261. doi: 10.1016/j.cell.2010.05.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Cowley M., Oakey R.J. Transposable elements re-wire and fine-tune the transcriptome. PLoS Genet. 2013;9:e1003234. doi: 10.1371/journal.pgen.1003234. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Casacuberta E., Gonzalez J. The impact of transposable elements in environmental adaptation. Mol. Ecol. 2013;22:1503–1517. doi: 10.1111/mec.12170. [DOI] [PubMed] [Google Scholar]
  • 14.Magwire M.M., Bayer F., Webster C.L., Cao C., Jiggins F.M. Successive increases in the resistance of Drosophila to viral infection through a transposon insertion followed by a Duplication. PLoS Genet. 2011;7:e1002337. doi: 10.1371/journal.pgen.1002337. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Guio L., Barron M.G., Gonzalez J. The transposable element Bari-Jheh mediates oxidative stress response in Drosophila. Mol. Ecol. 2014;23:2020–2030. doi: 10.1111/mec.12711. [DOI] [PubMed] [Google Scholar]
  • 16.Mateo L., Ullastres A., Gonzalez J. A transposable element insertion confers xenobiotic resistance in Drosophila. PLoS Genet. 2014;10:e1004560. doi: 10.1371/journal.pgen.1004560. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Akagi K., Li J., Symer D.E. How do mammalian transposons induce genetic variation? A conceptual framework: the age, structure, allele frequency, and genome context of transposable elements may define their wide-ranging biological impacts. Bioessays. 2013;35:397–407. doi: 10.1002/bies.201200133. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Chenais B., Caruso A., Hiard S., Casse N. The impact of transposable elements on eukaryotic genomes: from genome size increase to genetic adaptation to stressful environments. Gene. 2012;509:7–15. doi: 10.1016/j.gene.2012.07.042. [DOI] [PubMed] [Google Scholar]
  • 19.Hua-Van A., Le Rouzic A., Boutin T.S., Filee J., Capy P. The struggle for life of the genome's selfish architects. Biol. Direct. 2011;6:19. doi: 10.1186/1745-6150-6-19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Makalowski W., Pande A., Gotea V., Makalowska I. Transposable elements and their identification. Methods Mol. Biol. 2012;855:337–359. doi: 10.1007/978-1-61779-582-4_12. [DOI] [PubMed] [Google Scholar]
  • 21.Linheiro R.S., Bergman C.M. Whole genome resequencing reveals natural target site preferences of transposable elements in Drosophila melanogaster. PloS ONE. 2012;7:e30008. doi: 10.1371/journal.pone.0030008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Hormozdiari F., Hajirasouliha I., Dao P., Hach F., Yorukoglu D., Alkan C., Eichler E.E., Sahinalp S.C. Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery. Bioinformatics. 2010;26:i350–i357. doi: 10.1093/bioinformatics/btq216. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Keane T.M., Wong K., Adams D.J. RetroSeq: transposable element discovery from next-generation sequencing data. Bioinformatics. 2013;29:389–390. doi: 10.1093/bioinformatics/bts697. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Lee E., Iskow R., Yang L., Gokcumen O., Haseley P., Luquette L.J., 3rd, Lohr J.G., Harris C.C., Ding L., Wilson R.K., et al. Landscape of somatic retrotransposition in human cancers. Science. 2012;337:967–971. doi: 10.1126/science.1222077. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Robb S.M., Lu L., Valencia E., Burnette J.M., 3rd, Okumoto Y., Wessler S.R., Stajich J.E. The use of RelocaTE and unassembled short reads to produce high-resolution snapshots of transposable element generated diversity in rice. G3. 2013;3:949–957. doi: 10.1534/g3.112.005348. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Kofler R., Betancourt A.J., Schlotterer C. Sequencing of pooled DNA samples (Pool-Seq) uncovers complex dynamics of transposable element insertions in Drosophila melanogaster. PLoS Genet. 2012;8:e1002487. doi: 10.1371/journal.pgen.1002487. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Wu J., Lee W.P., Ward A., Walker J.A., Konkel M.K., Batzer M.A., Marth G.T. Tangram: a comprehensive toolbox for mobile element insertion detection. BMC Genom. 2014;15:795. doi: 10.1186/1471-2164-15-795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Zhuang J., Wang J., Theurkauf W., Weng Z. TEMP: a computational method for analyzing transposable element polymorphism in populations. Nucleic Acids Res. 2014;42:6826–6838. doi: 10.1093/nar/gku323. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.J.M. Cridland., Thornton K.R. Validation of rearrangement break points identified by paired-end sequencing in natural populations of Drosophila melanogaster. Genome biol. evol. 2010;2:83–101. doi: 10.1093/gbe/evq001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Platzer A., Nizhynska V., Long Q. TE-Locate: A Tool to Locate and Group Transposable Element Occurrences Using Paired-End Next-Generation Sequencing Data. Biology. 2012;1:395–410. doi: 10.3390/biology1020395. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Fiston-Lavier A.S., Carrigan M., Petrov D.A., Gonzalez J. T-lex: a program for fast and accurate assessment of transposable element presence using next-generation sequencing data. Nucleic Acids Res. 2011;39:e36. doi: 10.1093/nar/gkq1291. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Lerat E. Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs. Heredity (Edinb) 2010;104:520–533. doi: 10.1038/hdy.2009.165. [DOI] [PubMed] [Google Scholar]
  • 33.Kent W.J. BLAT–the BLAST-like alignment tool. Genome Res. 2002;12:656–664. doi: 10.1101/gr.229202. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.David M., Dzamba M., Lister D., Ilie L., Brudno M. SHRiMP2: sensitive yet practical SHort Read Mapping. Bioinformatics. 2011;27:1011–1012. doi: 10.1093/bioinformatics/btr046. [DOI] [PubMed] [Google Scholar]
  • 35.Rumble S.M., Lacroute P., Dalca A.V., Fiume M., Sidow A., Brudno M. SHRiMP: accurate mapping of short color-space reads. PLoS Comput. Biol. 2009;5:e1000386. doi: 10.1371/journal.pcbi.1000386. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Mackay T.F., Richards S., Stone E.A., Barbadilla A., Ayroles J.F., Zhu D., Casillas S., Han Y., Magwire M.M., Cridland J.M., et al. The Drosophila melanogaster Genetic Reference Panel. Nature. 2012;482:173–178. doi: 10.1038/nature10811. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Petrov D.A., Fiston-Lavier A.S., Lipatov M., Lenkov K., Gonzalez J. Population genomics of transposable elements in Drosophila melanogaster. Mol. Biol. Evol. 2011;28:1633–1644. doi: 10.1093/molbev/msq337. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Gonzalez J., Lenkov K., Lipatov M., Macpherson J.M., Petrov D.A. High rate of recent transposable element-induced adaptation in Drosophila melanogaster. PLoS Biol. 2008;6:e251. doi: 10.1371/journal.pbio.0060251. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Zhu Y., Bergland A.O., Gonzalez J., Petrov D.A. Empirical validation of pooled whole genome population re-sequencing in Drosophila melanogaster. PloS ONE. 2012;7:e41901. doi: 10.1371/journal.pone.0041901. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Stewart C., Kural D., Stromberg M.P., Walker J.A., Konkel M.K., Stutz A.M., Urban A.E., Grubert F., Lam H.Y., Lee W.P., et al. A comprehensive map of mobile element insertion polymorphisms in humans. PLoS Genet. 2011;7:e1002236. doi: 10.1371/journal.pgen.1002236. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Gonzalez J., Karasov T.L., Messer P.W., Petrov D.A. Genome-wide patterns of adaptation to temperate environments associated with transposable elements in Drosophila. PLoS Genet. 2010;6:e1000905. doi: 10.1371/journal.pgen.1000905. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Ellinghaus D., Kurtz S., Willhoeft U. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformat. 2008;9:18. doi: 10.1186/1471-2105-9-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Szak S.T., Pickeral O.K., Makalowski W., Boguski M.S., Landsman D., Boeke J.D. Molecular archeology of L1 insertions in the human genome. Genome Biol. 2002;3:research0052. doi: 10.1186/gb-2002-3-10-research0052. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Rawal K., Ramaswamy R. Genome-wide analysis of mobile genetic element insertion sites. Nucleic Acids Res. 2011;39:6864–6878. doi: 10.1093/nar/gkr337. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Kapitonov V.V., Jurka J. Molecular paleontology of transposable elements in the Drosophila melanogaster genome. Proc. Natl. Acad. Sci. U.S.A. 2003;100:6569–6574. doi: 10.1073/pnas.0732024100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Berezikov E., Bucheton A., Busseau I. A search for reverse transcriptase-coding sequences reveals new non-LTR retrotransposons in the genome of Drosophila melanogaster. Genome Biol. 2000;1 doi: 10.1186/gb-2000-1-6-research0012. RESEARCH0012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Dunsmuir P., Brorein W.J., Jr, Simon M.A., Rubin G.M. Insertion of the Drosophila transposable element copia generates a 5 base pair duplication. Cell. 1980;21:575–579. doi: 10.1016/0092-8674(80)90495-x. [DOI] [PubMed] [Google Scholar]
  • 48.Rubin G.M., Brorein W.J., Jr, Dunsmuir P., Flavell A.J., Levis R., Strobel E., Toole J.J., Young E. Copia-like transposable elements in the Drosophila genome. Cold Spring Harb. Symp. Quant. Biol. 1981;45:619–628. doi: 10.1101/sqb.1981.045.01.080. [DOI] [PubMed] [Google Scholar]
  • 49.Chaparro C., Sabot F. Methods and software in NGS for TE analysis. Methods Mol. Biol. 2012;859:105–114. doi: 10.1007/978-1-61779-603-6_6. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

SUPPLEMENTARY DATA

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES