Abstract
Background
Sequencing technologies give access to a precise picture of the molecular mechanisms acting upon genome regulation. One of the biggest technical challenges with sequencing data is to map millions of reads to a reference genome. This problem is exacerbated when dealing with repetitive sequences such as transposable elements that occupy half of the mammalian genome mass. Sequenced reads coming from these regions introduce ambiguities in the mapping step. Therefore, applying dedicated parameters and algorithms has to be taken into consideration when transposable elements regulation is investigated with sequencing datasets.
Results
Here, we used simulated reads on the mouse and human genomes to define the best parameters for aligning transposable element-derived reads on a reference genome. The efficiency of the most commonly used aligners was compared and we further evaluated how transposable element representation should be estimated using available methods. The mappability of the different transposon families in the mouse and the human genomes was calculated giving an overview into their evolution.
Conclusions
Based on simulated data, we provided recommendations on the alignment and the quantification steps to be performed when transposon expression or regulation is studied, and identified the limits in detecting specific young transposon families of the mouse and human genomes. These principles may help the community to adopt standard procedures and raise awareness of the difficulties encountered in the study of transposable elements.
Keywords: Retrotransposon, High-throughput sequencing, Data analysis, Mapping, Quantification
Background
Transposable elements (TEs) comprise approximately half of the mammalian genomes [1]. Based on de novo repeat identification, it has been suggested that two-thirds of the human genome is in fact composed of repetitive elements [2].TEs are first classified according to their ability to invade the genome and their related molecular mechanisms. DNA transposons use a cut-and-paste mechanism where the element is excised and inserted into a new locus. Retrotransposons use an intermediate RNA template to insert into new genomic locations, in a copy-and-paste manner. These are classified into Long-Terminal Repeat (LTR) elements that are similar to retroviruses, and non-LTR elements. Non-LTR elements are more abundant compared to LTR elements and DNA transposons in mammalian genomes. The vast majority of TE insertions are incapable of mobilization, due to invalidating truncations, internal rearrangements or mutations. However, based on cell culture assays, it has been estimated that 80–100 L1HS elements are competent for retrotransposition in the human genome [3] and around 3000 L1 elements from the Tf, A and Gf subfamilies are potentially capable of retrotransposition in the mouse genome [4]. De novo insertions of TEs -mainly Alu, L1 and SVA non-LTR families- have been associated with more than 100 human diseases [5]. In reaction, cells have developed several restraining mechanisms against TE activity. At the transcriptional level, DNA methylation and repressive histone modifications block TE expression. In the cytoplasm, some restriction factors degrade retrotransposon RNAs. Other factors play a role in the nucleus by interfering with the DNA integration step [6].
The emergence of high-throughput sequencing technologies has allowed making tremendous progress in our understanding of the regulation and functional impact of TEs. However, the characterization of these elements remains computationally challenging, mainly due to their repetitiveness [6]. As they are not unique in the genome, repeated sequences create ambiguities in the alignment step, which can lead to misleading biological conclusions if inappropriate parameters are applied [7, 8]. Different algorithms have been developed for the purpose of mapping reads according to the sequencing application [9]. By default, most of these tools are parameterized to randomly report one genomic position among the set of possible alignments. Additional parameters or filters are implemented to keep uniquely mapped reads, to report all possible positions of reads or to return up to a given number of valid alignments. Benchmarkings of these methods have also been reported to compare their efficiency. Some of them investigated specific biological applications, such as Whole-Genome Bisulfite Sequencing (WGBS) [10] and RNA-seq [11] or specific sequencing platforms [12]. Schbath et al. assessed the power of tools to retrieve all the read occurences. However, their study relied on simulated short single-end reads of 40 bp without any insertions/deletions (indels). Hatem et al. investigated the effect of different mapping parameters such as number of mismatches, seed and read length, gapped vs ungapped alignment. Nevertheless, they did not investigate the power of the different algorithms to align TE-derived reads.
Some tools were developed to quantify TEs within sequencing data. TEtools uses TE annotation to create Bowtie2 index and performs mapping by reporting randomly one position [13, 14]. RepEnrich recommends performing the mapping with Bowtie to retrieve unique alignments [15, 16]. It enables quantifying unique reads emanating from specific families (referred to repEnrich Unique in this study) and the total number of reads, unique and multiple, mapped to each TE family (repEnrich Total). The repEnrich Fractional method counts reads that map to a single TE family and assigns multi-mapped reads to corresponding families using a fractional value 1/n, where n is the number of TE families the read maps to. SQuIRE [17] allows quantifying TE single copies and families performing the alignment with STAR [18] and using an iterative method to assign multi-mapped reads (SQuIRE). Finally, TEtranscripts [19] advises to generate BAM files with the STAR mapper, and performs TE quantification using only uniquely-mapped reads (TEtranscripts Unique), or using multi-mapped reads with an iterative method (TEtranscripts Multiple).
In this study, we propose to benchmark at once the efficiency of the most used aligners and available tools for TE quantification. Using simulated data with mouse and human genomes, Bowtie, Bowtie2, STAR, Novoalign (http://www.novocraft.com), BWA aln [20] and mem [21] alignment algorithms were compared. We also assessed the effect of using paired-end library compared to single-end library with TE-derived reads. Reporting unique reads, randomly one position and all possible locations were compared when TE abundance was estimated. In addition, TE quantification was compared to TE-simulated abundance using the most recent and used RepeatMasker-based tools, TEtools, repEnrich, SQuIRE and TEtranscript. Finally, the efficiency to map reads from each TE subfamily within the mouse and the human genome was computed and revealed the difficulties of accessing specific young TE families.
Results
Mapping based on STAR and PE libraries are highly recommended to align reads coming from transposable elements
To compare different mapping algorithms and their efficiency to align reads from repeated sequences, we relied on simulated data (Fig. 1a). Using a reference genome, 2x100bp paired-end reads were simulated with ART v2.5.8 [22] mimicking Illumina HiSeq 2500 technology (mean fragment size = 200 bp, standard deviation = 20 bp and technology-specific sequencing errors). Reads overlapping with RepeatMasker annotations were kept (Smit, R, & P, 2013–2015). Three independent datasets were simulated at a 10X coverage and aligned using Bowtie1, Bowtie2, BWA aln and mem algorithms, Novoalign and STAR. Only one end of the simulated fragments (single-end (SE) alignment) or both ends (paired-end (PE) alignment) were used, allowing us to compare the performance of both library types when TE-derived reads are aligned. Algorithms were run while enabling unique, randomly-reported or multi-mapped reads, except for BWA algorithms that do not give the possibility to return several hits per read. Reported alignments were compared to the simulated positions. When congruent, alignments were flagged as true-positive (TP) and weighted by the number of reported hits for the corresponding read in the multi-mapped mode. This approach allowed penalizing algorithms that report too many positions per read.
In Fig. 1b, TP rate and percentage of mapping were represented using the chromosome 1 of the mouse genome as the reference genome for the data simulation (Additional file 1: Figure S1A for the chromosome 1 of the human genome). In the top panel, uniquely-reported reads were considered. Around 92 and 95% of the reads were aligned in the SE and PE libraries respectively, highlighting the importance of using PE library to increase the uniqueness of fragments derived from transposon sequences. Conversely, Bowtie1 is the only tool which does not capitalize on the PE library to improve the mapping results. Some uniquely-mapped reads with SE library were not anymore mapped using paired-end information because the second read of the pair had different valid alignments.
Bowtie2, BWA mem and aln algorithms do not allow reporting uniquely mapped reads with defined parameters. Post-mapping filtering is therefore required. In this case, these mappers had the same performance with both SE and PE libraries compared to STAR and Novoalign (Fig. 1b and Additional file 1: Figure S1A and Tables 1 and 2).
Table 1.
Algorithm | Library | Mode | Mapping percentage | True Positive rate | Memory in gbytes | Running Time in minutes | Output size in Mbytes |
---|---|---|---|---|---|---|---|
bowtie | PE | unique | 91.87823 | 99.97913 | 0.92 | 3.00 | 583.36 |
bowtie | SE | unique | 92.05224 | 99.92287 | 0.69 | 1.33 | 311.38 |
bowtie2 | PE | unique | 94.57886 | 99.93802 | 1.28 | 38.00 | 572.58 |
bowtie2 | SE | unique | 92.08282 | 99.84845 | 1.18 | 32.67 | 294.64 |
Bwa aln | PE | unique | 94.62602 | 99.88782 | 2.66 | 15.67 | 553.86 |
Bwa aln | SE | unique | 96.60879 | 95.82612 | 1.85 | 3.00 | 310.30 |
Bwa mem | PE | unique | 94.54763 | 99.95728 | 8.77 | 19.33 | 563.50 |
Bwa mem | SE | unique | 92.08548 | 99.89624 | 8.40 | 4.67 | 299.76 |
novoalign | PE | unique | 95.55760 | 99.61473 | 7.62 | 226.33 | 609.08 |
novoalign | SE | unique | 92.08982 | 99.92307 | 7.61 | 31.67 | 315.96 |
STAR | PE | unique | 95.37882 | 99.80753 | 16.67 | 2.00 | 553.24 |
STAR | SE | unique | 92.23340 | 99.73004 | 16.18 | 2.33 | 285.06 |
bowtie | PE | random | 99.95300 | 93.67212 | 0.93 | 3.00 | 596.75 |
bowtie | SE | random | 99.99001 | 93.04126 | 0.69 | 2.33 | 317.67 |
bowtie2 | PE | random | 99.99991 | 95.89737 | 1.28 | 35.67 | 607.86 |
bowtie2 | SE | random | 99.98093 | 92.97406 | 1.18 | 25.67 | 324.26 |
Bwa aln | PE | random | 99.99998 | 95.94218 | 2.66 | 17.67 | 604.39 |
Bwa aln | SE | random | 99.99801 | 93.01531 | 1.85 | 4.00 | 322.33 |
Bwa mem | PE | random | 99.99998 | 95.94068 | 9.42 | 18.33 | 612.39 |
Bwa mem | SE | random | 99.99998 | 93.01096 | 7.96 | 6.33 | 329.82 |
novoalign | PE | random | 99.99998 | 95.84899 | 7.62 | 272.00 | 616.78 |
novoalign | SE | random | 99.99989 | 93.03697 | 7.61 | 30.67 | 322.72 |
STAR | PE | random | 99.94380 | 95.93094 | 16.67 | 5.00 | 583.02 |
STAR | SE | random | 99.99024 | 93.01921 | 16.26 | 2.00 | 314.19 |
bowtie | PE | multi | 99.95300 | 92.89719 | 0.98 | 18.33 | 7289.52 |
bowtie | SE | multi | 99.99001 | 93.01711 | 0.71 | 9.67 | 2747.64 |
bowtie2 | PE | multi | 99.99998 | 76.80653 | 11.53 | 28658.67 | 228148.51 |
bowtie2 | SE | multi | 99.99998 | 70.81391 | 8.74 | 8205.33 | 161697.48 |
novoalign | PE | multi | 99.99998 | 95.85903 | 7.62 | 307.67 | 2627.41 |
novoalign | SE | multi | 99.99989 | 93.03718 | 7.61 | 99.00 | 3176.37 |
STAR | PE | multi | 99.94380 | 95.93265 | 23.95 | 7.00 | 2575.59 |
STAR | SE | multi | 99.99024 | 93.02143 | 26.64 | 4.00 | 2831.57 |
Values correspond to the average values of three independent simulated libraries with a 10X coverage. SE and PE refer to single end and paired end, respectively. Post-mapping filtering were applied for Bowtie2, Bwa mem and aln algorithms in order to extract uniquely-mapped reads
Table 2.
Algorithm | Library | Mode | Mapping percentage | True Positive rate | Memory in gbytes | Running Time in minutes | Output size in Mbytes |
---|---|---|---|---|---|---|---|
bowtie | PE | unique | 96.12725 | 99.99703 | 1.07 | 4.00 | 717.33 |
bowtie | SE | unique | 96.26772 | 99.98760 | 0.80 | 1.67 | 381.52 |
bowtie2 | PE | unique | 97.58530 | 99.99163 | 1.42 | 36.00 | 720.57 |
bowtie2 | SE | unique | 96.25897 | 99.93671 | 1.33 | 25.33 | 375.46 |
Bwa aln | PE | unique | 97.58600 | 99.99135 | 3.01 | 13.67 | 703.84 |
Bwa aln | SE | unique | 98.40958 | 98.52603 | 2.18 | 6.33 | 381.22 |
Bwa mem | PE | unique | 97.57669 | 99.99745 | 5.65 | 8.33 | 715.38 |
Bwa mem | SE | unique | 96.28285 | 99.98096 | 5.45 | 4.67 | 379.88 |
novoalign | PE | unique | 97.83211 | 99.99187 | 8.31 | 99.67 | 745.17 |
novoalign | SE | unique | 96.28793 | 99.98755 | 8.31 | 21.00 | 385.94 |
STAR | PE | unique | 97.79129 | 99.99166 | 18.12 | 2.33 | 693.70 |
STAR | SE | unique | 96.29801 | 99.96226 | 17.71 | 1.00 | 363.12 |
bowtie | PE | random | 99.95306 | 97.78786 | 1.07 | 4.00 | 722.46 |
bowtie | SE | random | 99.98993 | 97.48616 | 0.80 | 2.33 | 383.45 |
bowtie2 | PE | random | 99.99967 | 98.68378 | 1.42 | 47.00 | 738.73 |
bowtie2 | SE | random | 99.97064 | 97.42861 | 1.33 | 35.67 | 391.06 |
Bwa aln | PE | random | 99.99998 | 98.68727 | 3.01 | 13.67 | 733.20 |
Bwa aln | SE | random | 99.99814 | 97.47704 | 2.18 | 7.33 | 387.77 |
Bwa mem | PE | random | 99.99998 | 98.69222 | 6.05 | 9.33 | 744.88 |
Bwa mem | SE | random | 99.99998 | 97.47710 | 5.26 | 3.00 | 397.18 |
novoalign | PE | random | 99.99998 | 98.68797 | 8.31 | 100.67 | 748.47 |
novoalign | SE | random | 99.99998 | 97.48725 | 8.31 | 27.67 | 388.19 |
STAR | PE | random | 99.94355 | 98.68767 | 18.12 | 3.33 | 709.61 |
STAR | SE | random | 99.99103 | 97.47578 | 17.70 | 2.00 | 378.46 |
bowtie | PE | multi | 99.95306 | 97.41469 | 1.09 | 4.33 | 1032.87 |
bowtie | SE | multi | 99.98993 | 97.47888 | 0.82 | 2.00 | 540.64 |
bowtie2 | PE | multi | 99.99998 | 85.55682 | 11.92 | 71150.67 | 81772.06 |
bowtie2 | SE | multi | 99.99998 | 77.59895 | 6.34 | 62006.33 | 123387.84 |
novoalign | PE | multi | 99.99998 | 98.68698 | 8.31 | 83.67 | 800.39 |
novoalign | SE | multi | 99.99998 | 97.48601 | 8.31 | 24.00 | 572.07 |
STAR | PE | multi | 99.94355 | 98.69066 | 18.12 | 4.00 | 754.66 |
STAR | SE | multi | 99.99103 | 97.47921 | 17.64 | 2.00 | 541.40 |
Values correspond to the average values of three independent simulated libraries with a 10X coverage. SE and PE refer to single end and paired end, respectively. Post-mapping filtering were applied for Bowtie2, Bwa mem and aln algorithms in order to extract uniquely-mapped reads
When randomly-reported and multi-mapped reads were allowed (middle and bottom panels, Fig 1b and Additional file 1: Figure S1A), the percentage of mapping increased close to 100%, leading to a decrease of TP rate around 93% for Bowtie1, 93% for the others in SE and 96% in PE. In addition, we also observed a big drop in Bowtie2 TP rate in the multi-mapped mode. Bowtie2 did not guarantee that the reported alignments are the best possible in terms of alignment score. Consequently, more alignments were reported, leading to a decrease of TP rate compared to other algorithms. As in unique mode, Bowtie1 was less efficient using PE library than SE library compared to Novoalign and STAR.
Computation time, BAM file size and memory usage were finally reported (Tables 1 and 2, Fig. 1c for mouse simulation and Additional file 1: Figure S1B for the human simulation) for all applied mappers and modes. The runtime measurement includes post-mapping filtering in the unique mode for bowtie2, BWA mem and aln algorithms. All algorithms required less than 10GB, except STAR which required 26GB at most. On the other hand, STAR was at least 15 times faster compared to Novoalign. Reporting all possible alignments per read increased at least four times the output size in PE mode compared to randomly-reported alignments for the mouse simulation. Output size of Bowtie2 in multi-mapped mode confirmed the fact that Bowtie2 reported too many alignments per read inducing a decrease of TP rate. In conclusion, STAR gave the best compromise in terms of mapping efficiency and accuracy, as well as computing time.
Quantification of TE families: random and multiple counting methods give the best estimations
Regarding its better performance, STAR was used as the mapping algorithm in subsequent quantification analyses. One library was simulated at a 10X coverage using the pipeline described previously with the human and mouse genomes as reference. The same mapping parameters than in previous analyses were applied for the human simulation. However, mapping parameters were adjusted (see Additional file 5) for the mouse simulation allowing higher number of multi-mapped reads, to account for the more complex TE content in this species. TE-estimated quantification from different developed methods was compared to simulated abundance. TE families were quantified using uniquely-mapped reads (referred to FeatureCounts Unique alignments), randomly-reported position (FeatureCounts Random alignments) and all valid alignments (FeatureCounts Multiple alignments). In the FeatureCounts Multiple alignments, alignments were weighted by the number of corresponding hits. Quantifications were performed using featureCounts [23]. In addition, repEnrich, TEtools, SQuIRE and TEtranscripts were evaluated using recommended parameters. TE-simulated abundance and estimated abundance were correlated for the different methods (Fig. 2a for mouse simulation and Additional file 3: Figure S2A for human simulation).
Methods using only unique reads (FeatureCounts Unique alignments, repEnrich Unique, TEtranscripts Unique) underestimated some TE families of all classes (LTR, LINEs and SINEs), with repEnrich Unique being the least accurate. In contrast, counting the total number of reads mapping to each TE family -as it is the case with repEnrich Total- induced an overestimation. On the other hand, weighting by the number of hits (FeatureCounts Multiple alignments) or reporting randomly one position (TEtools and FeatureCounts Random alignments) gave the most satisfactory TE estimation with a correlation close to 1. To test whether coverage could influence these results, we repeated the simulation with 5X, 10X, 25X, 50X and 100X coverage, focusing on specific TE families known to be potentially active (B2_Mm1a, IAPEz-int and L1MdA_I for the mouse genome and AluYa5, HERVK-int, L1HS and SVA_F for the human genome). Independently of the coverage depth, methods using unique reads (FeatureCounts Unique alignments, repEnrich Unique, TEtranscripts Unique) consistently underestimated TE families (Additional file 3: Figure S3A and B), while FeatureCounts using random and multiple alignments and TEtools gave the best estimation, confirming the 10X genome-wide simulation.
By proposing to map reads on TE annotations only, TEtools contrasts with other mapping methods that align reads genome-wide and then extract TE-derived reads only. However, because transposable elements represent only half of the mammalian genomes, we wanted to estimate whether TEtools could introduce some biases. New datasets were then simulated uniformly genome-wide, including non-repeated sequences, by generating PE libraries with a 10X coverage from mouse and human genomes. Compared to the FeatureCounts Random alignments (with STAR for the mapping), TEtools clearly introduced an overestimation of both LINE1 and LTR elements by forcing non-derived reads to map to TE sequences (Fig. 2b and Additional file 2: Figure S2B).
Evolutionarily young families suffer from low percentage of mapping and low true positive rate
Using PE library simulated on the mouse and human genomes, we found that 89.8 and 93.4% of the reads were uniquely mapped, respectively, with a TP rate of 99.9% (Fig. 3a and Additional file 4: Figure S4A). However, we noticed that some TE families displayed a lower mapping percentage. This was the case for the L1HS family –a recent human-specific L1 family- whereby 49% of simulated reads had 88% of TP rate upon unique mapping. In the mouse genome, 25 families had less than 50% of mapping when uniquely-reported reads were allowed, six of them being annotated in the LINE order. Using estimated evolutionary age of mouse and human LINE1 families [24, 25], we found that the youngest families were the ones with the lowest percentage of mapping and TP rate (Fig. 3b and Additional file 4: Figure S4B). These two metrics appear therefore as new classifiers to rank L1 subfamilies according to their age. The link between mappability and the age of L1 families was previously reported by Sexton and Han for the human genome [26]. The lower the age is, the lower mappability is as well.
Among the 25 mouse TE families with less than 50% of mapping, 19 were annotated as LTR retrotransposons, with representatives of the three different classes of LTR defined by their similarities to exogenous retroviruses [27]. In the ERV1 class, MURVY-int, its related LTR (RLTR5_MM) and RLTR4_MM (LTR flanking Murine Leukemia virus elements) had less than 25% of mapping. In the ERVK class, reads corresponding to the IAPEz-int annotation had 28% of mapping. This annotation represents the internal portion of IAPLTR1 elements, which are the young active elements from the IAP subtypes [28]. Finally, MERVL-int annotations, which represent active members of the ERVL class, had only 30% of mapping [29].
As depicted in Fig. 1b and Additional file 1: Figure S1A, using PE library improved the mapping step by producing a higher percentage of uniquely-mapped reads: more precisely, 6 and 2% of additional uniquely-mapped reads were gained in genome-wide mouse and human simulations, respectively. However, there was a strong inter-TE family variability in the improvement (Fig. 3c and Additional file 4: Figure S4C). Mouse L1MdGf_II, L1MdA_III and L1MdF_I (Fig. 3c) and human L1PA3 and L1PA2 (Additional file 4: Figure S4C) showed a 30% mapping gain when a PE library was used. The gain was slightly less satisfactory for the youngest LINE1 families compared to the slightly older families mentioned above, with human L1HS gaining 22% (Additional file 4: Figure S4C) and mouse L1MdTf_I, L1MdTf_II and L1MdA_I gaining 10% only on average (Fig. 3c). Similarly, in the human genome, mapping was improved by 20% or more when using PE over SE libraries for the youngest, hominoid-specific SVA subtypes (SVA_E and SVA_F) and the youngest subfamilies of the AluY type (AluYa5 and AluYb9)(Additional file 4: Figure S4C) [29–31]. These results demonstrate the importance of paired-end sequencing libraries, especially for the study of evolutionarily young TE families, provided that they are not completely identical in sequence.
Discussion
Because of their repetitive nature, TE-derived sequences are complex to analyze. The objective of the present study was to provide objective guidelines for the analysis of transposable elements within high-throughput sequencing datasets.
Sample and library preparation
At the beginning of a project, experimental design and sample preparation should be conceived in order to retrieve as much information as possible. Chhangawala et al (2015) already showed that single-end reads increased the number of multi-mapped reads. In contrast, paired-end reads lower the amount of multi-mapped reads and increase splicing event detection [32]. Our study confirms the importance of using paired-end library instead of single-end when analyzing TE-derived reads, especially for evolutionarily young families such as SVA_F, AluYb9 and L1HS in the case of human-based analyses. Read length is another parameter to take into consideration when TE-derived reads are sequenced. Chhangawala et al (2015) showed that longer reads increased the uniqueness of sequenced fragments. Longer fragment size should also help during the mapping step, because the chance for the sequenced fragment to fall into the boundaries or to cover a polymorphism will increase with the size of the fragment. As a result, the mappability of the given fragment should increase. However, having longer reads is a limitation of the Illumina technology. It is also a limiting factor in some applications, such as ChIP-seq, CUT&RUN and WGBS, where fragment size is determined by obligate fragmentation steps (sonication, micrococcal nuclease digestion or bisulfite-induced DNA degradation).
Mapping
After quality control, read alignment against a reference genome is the first step in NGS analyses. Appropriate parameters and algorithms are needed to align as many TE-derived reads as possible. BWA algorithms (mem and aln) and bowtie2 have no defined parameter for retrieving uniquely mapped reads. In such case, post-mapping filtering has to be applied. In contrast, Novoalign, bowtie and STAR have dedicated parameters to report uniquely-mapped reads. However, bowtie does not capitalize on the information of paired-end reads. If a 5’end read -R1 read- is uniquely mapped and the corresponding 3’end read -R2 read- is a multi-mapper, bowtie discards the valid alignment from the R1 read. In contrast, Novoalign and STAR use the information from the R1 read and increase the percentage of mapping with paired-end library.
In the multiple-hit mode, Bowtie2 searches for up to k valid alignments per read, where k is a threshold given by the user (k was set to 5000 in this study). In Bowtie2’s reference manual, it is mentioned: “Bowtie 2 does not guarantee that the k alignments reported are the best possible in terms of alignment score” (http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml). Other alignments with different alignment scores are reported in addition to the best alignment, which creates a low true positive rate and a bigger BAM file compared to STAR and Novoalign (Tables 1 and 2).
We found that reporting multi-mapped reads or reporting randomly one position increases the percentage of mapping close to 100% but at the cost of lower precision, which confirms previous results [11, 33]. Discarding multi-mapped reads is a real cost for evolutionary young families due to quasi-identical copies. However, these families are the ones that are mostly regulated in the genome, by repression histone marks and DNA methylation [34, 35]. As a conclusion, using multi-mapped reads or reporting randomly one position has to be done with caution to avoid discarding the most important information of the TE fraction of the genome.
As with the uniquely-mapped reads, STAR and Novoalign were the best compromise to report multi-mapped reads or a random valid alignment. However, Novoalign had a big disadvantage, its computing time, especially using PE reads. Starting with more than three millions of paired-end reads simulated from the mouse chromosome 1, Novoalign randomly aligned this set of reads in 4.5 h (Tables 1 and 2), while STAR completed the same task in 5 min. As the amount of sequenced reads and the number of projects with sequencing data are growing, fast algorithms are requested. This is why we recommend using STAR for the mapping step. Nevertheless, specific parameters have to be adapted for the study of transposable elements. This is especially important for young families that display a low mappability score. Unadapted parameters can mask relevant results or on the contrary, create incorrect conclusions. By default, STAR reports up to 10 alignments per read. The ENCODE project recommends to report up to 20 alignments per reads for long RNA sequencing pipeline. These guidelines are adapted for pseudogenes. In the case of TE studies and genomes with high TE content, these parameters have to be tuned (see Methods). A previous study based on ChIP-seq data estimated that a threshold of 10,000 positions per read is optimal in term of computing time and storage, without significant loss of sequence information (0.25% of reads eliminated on average) [35].
Quantification of transposable elements
To highlight TE regulation, transposable element quantification is estimated and compared in different biological conditions. Dedicated methods have to be applied according to the parameters used during the alignment step. We demonstrated that quantification methods relying on uniquely-mapped reads underestimated the abundance of the youngest TE families, because of their low level of sequence diversity and consequently, low mapping performance.
When using reads with multiple hits, we found that reporting randomly one position or weighting multi-mapped reads with the number of hits give rise to the same estimation. However, reporting multi-hits is more consuming in terms of storage and time. In the case of mouse simulation, the output is five times bigger (500 Mbytes to 2500 Mbytes) when multi-hits are reported in comparison to the random mode. The increase in the rate and amount of sequencing data represents a high storage challenge for the community. Data analyses within TE studies has to be conducted with taking care of the amount of processed data. For this reason, we recommend to report randomly one position per read.
We also studied the specific case of TEtools, which quantifies TEs using randomly reported reads with Bowtie or Bowtie2. However, this tool considers a list of TE sequences extracted from a genome or manually annotated- as genomic references for the mapping. We showed that, in the case of available assembled genomes, performing the mapping onto the reference genome gives rise to a better estimation of TE quantity in comparison to the strategy applied by TEtools. Indeed, using only a part of the genome assembly introduces a bias in the alignment by forcing the mapping to this genome extract, the extent of which results from a combination of technology-specific sequencing errors and mismatch allowance in the alignment settings. Consequently, regions represented in this genome extract are overestimated. The method used by TEtools is analogous to a strategy where TE consensus sequences provided by RepBase are used for the mapping step [36]. Aligning reads against consensus sequences should also lead to an overestimation of the abundance of TEs; it adds moreover another confounding factor by allowing more mismatches. In the case of available assembly genomes, we therefore recommend to align reads with the reference genome and extract expression with FeatureCounts. Then, for RNA-seq analyses, gene quantification can be performed in the same time taking, advantage of only one step. Gene and transposon-based differential expression should be called in the same analysis, as it is done in SQuIRE and TEtranscripts.
Transposable elements and their evolution
Human and mouse genomes are estimated to contain 48.5 and 41.8% of TEs, respectively. Interestingly, using genome-wide simulation on these species, we observed a higher mappability in the human genome compared to the mouse one. These differences likely reflect a more recent activity of certain TE families in the mouse genome, and therefore a higher proportion of sequence homology among TE copies. The overview we provide here on the TE-specific mappability rate should help researchers qualifying their conclusions made on specific families. For instance, in the mouse, using uniquely-mapped reads on L1 young families, IAPEz and MERVL families will undoubtedly induce an underestimation of their abundance in NGS datasets. We demonstrate and quantify here that significant improvement − 20 to 30% of mapping gain- can be obtained for these young TE families by using PE library. This is truly important, particularly in RNA-seq datasets, as these families are the ones that have more intact sequences, including at transcription factor binding sites, and therefore the potential for being transcribed.
Conclusions
By comparing different available algorithms with simulated data generated onto the mouse and human genomes, we demonstrated the difficulty of analyzing evolutionarily young TE families. Improvements can nonetheless be gained if the following recommendations are followed:
paired-end library should be used to increase the uniqueness of sequenced fragments.
During the alignment step, STAR is the best compromise between efficiency and speed. Parameters have to be set according to the TE content.
Reporting randomly one position and using FeatureCounts to quantify TE families gives the best estimation values.
When TE annotation on an assembled genome is available, mapping and quantification should be done with the reference genome.
Evolutionarily young families suffer from low mappability rate and are severely underestimated if uniquely-mapped reads are reported.
Methods
Reconstruction of repeatMasker annotations
Transposon annotations were downloaded from the RepeatMasker website (Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-4.0. 2013–2015 <http://www.repeatmasker.org>). As described in Bailly-Bechet et al., 2014, a dictionary was constructed for LTR retrotransposons that associated elements corresponding to the internal sequence and those corresponding to LTR sequences. With the latter and the RepeatMasker database, fragments of transposable elements corresponding to the same copy were merged if the distance between them is less than 1000 bp.
Simulation data pipeline
2x100bp paired-end reads were simulated with ART v2.5.8 [22] mimicking Illumina HiSeq 2500 technologies (−m 200 –s 10). Simulated reads overlapping with reconstructed repeatMasker annotation were kept using Bedtools intersectBed v2.21.0.
Mapping comparison
The following tools were used: Bowtie v1.0.0, Novoalign v3.2.11, STAR v2.5.2b, Bowtie2 v2.1.0, BWA aln v0.7.15, BWA mem v0.7.15. All the mappers were run with four threads (except for Novoalign that can be run with only 1 thread). Parameters used for the unique, random and multiple mode are detailed in Additional file 5.
Quantification comparison
The following tools were compared. Command lines and parameters are detailed in Additional file 5.
repEnrich
as recommended, reads were first mapped with Bowtie v1.2 reporting unique alignments and retrieving multi-hits in fastq files (−m1 --max multimap.fastq). TE families were quantified using repEnrich v0.1.
TEtools
repeatMasker annotation was first extended 300 bp upstream and downstream in order to map reads located in the boundaries. TEtools v1.0.0 was used with Bowtie2 v2.2.4.
TEtranscripts
STAR v2.5.2b was used with the recommended parameters (−- outAnchorMultimapNmax 100 --outFilterMultimapNmax 100). TEtranscipts v1.5.1 was run using unique and multiple modes.
SQuIRE
To compare TE-estimated abundance with other tools, the same TE annotation was provided to the clean folder (−c option in SQuIRE count). SQuIRE v0.9.9.92 was run.
FeatureCounts unique, random and multiple alignments
featureCounts v1.5.1 was used with specific options (−s 0 -p). The option -M was used for random and multiple counting methods. In the multiple counting method, −-fraction option was also used in order to weight the counts for multi-mapped reads. Quantification of TE family was performed by summing all copies from each family.
Supplementary information
Acknowledgements
We would like to thank M. Greenberg for editing the manuscript.
Abbreviations
- ERV
Endogenous Retrovirus
- LINE
Long INterspersed Element
- LTR
Long Terminal Repeat
- PE
Paired-End
- SE
Single-End
- SVA
SINE-R, VNTR, and Alu
- TE
Transposable Element
- TP
True Positive
Authors’ contributions
AT, NS, EB and DB conceived the study. AT conducted the research and drafted the manuscript with NS and DB. All authors read and approved the final manuscript.
Funding
The laboratory of D.B. is part of the Laboratoire d’Excellence (LABEX) entitled DEEP (11-LBX0044). This research was supported by the ERC (grant ERC-Cog EpiREPRO).
Availability of data and materials
Data sharing not applicable to this article as no datasets were generated or analysed during the current study.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Nicolas Servant, Email: nicolas.servant@curie.fr.
Deborah Bourc’his, Email: deborah.bourchis@curie.fr.
Supplementary information
Supplementary information accompanies this paper at 10.1186/s13100-019-0192-1.
References
- 1.Platt RN, Vandewege MW, Ray DA. Mammalian transposable elements and their impacts on genome evolution. Chromosom Res. 2018;26(1–2):25–43. doi: 10.1007/s10577-017-9570-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.de Koning APJ, Gu W, Castoe TA, Batzer MA, Pollock DD. Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet. 2011;7(12):e1002384. doi: 10.1371/journal.pgen.1002384. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Brouha B, Schustak J, Badge RM, Lutz-Prigge S, Farley AH, Moran JV, et al. Hot L1s account for the bulk of retrotransposition in the human population. Proc Natl Acad Sci U S A. 2003;100(9):5280–5285. doi: 10.1073/pnas.0831042100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Goodier JL, Ostertag EM, Du K, Kazazian HH., Jr A novel active L1 retrotransposon subfamily in the mouse. Genome Res. 2001;11(10):1677–1685. doi: 10.1101/gr.198301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Hancks DC, Kazazian HH. Roles for retrotransposon insertions in human disease. Mobile DNA. 2016;7:9. [DOI] [PMC free article] [PubMed]
- 6.Treangen Todd J., Salzberg Steven L. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nature Reviews Genetics. 2011;13(1):36–46. doi: 10.1038/nrg3117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Royo H, Stadler MB, Peters AHFM. Alternative computational analysis shows no evidence for nucleosome enrichment at repetitive sequences in mammalian spermatozoa. Dev Cell. 2016;37(1):98–104. doi: 10.1016/j.devcel.2016.03.010. [DOI] [PubMed] [Google Scholar]
- 8.Marinov GK, Wang J, Handler D, Wold BJ, Weng Z, Hannon GJ, et al. Pitfalls of mapping high-throughput sequencing data to repetitive sequences: Piwi’s genomic targets still not identified. Dev Cell. 2015;32(6):765–771. doi: 10.1016/j.devcel.2015.01.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Fonseca N. A, rung J, Brazma a, Marioni JC. Tools for mapping high-throughput sequencing data. Bioinformatics. 2012;28(24):3169–3177. doi: 10.1093/bioinformatics/bts605. [DOI] [PubMed] [Google Scholar]
- 10.Tsuji J, Weng Z. Evaluation of preprocessing, mapping and postprocessing algorithms for analyzing whole genome bisulfite sequencing data. Brief Bioinform. 2016;17(6):938–52. [DOI] [PMC free article] [PubMed]
- 11.Baruzzo G, Hayer KE, Ji Kim E, Di Camillo B, FitzGerald GA, Grant GR. Simulation-based comprehensive benchmarking of RNA-seq aligners. Nat Methods. 2016;4(December):1–9. doi: 10.1038/nmeth.4106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Caboche S, Audebert C, Lemoine Y, Hot D. Comparison of mapping algorithms used in high-throughput sequencing: application to ion torrent data. BMC Genomics. 2014;15(1):1–16. doi: 10.1186/1471-2164-15-264. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Lerat E, Fablet M, Modolo L, Lopez-Maestre H, Vieira C. TEtools facilitates big data expression analysis of transposable elements and reveals an antagonism between their activity and that of piRNA genes. Nucleic Acids Res. 2016;45(4):gkw953. doi: 10.1093/nar/gkw953. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Langmead B, Salzberg SL. Fast gapped-read alignment with bowtie 2. Nat Methods. 2012;9(4):357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Criscione SW, Zhang Y, Thompson W, Sedivy JM, Neretti N. Transcriptional landscape of repetitive elements in normal and cancer human cells. BMC Genomics. 2014;15(1):583. doi: 10.1186/1471-2164-15-583. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25. doi: 10.1186/gb-2009-10-3-r25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Yang Wan R, Ardeljan Daniel, Pacyna Clarissa N, Payer Lindsay M, Burns Kathleen H. SQuIRE reveals locus-specific regulation of interspersed repeat expression. Nucleic Acids Research. 2019;47(5):e27–e27. doi: 10.1093/nar/gky1301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Dobin A, Davis C a, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15–21. doi: 10.1093/bioinformatics/bts635. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Jin Y, Tam OH, Paniagua E, Hammell M. TEtranscripts: a package for including transposable elements in differential expression analysis of RNA-seq datasets. Bioinformatics. 2015;31(22):3593–3599. doi: 10.1093/bioinformatics/btv422. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Li H, Durbin R. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics. 2009;25(14):1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv Prepr arXiv. 2013;00(00):3. [Google Scholar]
- 22.Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinformatics. 2012;28(4):593–594. doi: 10.1093/bioinformatics/btr708. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Liao Y, Smyth GK, Shi W. FeatureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30(7):923–930. doi: 10.1093/bioinformatics/btt656. [DOI] [PubMed] [Google Scholar]
- 24.Sookdeo A, Hepp CM, McClure M a, Boissinot S. Revisiting the evolution of mouse LINE-1 in the genomic era. Mob DNA. 2013;4(1):3. doi: 10.1186/1759-8753-4-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Khan H, Smit A, Boissinot S. Molecular evolution and tempo of amplification of human LINE-1 retrotransposons since the origin of primates. Genome Res. 2006;16(1):78–87. doi: 10.1101/gr.4001406. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Sexton CE, Han MV. Paired-end mappability of transposable elements in the human genome. Mobile DNA. 2019;10:29 [DOI] [PMC free article] [PubMed]
- 27.McCarthy Eugene M, McDonald John F. Genome Biology. 2004;5(3):R14. doi: 10.1186/gb-2004-5-3-r14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Qin C, Wang Z, Shang J, Bekkari K, Liu R, Pacchione S, et al. Intracisternal a particle genes: distribution in the mouse genome, active subtypes, and potential roles as species-specific mediators of susceptibility to cancer. Mol Carcinog. 2010;49(1):54–67. doi: 10.1002/mc.20576. [DOI] [PubMed] [Google Scholar]
- 29.Costas J. Molecular characterization of the recent intragenomic spread of the murine endogenous retrovirus MuERV-L. J Mol Evol. 2003;56(2):181–186. doi: 10.1007/s00239-002-2392-3. [DOI] [PubMed] [Google Scholar]
- 30.Wang H, Xing J, Grover D. Hedges Kyudong Han DJ, Walker JA, Batzer MA. SVA elements: a hominid-specific retroposon family. J Mol Biol. 2005;354(4):994–1007. doi: 10.1016/j.jmb.2005.09.085. [DOI] [PubMed] [Google Scholar]
- 31.Batzer MA, Deininger PL. Alu repeats and human genomic diversity. Nat Rev Genet. 2002;3(5):370–379. doi: 10.1038/nrg798. [DOI] [PubMed] [Google Scholar]
- 32.Chhangawala S, Rudy G, Mason CE, Rosenfeld JA. The impact of read length on quantification of differentially expressed genes and splice junction detection. Genome Biol Genome Biology. 2015;16(1):1–10. doi: 10.1186/s13059-015-0697-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Yu X, Guda K, Willis J, Veigl M, Wang Z, Markowitz S, et al. How do alignment programs perform on sequencing data with varying qualities and from repetitive regions? Bio Data Min. 2012;5(1):1. doi: 10.1186/1756-0381-5-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Barau J, Teissandier A, Zamudio N, Roy S, Nalesso V, Hérault Y, et al. The DNA methyltransferase DNMT3C protects male germ cells from transposon activity. Science. 2016;354(6314):909–912. doi: 10.1126/science.aah5143. [DOI] [PubMed] [Google Scholar]
- 35.Pezic D, Manakov SA, Sachidanandam R, Aravin AA. piRNA pathway targets active LINE1 elements to establish the repressive H3K9me3 mark in germ cells. Genes Dev. 2014;28(13):1410–1428. doi: 10.1101/gad.240895.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Bao W, Kojima KK, Kohany O. Repbase update, a database of repetitive elements in eukaryotic genomes. Mob DNA. 2015;6(1):11. doi: 10.1186/s13100-015-0041-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data sharing not applicable to this article as no datasets were generated or analysed during the current study.