Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2019 Oct 4;36(4):1167–1173. doi: 10.1093/bioinformatics/btz724

L1EM: a tool for accurate locus specific LINE-1 RNA quantification

Wilson McKerrow 1,2,, David Fenyö 3,4,
Editor: Janet Kelso
PMCID: PMC8215917  PMID: 31584629

Abstract

Motivation

LINE-1 elements are retrotransposons that are capable of copying their sequence to new genomic loci. LINE-1 derepression is associated with a number of disease states, and has the potential to cause significant cellular damage. Because LINE-1 elements are repetitive, it is difficult to quantify LINE-1 RNA at specific loci and to separate transcripts with protein coding capability from other sources of LINE-1 RNA.

Results

We provide a tool, L1EM that uses the expectation maximization algorithm to quantify LINE-1 RNA at each genomic locus, separating transcripts that are capable of generating retrotransposition from those that are not. We show the accuracy of L1EM on simulated data and against long read sequencing from HEK cells.

Availability and implementation

L1EM is written in python. The source code along with the necessary annotations are available at https://github.com/FenyoLab/L1EM and distributed under GPLv3.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Retrotransposons are genomic sequences that are able to copy themselves to a new location via an RNA intermediate. Long Interspersed Element 1 (LINE-1) is the only retrotransposon known to be capable of autonomous retrotransposition in the human genome. LINE-1 retrotransposition begins with transcription from one or more genomic loci. The unspliced LINE-1 RNA is polyadenylated and exported from the nucleus, where two proteins—ORF1p and ORF2p—are translated from separate open reading frames on the LINE-1 RNA. ORF1p is an RNA binding protein (Martin, 2006) while ORF2p possesses the endonuclease and reverse transcriptase activity necessary for retrotransposition (Feng et al., 1996; Hattori et al., 1986). The LINE-1 RNA then forms a ribonucleoprotein complex (RNP) with its protein products (Wei et al., 2001). The RNP is imported into the nucleus, likely during the S phase of cell cycle (Mita et al., 2018), where it is reverse transcribed and inserted at a new genomic locus. At least 17% of the human genome is derived from LINE-1 activity. However, in the human genome, only about 100 LINE-1 copies, from the L1HS subfamily, remain capable of retrotransposition. Most of the about 500 000 LINE-1 copies in the human genome are either ancient, truncated or have premature stop codons that prevent the translation of intact ORF1p or ORF2p.

In humans, LINE-1 activity has been observed in early embryonic development (Kano et al., 2009), during neuronal differentiation (Coufal et al., 2009), in cancer (Rodić et al., 2014; Tubio et al., 2014) and during aging (de Cecco et al., 2019). LINE-1 is theorized to contribute to genomic instability in cancer (Kemp and Longworth, 2015), play a regulatory role in stem cells and in early development (Percharde et al., 2018; Rodriguez-Terrones and Torres-Padilla, 2018), trigger inflammation in senescent cells (de Cecco et al., 2019) and provide genomic plasticity to neurons (Singer et al., 2010). However, despite a growing appreciation for the ubiquity of LINE-1 RNA, the repetitive nature of LINE-1 sequences presents significant challenges to the accurate quantification of LINE-1 expression from RNA-seq data. Firstly, because the sequences of many LINE-1 elements, especially the young (L1HS) elements, are highly similar, most reads do not map uniquely to a single locus. Secondly, because LINE-1 elements are interspersed throughout the genome, they often overlap with other transcripts that are not driven by LINE-1 expression. Thus LINE-1 aligning RNA is often present even when it is not directly expressed. We call this ‘passive co-transcription’ of LINE-1. Finally, only about a quarter of the L1HS loci in the human genome are full length, and only about of a quarter of those retain intact ORF1 and ORF2. Thus, knowing which loci are expressed is germane to the determining whether LINE-1 is actively retrotransposing under a given condition.

Recently, significant progress has been made toward RNA quantification for LINE-1 and other repetitive sequences. TEtranscripts (Jin et al., 2015) adapted the expectation maximization algorithm (EM) (Dempster et al., 1977), used widely for the quantification of gene isoforms [e.g. (Li and Dewey, 2011; Patro et al., 2017)], to estimate the expression of transposable elements at the subfamily level. SQuIRE (Yang et al., 2019) extended this method to quantify specific transposable element loci. In an orthogonal method, Philippe et al. (Philippe et al., 2016) sidestepped the alignment challenge by combining RNA-seq reads from transcripts that overrun the 3’ end of LINE-1 with an H3K methylation CHIPseq signal that extends upstream of the locus.

In this paper, we build on those successes by proposing L1EM, an EM based method that explicitly models the potential sources of LINE-1 RNA: canonical LINE-1 transcripts that begin at the 5’UTR and terminate at the polyA tail, run-on transcripts that begin at the 5’UTR but terminate downstream, antisense transcripts that originate from an antisense transcription start site in the LINE-1 5’UTR (Speek, 2001), and passive transcripts that include LINE-1 sequence but are not driven by the LINE-1 promoter. This method sacrifices some of the generality provided by previous methods for increased accuracy when quantifying LINE-1 expression. L1EM performs well against simulated data and against long read L1 5’RACE sequencing.

2 Materials and methods

2.1 The transcripts

All LINE-1 loci in the hg38 human reference genome that are labeled as L1HS or L1PA* (i.e. L1PA2, L1PA3, etc.) in RepeatMasker (Bao et al., 2015) are divided into two categories. Category 1, includes all elements that include the LINE-1 5’UTR, Category 0 consists of elements that do not include the 5’UTR. Because the 5’UTR functions as the LINE-1 promoter, only elements in category 1 are allowed to form LINE-1 transcripts. For each element in category 1, five transcripts, illustrated in Figure 1A, are considered:

  1. The ‘Only’ transcript that runs sense and includes only the annotated element, supported by sense reads that fall entirely within the LINE-1 element.

  2. The ‘Run-on’ transcript that runs sense and includes downstream sequence, supported by sense reads that overlap the LINE-1 element and do no extend upstream.

  3. The ‘Passive’ sense transcript that runs sense and includes both upstream and downstream sequence, supported by sense reads that overlap the LINE-1 element.

  4. The ‘Passive’ antisense transcript that runs antisense and includes both upstream and downstream sequence, supported by antisense reads that overlap the LINE-1 element.

  5. The ‘Antisense’ transcript that runs antisense and includes only the first 500 bases of the element plus upstream (downstream on the antisense strand) sequence (Speek, 2001), supported by antisense reads that do not extend more than 500 bases into the LINE-1 element.

Fig. 1.

Fig. 1.

Transcripts and pipeline. (A) Types of transcripts that include LINE-1 sequence. ‘Run-on’, ‘only’ and ‘antisense’ are only allowed at loci with 5’UTRs. ‘Passive’ transcripts are allowed at all reference loci. (B) Outline of the L1EM pipeline. L1HS/L1PA* reads are collected, and alignments to transcripts indicated in part A are stored in a reads-by-transcripts matrix. EM iterations are preformed until convergence to the maximum likelihood estimate

For elements without a 5’UTR (category 0), only the ‘Passive’ transcripts are included. Because they both can lead to retrotransposition, we use the term ‘proper’ to refer to ‘only’ and ‘run-on’ transcripts together.

These LINE-1 transcripts can then be quantified using an EM algorithm method that is similar to methods used to quantify gene isoforms [e.g. (Li and Dewey, 2011; Patro et al., 2017)]. L1EM starts with a list of transcripts that includes each of the possible transcript types at each locus. Then it calculates the extent to which each read supports each transcript and fractionally assigns reads to transcripts based on this support. The fractional assignments provide an initial estimate of the relative transcript abundances, which can be used to refine the fractional assignments, which in turn are used refine the transcript abundance estimates. These steps are repeated until convergence. Through this process, EM will not only make use of unique alignments, but also recruit multimapping reads to fill in gaps and generate transcript estimates and read assignments that are mutually consistent. This iterative process provides the relative transcript abundances that make observed RNA-seq reads most likely under the following model.

2.2 The generative model

List of random variables:

  • X= X1 XiXn, where Xi=1, is the relative abundance of each transcript enumerated above. n is the total number of transcripts.

  • R= R1 RjRm are the L1PAx aligning read sequences. m is the total number of L1PAx aligning reads.

  • A= A1 AjAm are the read alignments.

The reads are assumed to be independently generated by first sampling a transcript according to X, then choosing a random location in that transcript, and finally introducing mismatch/indel read errors with probability ϵ (0.01 by default). This model yields the following likelihood function:

PR|XjAjϵNMAjlιAjXιAj

where NM is the edit distance of an alignment, li is the effective length of transcript i, and ι(Aj) is the locus that Aj is an alignment to. For ‘passive’ transcripts, the effective length is the length of the element plus the median template length for read pairs. For ‘only’ transcripts, it is the element length minus the median template length. For ‘run-on’ transcripts it is the element length, and for ‘antisense’ transcripts it is 500. Transcripts are given a minimum effective length of 500.

The likelihood function can be simplified by rearranging the sum over Aj to group alignments by transcript:

PR|XjiAj:ιAj=iϵNMAjlιAjXi=defjG(Rj)X

where G(Rj) is defined by the above equation to be the expression in brackets. This simplification shows that the likelihood function is a product of linear functions, and is thus convex (Jiang and Wong, 2009). We can therefore find the maximum likelihood estimate of the relative expression levels (X) using expectation maximization, without risking the identification of local maxima.

2.3 Collection of L1HS/L1PA reads

If reads are not previously aligned to hg38, they are first aligned to hg38 using bwa aln (H. Li and Durbin, 2009), allowing an edit distance of up to 3. If reads are aligned, unaligned reads are extracted and realigned to hg38 using bwa aln. This realignment step is necessary as many aligners do not report alignments for highly ambiguous reads. Once genome alignments are complete, any read pair that overlaps an L1HS or L1PA* family element in either the initial alignment or the realignment is extracted for further analysis.

2.4 Generation of candidate alignments

Because it is not feasible to sum over all possible alignments of every read, we need to approximate G(Rj) by only summing over a set of candidate alignments. This set of alignments is generated by using bwa aln to find all alignments to the transcripts enumerated above that have an edit distance less than or equal to 5, and no indel within 20 basepairs of the end of the read. L1EM run-time is highly sensitive to these mismatch parameters, with more candidate alignments leading to a longer compute. Allowing fewer mismatches did not affect accuracy in simulation, but some loci may differ significantly from the reference in certain samples. Candidate alignments that have at most 2 more mismatches than the best alignment for that read are retained. The alignment likelihood, ϵNMAjlιAj, is then calculated for each candidate alignment and added to a m×n sparse matrix, GR, where

GiRj=Aj:ιAj=iϵNMAjlιAj

is proportional to the likelihood that a random read from transcript i will have the sequence Rj.

2.5 Estimation of transcript abundance

The EM algorithm requires iterating between two steps: first, during the E step, an estimate of the relative expression levels, X(t), is fixed, and the expected number of reads originating from each transcript is calculated. Second, during the M step, a new estimate of the expression level, X(t+1), is calculated by normalizing the expected counts. This yields the following iteration:

C(t)jG(Rj)*X(t)G(Rj)·X(t)

 

X(t+1)C(t)m

where the C(t) are the expected read counts at each transcript and the * operation indicates elementwise multiplication. Because the likelihood function is a product of linear functions, we can start with any initial guess, X(1), that has no 0 entry and iterate these steps until we converge to the maximum likelihood estimate:

X^=argmax P(R|X)

L1EM uses the uniform distribution as an initial guess, and repeats the iteration until no entry of X changes by more than a parameter δ (10-7 by default). Because some fraction of ‘passive’ transcription can be mislabeled as ‘only’ or ‘run-on’, ‘only’ and ‘run-on’ transcripts are not reported if they make up less than ¾ of the expression at that locus.

In this analysis, all estimated expression levels are reported as read pairs (fragments) per million (FPM):

X*mM/106

where X* is the final expression estimate, m is the total number of L1PAx aligning read pairs, and M is the total number of properly aligned read pairs anywhere in the genome.

2.6 Implementation

L1EM is implemented in python, with the pysam library used to read bam format alignment files and the scipy sparse matrix library used for all intensive computations. L1EM is available as an open source program under the GPLv3 license. Currently only annotations to quantify L1HS/PA elements in the hg38 human genome are available, but instructions for generating new annotations for other elements in other genomes are provided.

By chunking the G(R) matrix, L1EM is highly parallelizable. For the ENCODE RNA-seq experiments considered in Section 3.3, an average of 2282 steps, each taking an average of 1.5 s on 16 cores with 64 Gb memory, were needed to achieve our stringent threshold for convergence (δ=10-7). In many cases, the bwa alignment and other preprocessing steps are more intensive than the EM step. For the ENCODE RNA-seq experiments the entire pipeline (not including data download) took an average of about 2 h on 16 cores with 64 Gb memory.

3 Results

3.1 On simulated data, L1EM provides accurate estimates at loci with sufficient coverage

Five simulations were performed using the Flux Simulator (Griebel et al., 2012). The Flux Simulator models and simulates each step of an RNA-seq experiment: reverse transcription, fragmentation, adapter ligation, PCR amplification, gel segregation and sequencing. For each of the five simulations, we combined reads from two simulations: 30 000 proper LINE-1 reads were simulated from transcripts that exactly span each L1HS locus that includes the 5’ UTR and 500 000 passive LINE-1 reads were simulated from transcripts that begin 400 bases upstream of each L1HS or L1PA* element and continue 400 bases downstream of the element. The simulation of proper LINE-1 reads is considered the ground truth for LINE-1 expression and the simulation of passive LINE-1 reads is treated as contamination from passive co-transcription. The 76 basepair error model was used and the transcription start site variation was decreased from 25 to 5. This simulation set up was chosen to approximately align with the HEK293 RNA-seq data considered in Section 3.2.

To evaluate L1EM, we compared the number of reads assigned to L1HS ‘only’ transcripts at each locus in each simulation to the simulated expression levels in the first (proper L1HS) simulation. L1EM shows good agreement with these simulations (r2=0.83) especially at loci where at least 100 reads pairs are estimated to align (dotted line in Fig. 2, left panel.) We also tested SQuIRE (Yang et al., 2019) and counted unique L1HS alignments in these simulations. SQuIRE does not explicitly differentiate between proper and passive transcription, but it does estimate the start and stop of each repeat associated transcript. Therefore, we defined transcripts that start 50 bases or more upstream to be passive, and treated other transcripts as being indicative of proper LINE-1 expression. SQuIRE provides similar estimates at most loci (Fig. 2, center), but some proper transcription is misclassified as passive and vice versa, decreasing the overall agreement between simulation and estimation (r2=0.7). Merely counting unique alignments yields a poor estimate (Fig. 2, right).

Fig. 2.

Fig. 2.

Simulated data. Simulations were performed using the Flux Simulator. Note that plots have a square root scaling. Left: L1EM shows good agreement with simulation. Center: SQuIRE performs well but sometimes struggles to separate proper LINE-1 transcription from passive co-transcription. Right: Simply counting unique alignments yields a poor estimate

To test our choice of aligner and alignment parameters, we calculated the fraction of simulated L1HS reads that align, and the fraction of those for which the correct alignment is one of the candidates. bwa aln aligned 79% of the simulated reads, and for 99% of those the correct alignment was one of the candidates. Using TCGA STAR parameters, 72% aligned, with 67% having the correct alignment as one of the candidates. Increasing the number of allowed alignments to 1000 (as per SQuIRE), the fraction of read that align rose to 77%, with 93% of those having the correct alignment as a candidate. Surprisingly further increasing this parameter to 10 000 yielded worse results.

3.2 L1EM estimates agree with L1 5’RACE long read sequencing in HEK293 cells

Deininger et al. (2017) used the 5’RACE technique to specifically amplify LINE-1 transcripts beginning with the L1 5’UTR in HEK293 cells. PacBio reads exceeding 1000 letters in length were sequenced from the first 1237 bases of the amplified transcripts. We accessed these reads from the SRA database (SRR4099955), aligned them to hg38 using bwa mem, and then counted unique alignments to quantify LINE-1 transcripts beginning with the L1 5’UTR. As in the original analysis we found one L1HS element (22q12.1) that dominates expression, an L1HS element and a few L1PA2 elements that are moderately expressed, and a tail of loci with diminishing expression. Deininger et al. did not perform Illumina RNA-seq from these cells, so we ran L1EM on 2x150 bp strand-specific HEK293 RNA-seq reads from another study (Aktaş et al., 2017) (SRR3997504-7). To match the L1 5’RACE data, we pooled ‘only’ and ‘run-on’ transcripts. In their original analysis Deininger et al. found that most of the L1 5’RACE reads aligned to L1HS or L1PA2 elements. Thus, we focused on these two subfamilies.

For the elements that are moderately or highly expressed, L1EM does a good job of estimating expression levels (Fig. 3A). Estimates are less accurate when expression at a particular locus is supported only by a small number of reads. Coverage at the 5’ breakpoint is necessary to distinguish passive co-transcription from proper LINE-1 transcription, so at the very least we would not expect accurate results at loci with mean coverage below 1 (dashed line in Fig. 3). Note that because, for most fragments, the read pairs overlap, the median template (190 bp) rather than the read length (2 × 150 bp) is used to draw this line. Furthermore, because some reads do not align uniquely, each read provides less information. The dotted lines in Figure 3 mark 100 read pairs estimated, a rough cutoff for where we see accurate results.

Fig. 3.

Fig. 3.

Biological data. L1EM shows good agreement with long read L1 5'RACE sequencing in HEK293 when compared to competing methods. The y=1.064x (blue) line is a least squares fit for L1EM versus L1 5’RACE (A). Dashed line indicates a mean coverage of 1, below which L1EM is unlikely to give meaningful results. Dotted lines indicate an estimate of 100 read pairs, the cutoff used in our ENCODE analysis. Note the square root scaling. (A) L1EM compared to L1 5’RACE. (B) SQuIRE compared to L1 5’ RACE. (C) A comparison of uniquely mapping RNAseq reads to L1 5’ RACE. (D) L1EM compared to L1 5’ RACE when the reads are trimmed to 50 bp. (E) SQuIRE compared to L1 5’ RACE when the RNAseq reads are trimmed to 50 bp. (F) L1EM compared to L1 5’ RACE when the RNAseq strand specificity is ignored and the reads are shorted to 50 bp (Color version of this figure is available at Bioinformatics online.)

We also ran this RNA-seq data through SQuIRE (Yang et al., 2019), retaining full length, sense L1HS/L1PA2 transcripts that begin within 50 bp of the annotated starting position, and tested a simple method, consisting of counting only unique alignments. SQuIRE performed similarly, but slightly worse than L1EM (Fig. 3B): r2=0.81 for SQuIRE versus r2=0.87 for L1EM. Simply counting unique alignments yielded a poor estimate (Fig. 3C).

Finally, we tested the effect of trimming the reads and ignoring strand specificity on our ability to accurately estimate LINE-1 expression. Trimming reads did not affect the accuracy of L1EM: r2=0.89 for trimmed reads versus r2=0.87 for full length reads (Fig. 3D.) With the trimmed reads, SQuIRE struggled to accurately differentiate active and passive transcription (Fig. 3E). In general, there is much more passive transcription on the antisense strand. As a result, differentiating proper from passive transcription is significantly harder without strand specific reads, and L1EM accuracy suffers somewhat: r2=0.8 (Fig. 3F).

3.3 LINE-1 is expressed in some stem and cancer cells, but expressed little or not at all in tissue samples

We applied L1EM to all human 100 bp paired end, strand specific polyA RNAseq experiments in the ENCODE database (127 datasets including 49 from cell-lines, 16 from in vitro differentiated cells and 62 from tissues.) The full results of this analysis can be found in Supplementary Tables S1–S4. Based on our tests using simulated (Fig. 2) and L1 5’RACE data (Fig. 3), we made the somewhat conservative assertion that we can be confident in L1EM’s expression estimate at loci where at least 100 read pairs are estimated to align. We therefore used a cutoff of at least 100 reads pairs and FPM > 0.5 in at least one replicate to call expression at a given locus in a particular sample type. Figure 4, top panel, shows a heat map with all intact loci that meet these criteria in at least one sample of total or cytoplasmic RNA, and all such samples that meet these criteria at least one intact locus. Intact loci are defined to be full length insertions that have no nonsense mutation in either or ORF1 or ORF2.

Fig. 4.

Fig. 4.

ENCODE data. Estimates of L1HS locus expression from ENCDOE data (100 bp paired end, strand specific). Top: Loci expressed as only or run-on LINE-1 transcripts. Bottom: Loci expressed as from the antisense transcription start site

This analysis shows LINE-1 expression in several cancer cell lines (MCF-7, NCI-H60, HepG2 and K562), along with embryonic stem cells (H1-ESC) and several H1 derived cells including mesendoderm, trophoblast, mesenchymal stem cells and neural stem cells. Of the tissues tested, only esophagus samples achieved our expression criteria. However, in two of two brain and one of two testis samples, an intact locus was expressed at >0.5 FPM, but these samples were not sequenced deeply enough to achieve the 100 read pair cutoff. This result is broadly in agreement with what is known about LINE-1 activities in cancer, in early development and during neuronal differentiation.

One locus, chr22: 28663283-28669315 (chr22q12.1) located in the tetratricopeptide repeat gene TTC28, is the most highly expressed or one of the most highly expressed loci in all of these samples. Across all of the ENCODE samples tested, it accounts for 17% of intact L1HS expression. It is also the highly expressed locus in HEK293 (see Section 3.2). The overall expression pattern in H1-hESC and derived mesendoderm cells is quite different from the pattern in cancer cell-lines. In H1-hESC and mesendoderm cells, LINE-1 is expressed from a broad set of loci, but in the cancer cell-lines, one or a few loci dominate.

We also looked at estimates of transcription originating from the LINE-1 antisense transcription start site. Samples for which at least one read pair per million properly mapped is assigned to L1HS antisense transcripts are shown in Figure 4, bottom. As noted previously (Philippe et al., 2016) we see little overlap between proper LINE-1 expression and antisense expression. Only one locus (7p14.3) achieves our criteria for both proper and antisense expression, but it is not expressed in overlapping samples. One third of all antisense transcription is estimated to come from a locus for which the LINE-1 antisense TSS is coopted to be the TSS for the focadhesin (FOCAD) gene.

Finally, it is worth pointing out that there is a significant amount of passive co-transcription associated with several samples (see Supplementary Figs S3 and S4). H1 derived neural stem cells, H1 derived neurons and several heart tissue samples have loci that are passively transcribed on par with the proper LINE-1 expression seen in H1, MCF-7 and mesendodermal cells. In neurons and neural stem cells, this transcription is antisense and can thus be detected by strand specific techniques, but in the heart tissue samples this passive co-transcription is sense to the LINE-1 element, highlighting the importance of considering passive co-transcription when identifying LINE-1 expression.

3.4 The fraction of transcripts that ‘run-on’ varies considerably and depends on polyA length

Because L1EM explicitly models both LINE-1 transcripts that terminates at the polyA tail (‘only’) and transcripts that terminate downstream (‘run-on’), it provides an estimate of the fraction of proper LINE-1 transcripts that fail to terminate at the LINE-1 polyA tail. To determine whether the fraction of transcripts that run-on varies between loci, we reanalyzed 2x150bp strand specific RNA-seq data from MCF-7 cells (ERR973734, Philippe et al., 2016) using L1EM. This dataset provides extremely deep coverage of L1HS elements, with the top 50 most expressed loci all supported by more than 100 read pairs.

We find that the fraction of transcripts that fail to terminate varies widely by locus, observing ‘only’ transcription without ‘run-on’ transcription at many loci (Fig. 5). This is likely due to variation in the 3’ polyA sequence that strengthens or weakens the polymerase II termination signal. Considering the top 50 expressed LINE-1 loci and looking for a window at the 3’ end of each element that is at least 90% A, we find that loci with ‘run-on’ sequence have an average polyA window of length 22.8, while loci that do not ‘run-on’ have an average polyA window of length 36.4 (t-test p < 0.0003).

Fig. 5.

Fig. 5.

Run-on transcripts. Fraction of proper transcripts that terminate downstream of the polyA tail versus the polyA tail length

4 Discussion

We have shown that L1EM provides accurate estimates of RNA expression from young LINE-1 loci. It does this by combining the expectation maximization algorithm (EM) with our pre-existing knowledge of LINE-1 transcription: namely (i) that it can only occur from elements that retain the 5’UTR sequence, (ii) that it sometimes runs beyond the 3’ polyA sequence and (iii) that LINE-1 aligning RNA reads can be generated passively from transcription not related to LINE-1 activity. Here we show that L1EM performs accurately on simulated data (Fig. 2) and that it provides good agreement with L1 5’RACE pacbio sequencing (Fig. 3).

As a maximum likelihood method, EM will give a good estimate provided that (i) no two loci are exactly identical, (ii) the underlying model is correct and (iii) the sequence coverage is sufficiently deep. These strengths of maximum likelihood estimation also highlight the weaknesses of L1EM. If two loci are identical, L1EM will divide expression evenly between the two loci. While there are some identical loci annotated in the human genome, we have not yet observed this behavior. The model underlying L1EM differs from LINE-1’s true biology in at least two key ways. Firstly, because we use the bwa aln aligner, splicing is not considered. However, splicing is not part of LINE-1’s normal lifecycle so we do not expect this to be a major concern. Potentially more importantly, L1EM only considers reference loci, when in fact individuals can have dozens of polymorphic LINE-1 insertions (Gardner et al., 2017) that may be expressed. If non-reference insertions are expressed their RNA would be assigned to reference loci, most likely to the parent element since it would bear the greatest sequence similarity. Finally, and likely most importantly, no sequencing experiment achieves an infinite depth of coverage, and LINE-1 expression tends to be low compared to host genes. Accurately distinguishing passive from proper transcription requires the presence of at least one read overlapping the 5’ junction. Furthermore because of the high degree of sequence similarity, a LINE-1 aligning read provides less information about expression than one that aligns to an exon. Thus, when analyzing the ENCODE data, we focus on loci that are supported by at least 100 read pairs.

Our analysis of ENCODE RNA-seq data largely confirms what is known about LINE-1 expression: that LINE-1 is expressed during early development and many cancers, but is silenced in healthy somatic tissues. We find that a particular locus (located at 22q12.1) is nearly universally present in LINE-1 expressing cells. We also find that the fraction of LINE-1 transcripts that run through the polyA tail varies locus by locus. When ‘run-on’ transcripts are reinserted into the genome, they generate 3’ transductions that reveal the parent element for a novel insertion. 3’ transductions have been extensively analyzed to reveal ‘hot’ LINE-1 loci (e.g. Tubio et al., 2014). Our analysis suggests that the set of loci that form 3’ transductions may be a subset of the elements that generate novel insertions. It is possible that for some loci lacking LINE-1 RNA, it is actually a non-reference daughter insertion that is expressed. However, because we find an inverse relationship between polyA length and ‘run-on’ transcripts, we do not believe that non-reference insertions fully explain this result.

Previous methods have been developed to address locus-specific LINE-1 expression, but none are as scalable and comprehensive as L1EM. Philippe et al. (2016) measured LINE-1 locus specific expression by identifying an H3K4 methylation signal upstream of an element and a run-on signal downstream of the element. Relying on the availability of CHIPseq data and 3’ run-on transcription is a limitation for this method. Deininger et al. (2017) identified specifically expressed LINE-1 loci using the subset of reads that align uniquely to a particular LINE-1 element. However, because the intact L1HS LINE-1 elements are highly repetitive, few of the reads aligning to these elements align uniquely, making accurate quantification difficult. Finally, several other methods apply expectation maximization to transposable elements (including LINE-1) but only SQuIRE (Yang et al., 2019) provides locus specific estimates, and none include L1EM’s explicit modeling of the sources of LINE-1 RNA.

5 Conclusion

L1EM uses the expectation maximization algorithm to quantify LINE-1 transcripts arising from the 5’ UTR of each L1HS locus. It provides accurate locus specific quantifications both from simulated data and from real data. L1EM makes it possible to specifically measure intact LINE-1 transcripts that are (potentially) retrotransposition competent, and to identify critical loci.

Supplementary Material

btz724_Supplementary_Data

Acknowledgements

We would like to acknowledge Dr Jef Boeke for providing insight into the mechanisms of LINE-1 transcription and the potential sources of LINE-1 RNA.

Funding

This work was supported by the National Institutres of Health National Institute on Aging [P01AG051449 subcontract to D.F.]

Conflict of Interest: none declared.

Contributor Information

Wilson McKerrow, Institute for Systems Genetics, USA; Department for Biochemistry and Molecular Pharmacology, NYU School of Medicine, New York, NY, USA.

David Fenyö, Institute for Systems Genetics, USA; Department for Biochemistry and Molecular Pharmacology, NYU School of Medicine, New York, NY, USA.

References

  1. Aktaş T.  et al. (2017) DHX9 suppresses RNA processing defects originating from the Alu invasion of the human genome. Nature, 544, 115–119. [DOI] [PubMed] [Google Scholar]
  2. Bao W.  et al. (2015) Repbase update, a database of repetitive elements in eukaryotic genomes. Mobile DNA, 6, 11.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Coufal N.G.  et al. (2009) L1 retrotransposition in human neural progenitor cells. Nature, 460, 1127–1131. 10.1038/nature08248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. De Cecco M.  et al. (2019) L1 drives IFN in senescent cells and promotes age-associated inflammation. Nature, 566, 73.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Deininger P.  et al. (2017) A comprehensive approach to expression of L1 loci. Nucleic Acids Res., 45, e31.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Dempster A.P.  et al. (1977) Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodological), 39, 1–22. [Google Scholar]
  7. Feng Q.  et al. (1996) Human L1 retrotransposon encodes a conserved endonuclease required for retrotransposition. Cell, 87, 905–916. [DOI] [PubMed] [Google Scholar]
  8. Gardner E.J.  et al. (2017) The Mobile Element Locator Tool (MELT): population-scale mobile element discovery and biology. Genome Res., 27, 1916. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Griebel T.  et al. (2012) Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucleic Acids Res., 40, 10073–10083. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Hattori M.  et al. (1986) L1 family of repetitive DNA sequences in primates may be derived from a sequence encoding a reverse transcriptase-related protein. Nature, 321, 625.. [DOI] [PubMed] [Google Scholar]
  11. Jiang H., Wong W.H. (2009) Statistical inferences for isoform expression in RNA-Seq. Bioinformatics, 25, 1026–1032. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Jin Y.  et al. (2015) TEtranscripts: a package for including transposable elements in differential expression analysis of RNA-Seq datasets. Bioinformatics (Oxford, England), 31, 3593–3599. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Kano H.  et al. (2009) L1 retrotransposition occurs mainly in embryogenesis and creates somatic mosaicism. Genes Devel., 23, 1303–1312. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Kemp J.R., Longworth M.S. (2015) Crossing the LINE toward genomic instability: lINE-1 retrotransposition in cancer. Front. Chem., 3, 68. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Li B., Dewey C.N. (2011) RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 12, 323.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Li H., Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics(Oxford, England), 25, 1754–1760. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Martin S.L. (2006) The ORF1 protein encoded by LINE-1: structure and function during L1 retrotransposition. J. Biomed. Biotechnol., 2006, 45621. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Mita P.  et al. (2018) LINE-1 protein localization and functional dynamics during the cell cycle. ELife, 7, e30058. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Patro R.  et al. (2017) Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods, 14, 417–419. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Percharde M.  et al. (2018) A LINE1-nucleolin partnership regulates early development and ESC identity. Cell, 174, 391–405.e19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Philippe C.  et al. (2016) Activation of individual l1 retrotransposon instances is restricted to cell-type dependent permissive loci. ELife, 5, e13926. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Rodić N.  et al. (2014) Long interspersed element-1 protein expression is a hallmark of many human cancers. Am. J. Pathol., 184, 1280–1286. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Rodriguez-Terrones D., Torres-Padilla M.-E. (2018) Nimble and ready to mingle: transposon outbursts of early development. Trends Genet., 34, 806–820. [DOI] [PubMed] [Google Scholar]
  24. Singer T.  et al. (2010) LINE-1 retrotransposons: mediators of somatic variation in neuronal genomes?  Trends Neurosci., 33, 345–354. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Speek M. (2001) Antisense promoter of human L1 retrotransposon drives transcription of adjacent cellular genes. Molecul. Cell. Biol., 21, 1973–1985. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Tubio J.M.C.  et al. (2014) Mobile DNA in Cancer. Extensive transduction of nonrepetitive DNA mediated by L1 retrotransposition in cancer genomes. Science (New York, N.Y.), 345, 1251343. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Wei W.  et al. (2001) Human L1 retrotransposition: cis preference versus trans complementation. Mol. Cell. Biol., 21, 1429–1439. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Yang W.R.  et al. (2019) SQuIRE reveals locus-specific regulation of interspersed repeat expression. Nucleic Acids Res., 47, e27.. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btz724_Supplementary_Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES