Identifying transposon insertions and their effects from RNA-sequencing data

Julian R de Ruiter; Sjors M Kas; Eva Schut; David J Adams; Marco J Koudijs; Lodewyk F A Wessels; Jos Jonkers

doi:10.1093/nar/gkx461

. 2017 Jun 1;45(12):7064–7077. doi: 10.1093/nar/gkx461

Identifying transposon insertions and their effects from RNA-sequencing data

Julian R de Ruiter ^1,², Sjors M Kas ¹, Eva Schut ¹, David J Adams ³, Marco J Koudijs ¹, Lodewyk F A Wessels ^2,^4,^*, Jos Jonkers ^1,^*

PMCID: PMC5499543 PMID: 28575524

Abstract

Insertional mutagenesis using engineered transposons is a potent forward genetic screening technique used to identify cancer genes in mouse model systems. In the analysis of these screens, transposon insertion sites are typically identified by targeted DNA-sequencing and subsequently assigned to predicted target genes using heuristics. As such, these approaches provide no direct evidence that insertions actually affect their predicted targets or how transcripts of these genes are affected. To address this, we developed IM-Fusion, an approach that identifies insertion sites from gene-transposon fusions in standard single- and paired-end RNA-sequencing data. We demonstrate IM-Fusion on two separate transposon screens of 123 mammary tumors and 20 B-cell acute lymphoblastic leukemias, respectively. We show that IM-Fusion accurately identifies transposon insertions and their true target genes. Furthermore, by combining the identified insertion sites with expression quantification, we show that we can determine the effect of a transposon insertion on its target gene(s) and prioritize insertions that have a significant effect on expression. We expect that IM-Fusion will significantly enhance the accuracy of cancer gene discovery in forward genetic screens and provide initial insight into the biological effects of insertions on candidate cancer genes.

INTRODUCTION

Transposon-based insertional mutagenesis (TIM) is a high-throughput method for cancer gene discovery in mice (1). In TIM, discrete DNA elements called transposons can migrate throughout the genome by a cut-and-paste mechanism, in which they are excised from their original location in the genome and randomly reintegrated elsewhere (2). Depending on the location and orientation of their reintegration, these integrations can activate oncogenes or inactivate tumor suppressors, thereby inducing tumor development and progression (3). By identifying genomic loci that are recurrently affected by transposon insertions in multiple independent tumors, this approach can be used to identify candidate cancer genes (1,3,4).

Transposon insertion sites are typically identified using targeted DNA-sequencing approaches, in which junction fragments containing transposon and flanking genomic sequences are selectively amplified and sequenced (5). The genomic parts of these sequences are mapped to the reference genome to identify insertion sites and their genomic locations (6). These insertions are then assigned to their putative target gene(s) using heuristics, typically picking genes in the direct vicinity of the insertion. Examples of such heuristics are nearest gene (6), fixed window (7) and rule-based mapping approaches (8).

A significant drawback of DNA-sequencing approaches is that they do not provide any direct evidence that an insertion actually affects a gene. In ambiguous cases with multiple genes in the vicinity of an insertion, heuristic approaches are frequently unable to identify the true target(s) of the insertion. This typically leads to an arbitrary selection of a single gene (nearest gene), potentially selecting the wrong gene or missing other targets (false negatives). Alternatively, heuristics may select many genes in the direct vicinity of the insertion (fixed window, rule-based mapping), resulting in the selection of many non-target genes (false positives).

Additionally, DNA-sequencing approaches provide limited insight into how the expression of a target gene is affected by a transposon insertion and which novel transcripts may result from the insertion. This has two main drawbacks. First, it prevents prioritizing insertions that have a strong effect on gene expression and are therefore likely of more importance than insertions without an effect on expression. This limits effective discrimination between driver and passenger insertions, resulting in long lists of candidate loci which are likely to include a substantial fraction of false positives that do not affect expression. Second, it limits our understanding of how gene expression or the expression of (novel) gene transcripts is affected by insertions. These insights may be key to ultimately understanding the biological effect of insertions and how they may contribute to tumorigenesis.

In previous work, Temiz et al. have demonstrated that insertions can be identified in paired-end RNA-sequencing data using their tool Fusion Finder (9). In Fusion Finder, insertions are detected from discordant mate pair alignments, in which one mate aligns to a genomic sequence and the other to part of the transposon sequence. A drawback of this approach is that it does not use information from chimeric reads overlapping the fusion boundary between the gene and the transposon (split reads), limiting the accuracy and sensitivity of insertion detection. Additionally, the dependency on mate pair information prevents its use for analyzing datasets based on single-end RNA-sequencing.

In this work, we present an approach called IM-Fusion, which uses fusion-aware RNA-seq alignment to identify transposon insertions from splicing events between endogenous genes and the transposon. Key advantages of this approach are that it identifies exactly which gene(s) are affected by a transposon insertion and how the transposon is incorporated into the resulting gene transcript. Additionally, by using both split reads and discordant mate pairs to identify insertions, IM-Fusion is more sensitive than existing approaches and can be used to analyze single-end RNA-sequencing datasets. Finally, by combining insertions with exon-level expression data, we are able to accurately predict the consequences of integrations on gene transcripts.

MATERIALS AND METHODS

IM-Fusion

Identifying insertion sites

First, we create an augmented reference genome by adding the transposon as an extra sequence in the reference genome. Then, for each sample, we align sequence reads to the augmented reference genome using a fusion-aware RNA-seq aligner such as STAR (10) or Tophat-Fusion (11). By default, STAR is used for alignment, with the argument ‘–chimSegmentMin’ to ensure that chimeric read alignments are produced. Chimeric alignments from STAR are filtered to select alignments that represent fusions between the transposon and genomic sequences. Alignments that overlap with the fusion junction (represented by split-read alignments) are grouped by the position of their breakpoints, as these reads precisely identify the location of a fusion. Each such group is considered to represent a single gene-transposon fusion. For paired-end sequencing data, alignments that do not overlap with the fusion boundary are grouped if their mate positions fall within a pre-defined distance, which depends on the insert size of the dataset. Where possible, these ‘spanning’ read groups are assigned as additional support for fusions identified from split-reads. For cases where no such fusion is found, approximate locations for the corresponding fusions are predicted based on the bounds provided by the spanning reads.

The identified fusions are annotated to identify which gene(s) and which transposon feature(s) are involved in each fusion. Fusions that do not involve splice acceptor (SA) or splice donor (SD) features of the transposon or fusions that represent biologically implausible situations (such as fusions between transposon features and gene exons in opposite orientations) are considered artifacts and removed from the list of fusions. Optionally, fusions supported by less than a pre-defined number of reads can be removed to avoid fusions with low support. For this filtering, we provide two distinct measures: a support score and an FFPM (fusion fragments per million) score. The support score simply indicates the number of reads/mates that supports the corresponding fusion. The FFPM score is a scaled version of the support score, which is normalized for differences in sequencing depth between samples. This score is analogous to the FFPM score used by STAR-Fusion (12). The list of filtered fusions is used to predict approximate locations of the corresponding insertion sites, based on the breakpoints of the fusions.

Transcript assembly

To identify cases in which insertions lead to the expression of non-canonical transcripts, IM-Fusion provides an optional step which uses StringTie (13) to perform a reference-guided assembly of novel transcripts using the read alignment from STAR. The produced transcript annotation is used to assign any previously unannotated insertions to any novel transcripts that overlap with the insertion. If such a novel transcript overlaps with any known genes, the corresponding insertion is also assigned to these known genes, as the transcript likely represents an alternative transcript of these existing genes.

Selecting commonly targeted genes

Commonly targeted genes (CTGs) are selected by testing if genes are affected by insertions more frequently than would be expected by chance according to the Poisson distribution. The Poisson distribution expresses the probability of a given number of events occurring in a fixed interval of time or space, as long as the expected number of events in a fixed window is known and events occur independently. Specifically,

where k is the number of events and λ_g is the expected number of events in a fixed window. Here, each insertion represents an independent event and the fixed window is the genomic region of the gene of interest, optionally expanded to include a window around the gene. The expected number of insertions is calculated based on the size of the gene window, the size of the transcriptome (the union of windows for all genes) and the total number of insertions within the transcriptome windows.

In detail, we first count the number of insertions that were identified for a given gene g (by the insertion identification step) and were located within a pre-defined window (by default 20 kb) around the gene. This count is denoted as N_g. Second, we calculate the expected number of insertions in gene g (λ_g) based on its window size and the total number of insertions within the transcriptome as follows:

in which W_g corresponds to the size of the window around gene g, W_t the size of the transcriptome windows (the sum of windows for all genes in the genome, corrected for overlap between gene windows) and N_t represents the total number of insertions within the transcriptome windows. Using λ_g, we then calculate the probability of observing N_g or more insertions in gene g as:

After testing all genes of interest (by default all genes with at least one insertion in the gene), calculated P-values are corrected for multiple testing using Bonferroni correction.

If the transposon employed in the screen is known to be biased toward integrating at specific nucleotide sequences, λ_g can be calculated differently to take this integration bias into account. In this case, instead of using the size of the gene windows, we use the number of occurrences of the nucleotide sequence with the gene window (S_g) and within the transcriptome windows (S_t) to calculate λ_g:

To account for a potential bias in integrations on the chromosome on which the transposon concatemer is located, insertions and genes on the donor chromosome can be excluded from the analysis. In this case, genes on the donor chromosome are also excluded when calculating the transcriptome size (W_t/S_t) and the number of insertions (N_t).

Differential expression analysis

To test for differential expression, we first generate exon expression counts from the read alignments using featureCounts (14). For this count summarization, we use a flattened version of the reference GTF file, which is similar to the flattened GTF files produced by DEXSeq (15). This flattened GTF is required to ensure that overlapping exons from different transcripts of the same gene are only counted once by featureCounts.

Next, to test a given gene g for differential expression, we divide the exons of gene g into two groups: those before the transposon insertions in the gene ( Inline graphic ) and those after the insertions (). We assume that the expression counts of exons before the insertions () are not directly affected by the presence of an insertion and therefore reflect differences in the overall expression of the gene between samples. Based on this assumption, we normalize the counts of each sample for differences in overall expression of the gene by dividing the counts by a sample-specific normalization factor, which is calculated from the counts of the exons in Inline graphic using DESeq2's median-of-ratios approach (16). We then sum the normalized counts of exons in per sample, to get a single (normalized) count of expression after the insertion site for each sample. Finally, to actually test for differential expression in the presence of an insertion in gene g, we use a two-tailed Mann–Whitney-U test to compare the distribution of these counts between samples with an insertion in gene g and samples without an insertion in the gene.

In some cases, the above test is not possible because some samples do not have at least one exon before and after their insertion sites. This mostly occurs when insertions are located upstream of the first exon of the gene. To handle these cases, we first try to remove these problematic samples and repeat the test using the remaining samples. For cases where this does not leave us with any samples to test, we provide an additional gene-level test, which compares the expression of the overall gene between samples with/without insertions after normalizing for overall differences in sequencing depth.

By default, we do not use multiple testing correction for the differential expression test, as we primarily select CTGs using the Poisson-based test and use the differential expression test as an extra test to determine whether to keep the CTG. Additionally, not all CTGs may be subjected to the same test, as some genes may be tested using the gene-level test if the exon-level version is not applicable.

Single-sample differential expression

To test for differential expression in a single sample (as opposed to the group-wise test described above), we provide an alternative approach that uses the same normalization procedure, but uses a negative binomial distribution to compare the expression of the sample of interest to samples without an insertion. In this approach, a negative binomial is fitted using the after insertion counts of samples without an insertion in the gene. The after count of the sample of interest is then compared to this distribution using a two-tailed test to determine if the gene is differentially expressed.

Implementation

For convenience and reusability, we implemented the different steps of IM-Fusion in a Python package called imfusion, which is freely available on GitHub (https://github.com/nki-ccb/imfusion). Jupyter notebooks containing the code and results of the various computational analyses are also available on GitHub (https://github.com/jrderuiter/imfusion-analyses).

The Python package provides commands for each main step of IM-Fusion, including the construction of the custom reference genome, identification of insertions from RNA-seq reads, selection of CTGs and analysis of differential expression. The current implementation supports the use of STAR or Tophat-Fusion to detect fusions, although support for additional fusion-aware aligners may be added in the future. For full functionality, working installations of STAR/Tophat2, StringTie and featureCounts are required; as STAR or Tophat2 (which implements Tophat-Fusion) are used to align reads and detect fusions, StringTie is used to detect novel transcripts and featureCounts is used to generate the expression counts. Optionally, STAR-Fusion (12) can also be used to detect endogenous gene fusions as part of the STAR insertion detection pipeline.

Datasets

ILC dataset (RNA-seq)

Single-end RNA-sequencing data from 123 tumors were obtained from a dataset of a Sleeping Beauty (SB) transposon screen in a mouse model of invasive lobular breast carcinoma (ILC) (17). The RNA-seq data were downloaded from ENA in fastq format (accession number PRJEB14134) and analyzed using IM-Fusion (version 0.3.1) to detect SB insertion sites in each sample, as well as subsequently identify CTGs and their effects. For this analysis, we created an augmented reference genome using the mm10 version of the mouse genome and the T2/Onc transposon sequence (18). STAR (version 2.5.2b) was used to perform the alignment, StringTie (version 1.3.0) was used for transcript assembly and featureCounts (version 1.5.0-post3) was used to generate expression counts. Reference genome features were downloaded from Ensembl 76.

ILC dataset (ShearSplink)

DNA-sequencing data prepared using the ShearSplink protocol (19) for the same tumors as the ILC RNA-seq dataset were downloaded from Figshare (DOI: 10.6084/m9.figshare.4765111) and analyzed using the ShearSplink pipeline in PyIM (version 0.2.0, https://github.com/jrderuiter/pyim) to identify SB insertion sites. In essence, this pipeline first extracts genomic DNA from reads by removing the transposon and linker sequences. The genomic sequences are then aligned to the reference genome using Bowtie2 (version 2.2.8) (20), and the resulting alignments are grouped by sample and position to identify the location of insertion sites. Finally, identified insertions are assigned to their predicted target genes using the windows outlined in KC-RBM (8). To reduce the number of identified target genes for each insertion, we selected a single target gene for each insertion by picking the closest gene identified by KC-RBM. In cases where this was not possible, e.g. due to overlapping genes, we retained multiple target genes.

B-ALL dataset

Insertion data and paired-end RNA-seq data from 20 B-cell acute lymphoblastic leukemias (B-ALLs) were obtained from a previously published dataset of a SB screen performed in a mouse model of B-ALL (21). The RNA-seq data were downloaded from ENA in fastq format (study ID: ERP005291, array expression ID: E-ERAD-264). The insertion data were obtained from the Supplementary Materials of the publication or through personal communication. Control samples were omitted from the performed analyses.

Methods—ILC dataset

Gene-transposon fusion validation in RNA

Tumor RNA was extracted as previously described (22) and 300 ng was converted to complementary DNA (cDNA) with a Moloney murine leukemia virus reverse transcriptase using random hexamer primers according to manufacturer's protocol (Tetro cDNA synthesis kit, Bioline). Gene-transposon fusions were detected by standard polymerase chain reaction (PCR) with an annealing temperature of 58°C. The following primer sequences were used:

SA reverse

5΄-TTCCCGCGAATCCATCTTTC-3΄
En2SA reverse

5΄-GTCGACTGCAGAATTCGATGA-3΄
SD forward

5΄-GCCCATCAAGCTTGCTACTA-3΄
Myh9 forward

5΄-CTGTGTGGTCATCAACCCTTAT-3΄
Trp53bp2 reverse

5΄-ATCGCTCTGGTTTCGATAAGG-3΄
Ctnnd1 forward 1

5΄-GCTACATGCCTTGACAGATGA-3΄
Ctnnd1 forward 2

5΄-GAGAGGAGAAAGGCAGGAAAG-3΄
Hprt forward

5΄-CTGGTGAAAAGGACCTCTCG-3΄
Hprt reverse

5΄-TGAAGTACTCATTATAGTCAAGGGCA-3΄

Effects of insertions

To study the effects of individual SB insertions on expression, we visualized single insertions together with the expression of each of their targets in the affected sample and tested for differential expression over the insertion site in the sample. The visualization was generated using the Python package geneviz, which is freely available on GitHub (https://github.com/jrderuiter/geneviz). Gene annotations for the plot were obtained from Ensembl 76. Expression profiles were generated from the RNA-seq alignment of the sample using pysam (23), by counting the number of reads overlapping each nucleotide position in the plotted range. Junction strengths were derived from the junction files (SJ.out.tab) generated by STAR during the alignment. To test for differential expression, we used the single-sample exon-level test implemented by IM-Fusion.

Effects on CTGs

To identify biases in SA/SD insertions for the various CTGs, we counted the number of times each transposon feature (SD, SA, En2SA) was involved in the insertions affecting each CTG. The results were visualized to show the different distributions across CTGs. To test for differential expression, we applied IM-Fusions group-wise DE test for each CTG.

Insertion comparison

To compare the overlap in insertions between IM-Fusion and ShearSplink, we matched two insertions between IM-Fusion and ShearSplink under the following conditions: both insertions were identified in the same sample, had the same predicted target gene and their relative location and orientation was compatible. The latter restriction was used to ensure that a ShearSplink insertion was in the correct location to generate the fusion observed by IM-Fusion in the RNA-seq data. Insertions matched between the two approaches were marked as ‘Shared’, unmatched insertions were designated ‘IM-Fusion only’ or ‘ShearSplink only’ depending on the approach that identified them.

To identify features distinguishing shared insertions from insertions that were unique to either approach, we compared the set of shared insertions to the IM-Fusion- and ShearSplink-specific insertions. For both comparisons (Shared/ShearSplink and Shared/IM-Fusion), we first defined a set of features that could potentially affect insertion detection by either method. We then trained a logistic regression model on these features to predict whether an insertion was matched or unique to the corresponding approach. This model was used to determine the significance of each feature. Finally, we visualized the distributions of significant features for both the matched/unmatched insertions using kernel density estimation (KDE) plots for interpretation.

Candidate gene comparison

To compare the candidate genes identified by ShearSplink and IM-Fusion, we first identified significant common insertion sites (CISs) and differentially expressed CTGs (DE CTGs) separately using the respective approaches. We then visualized the resulting gene rankings, linking genes that were identified as candidate genes by both approaches. Candidate genes were colored to distinguish whether they were (i) shared between both approaches (black), (ii) were identified to have insertions but were not selected as a CTG/CIS by the other approach (blue), (iii) were selected as a CTG/CIS but were not differential expressed (green), (iv) were not selected as a CTG/CIS and were not differentially expressed (purple) and (v) were omitted entirely by the other approach (red).

ShearSplink insertion validation in DNA

Tumor DNA was isolated using a phenol–chloroform extraction. Transposon insertions were detected in 500 ng DNA by standard PCR with an annealing temperature of 58°C. The following primer sequences were used:

En2SA forward

5΄-GCTTGTGGAAGGCTACTCGAA-3΄
Nf1 11KOU029-R5.INS_12 reverse

5΄-CTCACGTGAAGTGGGAAAGACA-3΄
Nf1 12SKA029-R3.INS_15 reverse

5΄-GGCGCACACCTTTAATCCTAAC-3΄
Nf1 12SKA033-R3.INS_10 reverse

5΄-TAGCTCCCTGTGTGTTCCTTTG-3΄
Nf1 12SKA068-L3.INS_15 reverse

5΄-AAGGGTGAAGCAGGAGGATTAC-3΄
Nf1 12SKA092-L2.INS_10 reverse

5΄-ACGGAGAAGGAGAGAGGGAAA-3΄
Nf1 12SKA104-R3.INS_1 reverse

5΄-CCAACATCCCTGTTGTGTGTATG-3΄
Hprt forward

5΄-CTGGTGAAAAGGACCTCTCG-3΄
Hprt reverse

5΄-TGAAGTACTCATTATAGTCAAGGGCA-3΄

Endogenous fusion identification

Endogenous gene fusions were identified by applying STAR-Fusion (version 0.5.4) (12) to the raw RNA-seq data (fastq files) using recommended settings. The resulting list of fusions were combined across samples and filtered for fusions with breakpoints at known splice junctions, as these are most likely to reflect proper gene fusions. The filtered fusions were prioritized by grouping fusions on the involved genes and ranking by the recurrence of these gene pairs across samples. The fusions involving Fgfr2 were validated using the same approach as for the gene-transposon fusions, with the following additional primers:

Fgfr2 forward 5΄-TGGCCAGGGATATCAACAAC-3΄
Kif16b reverse 5΄-CTTTCCTGAGGGCTAGAGTTTG-3΄
Myh9 reverse 5΄-GATAGCGCCTTTGTCTCCTT-3΄
Tbc1d1 reverse 5΄-CCAGGCTGTGAGAAGGATTT-3΄

Methods—B-ALL dataset

Candidate gene comparison

To compare IM-Fusion with the DNA-seq results from the original publication, we applied IM-Fusion to the paired-end RNA-seq data and compared the identified DE CTGs with the published candidate genes (DE CISs). To avoid selecting CTGs with very low support in this relatively deeply sequenced dataset (as these are more likely to represent false positives), we filtered insertions with fewer than 10 supporting reads or mates from the CTG analysis.

Effect of sequencing depth

The B-ALL samples were downsampled to depths of 15, 30, 50 and 70 million reads using Seqtk (https://github.com/lh3/seqtk). IM-Fusion was applied to each of these downsampled datasets to identify DE CTGs, using the same settings as were used for the full dataset. The number of insertions and DE CTGs were compared between the different depths, as well as the overlap in DE CTGs between depths.

Single- versus paired-end comparison

A single-end version of the dataset was simulated by supplying only the first pair as input to IM-Fusion. The results from the paired-end and single-end analyses were compared by juxtaposing DE CTGs and insertions in these genes between the two analyses.

Fusion Finder comparison

We created an augmented version of the mm10 reference genome containing the T2/Onc transposon sequence in the same manner as described by Temiz et al. (9). This reference was modified to mask the En2 and Foxf2 gene loci, which contain sequences homologous to parts of the transposon sequence. Tophat2 (version 2.1.0) (24) was used to align reads to this augmented reference, after which the Fusion Finder script (version 3.1) was used to identify insertions in each sample. The results were compared with IM-Fusions DE CTGs and published candidate genes by analyzing the overlap between the identified insertions and the CTGs/CISs. To determine why certain CTGs/candidates were not identified by Fusion Finder, we visualized the distribution of the used transposon features and compared the alignments of reads supporting insertions unique to IM-Fusion between the Tophat2 and STAR alignments using pysam (23).

Endogenous fusion identification

Endogenous gene fusions were identified in the same manner as for the ILC dataset.

RESULTS

Identifying insertion sites from gene-transposon fusions

Transposon insertions can affect the expression of nearby genes, potentially leading to the activation of oncogenes or the inactivation of tumor suppressors. For example, consider the T2/Onc transposon (Figure 1A) that is used in this work. When integrated in the vicinity of a gene, this transposon can induce (over)expression of nearby genes by initiating transcription from its promoter sequence (MSCV) and then splicing into the gene using the SD sequence (Figure 1B). Alternatively, the transposon can truncate transcripts using either of its SA sites (SA/En2SA) and their corresponding polyA (pA) sites (Figure 1C). Depending on the gene and the location of the transposon, these truncations can inactivate the gene by resulting in an unstable transcript or inactive protein, or activate the gene by removing inhibitory protein domains.

Figure 1. — Overview of the *T2/Onc* transposon and its effects on gene expression. (A) The transposon sequence contains two splice acceptor sequences (SA and En2SA) with corresponding polyA sequences (pA), and a single promoter sequence (MSCV) combined with a splice donor (SD) sequence. (B) Sense insertions of the transposon either within or upstream of a gene may drive overexpression of the downstream gene sequence by initiating expression from the transposons promoter and SD sequence. (C) Insertions within genes (in either orientation) may truncate gene transcripts by splicing to either of the SA sites (SA or En2SA). The resulting truncations may inactive tumor suppressor genes, but can also activate oncogenes by removing inhibitory domains from the resulting protein.

In both of these cases, part of the transposon sequence is incorporated into the resulting mRNA transcript(s) via splicing between the affected gene and the transposon. As such, these transcripts effectively represent fusions between the transposon sequence and the affected gene. We therefore hypothesized that it should be possible to detect transposon insertion sites from RNA-sequencing by identifying gene-transposon fusions using existing gene fusion detection tools. By further analyzing the breakpoints of each fusion, we could determine exactly which gene and which feature of the transposon are involved in the fusion, and use this information to predict the location of the corresponding insertion site.