ChimeRScope: a novel alignment-free algorithm for fusion transcript prediction using paired-end RNA-Seq data

You Li; Tayla B Heavican; Neetha N Vellichirammal; Javeed Iqbal; Chittibabu Guda

doi:10.1093/nar/gkx315

. 2017 May 2;45(13):e120. doi: 10.1093/nar/gkx315

ChimeRScope: a novel alignment-free algorithm for fusion transcript prediction using paired-end RNA-Seq data

You Li ^1,^2,³, Tayla B Heavican ⁴, Neetha N Vellichirammal ¹, Javeed Iqbal ⁴, Chittibabu Guda ^1,^5,^*

PMCID: PMC5737728 PMID: 28472320

Abstract

The RNA-Seq technology has revolutionized transcriptome characterization not only by accurately quantifying gene expression, but also by the identification of novel transcripts like chimeric fusion transcripts. The ‘fusion’ or ‘chimeric’ transcripts have improved the diagnosis and prognosis of several tumors, and have led to the development of novel therapeutic regimen. The fusion transcript detection is currently accomplished by several software packages, primarily relying on sequence alignment algorithms. The alignment of sequencing reads from fusion transcript loci in cancer genomes can be highly challenging due to the incorrect mapping induced by genomic alterations, thereby limiting the performance of alignment-based fusion transcript detection methods. Here, we developed a novel alignment-free method, ChimeRScope that accurately predicts fusion transcripts based on the gene fingerprint (as k-mers) profiles of the RNA-Seq paired-end reads. Results on published datasets and in-house cancer cell line datasets followed by experimental validations demonstrate that ChimeRScope consistently outperforms other popular methods irrespective of the read lengths and sequencing depth. More importantly, results on our in-house datasets show that ChimeRScope is a better tool that is capable of identifying novel fusion transcripts with potential oncogenic functions. ChimeRScope is accessible as a standalone software at (https://github.com/ChimeRScope/ChimeRScope/wiki) or via the Galaxy web-interface at (https://galaxy.unmc.edu/).

INTRODUCTION

A major characteristic of many neoplasias is the presence of chromosomal re-arrangements often leading to higher chances of abnormal fusion of two separate genes (1). These fusion products are considered to be present at DNA level; and are therefore considered unique to cancer. However deep-sequencing technology revealed many more fusion transcripts, without detectable rearrangement at the DNA level and are partly generated via intergenic splicing (2). Over the last decade, an increasing number of fusion transcripts have been identified in major malignancies (3–5). A fusion transcript can contribute to oncogenicity by promoting the expression of a proto-oncogene (6); by deregulating a tumor suppressor gene (7); or by modifying the original structure/function of a protein to form a novel abnormal protein that stimulates tumorigenesis (8). Since oncogenic chimeric transcripts are cancer-specific (9), they offer a unique opportunity to be identified as cancer biomarkers for diagnostic (10,11), prognostic (12) and therapeutic (13,14) purposes to provide targeted cancer treatment. For instance, the first reported fusion gene, known as BCR-ABL1, was discovered in chronic myeloid leukemia (CML) in the early 1970s (15). This fusion gene was found to be associated with the Philadelphia chromosome (16), a recurrent translocation event found in more than 90% of CML patients. BCR-ABL1 encodes a constantly active tyrosine kinase that promotes leukemogenesis (17). Imatinib (18), one of the first drugs that target fusion genes, significantly improves the overall survival rate of CML patients (19) by inhibiting the tyrosine kinase activity of the BCR-ABL1 fusion protein.

High-throughput transcriptome sequencing, or RNA-Seq, has been widely used for fusion transcript detection, where the whole transcriptome from tumor cells, along with the chimeric fusion transcripts, are extracted and sequenced. To date, more than 20 tools have been developed for fusion transcript detection using RNA-Seq data, majority of which are alignment-based (20). Alignment-based methods in general report the optimal alignment results between short reads and the reference transcript sequences, and only achieve better performance when the compared sequences show high levels of homology (21). However, due to the highly perturbed nature of cancer genomes, aligning short reads generated from cancer genomes against the normal reference genome limits these methods from achieving good alignment. This is especially true for the reads derived from complex rearranged regions where fusion events often occur, thereby resulting in low prediction accuracies for these alignment-based methods (20). Moreover, identification of fusion transcripts by the majority of alignment-based methods depends heavily on the number of detected fusion reads. This lead to a bias toward identifying fusion transcripts with moderate to high expression (20) and could potentially leave novel, low expressed fusions undetected.

In order to address issues mentioned above, we developed a novel alignment-free method named ChimeRScope. Alignment-free approaches generally employ a broad collection of methods, including those based on k-mer frequencies or substrings, information theory, graphical representation or sequence clustering. ChimeRScope, interpreted as Chimeric RNA Scope or Chi(k)-meR Scope, predicts fusion transcripts by assessing the gene fingerprint sequences (in the form of k-mers) from RNA-Seq paired-end reads. In this method, reads containing two sets of gene fingerprints from two different genes will be scored based on the quality and the quantity of the fingerprint sequences. Such reads will be marked as fusion event supporting reads (FESRs) for downstream analysis. Using this approach, ChimeRScope eliminates the issues associated with using poorly aligned reads when predicting fusion candidates, thereby avoiding the issues often encountered with alignment-based methods. While the scoring algorithm uses an alignment-free approach, ChimeRScope can take advantage of the discordant reads from the pre-aligned datasets by third-party tools such as TopHat, as it only needs the discordant reads as input for fusion transcript prediction. Also, to identify the exact fusion junctions of the predicted fusion transcripts, toward the end, ChimeRScope also uses a targeted alignment step to visualize the fusion transcripts by aligning the identified FESRs against corresponding fusion partners. The final output from the ChimeRScope pipeline comprises of a list of predicted fusion transcripts with confidence scores and detailed information of the fusion events, fusion orientations and predicted fusion junction sequences, presented as text files and vector-based images.

ChimeRScope, implemented in JAVA, is a platform-independent software that can be installed and configured with minimal effort. It can be run as a command-line application, or integrated into online tools such as Galaxy server (22) for accessing the user-friendly Graphical User Interfaces. For demonstration purpose, we installed ChimeRScope on our local Galaxy server. The ChimeRScope suite, manuals and instructions (including Galaxy server configuration files for ChimeRScope) are available for downloading at ChimeRScope wiki site.

MATERIALS AND METHODS

ChimeRScope method overview

ChimeRScope is a novel alignment-free method that identifies fusion transcript candidates based on k-mer frequencies. It uses each k-mer as an independent unit and a potential gene fingerprint assigned with a weightage score that is negatively correlated with its frequency. ChimeRScope stores all k-mers in the transcriptome into a huge hash table (Gene-Fingerprint library, GF-library or k-mer library; Figure 1 A and B). Each hash key in the table represents a unique k-mer with the corresponding hash value representing the list of transcripts that contain the k-mer. When determining the origins of discordant reads, k-mers with low frequencies across different transcripts are considered as valuable gene fingerprints because they are more discriminative in the scoring function. Using this idea, a paired-end read will be classified as an FESR by ChimeRScope if this read embeds two sets of fingerprint sequences from two different genes with high confidence, which suggests a potential fusion event between these two genes (Figure 1 C–F). ChimeRScope summarizes the confidence scores of all FESRs and outputs a list of high-quality fusion transcripts after processing the entire discordant reads. To further investigate the reported fusion candidates, we also developed a module that performs targeted alignment between FESRs and the associated fusion transcript partners. This module outputs detailed information on fusion partners (e.g. chromosomal locations, orientations, fusion junction sequences and coordinates), all of which aids in downstream experimental validation and functional analysis. This information is also transformed into vector-based (i.e. Scalable Vector Graphics, or SVG) images to understand the fusion events better.

Figure 1. — ChimeRScope strategy for identifying FESRs: an example. A k-mer library is created by first, (A) generating all the k-mer profiles for all the genes. Next, we compare all k-mer profiles so that for each possible k-mer, the list of genes that contains that k-mer can be quickly identified. (B) A snapshot of a hypothetical example of the k-mer library. (C) The Circos map on the right illustrates an example of how ChimeRScope determine an FESR. A discordant paired-end read (100 bp × 2) that fails the stringent alignment against the reference genome is plotted in a circular layout with each nucleobase type represented by a unique color. (D) Four different variations of each k-mer in the read (e.g. highlighted region shows the 11th 17-mer for read 1 from base 11 to 27) will be created and searched against the k-mer library in order to obtain (E) a list of gene IDs that uses the corresponding k-mer as fingerprint. Each block represents a k-mer and each color here represents a unique gene ID. For example, four genes (G1: red, G2: green, G3: yellow and G4: orange) are related to the 11th 17-mer (from the 11th nucleotide to 27th nucleotide, as highlighted in gray region) and two genes (G1 and G4) are associated with the 29th 17-mer (highlighted in light yellow). (F) A complete graph is drawn for all eight matched genes. Each vertex in the complete graph represents a unique gene with the size of the vertex proportional to the overall fingerprint score for that gene. The edge value between two genes is defined by the distance (denoted as d) of two closest fingerprints of the gene pair (only listed several values). Gene pairs with small distance values tend to be false positives due to the similar sequences (See ‘Discussion’ and ‘Materials and Methods’ sections). If we consider only those gene pairs with an overlap of <5 bp (or with the distance more than 17 − 5 = 12) are valid fusion candidates, this read will be classified as a FESR that supports the fusion between G1 and G6 because G1 and G6 are two of the largest vertices (suggesting the most possible origins of the read based on the fingerprint sequences) with the distance value >12.

We designed the ChimeRScope algorithm into four distinct modules (Figure 2). The first module, ChimeRScope Builder, constructs a GF-library that serves as a reference library for searching against the k-mers from the sample reads. This is a one-time step that is carried out prior to processing the sample reads for each reference transcriptome. Next, the ChimeRScope Scanner module identifies an FESR based on the k-mer content of the read using a complete graph network model. This is the most memory and compute-intensive step because of the enormous number of search queries against the GF-library. The third module, ChimeRScope Sweeper, summarizes all the FESRs for each fusion candidate and predicts high quality fusion transcripts. The last module, ChimeRScore Examiner, performs targeted alignment of the FESRs against the reference transcripts of the fusion partners for better visualization of the fusion event to define the fusion junctions. ChimeRScope can use the total raw reads or the unmapped reads from previous alignments as input. However, as it only needs the unmapped reads for fusion transcript prediction, using the unaligned raw reads adds extensive computational burden to the program resulting in slow processing. Hence, we recommend the use of discordantly-aligned or unmapped reads as input for this program for expedited processing. Since the alignment step is a part of the standard RNA-Seq data analysis protocol, the input reads for ChimeRScope Scanner can be extracted directly from the sequence alignment files generated from these standard protocols such as TopHat. Therefore, ChimeRScope pipeline for fusion transcript detection can be integrated with popular RNA-Seq data analysis pipelines like the Tuxedo pipeline (23) for simultaneous analysis of the RNA-Seq datasets for both gene expression and fusion transcript detection.

ChimeRScope Builder

To generate the gene fingerprint library, we extracted all mRNA sequences (Figure 1A) by combining RefSeq human genome and gene model (GRCh38/hg38) using gtf_to_fasta from Tophat (24) package. For each mRNA sequence, ChimeRScope Builder poses a sliding window of size k, in order to generate the complete k-mer profile for each mRNA sequence. ChimeRScope Builder then compares all the k-mers profiles to obtain a huge hash table that uses each existing k-mer as a hash key, and the list of genes that contains that k-mer as the hash value (Figure 1B). All the hash keys and hash values were coded into binaries to reduce memory usage.

ChimeRScope Scanner

ChimeRScope Scanner takes the discordant paired-end reads (in forward-reverse orientation) in fastq format as inputs. We used the TopHat aligner from the Tuxedo pipeline as the example here since it is one of the most popular pipelines for RNA-Seq data analysis. However, other aligners (e.g. STAR (25) and bowtie (26)) that work well on transcriptome sequencing data can also be used with ChimeRScope for fusion transcript prediction. Samtools (27) was used to combine unmapped.bam (TopHat output file) with all the reads that are not properly mapped as read pairs (27) (SAM flag not equal to 2) from the accepted_hits.bam (TopHat output file). Next, all the discordant reads were extracted from the merged bam file using bamToFastq application from BEDTools (28).

For each discordant paired-end read of length 2L (length L for each end), 2(L − k + 1) k-mers were generated (Figure 1C) and four different k-mer variations (original, reverse, complementary and reverse-complementary k-mer) for each k-mer were created and searched against the k-mer library (Figure 1D and E) in order to cover all possible orientations of a fusion event. Next, for each k-mer, a list of transcripts that are associated with the k-mer and its variations were obtained. ChimeRScope Scanner then assigns a normalized weightage score that is negatively correlated with the size of that transcript list. Additionally, to avoid computational costs on common k-mers like poly-A sequence, k-mers associated with more than 100 transcripts was given zero weightage. The normalized weightage score for each k-mer is calculated using Inline graphic , where w is the weightage score and N is the size of the gene list. After processing the k-mers, ChimeRScope Scanner constructs a complete graph (in the form of 2D matrix) where each vertex represents a unique gene that existed in the k-mer profile of that read (total of n genes). The magnitude of the vertex Inline graphic is calculated by summarizing all the weightage scores of the associated k-mers. The edge value for each gene pair in the graph is calculated by the distance of the two closest gene fingerprints of that pair (Figure 1F). A small distance between two genes indicates that these two genes share high similar gene fingerprint profiles (e.g. distance = 1 means these two genes have k-1 number of nucleotides overlap near the fusion junction) in this read and are more likely to be false positives introduced by point mutations or sequencing errors. If there are several vertices Inline graphic connected with the biggest vertex/vertices with the edge value more than the distance cut-off c, this read will be classified as a FESR that supports the fusion event between and , which should be two of the most evident origins for a FESR (Figure 1F). The weightage score for each FESR is calculated using Inline graphic , where K_f and K_r are the maximum number of k-mers each end can have for the read. A FESR will achieve its largest weightage score only when , which shows a perfect fusion transcript event where the k-mer profile for each end of the read is the unique gene fingerprint set for each gene in the pair. Here, FESRs with very low scores (default cut-off is 0.1) are filtered out to prevent false positives.

ChimeRScope Sweeper

The final confidence score for each fusion transcript is calculated based on the overall distribution of the related FESRs’ weightage scores using an iteration function (Supplementary File 1). This function takes the FESR with the smallest weightage score from the FESR stack and updated it to the confidence score. Therefore, the updated confidence score is always larger than the current FESR weightage score. This guarantees that the final score is always higher than the score of the best FESR. Therefore, fusion transcripts with only a few FESRs can have high confidence scores if they have high quality FESRs. On the other hand, the confidence score accumulates at a slower rate when processing a subset of FESRs with the same level of weightage scores. Using this strategy, it prevents false positives from having high confidence scores when only large amounts of low quality FESRs were present. The final confidence score for a fusion transcript is ranges from 0 to 1. Any fusion transcript with a score of more than 0.5 is considered as a true fusion transcript candidate. The cut-off score indicates that a fusion transcript will be classified as true when it receives, for example, 70% support in average from each fusion partner (0.7*0.7 = 0.49).

Before ChimeRScope reports the final list of fusion transcripts, several filters were applied to remove false positives. These include the homology, sequence similarity, occurrences and genomic-distance filters. ChimeRScope can introduce false positives if the reported fusion transcript pair shares homologous fingerprint profiles (Figure 3A). Therefore, fusion pairs that comes from the same gene family (29) are filtered out. Moreover, ChimeRScope also uses Smith–Waterman algorithm (30) to further exclude those fusion transcripts with similar sequences within the fusion pairs. Adapter contaminations can also introduce false positives (Figure 3B). Although adapter sequences in RNA-Seq are designed not to match any gene sequences, the reversed adapter sequences can be part of the fingerprints for a subset of genes. Reads with adapter contamination sequenced from any expressed genes would exhibit a fusion-like pattern between the expressed gene and genes from that subset. To remove such false positives, ChimeRScope calculates the number of occurrences where one gene has been reported to be fused with other genes among the same dataset. Fusion transcripts involved with such genes with high counts (The default cut-off value is 200. The selection of this value is based on the results from the tested simulated datasets) should be false positives. This is because if the adapter sequences are included in the paired-end reads, they can pair with any associated genes, creating fusion-like patterns between the adapter sequences and these genes. Furthermore, the genomic-distance filter is also used to filter out read-throughs, a class of chimeric transcripts with true fusion-like patterns but are not biologically significant for tumorigenesis (31). At last, for fusion transcript analysis on real RNA-Seq datasets, an alignment search for the fusion sequences against non-coding genes (NCBI BLAST against NR) is recommended for filtering out potential false positives because the regular k-mer library only includes coding mRNAs. This step will filter out false positives caused by non-coding RNAs with fusion-like patterns.

Figure 3. — False positive fusion reads caused by genes with similar sequences and adapter contaminations. (A) An example of the false positives caused by similar sequence region. The reference sequences of G1 and G2 (regions were filled with different patterns) share a similar sequence (illustrated using the same pattern) near the 5΄ region. There are only a few nucleotides that are different (highlighted in red in G2). A paired-end read sequenced from the subject G1 cDNA contains a mutation that changes the fingerprint sequences in the region identical to the sequence of G2. This paired-end read, originated only from G1, will now be classified as a fusion read because it supports the fusion between G1 and G2 due to the above mentioned specific mutation contained in this read. We filtered out this type of false positives by screening for such similar regions in the reported fusion partners. (B) An example of the false positives caused by adapter sequences. Genes and adapters (including the reversed adapter sequence) were filled using different patterns to represent the differences of these sequences. Paired-end reads can have adapter contaminations when the template is smaller than the length of the read or due to other technical issues. ChimeRScope checks four different variations of the k-mers, including the reversed sequence. Although adapter sequences are designed not to significantly match any of the gene sequences, the reverse of the adapter sequences can match to certain genes (G4 matches the reverse adapter sequence in this example). All the paired-end reads sequenced from a cDNA library with adapter contaminations could be classified as fusion reads by ChimeRScope. We filter out this class of false positives by removing fusion transcripts involving partners with high counts because this kind of fusion partners tend to pair with a large number of genes.

ChimeRScope examiner

The Examiner module is designed as a post-analysis module that outputs the models of the fusion transcripts as both text files and SVG images. It performs target alignment using Smith–Waterman algorithm between each fusion transcript reported from the previous module and the corresponding FESRs. ChimeRScope Examiner calculates the estimated fusion junction from each FESR and reports the consensus fusion coordinates for the fusion transcript from the alignment results. In addition, ChimeRScope Examiner also reports the orientations and the strands of the fusion partners, along with the resolved fusion sequences near the fusion junction (+/− 100 bp to the fusion junction). These results were also automatically transformed into vector objects for improved interpretation of the fusion events (Figure 4). The SVG figure is automatically generated for each fusion and can be converted to publication-quality images without changing the aspect ratio. Moreover, most of the elements in the SVG image have their unique name attributes (e.g. transcript ID, exon number, paired-end read name). The name attributes are incorporated at the bottom of the image when users perform mouse-over action on these elements for enhanced understanding of the fusion event.

Figure 4. — A screenshot of the graphical output for fusion transcript *DAB2&FRYL*. Four different tracks are plotted to illustrate the fusion event. The first track of the SVG image lists the name of the fusion partners with their original orientations and the second track shows the transcripts that might be involved in the fusion event. Each rectangle in the second track represents a coding exon with the size of the box proportional to that exon. The third track, along with the highlighted region between track two and track three, show the amplified region near the fusion junction from each fusion partner. The last track of the image plots a maximum of 71 (determined by the dimension of the forth track) paired-end reads to show the alignment results of selected FESRs against the fusion junction. This figure shows that *DAB2* are fused with *FRYL* near the second exon (5΄ end) of *DAB2* and the second exon (3΄ end) of *FRYL*.

Shannon Index for k-mer size optimization

Shannon Index, also known as Shannon's entropy, was originally described in ecology (32) to quantify the uncertainty in predicting the identity of a species that is taken at random from the dataset. Similarly, k-mer libraries with larger Shannon index values indicate higher uncertainties of predicting the correct origins of randomly sequenced reads (shotgun sequencing). The Shannon index of a k-mer library is calculated using Inline graphic . Here, R is a collection of k-mer classes that have been identified in less than 100 transcripts in a library and p_i is the probability of each k-mer class in R. Those k-mers associated with 100 transcripts or more are not included in the k-mer library due to their low specificities; thus they were excluded from the Shannon index calculation.

Local Galaxy server implementation

Apart from the standalone ChimeRScope package, we also developed a web-based application of ChimeRScope using Galaxy server (22). The Galaxy server instance was installed on our local data analysis server following the instructions provided by Galaxy group (https://wiki.galaxyproject.org/). We changed our local Galaxy server into a production data analysis server using the recommended configurations. We also implemented Galaxy-specific options for ChimeRScope for enhanced compatibility with Galaxy server data management system. All the wrapper scripts used for ChimeRScope applications were implemented in XML (eXtensible Markup Language).

RNA-Seq datasets and data analysis procedures

Simulated datasets

We used two different simulated datasets to test the prediction performance of ChimeRScope against other popular methods. The first simulated datasets, namely 50_pos_set, was included in the FusionMap (33) package. It consists of 50 simulated fusion transcripts with the fusion reads ranging from only 2 read pairs to 1587 read pairs. In total, it contains 57 209 75 bp paired-end reads, with only 4300 reads that spans the fusion junctions. Another simulated dataset we used contains 15 different combinatorial subsets with 3 different read lengths (50, 75 and 100 bp) and 5 different coverage levels (5×, 20×, 50×, 100× and 200×). This dataset, namely comp_sim_set, was obtained from GC. Tseng's group (34). Those 15 datasets contain the same 150 artificial fusion events, all of which were simulated using sequences from Ensembl annotation.

Published RNA-Seq datasets

To evaluate the performance of ChimeRScope in real tumor samples, we tested ChimeRScope on transcriptome data from four breast cancer cell lines (BT-474, MCF-7, KPL-4, skBR-3) (35) with 26 experimentally validated fusion transcripts reported by the original study (The fusion CSE1L-ENSG00000236127 was removed from the list due to the deprecation of ENSG00000236127). The breast cancer cell line datasets were downloaded from NCBI Sequence Read Archive (SRA. http://www.ncbi.nlm.nih.gov/sra) with accession number SRP003186. Additionally, we also tested ChimeRScope on a subset of 272 patient samples from a large glioma study (36) (accession number SRP027383). This validation list consists of 13 RNA-Seq datasets with 31 validated fusion transcripts and it was also used by the JAFFA study (37).

In-house natural killer (NK) cell lines datasets

To further demonstrate the utility of ChimeRScope in vivo, we examined three natural killer (NK) cell lines for experimental validation. Here, four different NK cell line datasets (KHYG1, NKYS, NK92-PMIG, NK92-PRDM1) were downloaded from NCBI SRA database (SRP049695), which can be divided into three different cell lines (KHYG1, NKYS and NK92). The original NK92 cell line samples used for sequencing were transduced with either PMIG, a control vector, or a vector to knock-down PRDM1, a known tumor suppressor. We only used the normal/non-transduced NK92 RNA in the experimental validation step for those fusion transcripts predicted from NK92 samples since the original vector treated NK92 RNA used in the transcriptome sequencing was not available.

Data analysis

We installed ChimeRScope, SOAPfuse (V1.26), FusionCatcher (version 0.99.4d beta), JAFFA (version 1.06), EricScript (version 0.5.5) and MapSplice (version 2.2.1) on a local Linux-based data analysis server (160 cores and 512 GB memory). We processed all simulated datasets and cell lines RNA-Seq datasets mostly using the default settings. Results on NK cell lines were further filtered using awk command. All the settings and the reference database versions can be found in Supplementary File 2.

Experimental validation

The NK cell lines used for fusion detection were KHYG1, NKYS and NK92, which were cultured under standard conditions and total RNA was extracted as per standard protocol. The RNA extracted from these cells lines was reverse transcribed into cDNA using ProtoScript First Strand cDNA Synthesis Kit (New England BioLabs) as per manufacturer's protocol, followed by polymerase chain reaction (PCR) using the oligonucleotide primers designed using Vector NTI Advance (version 11.5.4) and Primer3Plus (38). NCBI primer blast (39) was used to check the off-target amplicons to ensure the binding site specificity of the primers. PCR was performed for each primer set using either the standard Taq PCR kit (New England BioLabs) or the One Taq PCR kit with GC buffer (New England BioLabs) depending on the GC content of the target sequences. All PCR products were analyzed by agarose gel electrophoresis. The PCR products of the approximate expected amplicon size were extracted from the agarose gel using the GeneJET Gel Extraction Kit (ThermoScientific). All the amplicons extracted from the agarose gel were Sanger sequenced using Applied Biosystems (ABI) 3730 DNA Analyzer as per manufacturer's protocol.

RESULTS

Optimization of k-mer size

It is expected that the choice of the k-mer size affects many factors such as the distribution of k-mer frequencies and the computational speed, while it is also restricted by factors like the read length; hence determining the optimal size of k in k-mers is crucial to achieve optimal accuracy and performance of ChimeRScope. Our choice of k-mer size for the ChimeRScope GF-library is based on three criteria. First, k should not be too small, because a small k will always yield less unique gene fingerprint sequences. Second, k should not be too large since sequence variations within a string can affect higher percentage of the k-mers when k is large (e.g. the maximum proportion of k-mers affected by a SNP is calculated by k/(L − k + 1), where L is the length of the read; larger k increases the proportion of the affected k-mers). Last, k should be an odd number because ChimeRScope tracks each k-mer using the index of the middle nucleotide; hence reversing a k-mer (see ‘Materials and Methods’ section) when k is an odd value will not alter the index. To decide the optimal k value for human reference genome GRCh38/hg38 (38 834 mRNA transcripts in RefSeq annotation), we plot the percentages of all k-mer classes (categorized by the occurrences in unique transcripts) for all k-mer libraries using the odd k values ranged from 13 to 29 (Figure 5A). We believe that GF-libraries with a higher level of discriminative k-mers (or lower Shannon Index (40)) are generally more advantageous when evaluating fingerprint sequences (41). Our results (Figure 5 and Supplementary Table S1 in Supplementary File 3) show that k = 17 is the optimal k-mer size for human reference model (GRCh38/hg38) as it is the smallest k size that gives highest levels of discriminative fingerprints (or low Shannon Index) compared with those of the GF-libraries with larger k values.

Figure 5. — k-mer percentages for nine selected k-mer libraries for GRCh38/hg38 (mRNAs only). (A) The k-mer percentages for nine selected k-mer libraries (k = 13, 15, 17, 19, 21, 23, 25, 27, 29) are plotted using 3D line chart, where each line in this chart represents a unique k-mer library. Libraries with larger k values are plotted to the further side of the figure. The x-axis lists all k-mer classes (characterized by the number of transcripts that use the k-mer as a fingerprint) and y-axis shows the corresponding percentages of the k-mer class in the certain k-mer library. For instance, ∼48% of the k-mers in the 15-mer library are unique fingerprint sequences (y = 48% when x = 1 and k = 15). For each k-mer library, the majority of the k-mer classes (∼99%) are discriminative k-mers with <10 associated genes. Overall, larger k often gives better k-mer library because it contains more discriminative k-mers (higher value toward left part of the x-axis). (B) Shannon Indices for all nine GF-libraries. GF-libraries for k = 17 or larger have similar low levels of Shannon Indices (∼1.4, highlighted in the red box). Consequently, k = 17 is the optimal k-mer size for GRCh38/hg38 because k = 17 is the smallest k-mer size that gives highest levels of discriminative k-mers (or low Shannon Index).

In total, the 17-mer GF-library for GRCh38/hg38 consists of ∼62.7 million k-mers (Supplementary Table S2 in Supplementary File 3). Out of which, 52.8% (∼33.1 million) are unique fingerprint sequences (identified in only one transcript) and 23.5% (∼14.5 million) are observed in only two different transcripts (Supplementary Table S3 in Supplementary File 3). Additionally, more than 99% of the k-mer sequences in 17-mer GF-library are associated with no more than eight transcripts indicating the significance of 17-mers across the human transcriptome. Notably, common fingerprints like poly-A sequences can drastically slow down the computational time due to their occurrences in a large number of transcripts. Therefore, we defined that k-mers associated with more than 100 transcripts (531 k-mers. See Supplementary Table S2 in Supplementary File 3) were assigned a weightage score of zero and thus excluded from the scoring step. Furthermore, we also calculated the number of the most discriminative 17-mers in the human transcriptome (Supplementary Table S4 in Supplementary File 4) and ranked all the transcripts by the numbers of the unique 17-mers (Supplementary File 4). Results have shown that 82% of all the transcripts (32001/38834, Supplementary File 4) have at least 10 unique fingerprint sequences. Supplementary Table S4 also shows that more than 99.4% (38587/38834) of the transcripts have at least one of these high-quality k-mers (k-mers found in no more than 10 transcripts). The rest of the transcripts (0.6%) can be grouped into 34 genes (Supplementary Table S4), most of which can be further classified into only four different gene families (Cancer/Testis antigen family, Ubiquitin Specific Peptidase 17-like family, GAGE cancer/testis antigen family and PRoline-Rich protein gene family). Overall, these statistics suggest that the 17-mer GF-library contains sufficient fingerprint sequences for most of the transcripts/genes.

Optimization of time and memory usage

Memory utilization is a common problem often encountered by alignment-free algorithms (42). ChimeRScope Scanner also consumes a lot of computational resources due to the huge hash table (the GF-library containing millions of k-mers) and the large search space (millions of reads). To overcome this issue, we optimized the algorithm by transforming all the k-mers into binaries, reducing the number of iterations and optimizing data structures (Supplementary methods in Supplementary File 5). These modifications led to a 6-fold improvement on the RAM usage for ChimeRScope Builder, while also reducing the run time by 60-fold compared with our first version of ChimeRScope prior to optimization (Table 1). The improved version takes only 6 GB RAM and 12 min for ChimeRScope Scanner to load the GF-library into the memory. The processing speed for ChimeRScope Scanner was enhanced by at least eight times with only 10% of the original memory cost in the improved version. This enables the use of ChimeRScope on projects with a large number of samples. For example, analysis on RNA-Seq datasets with several million discordant reads took more than 10 h using 10 threads with more than 100 GB RAM usage in the original ChimeRScope version. In comparison, the optimized version can analyze the same datasets within 2 h with the same number of threads and 20 GB RAM.

Table 1. Run time for ChimeRScope Scanner, before and after code optimization.

	Memory usage (in GigaByte)*		Time cost/speed
	Before	After	Before	After
Construct 17-mer GF-library	>100	17.4	25 h	25 min
Loading 17-mer GF-library	42	6	1 h	12 min
Parsing reads (per thread)	10	1	∼400/min	>5000/min**

Open in a new tab

* Memory usage was estimated using TotalMem()-freeMem() functions during the run time. The actual memory usage may vary.

** The processing speed for parsing reads was estimated using the 50_pos_set, simulated with different volumes (10×, 100× and 1000×).

Performance evaluation using simulated datasets

Results from a recently published paper (34) along with other studies (20,37,43–45) suggest that SOAPfuse (43), FusionCatcher (44), JAFFA (37), EricScript (46) and MapSplice (47) are generally the best-performing fusion transcript prediction methods. Therefore, we compared the prediction performance of ChimeRScope with these five methods (results reported either from this study or from previously published studies, if available). We used two different simulated datasets to evaluate the sensitivities and false discovery rates (FDRs). The first simulated dataset was published by FusionMap group with 50 positive fusion events (50_pos_set). The second set of simulated datasets was obtained from a recent study (34) that comprehensively evaluated the performance of 15 different fusion transcript prediction algorithms. These datasets, namely comp_sim_set, are comprised of 15 different combinatorial subsets with 3 different read lengths (50, 75 and 100 bp) and 5 different coverage levels (5×, 20×, 50×, 100× and 200×). Each dataset contains the same 150 artificial fusion events that were simulated based on the Ensembl (48) annotation. We analyzed these datasets and carried out head-to-head comparisons of ChimeRScope against other five methods using the same F-scores (34) for performance assessment. The complete list of fusion transcripts predicted by these six methods on simulated datasets can be found in Supplementary File 6.

The comparative performance of the selected methods on 50_pos_set is shown in Table 2. Because some of the simulated fusion transcripts can be classified as biologically insignificant (e.g. runaway transcripts between neighboring genes) by some filters implemented in ChimeRScope and in other methods like FusionCatcher, we ran our tests both with and without applying these filters on the simulated datasets. We were unable to run JAFFA locally on the 50_pos_set in this study; hence, we used the statistics reported in the original JAFFA publication (37) in Table 2 and we took the best results from the JAFFA hybrid mode output. Overall, ChimeRScope demonstrates the highest F-score (0.97) by predicting 47 true positives out of 50 without reporting any false positives (with all filters disabled). Even with all filters applied, ChimeRScope still reports a higher F-score than other tested methods. The second best-performing method on this dataset, FusionCatcher, also reports 47 true positives but with three false positives in its preliminary result (Supplementary File 6). The filters used by FusionCatcher removed these three false positives at the cost of losing 16 true positives, suggesting that FusionCatcher is very conservative in predicting fusion transcripts.

Table 2. Fusion transcript prediction on the 50_pos_set.

50_pos_set	Study	TP*	FP	Precision	Recall	FDR	F-score	Note	Best score
ChimeRScope	Current study	47	0	1.00	0.94	0.00	0.97	Without filters	0.97
	Current study	45	0	1.00	0.90	0.00	0.95	With all filters**
SOAPfuse	Current study	38	1	0.97	0.76	0.03	0.85		0.90
	PMID:26862001	41	0	1.00	0.82	0.00	0.90
JAFFA	PMID:26019724	44	0	1.00	0.88	0.00	0.94	JAFFA-hybrid	0.94
FusionCatcher	Current study	31	0	1.00	0.62	0.00	0.77	Final result	0.94
	Current study	47	3	0.94	0.94	0.06	0.94	Raw result
EricScript	Current study	14	0	1.00	0.28	0.00	0.44		0.88
	PMID:26862001	39	0	1.00	0.78	0.00	0.88
MapSplice	Current study	42	0	1.00	0.84	0.00	0.91		0.92
	PMID:26862001	43	0	1.00	0.86	0.00	0.92

Open in a new tab

*TP: true positive; FP: false positive; Recall: sensitivity; FDR: false discovery rate.

**Two fusion transcripts were filtered out by the similarity filter (PRKCA&USP49 score: 384; FKTN&SCAI score: 367). However, the F-score was still higher compared with others methods even with these two fusion transcripts removed.

ChimeRScope predicts 47 true positives (out of 50 true fusion transcripts) without reporting any false positives. Our method achieved the highest F-score with highest sensitivity (recall = 94%) and lowest FDR (0%). In comparison, JAFFA predicts in total 44 true positives and 0 false negatives, reporting the second highest F-score as FusionCatcher (47 true positives and 3 false positives in the raw result).

Next, we analyzed a more extensively simulated dataset, comp_sim_set (with 15 subsets) and compared our results (Supplementary Table S5 in Supplementary File 7) with the original study for this dataset (34). The initial results showed that ChimeRScope was only able to identify a maximum of 135 true fusion transcripts (Supplementary File 6). A closer inspection of these 15 fusion transcripts not predicted by ChimeRScope reveals that they were undetected due to the sequence differences between Ensembl and RefSeq annotation. Specifically, the 15 fusion transcripts in the comp_sim_set are simulated from the Ensembl sequences and fused at non-coding regions defined by RefSeq annotation (Supplementary File 8). Since ChimeRScope GF-library is built entirely based on RefSeq annotation, ChimeRScope could not detect any of these true fusion transcripts. Therefore, we removed these 15 fusion transcripts for all the methods from the final analysis (Table 3). We were unable to run JAFFA successfully on datasets with read lengths of 50 and 75 bps due to the computational issues caused by JAFFA assembly mode; and hence these results were not reported for these datasets in Table 3. Overall, ChimeRScope displays the best F-scores for 13 out of 15 datasets tested. Notably, ChimeRScope has significantly higher F-scores especially on the datasets with low coverage depth (5×), suggesting its unique advantage in detecting fusion transcripts with low expression levels or from datasets with low coverage depth in general. Furthermore, the variation of the F-scores reported by ChimeRScope is subtle (max = 0.957 and min = 0.905) across all the datasets with different read lengths and depth of coverage, indicating that our method is very robust and it generates consistently accurate results irrespective of the read length or sequencing coverage. On the other hand, the F-scores reported by alignment-based methods depend on longer read lengths and larger coverage depths, suggesting that the performances of the alignment-based methods are extremely sensitive to the technical variations in the Next-generation sequencing (NGS) data.

Table 3. F-scores for six major methods on comp_sim_set with 15 incompatible fusion transcripts removed.

comp_sim_set (TP = 135)	50 bp					75bp					100 bp
	5×	20×	50×	100×	200×	5×	20×	50×	100×	200×	5×	20×	50×	100×	200×
ChimeRScope	0.948	0.954	0.947	0.908	0.905	0.948	0.949	0.957	0.954	0.947	0.940	0.957	0.957	0.957	0.957
SOAPfuse	0.807	0.922	0.935	0.913	0.943	0.836	0.921	0.921	0.929	0.925	0.840	0.942	0.928	0.932	0.929
FusionCatcher	0.357	0.856	0.894	0.895	0.899	0.692	0.855	0.878	0.888	0.896	0.707	0.884	0.891	0.891	0.891
JAFFA											0.609	0.849	0.856	0.856	0.856
EricScript	0.234	0.256	0.296	0.317	0.298	0.331	0.425	0.434	0.432	0.429	0.361	0.461	0.458	0.458	0.464
MapSplice	0.235	0.514	0.548	0.579	0.594	0.381	0.525	0.586	0.591	0.598	0.383	0.541	0.568	0.583	0.577

Open in a new tab

The results were calculated from 135 effective fusion transcripts. ChimerRScope achieved the highest F-scores in 13 out of the 15 datasets. SOAPfuse alone reported higher F-scores in 100×_50 bp and 200×_50 bp, with marginal increase of 0.005 and 0.038 in F-scores, respectively.

Performance evaluation using cancer transcriptome data

To evaluate the performance of ChimeRScope in real tumor samples, we tested ChimeRScope on the transcriptome data from four breast cancer cell lines (Table 4) and 13 glioma patient samples (Table 5). We analyzed these datasets using other five methods for comparative analysis (the complete prediction results for both datasets can be found in Supplementary File 6). Results on four breast cancer cell lines have shown that ChimeRScope identifies the highest number of validated fusion transcripts (22 out of 26). Furthermore, ChimeRScope also reports higher FESRs for fusion transcripts with low coverages or expression levels (Supplementary Table S6 in Supplementary File 9), suggesting that the sensitivity of ChimeRScope to identify such fusion transcripts is higher. Results on the glioma samples have also demonstrated that ChimeRScope performed better than other methods. ChimeRScope reports an F-score of 0.210, which is better than SOAPfuse (0.178), FusionCatcher (0.152), JAFFA (0.106), MapSplice (0.080) and EricScript (0.015). Moreover, ChimeRScope is the only method that identifies FESRs for all 31 fusion transcripts (Three fusion genes, AP2A2&SBF2, CD81&SPAG6 and TPM3&ADAR were filtered out by the similarity filter and the adapter filter. See Supplementary Table S7 in Supplementary File 9). Consequently, these results also demonstrate that ChimeRScope is a better fusion transcript prediction method for cancer transcriptome datasets.

Table 4. Fusion transcript prediction on breast cancer cell lines for selected methods.

	BT474(11)		SKBR3(9)*		MCF7(3)		KPL4(3)		Total TP	Total prediction
	TP**	Total prediction	TP	Total prediction	TP	Total prediction	TP	Total prediction
ChimeRScope***	10	24	6	24	3	14	3	4	22	66
SOAPfuse	9	35	6	19	2	6	3	8	20	68
SOAPfuse***	7	26	6	11	3	7	3	4	19	48
FusionCatcher	9	31	6	24	2	7	2	5	19	67
FusionCatcher***	4	11	4	5	2	2	2	2	12	20
JAFFA	8	15	5	9	2	6	2	2	17	32
JAFFA***	7	16	4	7	3	8	1	2	15	33
EricScript	8	31	4	37	2	10	2	5	16	83
EricScript***	2	21	3	22	0	8	0	7	5	58
MapSplice	8	27	4	15	2	6	2	5	16	53
MapSplice***	8	28	5	11	3	8	2	5	18	52

Open in a new tab

*CSE1L-ENSG00000236127 was removed because ENSG00000236127 is no longer a valid gene in the latest Ensembl database.

**TP: true positive.

***Marked rows are the results reported by our study. Stats for other rows are reported from Liu et al. (34).

The numbers in parenthesis next to the cell line names (header line) are the total number of validated fusion transcripts in each cell line. ChimeRScope reports the highest number of true positives across four different breast cancer cell lines. Overall, ChimeRScope identified 22 true fusion transcripts out of 26 validated fusion transcripts.

Table 5. Fusion transcript prediction on 13 glioma samples with a total of 31 validated fusion transcripts.

	Total prediction	True positives	Recall (sensitivity)	Precision*	F-score
ChimeRScope	236	28	0.903	0.119	0.210
SOAPfuse	183	19	0.613	0.104	0.178
JAFFA	478	27	0.871	0.056	0.106
FusionCatcher	245	21	0.677	0.086	0.152
EricScript	1464	11	0.355	0.008	0.015
MapSplice	567	24	0.774	0.042	0.080

Open in a new tab

* Precision for each method was calculated by dividing the number of true positives by the total prediction (considering that all the untested fusion transcripts as false positives).

ChimeRScope successfully identifies the highest number (28 out of 31) of fusion transcripts, achieving the highest sensitivity, precision and F-score. SOAPfuse reports the second best precision while JAFFA reports the second best sensitivity.

Experimental validation of ChimeRScope predictions

We used paired-end RNA-Seq datasets from three NK lymphoma cell lines that we have published earlier (49). We analyzed these RNA-Seq datasets using all selected methods (Supplementary File 2). In total, ChimeRScope predicted 10 unique fusion transcripts, compared with 25 unique fusion transcripts by SOAPfuse, three by JAFFA and only one by FusionCatcher (Supplementary File 6). On the other hand, EricScript and MapSplice predicted unusually high number (378 and 127 fusion transcripts, respectively) of fusion transcripts compared to the other four methods. Considering that the total number of curated fusion transcripts in all cancers were only around 1000 gene pairs (ChimerKB from ChimerDB 3.0 at http://ercsb.ewha.ac.kr/fusiongene), and the consistently superior performance of the first four methods (ChimeRScope, SOAPfuse, JAFFA and FusionCatcher) against multiple datasets tested, we have limited the experimental validation of the predicted fusion genes to these four methods. Among the four methods, 30 unique fusion transcripts were predicted, including five that were reported by at least two of the four methods (Figure 6). Further analysis on fusion transcripts predicted by SOAPfuse has shown that BOLA2B&SMG1P2, TVP23C&CDRT4 and DSCR4&DSCR4-IT1 are directly associated with well-annotated read-through mRNAs (with NM IDs in NCBI Five Reference Sequence database). This type of predictions is classified as false fusion event by ChimeRScope (see ‘Discussion’ section) and hence is filtered out. In addition, we failed to design primers for ORC6&PLEKG4B and MAPK8&NMU due to low complexity regions. Therefore, these five fusion transcripts predicted by SOAPfuse were also excluded from our experimental validation (Supplementary Table S8 in Supplementary File 10).

Figure 6. — Venn diagram of all fusion transcripts reported by ChimeRScope, SOAPfuse, JAFFA and FusionCatcher on lymphoma cell lines. In total, 30 unique fusion transcripts were reported. ChimeRScope and SOAPfuse predicted the most number of common fusion transcripts (five genes), with two of them exclusively reported by these two methods. The results reported by FusionCatcher have the lowest overlaps with other methods (only one gene).

Of all the fusions predicted by different algorithms, we designed primers for 25 unique fusion transcripts and confirmed 14 fusion transcripts (56%; 14 out of 25 tested fusions) by RT-PCR and Sanger sequencing (Table 6; Supplementary Tables S9 and 10 in Supplementary Files 10 and 11, respectively). Due to space limitations, we chose to show validation results from only four fusion transcripts that are predicted by ChimeRScope with relatively lower number of fusion reads (five to seven FESRs, see Table 6). The Sanger sequencing chromatograms for other validated fusions can be found in Supplementary File 11. Figure 7 highlights the PCR results, primer target regions, Sanger sequencing chromatograms, and the exact fusion junctions marked by red lines (except for RPL14&SRP14 and LRRC37A3&NSF) for these four fusion transcripts. In total, ChimeRScope predicted 10 fusion transcripts from the lymphoma cell lines and we were able to experimentally validate all of these predictions. Thus, there are no false positives predicted by our method (FDR: 0%). However, our method missed four true positives that were predicted by other methods, hence the sensitivity of this method is at 71.4% (10 out of 14), which is the highest among all methods tested. Comparatively, among the 20 tested fusion transcripts from SOAPfuse predictions, only nine fusion transcripts were experimentally confirmed (sensitivity is 64.3% and FDR is 55%). All the fusion transcripts reported by JAFFA (three fusions) and FusionCatcher (one fusion) are also predicted by ChimeRScope and SOAPfuse. Therefore, both JAFFA and FusionCatcher achieve 100% precision rate, but with only 21.4 and 7.1% sensitivities, respectively. We also checked the validated fusion transcripts against those predicted by EricScript and MapSplice. EricScript and MapSplice have correctly predicted only six (sensitivity is 42.9%) and three (sensitivity is 21.4%) of these validated fusion transcripts, respectively. Overall, ChimeRScope reported the best F-score (0.833) for this dataset, compared with 0.529 for SOAPfuse, 0.353 for JAFFA and 0.133 for FusionCatcher (Table 6). We did not include EricScript and MapSplice in the experimental validation; hence, the F-scores and other statistics for these two methods are not available.

Table 6. List of the fusion transcripts confirmed by both PCR and Sanger sequencing in NK cell lines.

Cell line	Fusion transcript	Fusion read counts*
		ChimeRScope		JAFFA	FusionCatcher		SOAPfuse
KHYG1	PEX2&YWHAZ	34		22	(3)		49
KHYG1	ARIH2&PRKAR2A	(3)		(2)	(2)		5
KHYG1	CTSC&RAB38	(2)		(0)	(5)		10
KHYG1	PRKCH&FLJ22447	(0)		(0)	(1)		6
NKYS	LRRFIP1&RBM44	10		(0)	(64)		192
NKYS	RPL14&SRP14	7		(0)	(0)		(0)
NK92	C15orf57&CBX3	6		(6)	(1)		10
NK92	DAB2&FRYL	48		33	30		59
NK92	LEP&SND1	92		51	(60)		121
NK92	LRRC37A3&NSF	5		(0)	(0)		(4)
NK92	MAST2&METTL21A	8		(0)	(0)		(0)
NK92	NCOR2&UBC	5		(3)	(5)		(4)
NK92	NPIPB5&SMG1	15		(0)	(0)		(0)
NK92	PTMA&NPM1	(9)		(0)	(0)		13

Tool	Unique prediction	Test size	TP	FP	Sensitivity	FDR	F-score
ChimeRScope	10	10	10	0	0.714	0	0.833
SOAPfuse**	25	20	9	11	0.643	0.55	0.529
JAFFA	3	3	3	0	0.214	0	0.353
FusionCatcher	1	1	1	0	0.071	0	0.133
EricScript	378		6		0.429
MapSplice	127		3		0.214

Open in a new tab

*The number of identified fusion reads for each fusion transcript identified by each method is listed in the corresponding cells. Cells with parenthesis indicate that the fusion transcripts were filtered out by the corresponding tools and thus not reported in their final results.

**Five fusion transcripts predicted by SOAPfuse were excluded from the validation list because either the complete fusion sequence were associated with well annotated read-through mRNAs or the specific primer binding sites were not available due to repeated nucleotide sequences.

Fourteen fusion transcripts from the NK cell lines were validated by both PCR and Sanger sequencing. In total, ChimeRScope predicts all 10 positive fusion transcripts with no false positive. Moreover, ChimeRScope reports fusion reads for 13 of the total 14 experimentally validated fusion transcripts, higher than SOAPfuse, JAFFA and FusionCatcher. Overall, ChimeRScope reports the highest F-score (0.833) on the NK cell lines, compared with 0.529, 0.353 and 0.133 for SOAPfuse, JAFFA and FusionCatcher, respectively. Fusion transcripts predicted by EricScript and MapSplice were not included in the experimental validation; thus, the FDRs and F-scores were not calculated for these two methods.

Figure 7. — PCR and Sanger sequencing results for four fusion transcripts with low number of FESRs from ChimeRScope predictions. For each fusion transcript track, the left panel is the PCR panel and the right panel displays the predicted fusion sequence and the primer binding site, along with the Sanger sequencing chromatogram. Specifically, each PCR image has four lanes for a 100-bp ladder marker, the fusion transcript amplicon with the band of the matched product pointed by a red arrow, the positive control (actin beta, or *ACTB*) and negative control (water). The right panel shows the name of the fusion partners, the predicted fusion junction sequence (100 bp upstream and downstream, separated by the wildcard ‘N’), the binding sites of the primer pair used in the PCR panel, the chromatogram for the highlighted region (mostly the fusion junction, if applicable). The PCR experiments and the Sanger sequencing results confirmed the existences of these four genes in the NK cell lines. We were unable to resolve the fusion junctions for *RPL14&SRP14* and *LRRC37A3&NSF* due to the poor Sanger sequencing data quality. Therefore, the exact fusion junctions of these two fusions were not marked in the chromatograms.

Literature searches for all experimentally validated fusion transcripts have suggested that some of the chimeric transcripts are potentially oncogenic. For instance, a fusion transcript that is predicted by ChimeRScope, NCOR2&UBC, has also been reported previously in CLL patients (50). NCOR2 is a nuclear receptor corepressor that interacts with members of MAPK-signaling (51), Notch and NF-kappa-B pathways (52). The altered expression of this gene is associated with cell cycle progression and apoptosis in multiple cancers (53,54). Figure 8 illustrates the fusion model of NCOR2&UBC with the predicted (55) functional domains in the resulting chimeric protein. This chimeric transcript combines the first exon of NCOR2 and the second exon of UBC, creating a new transcript with the loss of the SANT (named after switching-defective protein 3 or SWI3, adaptor 2 or ADA2, nuclear receptor co-repressor or N-CoR, transcription factor IIIB or TFIIIB) domain that is responsible for chromatin-remodeling and transcription regulation (56,57). Another validated fusion transcript, LRRC37A3&NSF is predicted only by ChimeRScope method. This fusion involves a gene named N-ethylmaleimide sensitive factor (NSF). Studies have shown that NSF directly interacts with CD28 (58), a gene responsible for T-cell activation and survival. Although triggering of human NK cells by CD80 and CD86 (ligands of CD28) seems to be independent of CD28 (59), the absence of CD28 expressions in NK cell lines (59) could be the result of the LRRC37A3&NSF fusion event. Other exclusive ChimeRScope's predictions like MAST2&METTL21A and NPIPB5&SMG1 are kinase fusions (60) that are more likely to have oncogenic functions in cancer because they involve kinases like MAST2 (microtubule-associated serine/threonine kinase 2) and SMG1 (nonsense-mediated mRNA decay associated PI3K related kinase). At last, the only fusion transcript predicted by all four methods, DAB2&FRYL, includes a potential tumor suppressor gene named DAB2 (disabled homolog 2) which has been found to be associated with tumorigenesis in different cancers (61–63). These fusion transcripts mentioned above warrants further investigation to confirm their specific roles in tumorigenesis.

Figure 8. — The fusion model of *NCOR2&UBC* and the predicted functional domains. This fusion transcript is fused between the first exon of *NCOR2* (3΄ end) and the second exon of *UBC* (5΄ end). The SANT domain from *NCOR2* and the ubiquitin domains from *UBC* are plotted to the approximate position of the corresponding exons. Because the exons with the SANT motif sequence are not included in *NCOR2&UBC*, the predicted domains of the *NCOR2&UBC* fusion transcript only contain the ubiquitin domains from *UBC*.

Computational cost

We compared the computational cost of ChimeRScope against the other methods, in order to check the feasibility of using ChimeRScope on large-scale data analysis. Since each of these methods were developed in different programming languages with different ways of analyzing transcriptome data (Supplementary Table S11 in Supplementary File 12), we chose to run the same four samples from NK cell lines on each method by assigning similar amount of resources and compared the run time of these methods. We executed runs for one tool at a time on our local data analysis server (with 160 cores and 512 GB memory) to prevent potential competition for computational resources if the jobs were run simultaneously. We uniformly assigned 20 threads for each tool with multithreading features. The total run time for each method is listed in Table 7. On average, ChimeRScope pipeline only takes 1.67 h per sample (if integrated with standard RNA-Seq data analysis pipeline) outperforming the other five methods by far with a large margin. EricScript reported the second fastest time (5.21 h per sample). The Scanner and Sweeper modules of ChimeRScope that identify the FESRs and fusion transcript candidates, respectively, took over 90% of the total time due to the compute-intensive nature of these tasks. It is worth mentioning that the search space for ChimeRScope is only around 10% of the total reads (unmapped and discordantly mapped reads), hence our method can be integrated with the standard RNA-Seq data analysis pipeline as an add-on module of post-alignment step using the unaligned read data as input. In contrast, the other five methods are alignment-based methods that independently process the entire read data to predict fusion transcripts.

Table 7. The run time for fusion transcript prediction on NK cell lines using ChimeRScope and other five major methods.

	Total reads	Discordant reads	ChimeRScope				SF*	JF*	FC*	ES*	MS*
			Scanner	Sweeper	Examiner	Total
SRR1648334	39 555 460	4 920 296	0.60	0.78	0.07	1.45		57.03	7.67	4.02	9.87
SRR1648335	64 685 980	7 584 282	0.63	1.17	0.10	1.90		90.75	12.28	7.40	15.07
SRR1648336	48 472 502	6 043 845	1.02	0.57	0.13	1.72		72.74	8.37	4.40	9.92
SRR1648337	49 849 026	6 297 921	0.98	0.53	0.10	1.61		66.63	8.67	5.00	10.05
Time			3.23	3.05	0.40	6.68	101.75	287.15	36.99	20.82	44.91
Time per sample			0.81	0.76	0.10	1.67	25.44	71.79	9.25	5.21	11.23

Open in a new tab

* SF: SOAPfuse; JF: JAFFA; FC: FusionCatcher; ES: EricScript; MS: MapSplice.

The total number of reads for these four samples ranged from 39.6 to 64.7 million. Roughly 12% of the total reads in each sample are defined as discordant reads after TopHat alignment. ChimeRScope only takes >7 h to analyze all discordant reads. The time cost for each step is also listed in this table. Comparatively, the other five methods take significantly longer time to finish, possibly due to the time spent on alignment of the total reads. The run times for SOAPfuse on each individual samples are not available because SOAPfuse analyzes all four different samples in a single run.

Software and web application development

To increase the ease of accessibility to the research community, we developed both the standalone software package and a web-based application of the ChimeRScope using Galaxy server (22). The standalone package is distributed in a single Java Archive (JAR) file and it can be easily integrated with other third-party NGS data analysis software packages. For demonstration purpose and better accessibility, we also installed ChimeRScope on our local Galaxy server. Additionally, we ensured that no differences exist between the web-based Galaxy server version and the command line version of ChimeRScope. The online Galaxy server version of ChimeRScope will benefit the research community, especially for researchers with limited programming experiences. All the scripts and instructions to integrate ChimeRScope into Galaxy server are incorporated in the ChimeRScope wiki page.

DISCUSSION