Skip to main content
BMC Genomics logoLink to BMC Genomics
. 2025 Jun 30;26:670. doi: 10.1186/s12864-025-11866-6

Gene fusion detection in long-read transcriptome sequencing data with GFvoter

Xiaolan Zhao 1,2,#, Zitong Ren 1,#, Junhai Qi 1, Enfeng Qi 3, Xiaoyu Zhao 4,, Guojun Li 1,, Ting Yu 1,✉,#
PMCID: PMC12269178  PMID: 40676509

Abstract

Gene fusion is a prevalent occurrence in cancer patients, and fusions are significant both as diagnostic biomarkers and as therapeutic targets for cancer. Long-read transcriptome sequencing technology provides new opportunities for gene fusion detection. In this research, we have developed GFvoter, a novel method that employs a multivoting strategy to identify gene fusions from long-read transcriptome sequencing data. GFvoter calls two RNA-seq aligners, two fusion detection tools, and a newly designed scoring mechanism to conduct the so-called voting process in turn, which enables the accurate detection of potential fusions. We validated GFvoter using both simulated and real cell line datasets from PacBio and Nanopore and found that GFvoter significantly outperforms alternative methods. Moreover, GFvoter successfully reported the RPS6KB1:VMP1 gene fusion in the MCF-7 cell line, while none of the other tested tools detected this fusion. Overall, our findings show that GFvoter can accurately identify gene fusions from long-read RNA-seq data, which has the potential to improve cancer diagnosis and treatment. GFvoter is available at https://github.com/xiaolan-z/GFvoter.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12864-025-11866-6.

Keywords: Gene fusion detection, Long-read transcriptome sequencing, Multivoting, Scoring mechanism

Background

Gene fusion, as one of the significant ways in which new genes are generated, is the process by which partial or complete sequences of two different genes are joined together via some mechanism, ultimately forming a new gene. The most common mechanisms of gene fusion are chromosomal translocation (partial cross exchange of two chromosomal fragments), gene deletion (loss of a certain interval between genes), and chromosomal inversion (reconnection of a segment of the same chromosome after 3’ to 5’ reversal). Fusion genes are transcribed to produce fusion RNAs, also known as fusion transcripts or chimeric transcripts, which contain portions of two originally independent genes and encode a fusion protein [13].

Numerous studies have shown that gene fusion is closely related to the occurrence and development of various diseases, especially cancer. Fusions drive the development of 16.5% of cancer cases and play a unique driving role in more than 1% of cases [4]. Chronic myeloid leukaemia (CML) is a typical example of a malignant tumour defined by the a gene fusion event, BCR-ABL1, caused by mutual translocation between chromosomes 9 and 22 [5], leading to genomic instability and disruption of signalling pathways [6]. Other examples of gene fusions that play key roles in disease occurrence are the EML4-ALK fusion and CD74-NRG1 fusion in non-small cell lung cancer and the FGFR3-TACC3 fusion in glioblastoma [7] and bladder cancer [8]. Therefore, gene fusion detection is important and meaningful for in-depth research.

Third-generation or long-read sequencing technology (represented by PacBio and Oxford Nanopore) produces long reads (usually more than 1 kb) at low cost, allowing most transcript sequences to be covered by a single read, avoiding the need for transcriptome assembly, and achieving accuracy comparable to that of second-generation sequencing technology [9]. This approach has special advantages for the analysis of complex regions in the genome, especially regions with complex structures, and provides new opportunities for gene fusion detection. With the development of this sequencing technology, several alignment software programs suitable for third-generation sequencing data have emerged, such as minimap2 [10], Winnowmap2 [11], and ngmlr [12], which is conducive to gene fusion detection. To date, several tools have been developed for detecting gene fusions in long-read transcriptome sequencing data. LongGF [13] aligns reads to the genome with minimap2 and identifies reads that align to multiple genomic positions; JAFFAL uses minimap2 for double alignment to a reference transcriptome and genome; Genion [14] uses deSALT [15] to align reads twice with the genome; AERON [16] uses GraphAligner [17] to align reads with the transcriptome; and FusionSeeker [18] identifies fusions and reconstructs their transcript sequence with partial order alignment. However, the quantity and authenticity of reported fusions still need to be improved. This requires the further development of novel bioinformatic approaches for fusion detection in long-read data.

Here, we present GFvoter, a tool that uses a new multivoting strategy to detect gene fusions from long-read transcriptome sequencing data. GFvoter takes long reads as input and calls Minimap2 and Winnowmap2, LongGF, JAFFAL and a newly designed scoring mechanism to conduct the voting process in turn and finally outputs a gene fusion list. We evaluated the performance of GFvoter using real cancer cell line data from both PacBio and Oxford Nanopore platforms, and further validated its robustness and generalizability using simulated datasets, and compared the results with those of LongGF, JAFFAL, and FusionSeeker. Our results indicated that GFvoter exhibited superior performance over the alternatives. Notably, when we applied these four fusion callers to a PacBio dataset from the human MCF-7 breast cancer cell line, GFvoter successfully reported the RPS6KB1:VMP1 fusion, while the others did not. The usage of GFvoter is described in Supplementary Materials S1.

Results

To evaluate the performance of the GFvoter, we conducted a comparative analysis with three state-of-the-art fusion detection tools, namely, LongGT, JAFFAL, and FusionSeeker. This evaluation involved seventeen datasets, which included ten simulated datasets and seven real datasets. The real datasets included public long-read transcriptome sequencing of three cancer cell lines, namely, MCF-7, HCT-116, and A549, and one sample from a patient with acute myeloid leukaemia (AML). The ten simulated datasets were generated for the same 2500 fusion genes by Haas et al. [19].

The ground truth for the real data utilized in this study was derived from the fusion genes obtained from the Mitelman database, which is similar to JAFFAL; these fusion genes have undergone technical validation and are supported by relevant literature. The ground truth for the simulated data comes from 2500 fusion events simulated by Haas et al. The reported fusions mentioned in this article refer to the fusion genes detected by gene fusion detection tools, whereas known fusions refer to the reported fusion genes belonging to the ground truth. Notably, all the results presented in this article have undergone deduplication, that is, fusions with multiple breakpoints were counted as a single true or false positive.

GFvoter exhibits high accuracy on both real and simulated datasets

The gene fusion detection results of GFvoter, LongGF, JAFFAL and FusionSeeker on real and simulated datasets are summarized in Table 1. We evaluated the performance of the four tested fusion detection methods on the basis of these results. The accuracy was measured in terms of precision (the percentage of reported known fusions out of all the predicted ones) and the sensitivity was measured in terms of recall (the fraction of reported known fusions in the ground truth). Similar to the approach that previous studies used in the algorithm evaluation using real dataset [20], in this study, the ground truth used to calculate the recall is the union of known fusions detected by these four methods for each real dataset. Because for real data, the true set of gene fusions (i.e., the absolute ground truth) is usually unknown, because the database for known fusions is generally incomplete and the fusions that truly present in the sample remains unknown. For the simulated data, we tested four fusion callers on ten simulated datasets with sequence identity levels ranging from 75% to 95%. The results for the two simulated datasets are presented in Table 1, and the results for all the simulated datasets are provided in Supplementary Material S2.

Table 1.

The number of gene fusions reported across nine long-read sequencing datasets by gfvoter, longgf, JAFFAL and fusionseeker, including those previously validated as known fusions, which are indicated in parentheses

Tools GFvoter
#reported fusions (known fusions)
LongGF
#reported fusions (known fusions)
JAFFAL
#reported fusions (known fusions)
FusionSeeker
#reported fusions
(known fusions)
Real datasets Pac MCF-7 Rep1 9(5) 3(2) 11(4) 1(1)

Pac MCF-7

Rep2

39(10) 28(7) 45(10) 12(3)
Pac MCF-7 Rep3 45(28) 72(29) 488(40) 107(23)
Pac HCT-116 8(2) 0(0) 9(2) 0(0)
ONT MCF-7 16(10) 36(12) 100(13) 46(12)
ONT AML 37(2) 102(2) 502(11) 66(2)
ONT A549 1(1) 57(0) 198(4) 37(0)
Simulated datasets Pac fus_sim 90 401(385) 365(344) 429(370) 295(217)
ONT fus_sim 90 361(343) 356(335) 426(362) 319(227)

On the basis of the results from these nine datasets, GFvoter demonstrated the ability to capture fusion gene information accurately, achieving the highest precision on most of the tested datasets and recall higher than or comparable to that of the other tools. Specifically, GFvoter achieved the highest average precision (58.6%) across the nine datasets, surpassing LongGF and FusionSeeker (average precision values of 39.5% and 35.6%, respectively). JAFFAL had the lowest average precision (30.8%). For the PacBio MCF-7 Rep1 dataset, GFvoter did not exhibit the highest precision but reached the highest recall. Specifically, GFvoter reported 9 gene fusions, 5 of which were previously discovered and validated. GFvoter was the tool that detected the highest number of known fusions among the four detection methods. Although FusionSeeker achieved 100% precision, it reported only one fusion, which was significantly lower than the number reported by GFvoter. In addition, for the ONT MCF-7 dataset, where the recall of GFvoter was the lowest, GFvoter detected 16 gene fusions, of which 10 were previously validated, whereas JAFFAL reported 100 fusions, of which only 13 were known fusions; thus, GFvoter reported a much smaller number of false positive fusions. For the ONT AML and ONT A549 datasets, although GFvoter did not have the highest recall, its precision was obviously better than that of the other tools. Overall, it can be concluded that the GFvoter method exhibited superior performance in balancing precision and recall. In addition, GFvoter also achieved better performance on the two simulated datasets (Table 1; Fig. 1B).

Fig. 1.

Fig. 1

Evaluation of the performance of the four fusion callers on the nine datasets. A The F1 score of the four fusion callers on the nine datasets. The four fusion callers are shown in different colors, and GFvoter has the highest F1 score. B The precision (y-axis) and recall (x-axis) of the four fusion callers on the nine datasets. GFvote has the highest precision among eight out of these nine datasets. Please note that the x-axis and y-axis in this figure uses a non-zero starting point to emphasize relative performance differences between methods. While this visualization strategy highlights subtle but meaningful distinctions in precision and recall values, readers are cautioned to interpret the visual scale in conjunction with the exact numerical values provided in Table 1 for accurate comparison

We further calculated the F1 score, a harmonic mean of recall and precision (calculated as Inline graphic), to evaluate the overall performance of each fusion caller. The results are shown in Fig. 1A. Of the tools tested, GFvoter produced the highest F1 score on all nine experimental datasets. Specifically, GFvoter achieved an average F1 score of 0.569 (ranging from 0.080 to 0.972), whereas the average F1 score of JAFFAL was 0.386 (ranging from 0.039 to 0.902), that of LongGF was 0.407 (ranging from 0 to 0.910), and that of FusionSeeker was 0.291 (ranging from 0 to 0.645). Overall, GFvoter showed significantly better precision-recall balance on real datasets and better or comparable performance to LongGF and JAFFAL on simulated datasets (Supplementary Materials Fig. S1-S2).

Evaluation of gfvoter’s component contributions

GFvoter offers a highly effective pipeline for detecting fusion genes, and each of the aligners and fusion predictors used contribute to its accuracy. We tested the performance of GFvoter on nine datasets when Winnowmap2, LongGF, and JAFFAL were removed individually. The results (Supplementary Materials S3.1) indicated that the contribution of each tool is quite difficult to quantify and varies depending on the dataset.

Benchmarking under varying read thresholds

In addition, we evaluated the performance of each fusion predicter under varying minimum read thresholds. Because the minimum support threshold for fusions reported by different detection tools varies, LongGF may not report fusions that are supported by only a single read under default parameters. We used three minimum read thresholds (1, 2, 3) to run GFvoter, LongGF, JAFFAL, and FusionSeeker on two datasets. As detailed in Table S4(Supplementary Materials S3.2), increasing the minimum read thresholds generally enhances the precision of the four fusion detection tools while leading to a decrease in recall. However, GFvoter consistently ranks first in F1 score under different parameters, convincingly demonstrating its advancement over the other available tools.

Identification of gene fusions in the human breast cancer cell line MCF-7

We tested these four software tools on data from the human breast cancer cell line MCF-7, in which fusions had been previously validated via RT‒PCR and Sanger sequencing and there was orthogonal evidence of a translocation from whole-genome sequencing. The test results demonstrated the potential of GFvoter in detecting fusion genes using its designed scoring mechanism. The detailed results of the experiment are presented in Table 2. A total of 12 gene fusions were identified by the four callers in the real PacBio data from MCF-7. GFvoter detected 9 gene fusions, 5 of which were validated. GFvoter reported the RPS6KB1:VMP1 gene fusion, while the other three tools did not detect this fusion. RPS6KB1 and VMP1 are neighbouring genes located on the long arm of chromosome 17 at position 23. The fusion of VMP1 and RPS6KB1 was reported for the first time in 2011 [21]. In this work, the authors described an RPS6KB1/VMP1 fusion transcript that is the product of a tandem duplication and is present in breast cancer samples. The VMP1 gene is normally upstream of RPS6KB1. The tandem duplication changes this orientation, merging the 5’ part of RPS6KB1 and the 3’ part of VMP1. Since the fusion does not include an intact functional protein domain, the authors surmised that the fusion transcript may be a complex genomic “indicator” of genetic instability at the 17q23 locus that leads to gene amplification or the overexpression of critical oncogenic elements. The results for the remaining six real datasets are presented in Supplementary Materials S4.

Table 2.

Fusion genes reported by GFvoter, longgf, JAFFAL, and fusionseeker from the PacBio MCF-7 cell line data. GFvoter successfully reported the fusion gene RPS6KB1:VMP1, while none of the other tested tools detected this fusion

Gene fusions reported Fusion Detection Tools Known fusion
GFvoter LongGF JAFFAL FusionSeeker
RPS6KB1: DIAPH3 yes
C16orf62: IQCK
SLC25A24: NBPF6 yes
DPH7: PNPLA7 yes
CCNI: C11orf80
NCOA3: ACTL6A
BCAS4: BCAS3 yes
PMPCA: C1or95
RPS6KB1: VMP1 yes
RP3-430N8.11:ARMCX3
HIPK1:DENND2C
AC099850.1:VMP1

Computational resources

The runtime and memory consumption of each method we compared (GFvoter、LongGF、JAFFAL and FusionSeeker) are shown in the Supplementary Materials S7. The results show that most of GFvoter’s time is indeed spent on invoking other tools, however, the scoring stage of GFvoter consumes very little time, which aligns with our design concept of minimizing the overhead of the final scoring process. And regarding memory consumption, the differences in memory usage among the tools are not substantial in small - scale dataset. but GFvoter consumes more memory than other three tools. Considering the improved performance of GFvoter in accurately detecting gene fusions, as demonstrated in our previous performance evaluation sections (F1-score range from 0.453, 0.222, 0.381 to 0.606, respectively), the increased memory consumption can be regarded as a reasonable trade - off for achieving higher accuracy in gene fusion detection.

Conclusions

Among various sequencing technologies, long-read sequencing has become increasingly popular because of its ability to capture nearly full-length transcript sequences. This feature offers more convenient detection of gene fusions, transcript reconstruction, and analysis of expression differences between genes. The fundamental cause of false-positive fusion genes primarily stems from inherent technical limitations in sequencing and data analysis. For example, sequencing errors (such as base substitution, insertion, or deletion during long-read sequencing), transcriptional noise (non-functional chimeric transcripts generated by random splicing events in cells), and genomic structural complexity (e.g., repetitive sequences or copy number variations) can all mislead fusion detection algorithms into identifying non-existent fusion events. In particular, the use of long-read sequencing data for gene fusion detection can avoid the ambiguity introduced by the transcript assembly step. Here, we introduce GFvoter, a tool for identifying gene fusions from long-read transcriptome sequencing data that uses a multivoting strategy. We demonstrate that GFvoter outperforms three existing fusion detection tools, namely, LongGF, JAFFAL, and FusionSeeker, on both real and simulated data. This superior performance may have significant impact on disease prediction and treatment development in human cancer research.

GFvoter introduces two main innovations in the detection of fusion genes. One is the use of two aligners to align each set of data twice. The combination of the two results greatly enriches the alignment information, providing more candidates for subsequent gene fusion detection (Supplementary Materials S5). The second innovation is the addition of four proportions for evaluating the quality of the fusion supporting reads. A high-quality read tends to have a near-full-length alignment to the genome, as well as a high overlap of the alignment region with exons. By considering the ratio between them, some false-positive fusions can be filtered out, thus improving the precision.

However, there are several limitations to the use of GFvoter. This method requires the annotation of alignments with gene and exon information from the gene annotation file. Consequently, GFvoter may face limitations in detecting gene fusions involving unknown genes. However, this drawback is shared among existing gene fusion detection methods. The detection of gene fusions involving novel genes could be improved in the future through the addition of other genes and their exons to the standard gene annotation file. Second, GFvoter currently only consider a pair of primary and supplementary alignments instead of multiple alignments of a read. For reads with ≥ 3 gene alignments, manual curation or specialized multi-fusion detection modules may be required to distinguish true multi-gene fusions from alignment noise. Third, in the voting session with the scoring mechanism, many conditions are set relatively strictly, which may lead to many real fusions being missed. For example, the filtering criterion of GFvoter—requiring read alignments to fall within gene boundaries—was intended to reduce ambiguous cases where reads may partially align to intergenic or unannotated regions, potentially confounding gene-level analysis. On one hand, Gene fusions typically result from structural rearrangements between functional genes, such as chromosomal translocations. Alignments involving non-genic regions (e.g., intergenic DNA or repetitive elements) are less likely to correspond to functional chimeric transcripts and are more prone to sequencing or assembly artifacts. On the other hand, most validated gene fusions involve exonic regions of known protein-coding genes. By restricting alignments to annotated-gene boundaries, GFvoter aims to reduce false positives from non-genic alignments and focus on fusions with higher potential for functional relevance. In addition, GFvoter’s improvement in accuracy comes at the cost of increased computational demands compared to some existing tools, which may limit its applicability in large-scale studies. Future work will focus on optimizing GFvoter’s efficiency to reduce computational overhead while maintaining high accuracy.

In future work, we may consider optimizing the long-read data, for example, by reconstructing reads first, and using a reliable clustering algorithm to output more accurate breakpoint positions. We have also attempted to incorporate more fusion gene detection tools into the GFvoter pipeline. To date, we have tried to add FusionSeeker, but the results were not ideal (Supplementary Materials S3.3). We will continue to optimize our scoring mechanism. During the process, we may encounter situations where the accuracy is high but the recall is extremely low, which will require us to continuously adjust the algorithm to increase the recall.

Methods

Framework of the GFvoter

GFvoter uses a multivoting approach to call fusion genes accurately, drawing on the strengths of both RNA-seq aligners and fusion gene callers. GFvoter takes long RNA-seq reads as input and produces a reliable fusion gene list as output. First, GFvoter detects potential fusion genes from the reads, which consists of two parts. One part is obtained by aligning long RNA-seq reads to the reference genome using Minimap2 (v2.28) and Winnowmap2(v2.03). The other part is discovered by calling two fusion detection tools, LongGF (v0.1.2) and JAFFAL (v2.3). Next, GFvoter calls minimap2, Winnomap, LongGF, JAFFAL and a newly designed scoring mechanism to conduct the so-called voting process on these candidates individually. Finally, the candidates with a vote number exceeding a preset threshold are retained as the gene fusion events to be reported (Fig. 2).

Fig. 2.

Fig. 2

GFvoter pipeline for fusion detection. The testing process is divided into two steps: multiple voting and filtering. In the figure, the two dashed boxes represent the input and output of the GFvoter; a dashed square represents a “candidate”, i.e., a gene fusion to be identified; golden triangles represent ballots; and the five hexagons represent the five “voters”. The number of golden triangles in front of the five “voters” represents the number of votes owned by the “voter” in each decision, and the result of each decision of each voter is either zero or the number of votes. The table on the right shows the votes for some candidates

Obtaining fusion candidates from Minimap2 and Winnowmap2

Gene fusions are detected by first aligning long reads to a reference genome (hg38.fa) using the long-read aligner Minimap2 and Winnowmap2 with option -ax splice, resulting in two BAM files containing all alignment records. The information in the downloaded gene annotation file is sorted and deduplicated, and then the alignment information of long reads from two BAM files is annotated and summarized into one file, named alignment.info.

Next, we select the “qualified” long reads that may assist us in identifying fusion genes. On the basis of the alignment information, reads that aligned to only one genomic position or mapped to different locations of the same gene are filtered out (as shown in Fig. 3 (a), (b), where long read 1 and long read 2 were aligned to only one gene). The remaining reads have at least two alignment records in different genomic positions (as shown in Fig. 3 (c), (d), (e)), from which GFvoter determines a primary alignment using alignment information obtained from Minimap2 and Winnowmap2; the others are considered supplementary alignments if there are fewer mapped bases in the long read overlapping with mapped bases in the primary alignment; otherwise, they are considered secondary alignments. We eliminate long reads with no supplementary alignment since they are not informative for detecting fusion genes and may even interfere with the detection of fusion genes.

Fig. 3.

Fig. 3

Alignments of several long reads. Each black rectangle represents a read, i.e., a fragment of DNA or RNA sequence obtained from sequencing; the sky blue and green rectangles represent a portion of read sequence that is perfectly matched to a section of a gene (there may be more than one such read in a single long read); each dark blue rectangle represents the sequence of a gene; and the positions indicated by dashed arrows are the start and end positions of the gene segment aligned with the read

For long reads with more than one supplementary alignment, we select the read with the highest overlap with the exon. Next, it is necessary to check whether both the primary alignment and supplementary alignment of a read are aligned within the boundaries of a gene. Specifically, the start position of the alignment should not begin before the start position of the gene, and the end position of the alignment should not extend beyond the end position of the gene. We remove long reads that do not meet these conditions, such as long read 4 and long read 5 in Fig. 3 (d) and (e). Each of the remaining reads, such as the long read shown in Fig. 3 (c), is considered a “qualified” long read that may support gene fusion. Here, combining Minimap2 and Winnowmap2 to detect gene fusions enables more alignment information to be obtained, thus ensuring the precise capture of more fusion candidates.

Finally, we summarize the fusions supported by the “qualified” reads and count the number of distinct fusions; each fusion is denoted as A: B. Note that fusions composed of different fusion partners are considered as different fusions. GFvoter presents s fusion candidates obtained by minimap2 with a vector Inline graphic and Inline graphic fusion candidates obtained via Winnowmap2 with a vector Inline graphic, where Inline graphic and Inline graphic denote fusion candidates obtained via Minimap2 and Winnowmap2. Inline graphic.

We define a new operation Inline graphic, whose dimension is the number of elements in the union set of fusions related to all the components in X and Y, and the value of its components is the sum of the corresponding components in X and Y. We record the voting results of Minimap2 and Winnowmap2 as U, i.e., Inline graphic, where Inline graphic. If Inline graphic, the fusion candidate corresponding to the i-th component is derived solely from Minimap2 or Winnowmap2; otherwise, i.e., when Inline graphic, the fusion candidate corresponding to the i-th component appears in the results of both Minimap2 and Winnowmap2. Thus, with the alignment information from Minimap2 and Winnowmap2, we obtain k fusion candidates whose votes are the corresponding components of U.

Fusion detection with the scoring mechanism

To identify true fusions from fusion candidates accurately, we design a scoring mechanism that targets the k fusion candidates corresponding to each component in U. The mechanism first quantifies the reliability of each read by scoring the alignment quality. Then, a score is assigned to each fusion candidate according to the scores of all the reads supporting that candidate.

Each “qualified” long read r can be expressed as Inline graphic, where Inline graphic and Inline graphic represent primary alignment and supplementary alignment, respectively; according to the starting position of alignment on the read; these two alignments are also known as left alignment and right alignment, recorded as Inline graphic,Inline graphic. Inline graphic is a two-dimensional vector. Inline graphic is a triplet Inline graphic where Inline graphic and Inline graphic denote the length of left soft-mapping and right soft-mapping (soft-mapping allows for a certain degree of inexact matching or mismatches), and conversely, Inline graphic is the length of the read that perfectly matches the reference sequence. Inline graphic is also a triplet Inline graphic, where Inline graphic and Inline graphic represent the start position and the end position on the genome for the alignment, respectively, and Inline graphic denotes the total length of the exon overlapping with the mapping genomic interval. With respect to the results of Minimap2 and Winnowmap2, GFvoter counts the number of supporting reads for each fusion candidate, denoted as Inline graphic. To eliminate some randomness in the library preparation process, the number of supporting reads for each candidate should be greater than 2. As shown in Fig. 3 (f), (g), there should be an appropriate distance between the primary alignment and supplementary alignment, which is given by Eq. 2, and we restrict the distance between the two alignments of each read by a preset value c. The quality of alignment is also related to the total length of overlap with exons and whether it contains secondary alignment (Eq. 3). Additionally, we define four ratios regarding the alignments of each read (Eq. 4, Eq. 5, Eq. 6 and Eq. 7), as shown in Fig. 3 (h), (i), where Inline graphic and Inline graphic are the start and end positions of the gene, respectively. Finally, each read is assigned a reliability score through Eq. 1 (in our experiment,Inline graphic, Inline graphic, Inline graphic). To provide a more reasonable score for each read, we stipulate that Eq. 1 is applied only to reads that meet the criteria of Inline graphic, and Inline graphic Reads that do not meet these conditions are given a score of 0. For each fusion candidate, its score is calculated by combining the scores of all the reads supporting it. As defined by Eq. 8, the score is the average of the positive scores of all the reads supporting the candidate fusion if there is at least one score greater than 0 among the scores of all the reads supporting that fusion; otherwise, it is 0.

Finally, we represent the voting result as a k-dimensional vector, i.e., Inline graphic, where Inline graphic. When the score of the fusion candidate exceeds a default threshold (400), its corresponding component f takes a value of 6; otherwise, it takes a value of 0.

graphic file with name d33e1053.gif 1
graphic file with name d33e1059.gif 2
graphic file with name d33e1065.gif 3
graphic file with name d33e1071.gif 4
graphic file with name d33e1078.gif 5
graphic file with name d33e1084.gif 6
graphic file with name d33e1090.gif 7
graphic file with name d33e1096.gif 8

Fusion detection by LongGF and JAFFAL

LongGF and JAFFAL are tools developed to detect gene fusions from long-read transcriptome sequencing data. We obtained the results of fusion gene detection from the downloaded data using these two tools separately. With respect to LongGF, we used different parameters for the PacBio and Nanopore datasets. The parameters used for the PacBio datasets were set to “100 50 100 100 0 0 1”, whereas the parameters used for the Nanopore datasets were set to “100 50 100 100 0 0 2”. Then, we represented the voting results of LongGF and JAFFAL as Inline graphic, where Inline graphic and Inline graphic, where Inline graphic. Inline graphic or Inline graphic represents that LongGF or JAFFAL recognizes the gene fusion corresponding to the i-th component.

Filtering and output of reliable fusions

The results obtained from the five votes, i.e., Mnimap2, Winnowmap2, LongGF, JAFFAL, and the custom scoring mechanism, are summarized to obtain the final result, i.e.,Inline graphic, where Inline graphic and Inline graphic. Thus, the five callers identify a total of h fusion candidates, and the value of each component is the total number of votes for the corresponding fusion candidate. GFvoter filters out the candidates whose votes do not exceed the predefined threshold (6), and the remaining fusions are output as reliable gene fusions. The GFvoter output file includes the fusion ID, fusion name, number of supporting reads, chromosome and breakpoint position. GFvoter also indicates whether the fusion has been experimentally verified, that is, whether it is a known fusion, as shown in Table 3. The breakpoint is determined by taking the average of the results obtained by combining LongGF, JAFFAL and the scoring mechanism, and the number of supporting reads is determined by taking the maximum value of the three tools. Parameter -f of GFvoter can be set to output the score from each fusion method during the voting process, as detailed in Supplementary Materials S6.

Table 3.

Output results of the GFvoter detection of a set of MCF-7 cell line data

ID Gene Fusion supporting_reads chrom1 breakpoint1 chrom2 breakpoint2 Known
1 RPS6KB1:DIAPH3 4 chr17 59,930,174 chr13 59,666,846 yes
2 C16orf62:IQCK 4 chr16 19,591,873 chr16 19,856,486
3 RPS6KB1:VMP1 3 chr17 59,911,140 chr17 59,839,521 yes
4 SLC25A24:NBPF6 3 chr1 108,161,182 chr1 108,470,593 yes
5 BCAS4:BCAS3 2 chr20 50,795,173 chr17 61,357,466 yes
6 CCNI: C11orf80 1 chr11 77,066,406 chr4 66,788,269
7 PMPCA: C1orf95 1 chr9 136,412,199 chr1 226,596,802
8 NCOA3:ACTL6A 1 chr20 47,633,639 chr3 179,573,368
9 DPH7:PNPLA7 1 chr9 137,555,648 chr9 137,515,378 yes

Supplementary Information

Acknowledgements

Not applicable.

Project information

Project name: GFvoter: A software tool for accurate detection of gene fusion in long-read transcriptome sequencing data.

Project home page: https://github.com/xiaolan-z/GFvoter.

Operating system(s): Linux.

Programming language: python.

Other requirements: python 3.7.3 or higher, g + + 5.2.0 or higher.

License: MIT License.

Authors’ contributions

T.Y., Z.R., and XL.Z. conceived and designed the experiments. XL.Z. and Z.R. performed the experiments. T.Y., Z.R., and X.Z. analysed the data. XL.Z., Z.R., T.Y., E.Q., XY.Z. and J.Q. contributed reagents, materials, and analysis tools. XL.Z., Z.R., and T.Y. wrote the paper. Z.R. and XL.Z. designed the software used in the analysis. T.Y. and G.L. oversaw the project. All the authors reviewed the manuscript.

Funding

This work was supported by the National Key R&D Program of China with code 2020YFA0712400; the General Program of Guangxi Natural Science Foundation with code 2023GXNSFAA026410; the Fundamental Research Funds for the Central Universities; the National Natural Science Foundation of China with codes 12471461 and 62402286; the Special Grant Fund of the China Postdoctoral Science Foundation with code 2024T170510; and the Shandong Natural Science Foundation with code ZR2023QA059. The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Data availability

The annotation file we used is gencode.v22.chr_patch_hapl_scaff.annotation.gtf downloaded from https://www.gencodegenes.org/human/release_22.html. The simulated fusion data are available for download from https://figshare.com/articles/dataset/Long_Read_Fusion_Simulation/14459007, and their ground truths were downloaded from https://data.broadinstitute.org/Trinity//CTAT-LR-Fusion_PAPER/simulated_reads/jaffal_sim_data/. For real data, PacBio sequencing datasets for the MCF-7 and HCT-116 cell lines are available from the Sequence Read Archive (SRA) under SRR1853200, SRR14638376 and SRR5009523. The ONT sequencing datasets for the MCF-7 and A549 cell lines are available at https://github.com/GoekeLab/sg-nex-data. The sequencing data from the acute myeloid leukaemia (AML) patient sample is available from the SRA under SRR12048357. The known fusions within the real data provided by JAFFAL can be downloaded from https://github.com/Oshlack/JAFFA/blob/master/known_fusions.txt. The output for each tool across all the tested datasets can be available at https://sourceforge.net/projects/gfvoter/files/. The source code of GFvoter is available at https://github.com/xiaolan-z/GFvoter.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Xiaolan Zhao, Zitong Ren and Ting Yu contributed equally to this work.

Contributor Information

Xiaoyu Zhao, Email: ustcxyz@hotmail.com.

Guojun Li, Email: guojunsdu@gmail.com.

Ting Yu, Email: tingy@sdu.edu.cn.

References

  • 1.Davare MA, Tognon CE. Detecting and targetting oncogenic fusion proteins in the genomic era. Biol Cell. 2015;107(5):111–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Dorney R, Dhungel BP, Rasko JEJ, Hebbard L, Schmitz U. Recent advances in cancer fusion transcript detection. Brief Bioinform. 2023;24(1):bbac519. [DOI] [PMC free article] [PubMed]
  • 3.Mertens F, Johansson B, Fioretos T, Mitelman F. The emerging complexity of gene fusions in cancer. Nat Rev Cancer. 2015;15(6):371–81. [DOI] [PubMed] [Google Scholar]
  • 4.Gao Q, Liang WW, Foltz SM, Mutharasu G, Jayasinghe RG, Cao S, Liao WW, Reynolds SM, Wyczalkowski MA, Yao L, et al. Driver fusions and their implications in the development and treatment of human cancers. Cell Rep. 2018;23(1):227–e238223. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Quintás-Cardama A, Cortes J. Molecular biology of bcr-abl1-positive chronic myeloid leukemia. Blood. 2009;113(8):1619–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Kang ZJ, Liu YF, Xu LZ, Long ZJ, Huang D, Yang Y, Liu B, Feng JX, Pan YJ, Yan JS, et al. The Philadelphia chromosome in leukemogenesis. Chin J Cancer. 2016;35:48. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Singh D, Chan JM, Zoppoli P, Niola F, Sullivan R, Castano A, Liu EM, Reichel J, Porrati P, Pellegatta S, et al. Transforming fusions of FGFR and TACC genes in human glioblastoma. Science. 2012;337(6099):1231–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Guo G, Sun X, Chen C, Wu S, Huang P, Li Z, Dean M, Huang Y, Jia W, Zhou Q, et al. Whole-genome and whole-exome sequencing of bladder cancer identifies frequent alterations in genes involved in sister chromatid cohesion and segregation. Nat Genet. 2013;45(12):1459–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Wenger AM, Peluso P, Rowell WJ, Chang PC, Hall RJ, Concepcion GT, Ebler J, Fungtammasan A, Kolesnikov A, Olson ND, et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol. 2019;37(10):1155–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34(18):3094–100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Jain C, Rhie A, Hansen NF, Koren S, Phillippy AM. Long-read mapping to repetitive reference sequences using Winnowmap2. Nat Methods. 2022;19(6):705–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Sedlazeck FJ, Rescheneder P, Smolka M, Fang H, Nattestad M, von Haeseler A, Schatz MC. Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods. 2018;15(6):461–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Liu Q, Hu Y, Stucky A, Fang L, Zhong JF, Wang K. LongGF: computational algorithm and software tool for fast and accurate detection of gene fusions by long-read transcriptome sequencing. BMC Genomics. 2020;21(Suppl 11):793. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Karaoglanoglu F, Chauve C, Hach F. Genion, an accurate tool to detect gene fusion from long transcriptomics reads. BMC Genomics. 2022;23(1):129. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Liu B, Liu Y, Li J, Guo H, Zang T, Wang Y. DeSALT: fast and accurate long transcriptomic read alignment with de Bruijn graph-based index. Genome Biol. 2019;20(1):274. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Rautiainen M, Durai DA, Chen Y, Xin L, Low HM, Göke J, Marschall T, Schulz MH. AERON: Transcript quantification and gene-fusion detection using long reads. bioRxiv. 202. 2020.01.27.921338;doi:10.1101/2020.01.27.921338.
  • 17.Rautiainen M, Marschall T. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 2020;21(1):253. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Chen Y, Wang Y, Chen W, Tan Z, Song Y, Chen H, Chong Z. Gene fusion detection and characterization in Long-Read Cancer transcriptome sequencing data with fusionSeeker. Cancer Res. 2023;83(1):28–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Haas BJ, Dobin A, Li B, Stransky N, Pochet N, Regev A. Accuracy assessment of fusion transcript detection via read-mapping and de Novo fusion transcript assembly-based methods. Genome Biol. 2019;20(1):213. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Kovaka S, Zimin AV, Pertea GM, Razaghi R, Salzberg SL, Pertea M. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 2019;20(1):278. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Inaki K, Hillmer AM, Ukil L, Yao F, Woo XY, Vardy LA, Zawack KF, Lee CW, Ariyaratne PN, Chan YS, et al. Transcriptional consequences of genomic structural aberrations in breast cancer. Genome Res. 2011;21(5):676–87. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data Availability Statement

The annotation file we used is gencode.v22.chr_patch_hapl_scaff.annotation.gtf downloaded from https://www.gencodegenes.org/human/release_22.html. The simulated fusion data are available for download from https://figshare.com/articles/dataset/Long_Read_Fusion_Simulation/14459007, and their ground truths were downloaded from https://data.broadinstitute.org/Trinity//CTAT-LR-Fusion_PAPER/simulated_reads/jaffal_sim_data/. For real data, PacBio sequencing datasets for the MCF-7 and HCT-116 cell lines are available from the Sequence Read Archive (SRA) under SRR1853200, SRR14638376 and SRR5009523. The ONT sequencing datasets for the MCF-7 and A549 cell lines are available at https://github.com/GoekeLab/sg-nex-data. The sequencing data from the acute myeloid leukaemia (AML) patient sample is available from the SRA under SRR12048357. The known fusions within the real data provided by JAFFAL can be downloaded from https://github.com/Oshlack/JAFFA/blob/master/known_fusions.txt. The output for each tool across all the tested datasets can be available at https://sourceforge.net/projects/gfvoter/files/. The source code of GFvoter is available at https://github.com/xiaolan-z/GFvoter.


Articles from BMC Genomics are provided here courtesy of BMC

RESOURCES