Skip to main content
STAR Protocols logoLink to STAR Protocols
. 2022 Feb 28;3(1):101185. doi: 10.1016/j.xpro.2022.101185

FusionAI, a DNA-sequence-based deep learning protocol reduces the false positives of human fusion gene prediction

Pora Kim 1,5,6,7,, Hua Tan 1,5, Jiajia Liu 1,4, Himansu Kumar 1, Xiaobo Zhou 1,2,3
PMCID: PMC8892011  PMID: 35252882

Summary

Even though there were many tool developments of fusion gene prediction from NGS data, too many false positives are still an issue. Wise use of the genomic features around the fusion gene breakpoints will be helpful to identify reliable fusion genes efficiently. For this aim, we developed FusionAI, a deep learning pipeline predicting human fusion gene breakpoints from DNA sequence. FusionAI is freely available via https://compbio.uth.edu/FusionGDB2/FusionAI.

For complete details on the use and execution of this protocol, please refer to Kim et al. (2021b).

Subject areas: Bioinformatics, Health Sciences, Genomics, Molecular Biology, Computer sciences

Graphical abstract

graphic file with name fx1.jpg

Fusion gene is one of the biomarkers of the cancer genome. Identifying the reliable fusion genes from RNA-seq data makes many false positives. Utilizing the genomic features around the genomic breakpoints can help select reliable fusion gene breakpoints. FusionAI were trained with 36K human fusion genes with an accuracy of 97.4% (Kim et al., 2021b). The users can follow our protocol to make input data of FusionAI, run FusionAI, and create related genomic feature information around the fusion breakpoint area.

Highlights

  • FusionAI can predict the fusion breakpoints from the given DNA sequence

  • FusionAI can reduce the false positives of the predicted fusion genes by other tools

  • FusionAI can identify the genomic features related to the genomic breakage

  • FusionAI creates a landscape image of 44 human genomic features around the breakpoints


Even though there were many tool developments of fusion gene prediction from NGS data, too many false positives are still an issue. Wise use of the genomic features around the fusion gene breakpoints will be helpful to identify reliable fusion genes efficiently. For this aim, we developed FusionAI, a deep learning pipeline predicting human fusion gene breakpoints from DNA sequence. FusionAI is freely available via https://compbio.uth.edu/FusionGDB2/FusionAI.

Before you begin

Since the accelerated accumulation of the next-generation sequencing data, there were many tool developments for the prediction of fusion genes from the RNA-seq data such as STAR-Fusion (Haas et al., 2019), Arriba (Uhrig et al., 2021), SOAPfuse (Jia et al., 2013), deFuse (McPherson et al., 2011), and FusionScan (Kim et al., 2019). The main difference between those tools comes from the ways of dealing with the RNA sequencing reads that were aligned far apart and repeat region mappings. However, too many false positives were the main problems in the prediction of fusion genes and the researchers regarded the fusion genes that were predicted in more than two prediction tools as reliable fusions. This selection approach can be helpful in reducing some false positives, but also not be helpful in terms of that all these tools are relying on the split RNA sequencing reads. Using other types of information like genomic sequence features around the breakpoint area can be a helpful and efficient way for better removal of the false positives. To help identify reliable fusion genes efficiently, we developed FusionAI, a deep learning pipeline predicting human fusion gene breakpoints from DNA sequences. For the given breakpoint of fusion genes, FusionAI provides the possibility of being used as the fusion gene breakpoints and landscapes of human genomic features around the fusion gene breakpoints. FusionAI is freely available via https://compbio.uth.edu/FusionGDB2/FusionAI.

The protocol below describes the specific steps for running FusionAI for the fusion genes predicted in K562 cell using STAR-Fusion (Haas et al., 2019). By combining the output result of FusionAI to these predicted fusion genes, we can have more reliable fusion genes with reduced false positives from the fusion DNA sequence using the genomic features of the fusion gene breakpoints.

Software prerequisites and data requirements

Our model is installed and run under the Linux system. Before launching our program, preinstalled Python (>= v.3.0), TensorFlow, and Keras modules are required. You should also prepare fusion gene information that was predicted using other existing tools for your cancer sample. The example of prerequisites and input data format can be found on our website: https://compbio.uth.edu/FusionGDB2/FusionAI. All the R packages required to visualize 44 human genome features in a 20 Kb DNA sequence are listed in the key resources table under the “R packages to draw feature landscape image” category. The R package “bedtoolsr” can only be installed using devtools::install_github ("PhanstielLab/bedtoolsr") and other R packages can be installed using install.packages() function.

Key resources table

REAGENT or RESOURCE SOURCE IDENTIFIER
Deposited data

newdat_newmod_jj.h5 FusionAI model in this paper. https://compbio.uth.edu/FusionGDB2/FusionAI/newdat_newmod_jj.h5
gencode_hg19v19_.txt Gene structure information file with UCSC genome browser known gene format of GENCODE version 19. https://compbio.uth.edu/FusionGDB2/FusionAI/ gencode_hg19v19_.txt
nib_files_hg19.tar.gz Nib files of all chromosomes of hg19, which were transformed from fasta files provided from the UCSC genome browser. https://compbio.uth.edu/FusionGDB2/FusionAI/nib_files_hg19.tar.gz
chromosome_size.txt This paper https://compbio.uth.edu/FusionGDB2/FusionAI/chromosome_size.txt
features_info.txt This paper https://compbio.uth.edu/FusionGDB2/FusionAI/features_info.txt
feature.tar.gz This paper https://compbio.uth.edu/FusionGDB2/FusionAI/feature.tar.gz

Software and algorithms

Python (>=3.0) Python Software Foundation, 2021: high-level programming language https://www.python.org/downloads/
nibFrag Converts portions of a .nib file back to fasta format. http://hgdownload.soe.ucsc.edu/admin/jksrc.zip
Tensor flow TensorFlow is an end-to-end open source platform for machine learning. https://anaconda.org/conda-forge/tensorflow
keras A deep learning framework developed by François Chollet https://github.com/keras-team/keras
pandas A community project for fast and easy data analysis and manipulation https://pandas.pydata.org/about/
numpy Community project, 2021: array processing for numbers, strings, records, and objects https://numpy.org/
argparse A python module that makes it easy to write user-friendly command-line interfaces https://docs.python.org/3/library/argparse.html
FusionAI_pred.py This paper https://compbio.uth.edu/FusionGDB2/FusionAI/FusionAI_pred.py
FusionAI_FIS.py This paper https://compbio.uth.edu/FusionGDB2/FusionAI/FusionAI_FIS.py
pre_processing_for_FusionAI_from_tab_delim.py This paper https://compbio.uth.edu/FusionGDB2/FusionAI/pre_processing_for_FusionAI_from_tab_delim.py
bedtools (>=2.26.0) (Quinlan and Hall, 2010): a powerful toolset for genome arithmetic https://bedtools.readthedocs.io/en/latest/content/installation.html
R (>=3.5) (Team, 2019): software environment for statistical computing and graphics https://www.r-project.org/
devtools (>=1.13.6) (Wickham et al., 2018): developing R Packages tool https://cran.r-project.org/web/packages/devtools/index.html
bedtoolsr (2.30.0.1) (Patwardhan et al., 2019): genomic data analysis and manipulation http://phanstiel-lab.med.unc.edu/bedtoolsr-install.html
optparse (>=1.6.0) (Davis, 2018): Command Line Option Parser https://cran.r-project.org/web/packages/optparse/index.html
doParallel (1.0.16) (Corporation and Weston, 2020): parallel backend https://cran.r-project.org/web/packages/doParallel/index.html
iterators (1.0.13) (Analytics and Weston, 2020): a package to allow a programmer to traverse through all the elements of a vector, list, or other collection of data https://cran.r-project.org/web/packages/iterators/index.html
magrittr (2.0.1) (Bache and Wickham, 2020): A Forward-Pipe Operator for R https://cran.r-project.org/web/packages/magrittr/index.html
foreach (1.5.1) (Microsoft and Weston, 2020): an idiom that allows for iterating over elements in a collection, without the use of an explicit loop counter. https://cran.r-project.org/web/packages/foreach/index.html
ggplot2 (3.3.5) (Wickham, 2016): Elegant Graphics for Data Analysis https://cran.r-project.org/web/packages/ggplot2/index.html
gridExtra (2.3) (Auguie, 2017): a package to arrange multiple grid-based plots on a page https://cran.r-project.org/web/packages/gridExtra/index.html
scales (1.1.1) (Wickham and Seidel, 2020): Graphical scales map data to aesthetics, and provide methods for automatically determining breaks and labels for axes and legends. https://cran.r-project.org/web/packages/scales/index.html
cowplot (1.1.1) (Wilke, 2020): a set of themes, functions to align plots and arrange them into complex compound figures, and functions that make it easy to annotate plots and or mix plots with images. https://cran.r-project.org/web/packages/cowplot/index.html
ggpubr (>=0.1.7) (Kassambara, 2018): 'ggplot2′ Based Publication Ready Plots https://cran.r-project.org/web/packages/ggpubr/index.html

Materials and equipment

The program in this protocol was written in the Ubuntu Linux system using Python language (>=v.3.0). All experiments were carried out and evaluated under the Ubuntu system with the computational resources listed in Table 1.

Inline graphicCRITICAL: The implementation of the model is lightweight. However, the required memory usage in practice depends on the size of your own data.

Alternatives: 1. Our model can work with fewer CPU cores and less RAM memory, although it may take a longer time for a large dataset. During running the example input for FusionAI, it used 17.6% of a CPU and 0.3% of the memory of the server with the computation capacity described in Table 1. 2. If the user does not need to draw the feature images, then no need to install the software and algorithms to draw the feature landscape images listed in the key resources table.

Table 1.

Computation resources used in this study

Operating system Version
CentOS Linux 7.9.2009
CPU information Parameter
RAM Memory 93 GB
Thread(s) per core 2
Core(s) per socket 2
Model 85
Model name Intel(R) Xeon(R) Gold 6254 CPU @ 3.10 GHz
CPU MHz: 2899.816
CPU(s) 36

Step-by-step method details

Download our package and install the prerequisites

Inline graphicTiming: < 10 min

Prepare input data of 20 Kb DNA sequence of fusion genes

Inline graphicTiming: < 1 min

FusionAI takes the input data of fusion gene breakpoint information, which is given by other fusion gene prediction tools or known fusion gene information (k562_starfusion.txt and Table 2). The preprocessing script will make 20 Kb DNA sequences for individual fusion genes, which is the combined sequence of +/-5 Kb flanking sequence from the two breakpoints’ genomic position for individual fusion partner genes (Figure 1 and Table 3).

  • 2.

    Run preprocessing script to make a 20 Kb DNA sequence from the given fusion gene information. The fusion gene information should include the following information in tab-delimited format: Hgene, Hchr, Hbp, Hstrand, Tgene, Tchr, Tbp, Tstrand. The command is shown below. Here the $ INPUT_FILE is the output file after checking the junction position of the fusion breakpoints in step 2.

> python pre_processing_for_FusionAI_from_tab_delim.py [INPUT_FILE]

> python pre_processing_for_FusionAI_from_tab_delim.py k562_starfusion.txt

Inline graphicCRITICAL: The timing is based on the number of fusion genes of the input file.

Table 2.

Fusion gene information example, which were predicted for K562 cell-line from STAR-fusion

Hgene Hchr Hbp Hstrand Tgene Tchr Tbp Tstrand
BCR chr22 23632600 + ABL1 chr9 133729450 +
BAG6 chr6 31619433 - SLC44A4 chr6 31833561 -
NUP214 chr9 134074402 + XKR3 chr22 17288973 -

Figure 1.

Figure 1

Make input data of FusionAI

Table 3.

FusionAI input data example, which were made by running preprocessing script

Hgene Hchr Hbp Hstrand Tgene Tchr Tbp Tstrand 20 Kb fusion DNA sequence
BCR chr22 23632600 + ABL1 chr9 133729450 + TACCAGAGCGGCTGCCAAC…
BAG6 chr6 31619433 - SLC44A4 chr6 31833561 - CAGTGATGCTTCTGCCTCC…
NUP214 chr9 134074402 + XKR3 chr22 17288973 - GATAAAATTTTTTCACTAA…

Run FusionAI

Inline graphicTiming: < 2 s (depending on your data)

FusionAI takes the 20 Kb DNA sequence of fusion genes from the previous step and outputs two probabilities as not being used and being used as the fusion gene breakpoints (Figure 2).

  • 3.

    Run FusionAI prediction script to predict the fusion breakpoint tendency from the FusionAI model. Here the $ INPUT_FILE is the output file after making the 20 Kb DNA sequence in the previous step. $COLA and $COLB are the DNA sequences of 5′ and 3′ fusion partner genes that were created from the previous step. If the user wants to run for one specific fusion gene, then set $INDEX_OF_FUSION as row index of interested line in the input file.

> python FusionAI_pred.py [-h] -f [INPUT_FILE] -m [MODEL, default: newdat_newmod_jj.h5] -o [OUTPUT_FILE] -A [COLA] -B [COLB] -I [INDEX_OF_FUSION]

> python FusionAI_pred.py -f k562_starfusion.FusionAI.input -o k562_starfusion.FusionAI .output -m newdat_newmod_jj.h5

Figure 2.

Figure 2

Diagram of fusion gene breakpoints classification by FusionAI

Select high scored fusion genes (or interested fusion genes) from FusionAI output

Inline graphicTiming: < 5 s

From the output scores of FusionAI for the fusion candidates that were predicted from other tools, the users can select high scored or interested fusion genes. This can be done by the user in a text editor or another appropriate tool of choice. The users can stop the pipeline at this step if they do not need to do further analyses including feature importance analysis or drawing a landscape image of human genomic features in fusion genes, which take relatively long. With the output scores of FusionAI, still uses can reduce the false positives. For better understanding, Table 4 shows the comparison results among different cutoff of FusionAI scores, other prediction tools, and experimentally validated fusion genes. Table 5 shows the accuracy comparisons. When we used a higher threshold of FusionAI output scores, we could reduce the false positives efficiently.

  • 4.

    Sort the FusionAI prediction output based on the FusionAI scores of individual fusion genes and select high-scored fusion genes. The users can choose the cutoff score, which should be larger than 0.5. Table 4 below shows the examples that were chosen with different cutoffs like 0.5 or 0.95. Then, the selected fusion genes will be used for further analyses such as screening of the feature importance scores and landscaping the human genomic features across 20 Kb fusion DNA sequence in the following steps.

Table 4.

Selection of common fusion genes between FusionAI and other tools based on the FusionAI score including validated fusion genes for the user’s information

Hgene Hchr Hbp Hstrand Tgene Tchr Tbp Tstrand STAR-fusion STAR-fusion & FusionAI >0.5 STAR-fusion & FusionAI >0.95 STAR-fusion & arriba Validated FusionAI score
BCR chr22 23632600 + ABL1 chr9 133729450 + X X X X X 0.9999999
IMMP2L chr7 111127293 - DOCK4 chr7 111409733 - X X X X X 0.9999999
BAG6 chr6 31619432 - SLC44A4 chr6 31833561 - X X X X 0.99999857
RP11-344E13.3 chr17 20771998 + UBBP4 chr17 21730694 + X X X X 0.9999932
BAG6 chr6 31619432 - SLC44A4 chr6 31833378 - X X X X 0.9999831
C10orf76 chr10 103799769 - KCNIP2 chr10 103588956 - X X X X 0.99743265
RP11-321F6.1 chr15 66874586 + SMAD6 chr15 67004005 + X X X 0.9900406
NUP214 chr9 134074402 + XKR3 chr22 17288973 - X X X X X 0.95663476
RP11-96H19.1 chr12 46781755 + RP11-446N19.1 chr12 47046172 + X X 0.93317753
RP11-96H19.1 chr12 46781755 + RP11-446N19.1 chr12 46965038 + X X 0.9303843
RP5-964N17.1 chrX 113181480 - LRCH2 chrX 114398346 - X X 0.8816845
UPF3A chr13 115070392 + CDC16 chr13 115037658 + X X X X 0.8794392
CTC-786C10.1 chr16 85205413 + RP11-680G10.1 chr16 85391068 + X X 0.8380846
C16orf87 chr16 46858297 - ORC6 chr16 46729473 + X X X 0.6423692
RP11-680G10.1 chr16 85391249 + GSE1 chr16 85667519 + X 0.30633911
C16orf87 chr16 46858297 - ORC6 chr16 46727004 + X X 0.13516404
RP11-680G10.1 chr16 85391249 + GSE1 chr16 85682157 + X 0.040422514

Table 5.

Accuracies across different comparisons of results for the users’ information

STAR-fusion FusionAI > 0.5 FusionAI > 0.95 Arriba Validated
TP 6 6 5 4 6
FP 11 8 3 4 0
TN 0 3 8 9 11
FN 0 0 1 2 0
Precision 0.35 0.43 0.63 0.50 1.00
Recall 1.00 1.00 0.83 0.67 1.00
Accuracy 0.35 0.53 0.76 0.68 1.00
F-measure 0.52 0.60 0.71 0.57 1.00
MCC NA 0.34 0.54 0.34 1.00

Calculate the feature importance scores across 20 Kb DNA sequence

Inline graphicTiming: < 33 min

After selecting the reliable fusion gene candidates, the users can check the distribution of the feature importance scores of individual fusion genes across the 20 Kb fusion DNA sequence. To calculate the feature importance score (FIS), we masked 20 bp each time by setting all the 20 values to zero and measured the change of prediction outcome upon this masking. We slide this 20 bp window 20 nucleotides each time along the whole 20K input sequence and repeated the procedure to obtain the FIS for all the 20 bp segments. In this way, we got 20,000/20 = 1,000 FIS for each input sequence.

  • 5.

    Run FusionAI feature importance score script to get the feature importance scores across the 20 Kb fusion DNA sequence. Here the $ INPUT_FILE is the output file after making 20 Kb DNA sequence in step 3. $COLA and $COLB are the DNA sequences of 5′ and 3′ fusion partner genes that were created from step 3. If the user wants to run for one specific fusion gene, then set $INDEX_OF_FUSION, the row indexes of interested lines in the input file. If the user can use multiple GPUs, then the user can control the number of GPUs using the parameter of NGPUS. However, the GPU is not necessary (Figure 3).

> python FusionAI_FIS.py [-h] -f FILENAME [-m MODEL, default: newdat_newmod_jj.h5] [-o OUTPUT] [-A COLA] [-B COLB] [-I ROWI] [-N NGPUS]

> python FusionAI_FIS.py -f k562_starfusion.FusionAI.output -o k562_starfusion.FusionAI.output.FIS

Figure 3.

Figure 3

Calculate the feature importance scores across 20 Kb fusion DNA sequence

Visualize 44 human genomic features across 20 Kb DNA sequence

Inline graphicTiming: < 1 h 10 min and < 20 min for step 6 and 7, respectively

After getting reliable fusion gene candidates and feature importance scores, it is important to interpret the aspect of human genomic features. From our original work, we integrated 44 human genomic features across five important cellular mechanism categories such as integration site category of 6 viruses, 13 types of repeat category, 5 types of structural variant category, 15 different types of chromatin state category, and 5 gene expression regulatory category (Kim et al., 2021a, 2021b). From this step, the users can create two figures on the landscape of the fusion gene breakpoint-related genomic features across the 20 Kb fusion DNA sequence. Each script will create separate figures of individual fusion genes that have the FIS values from the previous step. All figures will be created under the user defined directory. The first figure is the overlap between the 20 Kb fusion DNA sequence and 44 genomic features and the second figure is the overlap between the top 1% FIS regions and 44 genomic features (Figure 4).

  • 6.

    Visualize 44 human genomic features across a 20 Kb DNA sequence. Run FusionAI genomic feature analysis script to make a landscape image of overlap between fusion breakpoints area (+/- 5 Kb) and 44 human genomic features.

> Rscript FusionAI_genomic_features.R -g [FUSION_GENE_FILE] -f [FEATURE_PATH] -s [CHROMOSOME_SIZE_FILE] -i [FEATURE_INFO_FILE] -o [OUTPUT_FILE_PATH]

> Rscript FusionAI_genomic_features.r -g K562_STARfusion.FusionAI.output.FIS -f ./features/ -s chromosome_size.txt -i features_info.txt -o ./K562/whole_features/

  • 7.

    Visualize the overlaps between the top 1% FIS regions and 44 human genomic features across 20 Kb DNA sequence. Run FusionAI genomic feature analysis script to have the landscape of overlap between high-FIS regions of fusion genes and 44 human genomic features.

> Rscript FusionAI_genomic_features2.R -g [FUSION_GENE_FILE] -f [FEATURE_PATH] -s [CHROMOSOME_SIZE_FILE] -i [FEATURE_INFO_FILE] -o [OUTPUT_FILE_PATH]

> Rscript FusionAI_genomic_features2.r -g K562_STARfusion.FusionAI.output.FIS -f ./features/ -s chromosome_size.txt -i features_info.txt -o ./K562/top1pct_features/

Figure 4.

Figure 4

Left - distribution of 44 human genomic features across 20 Kb fusion DNA sequence

Right - overlap between the top 1% FIS regions and 44 different types of human genomic features across 20 Kb fusion DNA sequence.

Expected outcomes

The above command will generate the following results from your fusion gene candidates’ information: FusionAI output scores: FusionAI result will be saved in the current working directory with your preferred output file name. Feature importance scores: 1,000 feature importance scores of individual fusion genes that were resulted as the potential fusion breakpoints will be saved in the current working directory with your preferred output file name. Genomic feature landscape images: the distribution of 44 human genomic features across 20 Kb fusion DNA sequence will be saved in the current working directory with your preferred output file name.

Limitations

For prediction tasks, since our program provides an option of taking one fusion at a time, there is no problem running it on a CPU.

Troubleshooting

Problem 1

Installation of FusionAI fails due to uninstalled prerequisites (step 1).

Potential solution

Please install the required dependencies manually through the links we provided in the key resources table, and then try installing FusionAI again.

Problem 2

Installation of FusionAI fails due to using old version python (step 1).

Potential solution

Please install the recent version of python at least v 3.0, and then try installing FusionAI again.

Problem 3

The preprocessing script fails to read the fusion gene information (step 2).

Potential solution

Please make the fusion gene information following the format described in step 2.

Problem 4

FusionAI fails to read the input file or parse it correctly.

Potential solution

Currently, FusionAI can only parse the tab- and space-separated file. Please check the format of the input file and make sure each column was properly separated and each row has the same number of columns.

Problem 5

FusionAI fails at the one-hot encoding step.

Potential solution

Make Sure the input DNA sequences contain only five letters: A, C, G, T, and N.

Problem 6

Running FusionAI fails due to missing parameters (step 3).

Potential solution

Please provide the essential parameters to run FusionAI such as input and output file names, and then run FusionAI again.

Problem 7

Creating genomic feature landscape image fails due to not downloading human genomic feature information files (step 3).

Potential solution

Please download the human genomic feature information files from the link we provided in the key resources table, and then run the script again.

Resource availability

Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Pora Kim (pora.kim@uth.tmc.edu).

Materials availability

This study did not generate new unique reagents.

Acknowledgments

This work was partially supported by the National Institutes of Health grants (R35GM138184) to P.K. The funders had no role in study design, data collection, analysis, decision to publish, or preparation of the manuscript. Funding for open access charge: Startup Fund to P.K. from the University of Texas Health Science Center at Houston.

Author contributions

Model data preparation, P.K.; model development, H.T. and P.K.; genomic feature data preparation, P.K. and J.L.; visualization of genomic features, P.K. and J.L.; test, H.K.; manuscript writing, P.K.; figures, P.K.; supervision, P.K. and X.Z.

Declaration of interests

The authors declare no competing interests.

Data and code availability

Code is available at https://compbio.uth.edu/FusionGDB2/FusionAI/.

References

  1. Analytics R., Weston S. 2020. Iterators: Provides Iterator Construct. R Package Version 1.0.13. [Google Scholar]
  2. Auguie B. 2017. gridExtra: Miscellaneous Functions for "Grid" Graphics. R package version 2.3. [Google Scholar]
  3. Bache S.M., Wickham H. 2020. Magrittr: A Forward-Pipe Operator for R. R Package Version 2.0.1. [Google Scholar]
  4. Corporation M., Weston S. 2020. doParallel: foreach parallel adaptor for the 'parallel' package. R package version 1.0.16. [Google Scholar]
  5. Davis T. 2018. Optparse: Command Line Option Parser. R Package version 1.6. 0. [Google Scholar]
  6. Haas B.J., Dobin A., Li B., Stransky N., Pochet N., Regev A. Accuracy assessment of fusion transcript detection via read-mapping and de novo fusion transcript assembly-based methods. Genome Biol. 2019;20:213. doi: 10.1186/s13059-019-1842-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Jia W., Qiu K., He M., Song P., Zhou Q., Zhou F., Yu Y., Zhu D., Nickerson M.L., Wan S., et al. SOAPfuse: an algorithm for identifying fusion transcripts from paired-end RNA-Seq data. Genome Biol. 2013;14:R12. doi: 10.1186/gb-2013-14-2-r12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Kassambara A. 2018. ggpubr: 'ggplot2' Based Publication Ready Plots. R package version 0.1.7. [Google Scholar]
  9. Kim P., Jang Y.E., Lee S. FusionScan: accurate prediction of fusion genes from RNA-Seq data. Genomics Inform. 2019;17:e26. doi: 10.5808/GI.2019.17.3.e26. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Kim P., Tan H., Liu J., Lee H., Jung H., Kumar H., Zhou X. FusionGDB 2.0: fusion gene annotation updates aided by deep learning. Nucleic Acids Res. 2021;50:D1221–D1230. doi: 10.1093/nar/gkab1056. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Kim P., Tan H., Liu J., Yang M., Zhou X. FusionAI: predicting fusion breakpoint from DNA sequence with deep learning. iScience. 2021;24:103164. doi: 10.1016/j.isci.2021.103164. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. McPherson A., Hormozdiari F., Zayed A., Giuliany R., Ha G., Sun M.G., Griffith M., Heravi Moussavi A., Senz J., Melnyk N., et al. deFuse: an algorithm for gene fusion discovery in tumor RNA-Seq data. PLoS Comput. Biol. 2011;7:e1001138. doi: 10.1371/journal.pcbi.1001138. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Microsoft, Weston S. 2020. Foreach: Provides Foreach Looping Construct. R package version 1.5.1. [Google Scholar]
  14. Patwardhan M.N., Wenger C.D., Davis E.S., Phanstiel D.H. Bedtoolsr: an R package for genomic data analysis and manipulation. J. Open Source Softw. 2019;4:1742. doi: 10.21105/joss.01742. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Quinlan A.R., Hall I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–842. doi: 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Team R.C. R Foundation for Statistical Computing; 2019. R: A Language and Environment for Statistical Computing. [Google Scholar]
  17. Uhrig S., Ellermann J., Walther T., Burkhardt P., Frohlich M., Hutter B., Toprak U.H., Neumann O., Stenzinger A., Scholl C., et al. Accurate and efficient detection of gene fusions from RNA sequencing data. Genome Res. 2021;31:448–460. doi: 10.1101/gr.257246.119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Wickham H., Hester J., Chang W. 2018. Devtools: Tools to Make Developing R Packages Easier. R package version 1.1.3.6. [Google Scholar]
  19. Wickham H., Seidel D. 2020. Scales: Scale Functions for Visualization. R Package version 1.1.1. [Google Scholar]
  20. Wickham H. Springer-Verlag; 2016. ggplot2: Elegant Graphics for Data Analysis. [Google Scholar]
  21. Wilke C.O. 2020. Cowplot: Streamlined Plot Theme and Plot Annotations for 'ggplot2'. R Package version 1.1.1. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Code is available at https://compbio.uth.edu/FusionGDB2/FusionAI/.


Articles from STAR Protocols are provided here courtesy of Elsevier

RESOURCES