Abstract
Cancer associated gene fusions (GF) are a potential source for highly immunogenic neo-antigens, but the lack of computational tools for accurate, sensitive identification of personal GFs has limited their targeting in personalized cancer immunotherapy. Here, we present EasyFuse, a machine learning computational pipeline for detecting cancer-specific GFs in transcriptome data obtained from human cancer samples. We provide an extensive experimental confirmation dataset and demonstrate that EasyFuse predicts personal GFs with high precision and sensitivity, outperforming previously described tools. By testing immunogenicity with autologous blood lymphocytes from patients with cancer, we detected pre-established CD4+ and CD8+ T-cell responses for 10 of 21 (48%), and for 1 of 30 (3%) of identified GFs, respectively. The high frequency of T-cell responses detected in cancer patients supports the relevance of individual GFs as neo-antigens that may be targeted in personalized immunotherapies, especially for tumors with low mutation burdens.
Introduction
Cancer is driven by genetic alterations including small variants, such as single nucleotide substitutions, small insertions and deletions, and large structural variants (SVs). SVs can give rise to the expression of gene fusions (GFs) that can drive tumor onset and progression1. Well-known examples are BCR-ABL and NTRK GFs that pioneered the field of targeted and histology-agnostic therapy2,3. More recently, tumor-specific epitopes derived from non-synonymous somatic mutations, so-called neo-epitopes, have been exploited as targets for personalized immunotherapy approaches, enabling T-cells to specifically recognize and kill tumor cells4–6. These approaches utilize the long tail of low-frequency and individual mutations as source for neo-antigens7,8. However, current approaches focusing on melanoma and cancer types with high mutational burden consider only neo-antigens derived from small variants and omit SV-derived GFs, despite reported evidences for their immunogenicity9,10.
Current bioinformatics approaches predict either SVs from whole genome sequencing data or GFs directly from RNA sequencing (RNA-Seq) data11,12. With exome- and RNA-Seq from archival formalin-fixed paraffin embedded (FFPE) samples being the standard for clinical trials, rapid clinical implementation of GF detection into diagnostic and therapeutic approaches requires highly accurate prediction from FFPE samples. However, available tools verified their predictions only on a paucity of previously described recurrent GFs, a low number of cell line-derived verified GFs, or purely on simulated data11. Therefore, their ability to predict GFs accurately in clinically relevant FFPE samples remains unclear.
Another challenge is the occurrence of GFs in normal tissue, arising from germline SVs that run as polymorphisms in the population (e.g. KANSL-ARL17A/B), or from read-through fusions that occur between proximal genes (cis-near) without underlying mutation (e.g. CTBS-GNG5)13–15. Such non-somatic GFs might appear recurrent in tumor samples, but only cancer-specific GFs can encode true neo-antigens.
Here, we present EasyFuse (https://github.com/TRON-Bioinformatics/easyfuse), a novel tool to predict tumor-specific GFs from clinical tumor samples. Furthermore, we demonstrate immunogenicity of GF-encoded neo-antigens.
Results
Sensitive detection requires currently multiple tools
We initially tested 17 publicly available tools for prediction of 52 previously published GFs in MCF7 and SKBR3 cell lines (Supplementary Figure 1 and Supplementary Table 1)16–21. From only five tools that met our baseline criteria, FusionCatcher predicted by far the most candidates, followed by SOAPfuse and InFusion. MapSplice2 and STAR-Fusion predicted lower numbers and showed highest concordance between sequencing replicates (Figure 1a and Supplementary Figure 2a)11,22–25. Despite these differences, all five tools consistently predicted 29-33 (combined 39) from 52 published GFs, of which 34 could be confirmed by qRT-PCR (Supplementary Figure 2b). However, 94% of predictions came from single tools and only 12% of these were found within both sequencing replicates (Figure 1b and Supplementary Figure 2c). Since this diversity is not reflected in the small set of published GFs, we designed a semi-automated verification strategy by qRT-PCR and amplicon size confirmation and tested 133 GFs (Supplementary Figure 2d-f and Supplementary Table 2). Although we observed slightly higher confirmation rates for GFs predicted with multiple tools and in sequencing replicates, validation success for single-tool predictions was with 61% surprisingly high (Figure 1c), indicating that performance evaluations based on consensus predictions or published GFs are insufficient.
Next, we predicted GFs with the same five tools from RNA sequencing data of 14 fresh frozen (FF) primary breast cancer samples and obtained a median number of 302 candidates per sample (Figure 1d). Similar to the cell lines, only a small fraction (8%) was identified by multiple tools (Figure 1d and Supplementary Figure 2g). Using the previously established validation approach, we tested 492 GFs and observed higher verification rates (78-100%) for multiple-tool predictions (Figure 1e and Supplementary Table 3). However, with a verification rate of 61%, single-tool predictions form with 90% the largest group of estimated positive GFs (Figure 1f). Of note, single-tool predicted positive-validated GFs are less supported by junction reads and spanning read (Supplementary Figure 2h and i). Our data indicate that current consensus approaches, which disregard single-tool hits, come at the cost of a dramatic loss in sensitivity11.
EasyFuse improves prediction of tumor-specific GFs
We investigated whether GFs were recurrently found among the 14 breast cancer samples and identified in total 425 (14%) GFs in at least two samples (Figure 2a). We observed that the majority (71%) of those have breakpoints in cis-near configuration (same chromosome, same strand, within 1 Mb) and could therefore be the result of read-through transcription (Figure 2b). The validation rate of recurrent cis-near GFs is comparable to others, suggesting that they are not predominantly caused by prediction artefacts (Supplementary Figure 3a). To investigate tumor-specificity, we analyzed 136 unrelated samples from 48 different normal tissues including four from breast tissue. Here, we observed a high overlap between GFs recurrently identified in tumor samples and GFs in normal breast tissue samples with a high fraction (74%) being in cis-near configuration (Figure 2c). From all recurrent cis-near GFs 39% were identified in normal breast samples and further 49% were identified in other normal tissues, whereas this was observed for only 1% and 5% for unique trans-like GFs, demonstrating that cis-near GFs are enriched for tumor-unspecific transcripts (Supplementary Figure 3b).
In order to improve prediction of trans-like GFs (associated with tumor-specificity), we developed the EasyFuse pipeline tailored towards best computational performance, sensitivity and precision (Figure 3a and Supplementary Figure 4a). An initial filtering step retains only discordant read pairs (>200 kb), split reads and unmapped reads leading to >90% reduction of total reads, a 10-fold improved runtime for all five prediction tools and reduced maximal memory consumption from up to 90 GB down to 30 GB (Figure 3b and c and Supplementary Figure 4b). With read filtering, the overall number of initial predicted cis-near and trans-like GFs is greatly reduced to 15% and 22% respectively (Figure 3d). Removed predictions were largely single-tool predictions, which led to an overall increase in tool concordance from 8% to 23%. Importantly, very high sensitivity (97%) towards qRT-PCR confirmed trans-like GFs is maintained (Figure 3e). Furthermore, after read filtering, 499 additional trans-like and only 80 additional cis-near GFs were identified (Figure 3d). These were mainly predicted by SOAPfuse and STAR-Fusion and are characterized by a lower number of junction reads (Supplementary Figure 4c and d). By testing 77 of these additional GFs, we confirmed 17 (22%) as true positive GFs (Supplementary Figure 4e). Although the validation success was relatively low, it nonetheless indicates that the read filtering step further increases sensitivity for trans-like GFs (Figure 3e). When considering all validated GFs (although not randomly selected and therefore potentially biased), SOAPfuse offers superior sensitivity towards confirmed cis-near GFs, while EasyFuse has superior sensitivity towards trans-like GFs (Supplementary Figure 4f). This shift of sensitivity from cis-near towards trans-like is in line with 86% reduction of predicted GFs also found in normal breast tissue, indicative of higher tumor-specificity (Supplementary Figure 4g).
Within EasyFuse, we provide a uniform quantification of GF supporting reads and aimed to provide best separation between qRT-PCR-positive and -negative GFs. We observed a significant difference between negative and positive fusion genes, but differences are low (Supplementary Figure 5a-c). Therefore, any cutoff to enrich for true positive GFs would also impede sensitivity dramatically. Of note, we observed a high correlation between quantification of filtered and unfiltered read data, indicating that filtering maintains junction and spanning reads (Supplementary Figure 5d and e). To further evaluate the sensitivity of EasyFuse with read filtering, we analyzed 2,425 simulated trans-like GFs from a previous published benchmark11. Contrary to the original publication, we required prediction of the exact breakpoint. For both 50 bp and 101 bp paired-end sequencing data, we observed also in the simulated data improved sensitivity after read filtering (Supplementary Figure 6a). In comparison to the qRT-PCR-confirmed GFs, simulated GFs were more consistently predicted by multiple tools and had higher read support (Supplementary Figure 6b and c). Together, read filtering and independent quantification of GF supporting reads (1) markedly decrease runtime of fusion prediction tools, (2) reduce the number of predicted cis-near GFs that are often not tumor-specific, (3) further increase sensitivity for trans-like GFs, and (4) allow for consistent quantification of supporting reads.
Machine learning enables specific prediction in FFPE samples
Next, we wanted to confirm and improve the performance of EasyFuse in clinically relevant FFPE samples. We sequenced technical replicates of 14 samples from primary and metastatic tumors, which represent diverse cancer entities and had varying tumor content (20-90%). We predicted a median of 205.5 GF candidates per sample (Figure 4a). For sensitive validation, we tested by qRT-PCR on FFPE and matching FF samples derived from the same resection. Initially, we validated 853 GFs, of which 535 were identified in one and 318 in both sequencing replicates (Figure 4b and c and Supplementary Table 4). The total number of predicted GFs and the positive validation rate did not correlate with tumor type or content, suggesting that EasyFuse provides comparable performance across a wide range of sample types. From GFs concordantly predicted in both sequencing replicates, 79% were positive while only 32% predicted in one replicate were positive (Figure 4c).
To further boost sensitivity and precision, we built a training data set with confirmation data for 890 GF calls in sequencing replicates of 11 samples (Figure 4b and Supplementary Figure 7a and b) to train a random forest classifier on four different subsets of features (Figure 4d and Supplementary Table 5) and optimize hyper-parameters (Supplementary Figure 7c and d). We found the features breakpoint configuration (type), match with known exon boundaries (exon_boundary) and the spanning read pair quantification by EasyFuse (ft_span) to be most predictive (Figure 4d). The prediction scores were markedly higher for fusion candidates detected in both replicates (Supplementary Figure 7e). We assessed the performance in a test data set with validation data for 281 GF calls in replicates of three samples. Both "full" feature sets, with and without information on identification in sequencing replicates, performed equally well (AUC 0.92 vs. 0.91, Figure 4e and f). Two models with smaller feature sets, which operate independently of the prediction tools and provide more flexibility, achieved slightly lower performance (AUC 0.88 and 0.86). For best performance and versatility, the model using the full feature set without replicate information ("EF_fuN") was chosen for further analysis.
Next, we assessed performance of EasyFuse in comparison with the tools Arriba, FusionCatcher, InFusion, MapSplice2, SOAPfuse, and STAR-Fusion on the set of simulated GFs. Here, high sensitivities were observed for multiple tools, whereby Arriba had the highest sensitivity (0.77 for 50-bp-reads and 0.91 for 101-bp-reads) and EasyFuse ranked second (0.76 for 50-bp-reads and 0.87 for 101-bp-reads, Supplementary Figure 8a). Of note, the model in EasyFuse was trained only with 50 bp reads, but performed also well for 101 bp reads indicating robust generalization.
To benchmark the tool performance on clinically relevant FFPE samples, we ran the tools on the original unfiltered sequencing data of the three test samples (Supplementary Figure 8b)11,22–26.
However, as we initially considered only candidates from EasyFuse, GFs predicted by the other tools were strongly underrepresented in the confirmation data. We therefore separated prediction data into concordance bins (according to number of detecting tool) and validated 122 additional GFs from underrepresented concordance bins (Supplementary Figure 8c and d and Supplementary Table 4). Using the extended confirmation data, the overall sensitivity and precision (PPV) were calculated according to weighted bin specific values (Methods, Supplementary Figure 8e)27. Considering all predicted GFs, sensitivity is relatively low across all tools with values ranging from 0.02 for Arriba to 0.40 for EasyFuse (Figure 4g). EasyFuse achieved an overall precision of 0.72, while other tools ranging from 0.26 for Arriba to 0.64 for STAR-Fusion. For trans-like GFs, which we consider most relevant, EasyFuse even more clearly outperforms the other tools by achieving a sensitivity of 0.43 and a precision of 0.71, while most other tools showed a clear drop in performance (Figure 4h). Also when confirmation data was not weighted by concordance bins or when requiring confirmed GFs to be detected in each individual sequencing replicate, EasyFuse consistently outperformed other tools in prediction performance for trans-like GFs (Supplementary Figure 8f and g and Supplementary Figure 9). Taken together, read filtering, re-quantification and the machine learning model enable EasyFuse to achieve relatively high sensitivity and specificity for the prediction of GFs from clinically relevant FFPE samples.
Predicted GFs elicit spontaneous T-cell responses
We used EasyFuse to predict GFs in a cohort of 14 FFPE melanoma samples and selected encoded neo-antigens to test their immune recognition by autologous T-cells. A median number of 46 GFs was predicted per sample (Figure 5a). We filtered the predicted data set for neo-antigen candidates by removing non-coding GFs and likely germline events (Supplementary Table 6). For the remaining targets, we prioritized HLA-haplotype matched HLA class I and class II neo-epitopes. A median of nine neo-antigen candidates (defined as filtered GF with at least one predicted epitope) was predicted per sample (Figure 5b and Supplementary Figure 10a).
In total, 30 predicted fusion neo-antigen candidates across all patients were tested for CD4+ or CD8+ T-cell reactivity by IFN-γ ELISPOT assay after in vitro stimulation of patient-derived peripheral blood mononuclear cells (PBMCs). Of 21 fusion peptides, evaluable for class II immune response, 10 (48%) elicited a positive CD4+ T-cell response. Of 30 reliable CD8+ assays, one (3%) showed positive CD8+ T-cell responses (Figure 5c, d and Supplementary Table 7 and 8). One response (PPP1R12C-CNN2) was directed against a wild-type peptide, while all remaining ones were directed against novel peptide sequences, either across the breakpoint or in an out-of-frame translated peptide sequence. ZNF417-TSPAN11 was positive for both CD8+ and CD4+ T-cell reactivity, with responses directed against two overlapping but distinct peptides across the breakpoint (Figure 5e). When assessing expression level of GFs from the number of supporting reads, we did not find a correlation with positive immune responses (Supplementary Figure 10b and c). However, all identified T-cell responses were directed against neo-epitopes with predicted binding affinities below 500 nM (Figure 5f). When considering only those: 1/24 GFs (4%) showed a CD8+ T-cell response and 10/16 GFs (63%) a CD4+ T-cell response. For four GF peptides with strong reactivity in patient-derived T-cells, we tested T-cell recognition and stimulation in PBMCs from three unrelated healthy donors. Two fusion peptides led to CD4+ T-cell responses in two healthy donors, and another fusion peptide elicited CD8+ T-cell response in all three donors (Supplementary Figure 10d). This data indicates that neo-epitope-specific T-cell responses can be detected in the T-cell repertoire of healthy donors, which might be attributed to pre-established, cross-reactive memory T-cells10,28.
In order to determine the frequency and relevance of GFs as a class of neo-epitopes, we used EasyFuse to predict GFs across a larger cohort of 57 FF breast cancer samples, including the 14 samples described above. Here, we predicted a median of 12 fusion neo-antigens (57 GFs) per sample (Figure 5g and Supplementary Figure 10e). The vast majority (95%) of neo-antigens were detected only in individual samples, indicating that in this breast cancer cohort, immunogenic neoepitopes from GFs are not linked with recurrence.
Discussion
Current GF prediction tools lack reliability and sufficient validation, which hinders their application especially for prediction in individual samples11,20,29. A recent pan-cancer study used prediction tools validated on only 28 previously confirmed GFs and a recent benchmark paper used consensus prediction as ground truth without any further verification data1,11. While using consensus predictions is an unbiased approach, it cannot reflect positive single-tool predictions. Considering that these make up the majority of all true positive events in our analysis, this is a major shortcoming of such evaluation approaches. Furthermore, read-through GFs that are often not restricted to the tumor tissue are also more consistently and recurrently predicted across tools and samples and are, therefore, likely overestimated.
Here, we provide extensive data for performance benchmarking with six other prediction tools and demonstrate poor sensitivity, especially towards trans-like GFs that we consider most relevant. The low sensitivity compared to previous benchmarks is on the one hand due to stringent demands such as exact breakpoint identification, which is required for neo-epitope prediction, and on the other hand due to the relative low numbers of supporting reads, for most true positive GFs in our data set. The majority are individual randomly occurring likely non-functional GFs and their expression profile is likely lower compared to recurrent oncogenic driver GFs. Their detection in heterogenic real patient samples is therefore more challenging, which is, if at all, only partly reflected in current benchmarking papers using simulated reads or RNA spike-ins11,30. Nonetheless, our immune data highlights their importance for immunotherapy approaches and thus the need for more sensitive detection tools. Although EasyFuse is significantly more sensitive in predicting trans-like GFs, it is still limited to 43%. Since supporting reads are conceptually the best evidence of true positive GFs, improvements to the alignment process might help to further increase overall performance. This is also reflected by the high number of GFs only predicted by Soapfuse that uses its own internal aligner.
A number of previous reports have described the immunogenicity of recurrent GFs such as BCR-ABL, ETV6-AML1 and DEK-CAN in leukemia, as well as SYT-SSX and PAX-FKHR in sarcomas31–35. Thus, recurrent GFs of oncogenic driver genes are considered as targets of particular relevance for immunotherapy36. However, here we show that the vast majority of potentially immunogenic GFs in a breast cancer cohort are non-recurring, unique events. The detected immunogenicity rate of 48% for GFs is significantly higher than the 19% rate of spontaneous immune responses against point mutations that we had previously described37. Compared to point mutations, GFs might have two main advantages: (1) out-of-frame GFs have increased chance (by size) of encoding multiple neo-epitopes and (2) GF neo-epitopes may provide a higher dissimilarity to self-antigens potentially improving their immunogenicity38,39. The vast majority of these immune responses are CD4+ T-cell responses, which is in line with previous reports for neo-epitopes derived from somatic point mutations. As reported previously for GFs10 and point mutations40, we observed immune responses in the T-cell repertoire of healthy donors, indicating that pre-established, cross-reactive memory T-cells can recognize tumor neo-antigens28,41,42. The higher dissimilarity of especially out-of-frame GFs to self-antigens, might stimulate a more diverse TCR repertoire and, thereby, increase the chance to activate pre-established, cross-reactive memory T-cells43.
Furthermore, our data indicates that individual GFs could be a more abundant source for tumor-specific neo-antigens, with a median of 57 GFs compared to previous estimates of 3 or 4.2 per breast cancer sample1,44.
Our data suggest that GFs can provide a rich source of neo-epitopes of particular relevance for patients with a low burden of small mutations that would otherwise not qualify for personalized immunotherapy approaches. Moreover, EasyFuse describes more accurately the landscape of individual GFs and can therefore become a crucial tool for cancer research. The presented data on verified GFs is also valuable as a resource for further design and optimization of prediction tools. Finally, EasyFuse can enable individual therapy decisions for targetable GFs as well as for personalized immunotherapy approaches.
Methods
RNA sequencing from FFPE samples
Total RNA was extracted from FFPE sample material using either AmpTec ExpressArt FFPE Clear RNAready kit or on the Maxwell RSC instrument using the Maxwell RSC RNA FFPE kit. The RNA samples were fragmented, primed and reverse-transcribed using NEBNext® RNA First Strand Synthesis Module and NEBNext® Ultra™ Directional RNA Second Strand Synthesis Module. The cDNA was end-repaired and adenylated. The appropriate NEXTflex single index-adapter from BioScientific Corp to distinguish different samples was then ligated followed by a pre-amplification. All steps of end repair, a-tailing, ligation and pre-amplification were done using Roche/KAPABiosytems KAPA Hyper Prep Kit. After library preparation target regions were captured by hybridization of biotinylated RNA-baits (library probes) using Agilent SureSelect XT Target Enrichment Kit and Agilent SureSelect XT Human All Exon V6. The formed DNA-RNA hybrids were then captured using streptavidin-coated magnetic beads to bind the biotinylated RNA-baits. This enriched fraction was subsequently amplified via PCR using Roche/KAPABiosytems KAPA Library Amplification kit with Primer Mix. Sequencing libraries were further checked for quantity and quality using Qubit 3 fluorometer and Agilent Bioanalyzer 2100. From each extracted FFPE RNA, two sequencing libraries were prepared as technical replicates. The libraries were sequenced in paired-end mode (2 x 50 nt) on a NovaSeq6000 S2 flow cell resulting in ~75 million distinct sequencing reads per library. Demultiplexing was performed with bcl2fastq v2.20.0.422.
RNA sequencing FF samples
Total RNA was extracted from FF sample material using Qiagen's RNeasy tissue mini kit. Sequencing libraries were generated using Illumina's TruSeq stranded mRNA kit: mRNA molecules containing poly-A were purified using magnetic beads bound to poly-T-oligo. After purification, the mRNA was fragmented into small pieces. RNA fragments were copied into first strand cDNA using reverse transcriptase and random primers, followed by second strand cDNA synthesis using DNA Polymerase I and RNase H. These cDNA fragments then had the addition of a single 'A' base and subsequent ligation of the adapter. The products were then purified and PCR-enriched to create the final cDNA library. Sequencing libraries were further checked for quantity and quality using Qubit fluorometer and Agilent Bioanalyzer 2100. The libraries were sequenced 2-plex in paired-end mode (2 x 50 nt) on a HiSeq2500 HS V3 flow cell resulting in ~150-200 million distinct sequencing reads per library. Demultiplexing was performed with bcl2fastq v2.20.0.422.
Computational GF prediction from RNA-seq samples
Sequencing reads were first quality-trimmed using a combination of FastQC (0.11.8, using parameters " --nogroup -extract") and Skewer (v0.2.2, with "-m 0.75"). The quality-trimmed reads were then aligned against the human reference genome hg38 and Ensembl GRCh38.86 gene models using STAR1 with the following parameters: "--chimSegmentMin 10 --chimJunctionOverhangMin 10 -- alignSJDBoverhangMin 10 --alignMatesGapMax {4} --alignIntronMax {4} --chimSegmentReadGapMax 3 --alignSJstitchMismatchNmax 5 -1 5 5 --seedSearchStartLmax 20 --winAnchorMultimapNmax 50 --outSAMtype BAM SortedByCoordinate --chimOutType Junctions SeparateSAMold --chimOutJunctionFormat 1". In a subsequent step the non-discordant reads were filtered from the input data classifying intra-chromosomally mapped read pairs with a minimum distance of 200.000 base pairs or trans-chromosomal read pairs as discordant. FusionCatcher (1.0)2, STAR-Fusion (1.5.0)3 and SOAPfuse (1.2.7)4 were run with standard parameters and reference data if provided, including default candidate filtering steps according to internal exclusion lists. MapSplice2 (2.2.1)5 was run with the parameters "--qual-scale phred33 --bam --seglen 20 --min-map-len 40 -fusion". InFusion (0.8)6 was run with "--skip-finished --min-unique-alignment-rate 0 --min-unique-split-reads 0 --allow-non-coding".
After running the detection tools, results were gathered into a single table and were annotated with a unique identifier (BPID) based on the breakpoint location in genomic coordinates and transcriptional sense of the affected genes. Additionally, a so called "context sequence" has been generated, consisting of 400 bp pure exonic sequence downstream and upstream of the breakpoint. This context sequence was given a unique context_sequence_id calculated as XXH64 hash value from the sequence itself. Finally, we aligned the filtered input reads against the context sequences using STAR1 with the parameters described above to quantified junction reads and spanning read pairs. Junction reads were defined as reads overlapping the breakpoint position in the context-sequence by at least 10 bp. Spanning pairs were defined as read pairs that mapped concordantly with each mate to a different side of the breakpoint. These steps are implemented and EasyFuse version 1.3.0, which is the version used in this study.
Selection of GF candidates for testing by qRT-PCR
Testing of GF candidates in the two cell lines (MCF7 and SKBR3) focused on verification of a set of previously published GFs. A small number of selected candidates across different concordance bins was also tested. Concordance bins were defined according to number of predicting tools. The same approach was used to test GFs predicted in 14 FF breast tissue samples. This validation data was intended to setup the EasyFuse validation pipeline; the highly biased nature of the data made it unsuitable for unbiased training of a machine learning model.
In order to develop a machine learning model based on EasyFuse GF prediction we applied a different selection strategy for 14 matched FF/FFPE tumor samples to get a comprehensive training data set. Here, we validated a comparable number of randomly selected GFs that were detected in either both or only one sequencing replicate. These data are roughly balanced and well suited to train the machine learning model.
For performance comparison with other tools, we ran the tools Arriba (version 1.1.0), Fusioncatcher, Infusion, Mapsplice, Soapfuse, and STAR-Fusion on the original unfiltered sequencing data independently on both replicates of three test samples. We defined a fusion call as the identified breakpoint in a sample and assigned each fusion call to a unique concordance bin. We defined concordance as the number of combinations between tool and replicate in which a given fusion breakpoint pair was detected for each sample. A given combination of tools in a given concordance bin has a dedicated number of calls and we randomly select a comparable number of calls from each subset for performance comparison.
Primer Design for GF testing
Primer design was done using a custom R script: Context sequences were used for primer design in Primer Blat7, which uses the primer38,9 software with the following restrictions: (1) no primer should align within 20 bp up- and downstream of the breakpoint, (2) amplicons must span the breakpoint and (3) amplicon size should not exceed 150 bp. PrimerBlat employs a DNA fasta file converted into 2bit-format to test for specificity. The output of PrimerBlat was parsed to NCBI PrimerBlast in a web-request. Primer pairs with alignments to off-target loci were removed and remaining primer pairs were returned for each input site. Only one primer pair was reported per context sequence. In the case that no technical suitable primer pair was identified for a given context sequence, this candidate was excluded from testing. Primers were ordered as "Custom DNA Oligos" at eurofins genomics in 0.01 μmol synthesis scale and salt free purification conditions. All olgios used in the study are available in Supplementary Table 2, 3 and 4.
Quantitative Real-Time PCR
Quantitative Real-Time PCR (qRT-PCR) analysis was performed on Applied Biosystems 7300 Real-Time PCR System using Applied Biosystems Sequence Detection Software Version 1.4 and QIAGEN QuantiTect SYBR Green PCR Kit. Total RNA was extracted from the sample material using QIAGEN RNeasy Mini Kit. Template cDNA was created via reverse transcription reaction from 1 μg total RNA using TAKARA PrimeScript RT Reagent Kit with gDNA Eraser and the supplied random hexamer/oligo dT primer mix following the manufacturer's manual. Final volume after 1:3 dilution with water was 60 μl. Primers, designed to amplify estimated fusion events in the qRT-PCR, were used at a final concentration of 0.333 μM each. 40 cycles of three-step qRT-PCR were run on 2.5 μl cDNA per reaction in a final volume of 30 μl at an annealing temperature of 60°C. To discriminate true positive signals from unspecific background, e.g. partial binding to wild type transcripts, a reliability cut-off was set at Ct 35. Negative controls, replacing cDNA template with the equivalent volume of water, were run in parallel to examine primer dimers. Results were analyzed with 7300 System Sequence Detection Software 1.4 (Applied Biosystems).
Capillary gel electrophoresis
After qRT-PCR, the PCR products were analyzed for expected sizes and potential side products using QIAGEN QIAxcel capillary gel electrophoresis instrument with QIAGEN DNA Screening Kit. Amplicon sizes were determined using QIAGEN QX Alignment Marker 15 bp/500 bp. Results were analyzed using the BioCalculator Software 3.2 (QIAGEN).
Sanger sequencing
For verification of fusion transcript amplicons, selected candidates were sent to eurofins genomics for Sanger sequencing. The PCR reactions were cleaned up using either QIAGEN MinElute PCR purification Kit or Invitrogen PureLink Pro96 PCR Purification Kit. Sanger sequencing was performed in both directions using the corresponding forward and reverse primers.
Evaluation of qRT-PCR results
qRT-PCR and capillary gel electrophoresis results were evaluated using a custom R script. Based on those results, each primer was first evaluated as being "positive", "negative" or "unclear" using a fixed set of rules. For positive evaluation an amplicon was detected in qRT-PCR below cycle-threshold value 35 (ct < 35) in respective tumor sample (either FF or FFPE for matched cohort) and at least 5 cycles lower compared to water and -RT negative control samples. Furthermore, the amplicon product size had to match the expected amplicon size (maximum difference of 15% in amplicon size).
Only results that were clearly positive (Ct-value below threshold, negative controls clean and correct amplicon size) were assigned as "positive". In case no amplification was observed, the result was assigned as "negative". Results with either very weak amplification above threshold or amplification in negative controls were assigned as "unclear" and not taken into account for further consideration. Validation results for each primer pair were matched to respective break points. In cases where multiple transcript variants per breakpoint were possible and multiple primer pairs were required for testing, the breakpoint was considered positive when one primer pair was evaluated positive.
Preparation of simulated GFs
Fastq files with simulated GFs and information on simulated breakpoints were obtained from: https://data.broadinstitute.org/Trinity/CTAT_FUSIONTRANS_BENCHMARKING/on_simulated_data/. In order to enable exact breakpoint matching, provided positions were converted from the provided hg19 annotation to hg38 using the UCSC liftOver tool. 2425 breakpoint pairs that were unambiguously lifted over were considered as true positive GF list.
Training and evaluation of machine learning model
In order to train and validate the prediction model on independent data, the cohort with matched FF/FFPE data from 14 samples from 11 patients was randomly split by patients into a training and test data set. The training data set consists of 11 samples from 8 individual patients with positive or negative confirmation data for 890 candidate fusion breakpoint calls from each replicate. The test data consisted of 3 samples from 3 independent patients with confirmation data for 281 fusion breakpoint calls from each replicate. The training data were used to optimize hyper-parameter and train the prediction models, while the test data were only used to assess the final performance and benchmarking with other tools.
To predict whether a given candidate fusion transcript is validated as positive or negative, we trained Random Forest models using the R package randomForest (v.4.6-14) in R (v3.6.1)10. Thereby, we used as input features multiple annotations of candidate GFs including, for instance, the fraction of supporting junction and spanning reads.
Beside the input features, the Random Forest algorithm has internal parameters (hyper-parameters) which we optimized for each model separately using five times repeated 5-fold inner cross-validation within the training data. We selected parameter combinations with maximal auROC by allowing for "ntree" 250, 500 and 750, for "mtry" 3, 4, 5 and 7 and for "nodesize" 1, 5, 10, 20, 30, 40, 50 and 100 (Supplementary Figure 7a). For performance assessment of the final models and comparison to other fusion detection tools, models were trained on the entire training data and subsequently applied to the test data set. A GF was classified positively, if the prediction probability score was ≥ 0.5. For applying EasyFuse to other cohorts, we trained for each feature sub-set a final model with the same hyper-parameters on the entire FFPE samples.
Unbiased performance evaluation in tool benchmark
To compare the prediction performance between tools, we used calls from the tools Arriba (2.1.0), Fusioncatcher (1.0), Infusion (0.8), Mapsplice (2.2.1), Soapfuse (1.2.7), STAR-Fusion (1.5.0) and EasyFuse (1.3.0) in individual replicates of the 3 test set samples. Fusions with breakpoints in intergenic regions or in the same gene (according to Ensembl GRCh38.86) were discarded. We define a call as a unique fusion breakpoint pair in a sample that was detected in at least one of the sequencing replicates. The concordance of a call is defined as the number of tools that detect the call. Although we sampled calls for validation experiments with the goal to distribute calls equally among tools and constrained among concordance bins, the final accuracy needs to be weighted according to the number of actual calls from all tools per concordance bin11. The sensitivity or true positive rate (TPR) was calculated as follows:
where N is the total number of calls from all tools and nc the number of calls per concordance bin c by all tools. TPRC is the true positive rate calculated from all validated calls (from all tools) with concordance c. For the weighted calculation of the precision or positive predicted value (PPV), we weight the concordance bins by the number of calls detected by the specific tool in consideration. This is motivated by the fact that only detected calls are considered in the calculation of the PPV and therefore, tools are weighted according to the number of calls they have in each bin. Formally, the PPV for tool t was calculated as
where nct the number of calls per concordance bin c that were detected by tools t.
The f1 score was calculated as the harmonic mean of PPV and sensitivity.
GF prediction and neo-antigen candidate selection for immunoscreening
To select neo-antigens for immunogenicity testing and to analysis neo-antigens in the breast cancer cohort, EasyFuse 1.3.0 was run for each sample's technical replicate and the model "EF_full" was applied. All fusion transcripts were annotated with the resulting neo-peptide sequence. Patient HLA class I and class II alleles were obtained from the corresponding RNA-seq data using seq2HLA v2.212. MHCpan 4.013 was used to predict MHC I epitopes of 8, 9, 10, and 11 amino acids length and MHCIIpan 4.014 to predict MHC class II epitopes of 15 amino acids length. Only epitopes with a binding affinity score <500 nM for MHC class I or class II were considered. Furthermore, epitopes that have a perfect sequence match with any protein in the human SwissProt database were disregarded.
Fusions with a single GF or the gene pair in a curated exclusion list of candidates reported to be detected normal tissues were filtered out. Only those fusions positively predicted by the model EF_full that also lead to a valid open-reading frame and have breakpoints matching exon boundaries of both genes were considered. In the immunogenicity cohort 30 candidates were selected for immunogenicity testing.
Peptides
Synthetic 15-mer peptides with 11 amino acid overlaps covering the predicted GFs adjacent and across the GF breakpoint, referred to as an overlapping peptide pool (OLP), were used. All synthetic peptides were purchased from JPT Peptide Technologies GmbH and dissolved in DMSO to a final concentration of 3 mM.
In vitro stimulation of PBMCs
The patient PBMC material was collected during a multicentre phase I study NCT02035956 or as part of the RB_T002 research program (DRKS-ID: DRKS00011790). The studies were carried out in accordance with the Declaration of Helsinki and good clinical practice guidelines and with approval by the institutional review board or independent ethics committee of each participating site and the competent regulatory authorities. All patients provided written informed consent.
PBMCs for immunogenicity testing were isolated by Ficoll-Hypaque (Amersham Biosciences) density gradient centrifugation from peripheral blood or Leukapheresis samples. For generation of immature DCs15 or fast DCs16 monocytes were purified using the MACS CD14 isolation kit (Pan Monocyte Isolation Kit human, Cat 130-096-537, Miltenyi Biotec, Bergisch Gladbach, Germany) and were subsequently cultured in six-well plates (0.5–1.5 x 106 cells/ml) in fresh complete medium (CellGro GMP Serum-free Dendritic Cell Medium (DC), Cellgenix GmbH, Freiburg, Germany) supplemented with 1000 U/ml GM-CSF, 1000 U/ml IL-4 and 50 U/ml Penicillin-Streptomycin for 24h. For the generation of fast DCs an incubation with a combination of proinflammatory mediators (1000 U/ml TNF-α, 10 ng/ml IL-1β, 1000/ml IL-6, and 1 μM PGE2) followed. Incubation time was again 24 h.
Alternatively, monocytes were cultured for 4 days with GM-CSF (1000 U/ml), IL-4 (1000 U/ml) and Penicillin-Streptomycin (50 U/ml) (iDCs).
CD4+ and CD8+ T-cells were isolated from cryopreserved PBMCs using microbeads (Miltenyi Biotec). Individual IVS cultures were set up using OLPs covering the predicted GF. For this, CD4+ T-cells were expanded in the presence of fast DCs – effector to target ratio (E:T = 10:1) – and the respective target peptide pools. For the expansion of CD8+ T-cells, CD4-depleted PBMCs were co-cultured with purified CD8+ T-cells (E:T = 1:10) in the presence of IL-4 and GM-CSF (each 1000 U/mL) and the respective target peptide pool. One day after starting the IVS, fresh culture medium containing 10 U/ml IL-2 (Proleukin S, Novartis), 5 ng/mL IL-15 (Peprotech) and IL-4 and GM-CSF (each 1000 U/mL, Miltenyi Biotec) were added. Seven days after setting up the IVS cultures, IL-2 was replenished (10 U/mL).
After 11 days of stimulation, cells were used in ELISpot assays.
IFN-γ ELISpot
Multiscreen filter plates (Merck Millipore), pre-coated with antibodies specific for IFN-γ (Mabtech, 1:133) were washed with PBS and blocked with X-VIVO 15 (Lonza) containing 2% human serum albumin (CSL-Behring) for 1-4 hours. 0.5 – 1.0 x 105 effector cells/well were stimulated for 16 – 20 h with autologous DCs loaded with respective OLPs used for in vitro stimulation and the individual 15-mer peptides. All tests were performed in duplicate or triplicate and included a positive control (anti-CD3, Mabtech, 1:1000). Spots were visualized with a biotin-conjugated anti-IFN-γ antibody (Mabtech, 1:1000) followed by incubation with ExtrAvidin-Alkaline Phosphatase (Sigma-Aldrich) and BCIP/NBT substrate (Sigma-Aldrich). Plates were scanned using CTL's ImmunoSpot® Series S five Versa ELISpot Analyzer (S5Versa-02-9038) and analyzed by ImmunoCapture V6.3. Spot counts were summarized as mean values for each triplicate or duplicate. T-cell responses stimulated by peptides were compared to control peptide-loaded target cells. A response was defined as positive with a minimum of 25 spots per 5.0 x 104 cells in the post-IVS setting as well as a spot count that was more than twice as high as the respective control.
Supplementary Material
Acknowledgement
We thank the multicentre phase I study NCT02035956 and the RB_T002 research program (DRKS-ID: DRKS00011790) patients, from whom analyzed samples were obtained and we thank the involved study site teams for their support and collaboration. We thank Karen Chu and Claudia Büchner for proof-reading and assistance with the manuscript. We thank Oezlem Akilli-Oeztuerk for support with biosampling, Rüdiger Siek for support with installation of tools and setup on our servers and Christoph Ritzel for testing of EasyFuse pipeline and Docker image. Furthermore, we thank Stefania Gangi Maurici for technical support with qRT-PCR analysis and Alina Henrich, Stefanie Burchard and Patricia Do Dinh for technical support with the RNA sequencing. This work was supported by an European Research Council (ERC) Advanced Grant to U.S. [ERC-AdG 789256].
Footnotes
Author contributions
U.S. conceptualized the work and strategy. D.W., J.I., M.S. and I.V. planned and analyzed experiments. K.S. and M.S. performed the experiments. D.W., J.I., P.S., C.H., U.L., B.S., F.L. and M.L. performed data analysis. D.W., J.I. and U.S. interpreted data and wrote the manuscript.
Competing interests statement
U.S. is a board member and employee at BioNTech SE (Mainz, Germany). U.L., K.S. and I.V. are employee at BionTech SE. U.S. is CEO and stock owner of Biontech SE. U.S., K.S. and I.V. have securities in BioNTech SE. The remaining authors declare no competing interests.
Data availability
Sequence data from this study has been deposited at the Sequence Read Archive (SRA accession: PRJNA607061 used in Figure 1, 2 and 3; NCBI BioProject ID: PRJNA764684 used in Figure 2) or the European Genome-phenome Archive (EGA accession: EGAS00001004877; used in figure 4 and 5). Previously published sequencing data (immunogenicity cohort Figure 5, samples 10-14) are available at EGA (EGA accession: EGAD00001004455). Previously sequenced cell line data used in Figure 1 is available at SRA accession: PRJNA543964. Raw predicted GFs for all samples are available on figshare (https://figshare.com/s/f5c9c9a3b1b1d9860955).
Code availability
The source code and documentation of EasyFuse is available at GitHub (https://github.com/TRON-Bioinformatics/easyfuse).
References
- 1.Gao Q, et al. Driver Fusions and Their Implications in the Development and Treatment of Human Cancers. Cell reports. 2018;23:227–238.:e3. doi: 10.1016/j.celrep.2018.03.050. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Shtivelman E, Lifshitz B, Gale RP, Canaani E. Fused transcript of abl and bcr genes in chronic myelogenous leukaemia. Nature. 1985;315:550–554. doi: 10.1038/315550a0. [DOI] [PubMed] [Google Scholar]
- 3.Amatu A, Sartore-Bianchi A, Siena S. NTRK gene fusions as novel targets of cancer therapy across multiple tumour types. ESMO open. 2016;1:e000023. doi: 10.1136/esmoopen-2015-000023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Sahin U, Tureci O. Personalized vaccines for cancer immunotherapy. Science (New York, N.Y.) 2018;359:1355–1360. doi: 10.1126/science.aar7112. [DOI] [PubMed] [Google Scholar]
- 5.Carreno BM, et al. Cancer immunotherapy. A dendritic cell vaccine increases the breadth and diversity of melanoma neoantigen-specific T cells. Science (New York, N.Y.) 2015;348:803–808. doi: 10.1126/science.aaa3828. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Ott PA, et al. An immunogenic personal neoantigen vaccine for patients with melanoma. Nature. 2017;547:217–221. doi: 10.1038/nature22991. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Garraway LA, Lander ES. Lessons from the cancer genome. Cell. 2013;153:17–37. doi: 10.1016/j.cell.2013.03.002. [DOI] [PubMed] [Google Scholar]
- 8.Chang MT, et al. Identifying recurrent mutations in cancer reveals widespread lineage diversity and mutational specificity. Nature biotechnology. 2016;34:155–163. doi: 10.1038/nbt.3391. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Bosch GJ, Joosten AM, Kessler JH, Melief CJ, Leeksma OC. Recognition of BCR-ABL positive leukemic blasts by human CD4+ T cells elicited by primary in vitro immunization with a BCR-ABL breakpoint peptide. Blood. 1996;88:3522–3527. [PubMed] [Google Scholar]
- 10.Yang W, et al. Immunogenic neoantigens derived from gene fusions stimulate T cell responses. Nature medicine. 2019;25:767–775. doi: 10.1038/s41591-019-0434-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Haas BJ, et al. Accuracy assessment of fusion transcript detection via read-mapping and de novo fusion transcript assembly-based methods. Genome biology. 2019;20:213. doi: 10.1186/s13059-019-1842-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Kosugi S, et al. Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing. Genome biology. 2019;20:117. doi: 10.1186/s13059-019-1720-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Zhou JX, et al. Identification of KANSARL as the first cancer predisposition fusion gene specific to the population of European ancestry origin. Oncotarget. 2017;8:50594–50607. doi: 10.18632/oncotarget.16385. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Pintarelli G, et al. Read-through transcripts in normal human lung parenchyma are down-regulated in lung adenocarcinoma. Oncotarget. 2016;7:27889–27898. doi: 10.18632/oncotarget.8556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Babiceanu M, et al. Recurrent chimeric fusion RNAs in non-cancer tissues and cells. Nucleic acids research. 2016;44:2859–2872. doi: 10.1093/nar/gkw032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Sorn P, Hohlstrater C, Lower M, Sahin U, Weber D. ArtiFuse - Computational validation of fusion gene detection tools without relying on simulated reads. Bioinformatics (Oxford, England) 2019 doi: 10.1093/bioinformatics/btz613. [DOI] [PubMed] [Google Scholar]
- 17.Asmann YW, et al. A novel bioinformatics pipeline for identification and characterization of fusion transcripts in breast cancer and normal cell lines. Nucleic acids research. 2011;39:e100. doi: 10.1093/nar/gkr362. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Edgren H, et al. Identification of fusion genes in breast cancer by paired-end RNA-sequencing. Genome biology. 2011;12:R6. doi: 10.1186/gb-2011-12-1-r6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Kangaspeska S, et al. Reanalysis of RNA-sequencing data reveals several additional fusion genes with multiple isoforms. PloS one. 2012;7:e48745. doi: 10.1371/journal.pone.0048745. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Maher CA, et al. Transcriptome sequencing to detect gene fusions in cancer. Nature. 2009;458:97–101. doi: 10.1038/nature07638. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Sakarya O, et al. RNA-Seq mapping and detection of gene fusions with a suffix array algorithm. PLoS computational biology. 2012;8:e1002464. doi: 10.1371/journal.pcbi.1002464. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Nicorici D, et al. FusionCatcher - a tool for finding somatic fusion genes in paired-end RNA-sequencing data. 2014 [Google Scholar]
- 23.Wang K, et al. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic acids research. 2010;38:e178. doi: 10.1093/nar/gkq622. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Jia W, et al. SOAPfuse. An algorithm for identifying fusion transcripts from paired-end RNA-Seq data. Genome biology. 2013;14:R12. doi: 10.1186/gb-2013-14-2-r12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Okonechnikov K, et al. InFusion. Advancing Discovery of Fusion Genes and Chimeric Transcripts from Deep RNA-Sequencing Data. PloS One. 2016;11:e0167417. doi: 10.1371/journal.pone.0167417. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Uhrig S, et al. Accurate and efficient detection of gene fusions from RNA sequencing data. Genome research. 2021;31:448–460. doi: 10.1101/gr.257246.119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Pan-cancer analysis of whole genomes. Nature. 2020;578:82–93. doi: 10.1038/s41586-020-1969-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Leng Q, Tarbe M, Long Q, Wang F. Pre-existing heterologous T-cell immunity and neoantigen immunogenicity. Clinical & translational immunology. 2020;9:e01111. doi: 10.1002/cti2.1111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Heyer EE, et al. Diagnosis of fusion genes using targeted RNA sequencing. Nature communications. 2019;10:1388. doi: 10.1038/s41467-019-09374-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Creason A, et al. A community challenge to evaluate RNA-seq, fusion detection, and isoform quantification methods for cancer discovery. Cell systems. 2021 doi: 10.1016/j.cels.2021.05.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Buzyn A, et al. Peptides derived from the whole sequence of BCR-ABL bind to several class I molecules allowing specific induction of human cytotoxic T lymphocytes. European journal of immunology. 1997;27:2066–2072. doi: 10.1002/eji.1830270834. [DOI] [PubMed] [Google Scholar]
- 32.Gambacorti-Passerini C, et al. Human CD4 lymphocytes specifically recognize a peptide representing the fusion region of the hybrid protein pml/RAR alpha present in acute promyelocytic leukemia cells. Blood. 1993;81:1369–1375. [PubMed] [Google Scholar]
- 33.Makita M, et al. Leukemia-associated fusion proteins, dek-can and bcr-abl, represent immunogenic HLA-DR-restricted epitopes recognized by fusion peptide-specific CD4+ T lymphocytes. Leukemia. 2002;16:2400–2407. doi: 10.1038/sj.leu.2402742. [DOI] [PubMed] [Google Scholar]
- 34.Sato Y, et al. Detection and induction of CTLs specific for SYT-SSX-derived peptides in HLA-A24(+) patients with synovial sarcoma. Journal of immunology (Baltimore, Md.: 1950) 2002;169:1611–1618. doi: 10.4049/jimmunol.169.3.1611. [DOI] [PubMed] [Google Scholar]
- 35.van den Broeke LT, Pendleton CD, Mackall C, Helman LJ, Berzofsky JA. Identification and epitope enhancement of a PAX-FKHR fusion protein breakpoint epitope in alveolar rhabdomyosarcoma cells created by a tumorigenic chromosomal translocation inducing CTL capable of lysing human tumors. Cancer research. 2006;66:1818–1823. doi: 10.1158/0008-5472.CAN-05-549. [DOI] [PubMed] [Google Scholar]
- 36.Yang W, et al. Immunogenic neoantigens derived from gene fusions stimulate T cell responses. Nature medicine. 2019;25:767–775. doi: 10.1038/s41591-019-0434-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Sahin U, et al. Personalized RNA mutanome vaccines mobilize poly-specific therapeutic immunity against cancer. Nature. 2017;547:222–226. doi: 10.1038/nature23003. [DOI] [PubMed] [Google Scholar]
- 38.Richman LP, Vonderheide RH, Rech AJ. Neoantigen Dissimilarity to the Self-Proteome Predicts Immunogenicity and Response to Immune Checkpoint Blockade. Cell systems. 2019;9:375–382.:e4. doi: 10.1016/j.cels.2019.08.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Bjerregaard A-M, et al. An Analysis of Natural T Cell Responses to Predicted Tumor Neoepitopes. Frontiers in immunology. 2017;8:1566. doi: 10.3389/fimmu.2017.01566. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Strønen E, et al. Targeting of cancer neoantigens with donor-derived T cell receptor repertoires. Science (New York, N.Y.) 2016;352:1337–1341. doi: 10.1126/science.aaf2288. [DOI] [PubMed] [Google Scholar]
- 41.Balachandran VP, et al. Identification of unique neoantigen qualities in long-term survivors of pancreatic cancer. Nature. 2017;551:512–516. doi: 10.1038/nature24462. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Bessell CA, et al. Commensal bacteria stimulate antitumor responses via T cell cross-reactivity. JCI insight. 2020;5 doi: 10.1172/jci.insight.135597. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Nelson RW, et al. T cell receptor cross-reactivity between similar foreign and self peptides influences naive cell population size and autoimmunity. Immunity. 2015;42:95–107. doi: 10.1016/j.immuni.2014.12.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Robinson DR, et al. Functionally recurrent rearrangements of the MAST kinase and Notch gene families in breast cancer. Nature medicine. 2011;17:1646–1651. doi: 10.1038/nm.2580. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 1.Dobin A, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics (Oxford, England) 2013;29:15–21. doi: 10.1093/bioinformatics/bts635. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Nicorici D, et al. FusionCatcher - a tool for finding somatic fusion genes in paired-end RNA-sequencing data. 2014 [Google Scholar]
- 3.Haas Brian J, et al. STAR-Fusion: Fast and Accurate Fusion Transcript Detection from RNA-Seq. bioRxiv. 2017 [Google Scholar]
- 4.Jia W, et al. SOAPfuse. An algorithm for identifying fusion transcripts from paired-end RNA-Seq data. Genome biology. 2013;14:R12. doi: 10.1186/gb-2013-14-2-r12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Wang K, et al. MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic acids research. 2010;38:e178. doi: 10.1093/nar/gkq622. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Okonechnikov K, et al. InFusion. Advancing Discovery of Fusion Genes and Chimeric Transcripts from Deep RNA-Sequencing Data. PloS one. 2016;11:e0167417. doi: 10.1371/journal.pone.0167417. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Kent WJ. BLAT—the BLAST-like alignment tool. Genome research. 2002;12:656–664. doi: 10.1101/gr.229202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Untergasser A, et al. Primer3--new capabilities and interfaces. Nucleic acids research. 2012;40:e115. doi: 10.1093/nar/gks596. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Koressaar T, Remm M. Enhancements and modifications of primer design program Primer3. Bioinformatics (Oxford, England) 2007;23:1289–1291. doi: 10.1093/bioinformatics/btm091. [DOI] [PubMed] [Google Scholar]
- 10.Liaw A, Wiener M. Classification and Regression by randomForest. R News. 2002;2:18–22. [Google Scholar]
- 11.Pan-cancer analysis of whole genomes. Nature. 2020;578:82–93. doi: 10.1038/s41586-020-1969-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Boegel S, et al. HLA typing from RNA-Seq sequence reads. Genome medicine. 2012;4:102. doi: 10.1186/gm403. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Jurtz V, et al. NetMHCpan-4.0. Improved Peptide-MHC Class I Interaction Predictions Integrating Eluted Ligand and Peptide Binding Affinity Data. Journal of immunology (Baltimore, Md.: 1950) 2017;199:3360–3368. doi: 10.4049/jimmunol.1700893. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Jensen KK, et al. Improved methods for predicting peptide binding affinity to MHC class II molecules. Immunology. 2018;154:394–406. doi: 10.1111/imm.12889. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Holtkamp S, et al. Modification of antigen-encoding RNA increases stability, translational efficacy, and T-cell stimulatory capacity of dendritic cells. Blood. 2006;108:4009–4017. doi: 10.1182/blood-2006-04-015024. [DOI] [PubMed] [Google Scholar]
- 16.Dauer M, et al. Mature Dendritic Cells Derived from Human Monocytes Within 48 Hours. A Novel Strategy for Dendritic Cell Differentiation from Blood Precursors. The Journal of Immunology. 2003;170:4069–4076. doi: 10.4049/jimmunol.170.8.4069. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Sequence data from this study has been deposited at the Sequence Read Archive (SRA accession: PRJNA607061 used in Figure 1, 2 and 3; NCBI BioProject ID: PRJNA764684 used in Figure 2) or the European Genome-phenome Archive (EGA accession: EGAS00001004877; used in figure 4 and 5). Previously published sequencing data (immunogenicity cohort Figure 5, samples 10-14) are available at EGA (EGA accession: EGAD00001004455). Previously sequenced cell line data used in Figure 1 is available at SRA accession: PRJNA543964. Raw predicted GFs for all samples are available on figshare (https://figshare.com/s/f5c9c9a3b1b1d9860955).
The source code and documentation of EasyFuse is available at GitHub (https://github.com/TRON-Bioinformatics/easyfuse).