Abstract
Background
Transcriptomic structural variants (TSVs)—large-scale transcriptome sequence change due to structural variation - are common in cancer. TSV detection from high-throughput sequencing data is a computationally challenging problem. Among all the confounding factors, sample heterogeneity, where each sample contains multiple distinct alleles, poses a critical obstacle to accurate TSV prediction.
Results
To improve TSV detection in heterogeneous RNA-seq samples, we introduce the Multiple Compatible Arrangements Problem (MCAP), which seeks k genome arrangements that maximize the number of reads that are concordant with at least one arrangement. This models a heterogeneous or diploid sample. We prove that MCAP is NP-complete and provide a -approximation algorithm for and a -approximation algorithm for the diploid case () assuming an oracle for . Combining these, we obtain a -approximation algorithm for MCAP when (without an oracle). We also present an integer linear programming formulation for general k. We characterize the conflict structures in the graph that require alleles to satisfy read concordancy and show that such structures are prevalent.
Conclusions
We show that the solution to MCAP accurately addresses sample heterogeneity during TSV detection. Our algorithms have improved performance on TCGA cancer samples and cancer cell line samples compared to a TSV calling tool, SQUID. The software is available at https://github.com/Kingsford-Group/diploidsquid.
Keywords: Transcriptomic structural variation, Integer linear programming, Heterogeneity
Background
Transcriptomic structural variations (TSVs) are transcriptome sequence alterations due to genomic structural variants (SVs). TSVs may cause the joining of parts from different genes, which are fusion-gene events. Fusion genes are known for their association with various types of cancer. For example, the joint protein products of BCR-ABL1 genes are prevalently found in leukemia [1]. In addition to fusion genes, the joining of intergenic and genic regions, called non-fusion-gene events, are also related to cancer [2].
TSV events are best studied with RNA-seq data. Although SVs are more often studied with whole genome sequencing (WGS) [3–8], the models built on WGS data lack the flexibility to describe alternative splicing and differences in expression levels of transcripts affected by TSVs. In addition, RNA-seq data is far more common [9] than WGS data in some data cohorts, for example, in The Cancer Genome Atlas (TCGA, https://cancergenome.nih.gov).
Many methods have been proposed that identify fusion genes with RNA-seq data. Generally, these tools identify candidates of TSV events through investigation into read alignments that are discordant with the reference genome (e.g. [10–15]). A read alignment is concordant with a reference sequence if the alignment to the sequence agrees with the read library preparation. For example in paired-end Illumina sequencing, the orientation of the forward read should be -to- and the reverse for the mate read. Otherwise the alignment is discordant with the reference. A series of filtering or scoring functions are applied on each TSV candidate to eliminate the errors in alignment or data preparation. The performance of filters often relies heavily on a large set of method parameters and requires prior annotation [16]. Furthermore, most of the fusion-gene detection methods limit their scope to the joining of protein-coding regions and ignore the joining of intergenic regions that could also affect the transcriptome. An approach that correctly models both fusion-gene and non-fusion-gene events without a large number of ad hoc assumptions is desired.
An intuitive TSV model is the one that describes directly the rearrangement of the genome. For example, when an inversion happens, two double-strand breaks (DSB) are introduced to the genome and the segment between the DSBs is flipped. After a series of SVs are applied to a genome, a rearranged genome is produced. In order to identify the TSVs, we can attempt to infer the rearranged genome from the original genome and keep track of the arrangements of genome segments. Since a model of the complete genome is produced, both fusion-gene and non-fusion-gene events can be detected. A recently published TSV detection tool, SQUID [9], models TSV events in this way by determining a single rearrangement of a reference genome that can explain the maximum number of observed sequencing reads. SQUID finds one arrangement of genome segments such that a maximum number of reads are concordant with it. Novel transcriptomic adjacencies appearing in the arrangement are predicted as TSVs while the ones not appearing are regarded as sequencing or alignment errors.
Despite the generally good performance of SQUID, it relies on the assumption that the sample is homogeneous, i.e. the original genome contains only one allele that can be represented by a single rearranged string. This assumption is unrealistic in diploid (or high ploidy) organisms. When TSV events occur within the same regions on different alleles, read alignments may suggest multiple conflicting ways of placing a segment. Under the homogeneous assumption, conflicting TSV candidates are regarded as errors. Therefore, this assumption leads to discarding the conflicting TSV candidates that would be compatible on separate alleles and therefore limits the discovery of true TSVs. Conflicting SV candidates are addressed in a few SV detection tools such as VariationHunter-CR [6]. However, VariationHunter-CR assumes a diploid genome, and its model is built for WGS data that lacks ability to handle RNA-seq data.
We present an improved model of TSV events in heterogeneous contexts. We address the limitation of the homogeneous assumption by extending the assumption to k alleles. We introduce the multiple compatible arrangements problem (MCAP), which seeks, assuming the number of alleles k is known, an optimal set of k arrangements of segments such that the number of sequencing reads that are concordant with any of the arrangements is maximized. Each arrangement is a permutation and reorientation of all segments from the reference genome, representing the altered sequence of one allele. A connection between segments is predicted as a TSV if its supporting reads are discordant in the original genome but are concordant in any of the k arrangements, otherwise the connection either agrees with the reference genome or is considered as errors. We show that MCAP is NP-complete. To address NP-completeness, we propose a -approximation algorithm for the case and a -approximation solution to the case using an oracle for . Combining these, we obtain a -approximation algorithm for MCAP when (without an oracle). We also present an integer linear programming (ILP) formulation that gives an optimal solution for general k.
We characterize the patterns of reads that result in conflicting TSV candidates under a single-allele assumption. We show that these patterns are prevalent in both cancer cell lines and TCGA samples, thereby further motivating the importance of SV detection approaches that directly model heterogeneity.
We apply our algorithms to 381 TCGA samples from 4 cancer types and show that many more TSVs can be identified under a diploid assumption compared to a haploid assumption. We also evaluate an exact ILP formulation under a diploid assumption (D-SQUID) on previously annotated cancer cell lines HCC1395 and HCC1954, identifying several previously known and novel TSVs. We also show that, in most of the TCGA samples, the performance of the approximation algorithm is very close to optimal and the worst case of -approximation is rare.
The Genome Segment Graph (GSG)
A Genome Segment Graph, similar to a splice graph [17], encodes relationships between genomic segments and a set of reads. A segmentationS of the genome is a partition of the genome into disjoint intervals according to concordant and discordant paired-end alignments with respect to the reference genome. The genome partitioning, edge construction and edge filtering is done in the same way as in Ma et al. [9].
Definition 1
(Genome Segment Graph) A genome segment graph is a weighted, undirected graph derived from a segmentation S of the genome and a collection of reads. The vertex set, , includes a vertex for both endpoints, head (h) and tail (t), for each segment . The head of a segment is the end that is closer to the end of the genome. The tail is the end that is close to the end. Pairs of reads that span more than one segment are represented by edges. There are four types of connections: head-head, head-tail, tail-head and tail-tail. Each edge , where , is undirected and connects endpoints of two segments. The weight () is the number of sequencing reads that support edge e.
We also define the weight of a subset of edges . (More details on the GSG provided in Ma et al. [9]).
Definition 2
(Permutation, orientation function and arrangement) A permutation is a function where , where i is the index of segment in an ordering of a set S of segments. We also define orientation function if segment u should remain in the original orientation, or 0 if it should be inverted. An arrangement is a pair of permutation and orientation functions .
If , we say that segment u is closer to the end of the rearranged genome than segment v. Each arrangement is a concatenation of segments from different chromosomes, which retrieves the sequences affected by inter- and intra-chromosomal TSV events. The arrangement of genome segments imitates the movements of genomic sequences by SVs. One crucial difference between arrangement in GSG and sequence movements by SVs is that an arrangement in GSG only captures the movement that are relevant to transcriptome sequence alterations. Such alterations can either fuse two transcript sequences or incorporate previously non-transcribing sequences into transcripts as long as they are present in RNA-seq reads.
Definition 3
(Concordant and discordant edges) Let e be an edge connecting segment u on end a and segment v on end b (). Given arrangement , suppose , edge e is concordant with respect to the arrangement if and . Denote the concordance as . Otherwise, e is discordant and denote as .
Combining the permutation and orientation function, the edge concordance condition can be equivalently expressed as
Since edges are constructed based on segment connections indicated by read alignments, the concordance and discordance of edges are extensions from read alignments. A discordant edge represents a set of discordant read alignments. Examples of discordant edges with tail-tail and head-head connections are shown in Fig. 1a. Concordant edges, when connecting nodes that belong to the same chromosome, represent concordant alignments that are either continuous alignments or split-alignments due to alternative splicing. Due to alternative splicing, a node can be incident to multiple concordant edges given an arrangement. Edges that initially spanned two chromosomes but become concordant in an arrangement represent inter-chromosomal translocation events.
Segments connected by discordant edges can be arranged so that some of the discordant edges become concordant. See Fig. 1b,c for examples of arrangements that make tail-tail and head-head connections concordant.
Definition 4
(Conflicts among a set of edges) Given GSG and a subset of edges , the edges in set are in conflict with each other if there is no single arrangement such that . Otherwise, edges in set are compatible with each other.
Definition 5
(Transcriptomic structural variant(TSV)) A TSV is a new adjacency in transcript sequences that cannot be explained by alternative splicing.
In GSG, the adjacencies in transcript sequences are represented by edges. New adjacencies that cannot be explained by alternative splicing belong to one of two categories: (1) the set of edges discordant with respect to the original arrangement but concordant in the rearranged genome, (2) edges concordant in both the original and the rearranged genomes that connect segments that are further apart than a user-specified distance, or from different chromosomes. Edges in both categorites are ouput as TSVs. Here, as in Ma et al. [9], edges in the second category are identified during a post-processing step in the implementation.
The Multiple Compatible Arrangements Problem (MCAP)
Problem statement
Given an input GSG and a positive integer k, the multiple compatible arrangements problem seeks a set of k arrangements that are able to generate the maximum number of sequencing reads:
1 |
where is 1 if edge e is concordant in at least one , and 0 otherwise.
This objective function aims to find an optimal set of k arrangements of segments where the sum of concordant edge weights is maximized in the arranged alleles, where k is the number of alleles and assumed to be known. The objective seeks to maximize the agreement between arranged allelic sequences and observed RNA-seq data. Assuming that the majority of RNA-seq reads are sequenced correctly, the concordant edges with respect to the optimal set of arrangements represent the most confident transcriptomic adjacencies. In heterogeneous samples where , MCAP separates the conflicting edges onto k alleles as shown in an example in Fig. 1.
When , the problem reduces to finding a single arranged genome to maximize the number of concordant reads, which is the problem that SQUID [9] solves. We refer to the special case when as single compatible arrangement problem (SCAP).
Predicted TSVs are the concordant edges with respect to any of the arrangements in a solution to MCAP that were either discordant with respect to the reference genome or spanning multiple chromosomes.
NP-completeness of SCAP and MCAP
Theorem 1
SCAP is NP-complete.
Proof
We prove the NP-completeness by reducing from the Fragment Orientation Problem (FOP) that has been formulated and studied by Kececioglu et al. [18]. In FOP, for any pair of fragments, there is evidence supporting or against that they have the same orientation. FOP maximizes the agreement with the evidence by assigning the fragment orientation. We rephrase the problem statement as follows.
Input: A set of fragments and a score function that satisfies the following two conditions:
Output: An orientation of fragments .
Objective: Maximize the sum of score according to the orientation,
Kececioglu et al. [18] defined two symmetric functions and used them to express the objective function in a more specific way:
where is defined as , and is defined as .
Given any FOP instance, a SCAP instance is constructed in polynomial time by constructing a segment for each fragment in and assigning edge weights based on the same and opp function values. Specifically, for fragment , construct a segment . For any pair of segments construct four edges with the following weights: , , , and . Due to the correspondence between segments S and fragments , they can be viewed as parameter substitution and used in interchangeably in FOP and SCAP.
Because the constructed GSG is a complete graph except that there is no within-segment edges, the maximization of SCAP over permutation and orientation f can be rewritten as
In the last step of the above equation, since the objective function does not contain permutation , we can take out of the optimization parameter. That means for any permutation the maximum sum of concordant edge weights is the same. Applying reparameterization by changing segment to fragment and changing the segment orientation function f with fragment orientation function O, the above maximization problem is the same as FOP. As a result, the optimal solution of SCAP and FOP can be used interchangeably to maximize the criterion of each other.
Therefore, given any instance of FOP, an instance of SCAP can be constructed in polynomial time whose solution contains an orientation function that maximized FOP instance at the same time. Since FOP is NP-complete, SCAP is also NP-complete.
Corollary 1
MCAP is NP-complete.
Proof
SCAP is a special case of MCAP with , so the NP-completeness of MCAP is immediate.
A -approximation algorithm for SCAP
We provide a greedy algorithm for SCAP that achieves at least approximation ratio and takes O(|V||E|) time. The main idea of the greedy algorithm is to place each segment into the current order one by one by choosing the current “best” position. The current “best” position is determined by the concordant edge weights between the segment to be placed and the segments already in the current order.
Theorem 2
Algorithm 1 approximates SCAP with at least approximation ratio.
Proof
Denote as the concordant edges in the arrangement of Algorithm 1. Let OPT be the optimal value of SCAP. We are to prove .
For iteration i in the for loop, the edges are considered when comparing the options. Each of the four options makes a subset of concordant. These subsets are non-overlapping and their union is . Specifically, the concordant edge subset is for the first option, for the second, for the third, and for the last.
By the selecting the option with the largest sum of concordant edge weights, the concordant edges in iteration i satisfies . Therefore, the overall concordant edge weights of all iterations in the for loop satisfy
Each edge must appear in one and only one of , and thus . This implies .
Algorithm 1 can be further improved in practice by considering more order and orientation options when inserting a segment into current order. In Algorithm 1, only two possible insertion places are considered: the beginning and the end of the current order. However, a new segment can be inserted between any pair of adjacent segments in the current order. We provide an extended greedy algorithm to take into account the extra possible inserting positions (Algorithm 2). Algorithm 2 has a time complexity of , but it may achieve a higher total concordant edge weight in practice.
A -approximation of MCAP with using a SCAP Oracle
If an optimal SCAP solution can be computed, one way to approximate the MCAP’s optimal solution is to solve a series of SCAP instances iteratively to obtain multiple arrangements. Here, we prove the solution based on iteratively solving SCAP has an approximation ratio of for the special case of MCAP with .
Theorem 3
Algorithm 3 is a-approximation of MCAP with. Denote the optimal objective sum of edge weights in MCAP withasOPT, and the sum of edge weights in the two iterative SCAP asW, then
Proof
Denote MCAP with as 2-MCAP. Let and be concordant edges in the optimal two arrangements of 2-MCAP. It is always possible to make the concordant edges of the arrangements disjoint by removing the intersection from one of the concordant edge set, that is . Let . The optimal value is .
Denote the optimal set of concordant edges in the first round of Algorithm 3 as . The optimal value of SCAP is . can have overlap with the two concordant edge sets of the 2-MCAP optimal solution. Let the intersections be and . Let the unique concordant edges be , and .
After separating the concordant edges in 2-MCAP into the intersections and unique sets, the optimal value of 2-MCAP can be written as , where the four subsets are disjoint. Therefore the smallest weight among the four subsets must be no greater than . We prove the approximation ratio under the following two cases and discuss the weight of the second round of SCAP separately:
Case (1): the weight of eitheroris smaller than. Because the two arrangements in 2-MCAP are interchangeable, we only prove for the case where . A valid arrangement of the second round of SCAP is the second arrangement in 2-MCAP, though it may not be optimal. The maximum concordant edge weights added by the second round of SCAP must be no smaller than . Combining the optimal values of two rounds of SCAP, the concordant edge weight is
2 |
Case (2): bothand. The subset with smallest sum of edge weights is now either or . Without loss of generality, we assume has the smallest sum of edge weights and . Because the first round SCAP is optimal for the SCAP problem, its objective value should be no smaller than the concordant edge weights of either arrangement in 2-MCAP. Thus
3 |
A valid arrangement for the second round of SCAP can be either of the arrangements in 2-MCAP optimal solution. Picking the first arrangement of 2-MCAP as the possible (but not necessarily optimal) arrangement for the second round of SCAP, the concordant edge weights added by the second round of SCAP must be no smaller than . Therefore, the total sum of concordant edge weights of the optimal solutions of both rounds of SCAP is
4 |
Corollary 2
An approximation algorithm for MCAP withcan be created by using Algorithm 1 as the oracle for SCAP in Algorithm 3. This approximation algorithm runs inO(|V||E|) time and achieves at leastapproximation ratio.
The proof of the corollary is similar to the proof of Theorem 3. By adding a multiplier of to the right of inequalities (3) when lower bounding by , the approximation ratio can be derived accordingly.
Integer linear programming formulation for MCAP
MCAP, for general k, can be formulated as an integer linear programming (ILP) to obtain an optimal solution. We rewrite the i-th permutation (), orientation () and decision () functions with three boolean variables , and . For and , we have:
if edge and 0 otherwise.
if for segment u and 0 if .
if , or segment u is in front of v in arrangement i and 0 otherwise.
In order to account for the edges that are concordant in more than one arrangement in the summation in Equation 1, we define such that if edge e is concordant in one of the k arrangements and 0 otherwise. The constraints for are as follows:
5 |
6 |
The objective function becomes
7 |
We then add ordering and orientation constraints. If an edge is a tail-head connection, i.e. concordant to the reference genome, if and only if . If an edge is a tail-tail connection, if and only if . If an edge is a head-tail connection, if and only if . If an edge is a head-head connection, if and only if . The constraints for a tail-head connection are listed below in Equation 8, which enforce the assignment of boolean variables , and :
8 |
The constraints of other types of connections are similar and detailed in Ma et al. [9]. Additionally, constraints are added so that all segments are put into a total order within each allele. For two segments u, v, segment u will be either precede or follow segment v, i.e. . For three segments u, v, w, if u precedes v and v precedes w, then u has to precede w: .
The total number of constraints as a function of k is . When k increases, the number of constraints grows linearly. When , the ILP formulation reduces to the same formulation as SQUID.
Characterizing the conflict structures that imply heterogeneity
In this section, we ignore edge weights and characterize the graph structures where homogeneous assumption cannot explain all edges. We add a set of segment edges, , to the GSG. Each connects the two endpoints of each segment, i.e. for . The representation of GSG becomes .
Definition 6
(Conflict structures and compatible structures) A conflict structure, , is a subgraph of a GSG where there exists a set of edges that cannot be made concordant using any single arrangement. A compatible structure is a subgraph of a GSG where there exists a single arrangement such that all edges can be made concordant in it.
Definition 7
(Simple cycle in GSG) A simple cycle, , is a subgraph of a GSG, such that and , with and where when except .
Definition 8
(Degree and special degree of a vertex in subgraphs of GSG) Given a subgraph of GSG, , refers to the degree of vertex that counts only the edges that connect to v. deg(v) refers to the number of edges that connect to v.
Theorem 4
Any acyclic subgraph of GSG is a compatible structure.
Proof
We show that any acyclic subgraph with N edges (), , of GSG is a compatible structure by induction.
When , is a compatible structure because no other edge in is in conflict with the only edge .
Assume the theorem hold for any acyclic subgraph that contains n edges. Let be an acyclic subgraph with edges. Since is acyclic, there must be a leaf edge that is incident to a leaf node. Denote the leaf node as and the leaf edge (). By removing edge e and leaf node , the subgraph is also acyclic and contains n edges. According to the assumption, is a compatible structure and there is an arrangement of the segments in which all edges in is concordant. Because no other edge in except e connects to , it is always possible to place segment v back to the arrangement such that e is concordant. Specifically, one of the four placing options will satisfy edge e: the beginning of the arrangement with orientation 1, the beginning with orientation 0, the end with orientation 1 and the end with orientation 0. Therefore, is a compatible structure.
By induction, acyclic subgraph of GSG with any is a compatible structure.
Theorem 5
A simple cycle is a compatible structure if and only if there are exactly two vertices, and such that and and belongs to different segments.
Proof
We prove sufficiency and necessity separately in Lemma 1 and Lemma 2.
Lemma 1
IfCis a compatible structure, there are exactly two vertices, that belong to different segments, such that
Proof
We discuss compatibility in two cases:
Case (1):All edges are concordant inC. Sort the vertices by genomic locations in ascending order and label the first vertex and the last , assuming . Similarly, sort the set of segments in C by the values of their permutation function and label the first segment and the last , assuming . Since concordant connections can only be tail-head connections (e.g. Figure 1 b,c), and . Since C is a simple cycle, all vertices have . Because and are the first and last vertices in this arrangement, the edges incident to or must be in . It follows that the two edges incident to connects to and . Similarly, edges incident to connects to and . Therefore, we have . Any other vertex () is connected by one and one and thus has .
Case (2):Some edges are discordant inC. If discordant edges exist in cycle C, according to the definition of compatible structure, segments in C can be arranged such that all edges are concordant. This reduces to case (1).
Lemma 2
If there are exactly two vertices inthat belong to different segments,and, such that, thenCis a compatible structure.
Proof
Let and be the one of the end points of segments and , respectively. We can arrange and such that , and that , . Rename to and to . Since C is a simple cycle, we can find two simple paths, and , between and and there is no edge between and . Let and denote and that exclude and and the edges incident to and . Since and as acyclic subgraphs of GSG, according to Theorem 4, and are compatible structures and therefore segments in and can be arranged so that all edges are concordant. Denote the first and last vertices in the arranged as and , and the first and last vertices in the arranged as and . Because all the edges are concordant in , and are the head and tail of the first and last segments in . Because only and have in C, must be connected to or and must be connected to or . A similar argument applies to and . To ensure concordance of edges connected to and , if is connected to and is connected to , we flip all the segments in . The similar operation is applied to , and . Now we have a compatible structure.
Corollary 3
A necessary condition for a subgraph to be a conflict structure is that it contains cycles. A sufficient condition for a subgraph to be a conflict structure is that it contains a simple cycle which is not a compatible structure.
The corollary is a direct derivation from Theorem 4 and Theorem 5 when considering general graph structures.
In practice, we determine if a discordant edge, , is involved in a conflict structure by enumerating all simple paths using a modified depth-first search implemented in Networkx [19, 20] between u and v omitting edge e. We add e to each path and form a simple cycle. If the simple cycle satisfies Corollary 3, we stop path enumeration and label the e as discordant edge involved in conflict structure. If the running time of path enumeration exceeds 0.5 seconds, we shuffle the order of DFS and repeat the enumeration. If path enumeration for e exceeds 1000 reruns, we label e as undecided.
Results
To produce an efficient, practical algorithm for TSV detection in diploid organisms, we use the following approach, which we denote as D-SQUID: Run the ILP under the diploid assumption by setting on every connected component of GSG separately. If the ILP finishes or the running time of the ILP exceeds one hour, output the current arrangements.
D-SQUID identifies more TSVs in TCGA samples than SQUID
We calculate the fraction of discordant edges involved in conflict structures (Fig. 2a) in 381 TCGA samples from four types of cancers: bladder urothelial carcinoma (BLCA), breast invasive carcinoma (BRCA), lung adenocarcinoma (LUAD) and prostate adenocarcinoma (PRAD). Among all samples, we found less than 0.5% undecided edges out of all discordant edges. The distribution of fraction of discordant edges within conflict structures are different among cancer types. The more discordant edges are involved in conflict structures, the more heterogeneous the sample is. Among four cancer types, PRAD samples exhibit the highest extent of heterogeneity and BRCA samples exhibit the lowest. On average, more than 90% of discordant edges are within conflict structures in all samples across four cancer types. This suggests that TCGA samples are usually heterogeneous and may be partially explained by the fact that TCGA samples are usually a mixture of tumor cells and normal cells [21].
We compare the number of TSVs found by D-SQUID and SQUID (Fig. 2b). In all of our results, all of the TSVs found by SQUID belong to a subset of TSVs found by D-SQUID. D-SQUID identifies many more TSVs than SQUID on all four types of cancers.
A discordant edge is termed resolved if it is made concordant in one of the arrangements. Among all discordant edges in all samples, D-SQUID is able to resolve most of them (Fig. 2c), while SQUID is only able to resolve fewer than 50% of them. The results demonstrate that D-SQUID is more capable of resolving conflict structures in heterogeneous contexts, such as cancer samples, than SQUID.
D-SQUID identifies more true TSV events than SQUID in cancer cell lines
We compare the ability of D-SQUID and SQUID to detect fusion-gene and non-fusion-gene events on previously studied breast cancer cell lines HCC1395 and HCC1954 [22]. The annotation of validated TSVs is taken from Ma et al. [9]. In both cell lines, D-SQUID discovers more TSVs than SQUID. In HCC1954, D-SQUID identifies the same number of known TSVs including fusions of gene (G) regions and intergenic (IG) regions compared with SQUID. In HCC1395, D-SQUID identifies 2 more true TSV events that are fusions of genic regions. We tally the fraction of discordant edges in conflict structures (Fig. 3c) and find similar fractions between HCC1395 and HCC1954, which indicates that the extent of heterogeneity in two samples are similar. Compared to Fig. 2a, the fraction in HCC samples is much lower than that in TCGA samples. This matches the fact that two HCC samples contain the same cell type and are both cell line samples, which are known to be less heterogeneous than TCGA samples.
D-SQUID predicts TSVs in biologically significant genes in cancer cell lines
Figure 4 gives two examples of TSVs predicted by D-SQUID but not by SQUID. Such TSVs are involved in conflict structures and can only be resolved by separating discordant edges into different arrangements.
An example of a validated TSV is shown in Fig. 4a. The head-tail connection between segment and conflicts with the tail-head connections between segments and and segments and . Such a conflict structure is resolved by separating edge into the second arrangement. Notice that since no discordant edges are made concordant in the first arrangement, no new TSVs are predicted. Therefore, the corresponding gene model for the first arrangement is the same as that of the original arrangement. The affected regions are exons of ERO1A and FERMT2 genes. As predicted by D-SQUID, this TSV involves an insertion of the sixth and the seventh exons of FERMT2 between the sixth and seventh exons of ERO1A.
Among the unvalidated TSVs predicted by D-SQUID, some of them affect genes that are associated with breast cancer. The TSV shown in Fig. 4b involves an insertion of the 3’ untranslated region (UTR) of CLPSL1 and the entire CLPS gene between the first and second exons of CLPSL1. It has been reported that CLPSL1 is associated with a prognostic factor of breast cancer [23].
A full list of affected regions in HCC samples can be found in Additional file 1.
Evaluation of approximation algorithms
We evaluate the approximation algorithms for diploid MCAP () using two different subroutines described in previous sections. In this subsection, A1 refers to using Algorithm 1 with worst case runtime O(|V||E|) as a subroutine and A2 refers to using Algorithm 2 with worst case runtime as a subroutine. Both A1 and A2 solve SCAP by greedily inserting segments into the best position in the current ordering. While A1 only looks at the beginning and ending of the ordering, A2 looks at all the positions.
In order to compare the performance of approximations to the exact algorithm using ILP, we run D-SQUID, A1 and A2 on TCGA samples. The algorithms are evaluated on runtime and total weight of concordant edges in the rearranged genomes. “Fold difference” on the axes of Fig. 5 refers to the ratio of the axis values of D-SQUID over that of A1 or A2. Both A1 and A2 output results in a much shorter period of time than D-SQUID. A2 achieves better approximation than A1, demonstrated by closer-to-one ratio of total concordant edge weight, at a cost of longer run time.
The run time of D-SQUID ILP exceeds 1 h on 4.5% of all connected components in all TCGA samples. D-SQUID outputs sub-optimal arrangements in such cases. As a result, approximation algorithms, especially A2, appear to resolve more high-weight discordant edges than D-SQUID in some of the samples in Fig. 5, which is demonstrated by data points that fall below 1 on the y axes. A1 resolves more high-weight edges in 10 samples and A2 resolves more high-weight edges in 54 samples than D-SQUID.
Conclusions
We present approaches to identify TSVs in heterogeneous samples via the multiple compatible arrangements problem (MCAP). We characterize sample heterogeneity in terms of the fraction of discordant edges involved in conflict structures. In the majority of TCGA samples, the fractions of discordant edges in conflict structures are high compared to HCC samples, which indicates that TCGA samples are more heterogeneous than HCC samples. This matches the fact that bulk tumor samples often contain more heterogeneous genomes than cancer cell lines, which suggests that fraction of conflicting discordant edges is a valid measure of sample heterogeneity.
We show that obtaining exact solutions to MCAP is NP-complete. We derive an integer linear programming (ILP) formulation to solve MCAP exactly. We provide a -approximation algorithm for MCAP when the number of arrangements is two (), which runs in time O(|V||E|). It approximates the exact solutions well in TCGA samples.
MCAP addresses this heterogeneity. In 381 TCGA samples, D-SQUID is able to resolve more conflicting discordant edges than SQUID. Since D-SQUID solves MCAP by separating conflicting TSVs onto two alleles, D-SQUID’s power to find TSVs generally increases as the extent of heterogeneity increases. In HCC cell lines, D-SQUID achieves better performance than SQUID. Aside from validated TSV events, D-SQUID discovers unvalidated fusion-gene events that impact genes associated with cancer, which requires further investigation.
Several open problems remain. MCAP relies on the number of arrangements (k) to make predictions. It is not trivial to determine the optimal k for any sample. In addition, although MCAP is solved by separating TSVs onto different alleles, there are typically many equivalent phasings. Developing techniques for handling these alternative phasings is an interesting direction for future work. Analyzing the effect of TSVs, especially non-fusion-gene ones, on their impact on cellular functions and diseases is another direction of future work.
Another potential future direction to improve the accuracy of TSV prediction is to incorporate the distance between breakpoints and read pairs into the optimization formulation. A long distance between read pairs mapped to the reference genome indicates a potential TSV event induced by deletion events. Ignoring such long distances leads to false negatives. On the other hand, long distances between breakpoints of a fusion-gene TSV in the rearranged genome can potentially indicate false positive predictions. We show that thresholding distances during pre- and post-processing steps of D-SQUID is helpful in reducing false negatives, but not as effective in reducing false positives partially due to the lack of distance consideration in the current problem formulation (Additional files 1, 2). Investigating and evaluating potential ways to incorporate the distance information, such as adding a distance threshold to the edge concordance definition or adding distance penalties into the ILP, is a future direction for improvement.
Supplementary information
Acknowledgements
The results shown here are in part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. Specifically, it used the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC) [24].
Abbreviations
- BLCA
Bladder urothelial carcinoma
- CS
Conflict structure
- BRCA
Breast invasive carcinoma
- GSG
Genome Segment Graph
- ILP
Integer linear programming
- LUAD
Lung adenocarcinoma
- MCAP
Multiple Compatible Arrangements Problem
- PRAD
Prostate adenocarcinoma
- SCAP
Single Compatible Arrangement Problem
- SRA
Sequencing Read Archive
- TCGA
The Cancer Genome Atlas
- TSV
Transcriptomic structural variant
Authors’ contributions
YQ and CK designed this study. YQ and CM developed the computational methods and ran the experiments. YQ, CM and HX wrote the manuscript. All authors read and approved the final manuscript.
Funding
This work was supported in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative (GBMF4554 to C.K.); the US National Institutes of Health (R01GM122935); and The Shurl and Kay Curci Foundation. This project is funded, in part, by a Grant (4100070287) from the Pennsylvania Department of Health. The department specifically disclaims responsibility for any analyses, interpretations, or conclusions.
Availability of data and materials
HCC1395 and HCC1954 sequencing data analyzed during the current study are available in the SRA repository. The accession numbers are
HCC1395 RNA-seq: SRR2532336 [25]
HCC1954 RNA-seq: SRR2532344 [26] and SRR925710 [27]
TCGA WGS and RNA-seq data are available through application to dbGaP [28].
Ethics approval and consent to participatee
Not applicable.
Consent for publication
Not applicable.
Competing interests
C.K. is co-founder of Ocean Genomics, Inc.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Yutong Qiu and Cong Ma contributed equally to this work
Supplementary information
Supplementary information accompanies this paper at 10.1186/s13015-020-00170-5.
References
- 1.Deininger MW, Goldman JM, Melo JV. The molecular biology of chronic myeloid leukemia. Blood. 2000;96(10):3343–3356. doi: 10.1182/blood.V96.10.3343. [DOI] [PubMed] [Google Scholar]
- 2.Wang X, Zamolyi RQ, Zhang H, Pannain VL, Medeiros F, Erickson-Johnson M, Jenkins RB, Oliveira AM. Fusion of HMGA1 to the LPP/TPRG1 intergenic region in a lipoma identified by mapping paraffin-embedded tissues. Cancer Genet Cytogenet. 2010;196(1):64–67. doi: 10.1016/j.cancergencyto.2009.09.003. [DOI] [PubMed] [Google Scholar]
- 3.Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, McGrath SD, Wendl MC, Zhang Q, Locke DP. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods. 2009;6(9):677. doi: 10.1038/nmeth.1363. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Layer RM, Chiang C, Quinlan AR, Hall IM. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 2014;15(6):84. doi: 10.1186/gb-2014-15-6-r84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Rausch T, Zichner T, Schlattl A, Stütz AM, Benes V, Korbel JO. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics. 2012;28(18):333–339. doi: 10.1093/bioinformatics/bts378. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Hormozdiari F, Hajirasouliha I, Dao P, Hach F, Yorukoglu D, Alkan C, Eichler EE, Sahinalp SC. Next-generation variationhunter: combinatorial algorithms for transposon insertion discovery. Bioinformatics. 2010;26(12):350–357. doi: 10.1093/bioinformatics/btq216. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Dixon JR, Xu J, Dileep V, Zhan Y, Song F. Integrative detection and analysis of structural variation in cancer genomes. Nat Genet. 2018;50(10):1388. doi: 10.1038/s41588-018-0195-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Sedlazeck FJ, Rescheneder P, Smolka M, Fang H, Nattestad M, von Haeseler A, Schatz MC. Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods. 2018;15(6):461–468. doi: 10.1038/s41592-018-0001-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Ma C, Shao M, Kingsford C. SQUID: transcriptomic structural variation detection from RNA-seq. Genome Biol. 2018;19(1):52. doi: 10.1186/s13059-018-1421-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Huang Z, Jones DT, Wu Y, Lichter P, Zapatka M. confFuse: high-confidence fusion gene detection across tumor entities. Front Genet. 2017;8:137. doi: 10.3389/fgene.2017.00137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.McPherson A, Hormozdiari F, Zayed A, Giuliany R, Ha G. deFuse: an algorithm for gene fusion discovery in tumor RNA-Seq data. PLoS Comput Biol. 2011;7(5):1001138. doi: 10.1371/journal.pcbi.1001138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Davidson NM, Majewski IJ, Oshlack A. Jaffa: High sensitivity transcriptome-focused fusion gene detection. Genome Med. 2015;7(1):43. doi: 10.1186/s13073-015-0167-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Nicorici D, Satalan M, Edgren H, Kangaspeska S, Murumagi A, Kallioniemi O, Virtanen S, Kilkku O. FusionCatcher–a tool for finding somatic fusion genes in paired-end RNA-sequencing data. BioRxiv. 2014;011650.
- 14.Torres-García W, Zheng S, Sivachenko A, Vegesna R, Wang Q, Yao R, Berger MF, Weinstein JN, Getz G, Verhaak RG. PRADA: pipeline for RNA sequencing data analysis. Bioinformatics. 2014;30(15):2224–2226. doi: 10.1093/bioinformatics/btu169. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Jia W, Qiu K, He M, Song P, Zhou Q. SOAPfuse: an algorithm for identifying fusion transcripts from paired-end RNA-Seq data. Genome Biol. 2013;14(2):12. doi: 10.1186/gb-2013-14-2-r12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Liu S, Tsai W-H, Ding Y, Chen R, Fang Z. Comprehensive evaluation of fusion transcript detection algorithms and a meta-caller to combine top performing methods in paired-end RNA-seq data. Nucleic Acids Res. 2015;44(5):47. doi: 10.1093/nar/gkv1234. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Heber S, Alekseyev M, Sze S-H, Tang H, Pevzner PA. Splicing graphs and EST assembly problem. Bioinformatics. 2002;18(suppl-1):181–188. doi: 10.1093/bioinformatics/18.suppl_1.S181. [DOI] [PubMed] [Google Scholar]
- 18.Kececioglu JD, Myers EW. Combinatorial algorithms for DNA sequence assembly. Algorithmica. 1995;13(1–2):7. doi: 10.1007/BF01188580. [DOI] [Google Scholar]
- 19.Hagberg A, Swart P, Chult SD. Exploring network structure, dynamics, and function using NetworkX. Technical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States) 2008.
- 20.Sedgewick R. Algorithms in C, part 5: graph algorithms. 3. Boston: Addison-Wesley Professional; 2001. [Google Scholar]
- 21.Aran D, Sirota M, Butte AJ. Systematic pan-cancer analysis of tumour purity. Nat Commun. 2015;6:8971. doi: 10.1038/ncomms9971. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Gazdar AF, Kurvari V, Virmani A, Gollahon L, Sakaguchi M. Characterization of paired tumor and non-tumor cell lines established from patients with breast cancer. Int J Cancer. 1998;78(6):766–774. doi: 10.1002/(SICI)1097-0215(19981209)78:6<766::AID-IJC15>3.0.CO;2-L. [DOI] [PubMed] [Google Scholar]
- 23.Xiu Y, Liu W, Wang T, Liu Y, Ha M. Overexpression of ect2 is a strong poor prognostic factor in er (+) breast cancer. Mol Clin Oncol. 2019;10(5):497–505. doi: 10.3892/mco.2019.1832. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Nystrom NA, Levine MJ, Roskies RZ, Scott J Bridges: a uniquely flexible HPC resource for new communities and data analytics. In: Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure, p. 30 2015.
- 25.Marcotte R, Sayad A, Brown KR, Sanchez-Garcia F, Reimand J, Haider M, Virtanen C, Bradner JE, Bader GD, Mills GB et al. Functional genomic landscape of human breast cancer drivers, vulnerabilities, and resistance. Elsevier 2016. https://www.ncbi.nlm.nih.gov/sra/?term=SRR2532336 [DOI] [PMC free article] [PubMed]
- 26.Marcotte R, Sayad A, Brown KR, Sanchez-Garcia F, Reimand J, Haider M, Virtanen C, Bradner JE, Bader GD, Mills GB et al. Functional genomic landscape of human breast cancer drivers, vulnerabilities, and resistance. Elsevier 2016. https://www.ncbi.nlm.nih.gov/sra/?term=SRR2532344 [DOI] [PMC free article] [PubMed]
- 27.Daemen A, Griffith OL, Heiser LM, Wang NJ, Enache OM, Sanborn Z, Pepin F, Durinck S, Korkola JE, Griffith M et al. Modeling precision treatment of breast cancer. BioMed Central 2013. https://www.ncbi.nlm.nih.gov/sra/?term=SRR925710 [DOI] [PMC free article] [PubMed]
- 28.Mailman MD, Feolo M, Jin Y, Kimura M, Tryka K, Bagoutdinov R, Hao L, Kiang A, Paschall J, Phan L. The ncbi dbgap database of genotypes and phenotypes. Nat Genet. 2007;39(10):1181. doi: 10.1038/ng1007-1181. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
HCC1395 and HCC1954 sequencing data analyzed during the current study are available in the SRA repository. The accession numbers are
HCC1395 RNA-seq: SRR2532336 [25]
HCC1954 RNA-seq: SRR2532344 [26] and SRR925710 [27]
TCGA WGS and RNA-seq data are available through application to dbGaP [28].