Inferring the global structure of chromosomes from structural variations

Tomohiro Yasuda; Satoru Miyano

doi:10.1186/1471-2164-16-S2-S13

. 2015 Jan 21;16(Suppl 2):S13. doi: 10.1186/1471-2164-16-S2-S13

Inferring the global structure of chromosomes from structural variations

Tomohiro Yasuda ^1,², Satoru Miyano ^1,^✉

PMCID: PMC4331713 PMID: 25707904

Abstract

Background

Next generation sequencing (NGS) technologies have made it possible to exhaustively detect structural variations (SVs) in genomes. Although various methods for detecting SVs have been developed, the global structure of chromosomes, i.e., how segments in a reference genome are extracted and ordered in an unknown target genome, cannot be inferred by detecting only individual SVs.

Results

Here, we formulate the problem of inferring the global structure of chromosomes from SVs as an optimization problem on a bidirected graph. This problem takes into account the aberrant adjacencies of genomic regions, the copy numbers, and the number and length of chromosomes. Although the problem is NP-complete, we propose its polynomial-time solvable variation by restricting instances of the problem using a biologically meaningful condition, which we call the weakly connected constraint. We also explain how to obtain experimental data that satisfies the weakly connected constraint.

Conclusion

Our results establish a theoretical foundation for the development of practical computational tools that could be used to infer the global structure of chromosomes based on SVs. The computational complexity of the inference can be reduced by detecting the segments of the reference genome at the ends of the chromosomes of the target genome and also the segments that are known to exist in the target genome.

Background

Next-generation sequencing (NGS) technologies have drastically reduced the cost of genome sequencing [1]. As more genomic sequences have become available, it has become clear that genomes contain many structural variations (SVs), which include large insertions, deletions, tandem duplications, and translocations. SVs have already been associated with diverse diseases [2]. For example, the fusion genes BCRABL and EML4-ALK play key roles in the development of cancer, and it is believed that other recurrent rearrangements remain to be discovered [3]. In cancer genomes, many SVs are occasionally concentrated in a small region of the genome [4-6]. It has been suggested that a single catastrophic mutational event, known as chromothripsis [6], causes these concentrations. A study of prostate cancer also uncovered a distinct type of complex rearrangement termed chromoplexy [7,8], wherein rearrangements are unclustered but involve multiple chromosomes. Complex genomic rearrangements have even been observed in germline mutations, resulting in serious congenital diseases [9]. Because of their importance in functions of the genome, various methods have been developed for finding SVs [10-16]. When genomic rearrangements are complex, enumerating only individual SVs is insufficient for elucidating the global structure of chromosomes, i.e., how the segments in a reference genome are extracted and ordered in an unknown target genome. Here, the reference genome is known and is a pre-existing sequenced genome of the same organism, such as the GRCh38 build of the human genome [17].

In this study, we address the problem of inferring the global structure of chromosomes based on SV data, which refer to aberrant adjacencies of genomic regions and copy number variations (CNVs) in this study. By solving this problem, we can determine the order of the genomic regions in the target genome. This order affects the structure of proteins if the genomic regions contain coding regions, and regulation of genes if the genomic regions include promoters or enhancers. In addition, raw SV data could be corrected by inferring the global structure of chromosomes because an optimal global structure would ignore false positive detection of aberrant adjacencies or correct wrongly estimated copy numbers. The task of inferring chromosomes is formulated as an optimization problem on a graph, which we term as a chromosome graph. Our contributions are summarized as follows:

• To infer the global structure of chromosomes, we formulate a computational problem that takes into account the number and length of chromosomes, as well as aberrant adjacencies and CNVs caused by genomic rearrangements. By taking SV data as the input, relatively low-depth NGS sequencing can be used.

• We prove that the problem is NP-complete.

• We propose a biologically meaningful restriction that makes the problem solvable in polynomial time. We also show an algorithm that solves the restricted problem.

Oesper et al. [18] presented a pioneering work that aimed to infer the global structure of chromosomes from SV data. They formulated the copy number and adjacency genome reconstruction problem. Their formulation is based on graphs that they termed interval-adjacency graphs. These graphs are essentially the same as our chromosome graphs, except that we used bidirected graphs [19,20] while they used alternating paths to exclude paths on the graph that do not correspond to chromosomes. They also implemented an efficient algorithm called paired-end reconstruction of genome organization (PREGO) that solved their problem and obtained promising results. Our work includes the following results that were not addressed by Oesper et al. First, we present a formulation that takes into account the number and length of chromosomes determined experimentally. Second, we prove that the problem is NP-complete. Finally, we propose a variation of the problem that can be solved in polynomial time.

Some methods can also be applied to analyze the global structure of genomes by using non-SV data. First, de novo sequence assembly aims at reconstructing target genomes from raw NGS sequences [19,21-25]. It includes a step to order fragments of genomes obtained by assembling NGS sequences. The step is usually implemented as an optimization problem, involving searching for paths that cover all vertices or all edges corresponding to substrings of genome sequences [19,21]. By contrast, we allow some vertices and edges to be ignored because some portions of the reference genome might not appear in the target genome. Second, reference-assisted assembly [26], also known as comparative assembly [27], aims at ordering segments of an unknown target genome by using known genomes of other organisms. By contrast, we order segments so that the chromosomes in the solution are most consistent with the SV data and the experimentally determined number and length of chromosomes. Finally, methods based on permutations of integers [28] compare two genomes represented by two sequences of integers corresponding to genes or markers in the genome. Instead of using such sequences, we exploit SV data.

The rest of this paper is organized as follows. First, we present types of experimental data from which we infer the global structure of chromosomes. Next, we give our formulation of the problem of inferring the global structure of chromosomes, and show that the problem is NP-complete. Then, we show a variation of the problem that is solvable in polynomial time. Finally, we discuss our results and state our conclusions.

Results

Experimental data

We assume the following experimental data as input.

Aberrant adjacencies

In the target genome, distant segments in the reference genome may be adjacent because of rearrangements (Figure 1). Such aberrant adjacencies are detected by using NGS technologies as follows. First, NGS technologies can generate read pairs that are a few hundred bases apart from each other in the target genome. If two reads of a pair are not mapped to the reference genome with the expected orientations and mapped distance, the pair is called a discordant pair and is likely to be caused by SVs [12-14]. Second, if the alignment of a read and reference genome is split into more than one portion, such a split read also indicates a rearrangement [16]. A breakpoint is a position at a boundary of a rearrangement. Here, we ignore small differences between the real breakpoints and their estimations.

Copy numbers

The number of occurrences of a subsequence in the reference genome may change because of rearrangements. This phenomenon results in copy number variations (CNVs). Traditionally, CNVs have been analyzed by using DNA microarrays [11]. Several recent methods detect CNVs by finding changes in the depth of coverage of NGS sequences [4,15]. Although tumor samples are usually a mixture of normal cells and various tumor cells, the copy numbers of a cancer cell can still be estimated by single-cell analysis [29]. In this paper, for the sake of conciseness, the boundaries of CNVs are also called breakpoints.

Number of chromosomes and truncations

Identifying chromosomes and finding aberrant chromosomes by microscopy is an important part of clinical diagnostics [30]. The number of chromosomes, denoted by n_Nin this paper, is available after inspection. Throughout this paper, we assume that n_N≥ 1. In addition, we also take into account the number of chromosomal truncations, which we denote as n_T. Chromosomal truncations are detected as a decrease in copy numbers without aberrant adjacencies. We consider n_Nand n_Tto improve the inference of the global structure of chromosomes from SV data.

Chromosome length

The length of chromosomes can be estimated experimentally from flow karyotyping, and, approximately, from microscopic images [31]. Here, the estimated length is denoted by λ_ifor 1 ≤ i ≤ N_L, where N_L(≥ n_N) is the maximum possible number of chromosomes.

Problem definition

Any instance of our problem is modeled as a graph that we term a chromosome graph. The graph contains elements derived from the reference genome and experimental data. Each vertex corresponds to a location in the reference genome. In addition, each edge corresponds to either a segment in the reference genome, an adjacency of flanking segments in the reference genome, or an aberrant adjacency in the target genome caused by rearrangements.

We assume that the target genome is a set of chromosomes, each of which is a concatenation of segments in the reference genome. Each chromosome in the target genome is represented as a path on the graph, and these paths explain how segments in the reference genome are incorporated into the target genome. The goodness of the estimated target genome is measured by a cost function, and we search for an optimal set of chromosomes that minimizes this cost function.

We first define a graph that contains some of elements described above. Then, we extend the graph to a chromosome graph. Finally, we present the formal definition of the problem.

Prototype chromosome graph

We first construct an undirected graph called a prototype chromosome graph, G = (V,E) (Figure 2). Let N_Cbe the number of chromosomes of the reference genome and n_ibe the number of breakpoints in the i-th chromosome of the reference genome. Then, V contains the following vertices.

• Vertices corresponding to breakpoints:

V_{M} = {v_{i, j} | 1 \leq i \leq N_{C}, 1 \leq j \leq n_{i}} .

• Vertices corresponding to the beginning of chromosomes in the reference genome:

V_{5} = {v_{i, 0} | 1 \leq i \leq N_{C}} .

• Vertices corresponding to the end of chromosomes in the reference genome:

V_{3} = {v_{i, n_{i + 1}} | 1 \leq i \leq N_{C}} .

Then, we define V = V₅∪ V₃∪ V_M.

Next, we define a set of edges, E. We make the following two types of edges.

• Edges corresponding to segments between two breakpoints that are next to each other in the reference genome. For each 1 ≤ i ≤ N_Cand 0 ≤ j ≤ n_i, we make an edge e_i,j= (v_i,j, v_i,j₊₁).

• Edges corresponding to aberrant adjacency of two segments in the reference genome. Let N_Abe the number of detected aberrant adjacencies. For the k-th aberrant adjacency (1 ≤ k ≤ N_A) that links positions corresponding to $v_{i_{1}, j_{1}}$ and $v_{i_{2}, j_{2}}$ , we make an edge $e_{L k} = (v_{i_{1}, j_{1}}, v_{i_{2}, j_{2}})$ .

Then, we define

\begin{gathered} E_{S} = {e_{i, j} | 1 \leq i \leq N_{C}, 0 \leq j \leq n_{i}}, \\ E_{L} = {e_{L k} | 1 \leq k \leq N_{A}}, \\ E = E_{S} \cup E_{L} . \end{gathered}

Chromosome graph

In a prototype chromosome graph, a path might visit two edges in E_Lcontiguously. Such a path does not correspond to a real chromosome. To exclude such a path we use a technique similar to that of Oesper et al. [18]. Although Oesper et al. [18] used alternating paths, their formulation can be represented by using a bidirected graph whose edges have directions at both ends [19,32]. We directly define our graph by using a bidirected graph (Figure 3). Let d(e, v) ∈ {+,−} be the direction of an edge e at a vertex v, and −d(e, v) be the opposite direction of d(e, v).

• Each vertex v_i,j∈ V_Mis split into two vertices $v_{i, j}^{+}$ and $v_{i, j}^{-}$ . The set V_Mis redefined as

V_{M} = {v_{i, j}^{-}, v_{i, j}^{+} | 1 \leq i \leq N_{C}, 1 \leq j \leq n_{i}} .

Vertices in V₅and V₃are renamed so that

\begin{gathered} V_{5} = {v_{i, 0}^{-} | 1 \leq i \leq N_{C}}, \\ V_{3} = {v_{i, n_{i + 1}}^{+} | 1 \leq i \leq N_{C}} . \end{gathered}

• An edge e_i,j= (v_i,j, v_i,j+1) ∈ E_Sis reconnected to $v_{i, j}^{-}$ and $v_{i, j + 1}^{+}$ . In addition, $d (e_{i, j}, v_{i, j}^{-}) = -$ and $d (e_{i, j}, v_{i, j + 1}^{+}) = +$ .

• Let e ∈ E_Lbe an edge connected to v_i,jin the prototype chromosome graph. If e corresponds to an aberrant adjacency involving the segment that stretches toward v_i,j+1, e is reconnected to $v_{i, j}^{-}$ and $d (e, v_{i, j}^{-})$ is set to '+'. Otherwise, e is reconnected to $v_{i, j}^{+}$ and $d (e, v_{i, j}^{+})$ is set to '−'.

• We add the following set of new edges:

E_{R} = {{\hat{e}}_{i, j} = (v_{i, j}^{+}, v_{i, j}^{-}) | 1 \leq i \leq N_{C}, 1 \leq i \leq n_{i}} .

Directions are set so that $d ({\hat{e}}_{i, j}, v_{i, j}^{+}) = -$ and $d ({\hat{e}}_{i, j}, v_{i, j}^{-}) = +$ .

The modified graph represents a chromosome graph.

Paths and chromosomes

A path c = v₁e₁v₂e₂v₃... e_lv_l+1on a chromosome graph G is an alternating sequence of vertices and edges, which has the following properties:

• The first and the last of c are vertices.

• Any subsequence of the form e_kv_ke_k+1(1 ≤ k ≤ l) means that d(e_k, v_k) = −d(e_k+1, v_k).

A path c is said to visit an edge e if c contains e. Similarly, c is said to visit a vertex v if c contains v. When a path is written as a sequence of vertices and edges, for simplicity, we omit the notation of the vertices if they are clear. Let C = {c₁, c₂,..., c_|C|} be a multiset of paths on G. We define C as a multi-set so that more than one identical path can exist. In addition, let m(c, e) be the number of times c visits an edge e, and $m (C, e) = \sum_{c_{i} \in C} m (c_{i}, e)$ . A cycle is a path whose first and last vertices are identical and the directions of the first and the last edges at the vertex are opposite. A chromosome on G is a path whose first and last edges are both in E_S.

Copy numbers and lengths

Two integers are assigned to each e ∈ E. First, n(e) for e ∈ E_Srepresents an experimentally estimated copy number of the corresponding segment in the reference genome. Second, |e| for e ∈ E_Srepresents the length of the corresponding segment in the reference genome. For e ∈ E_L∪ E_R, we set n(e) and |e| to 0. The length of a path c is defined as $| c | = \sum_{e \in E} | e | m (c, e)$ . To simply describe all properties of e together, we use the following notation:

e = ⟨ d (e, v_{1}) v_{1}, d (e, v_{2}) v_{2}, n (e), | e | ⟩ .

Upper bound on parameters

Campbell et al. [4] presented examples of amplified regions in cancer cells. The copy numbers were less than 100 in these regions. Therefore, we assume that the copy numbers are in at most hundreds. We also assume that short repeat elements are masked in advance in order to exclude segments that appear spuriously. Based on the details given above, we assume that n_N, n_T, and n(e) for e ∈ E_Sare all less than a fixed constant U. The value of U does not have to be determined because U is only used in the analysis of computational complexity.

Formulation of the problem

To find an optimal set of chromosomes, we define an optimization problem over a chromosome graph. We define a cost function to be used as a target function of the optimization problem. This function imposes costs on the number of chromosomes, the number of chromosomal truncations, and the number of visits to edges, penalizing for deviations from those that are experimentally expected.

Let C = {c₁, c₂,..., c_|C|} be a multi-set of chromosomes on G, and w_N(C) be the cost of the difference between n_Nand |C|. Also let Tr(C) be the number of ends of chromosomes in V_M, and w_T(C) be the cost of the difference between n_Tand Tr(C). In addition, w(e, x) for e ∈ E_Sis defined as the cost when e is visited x-times. For e ∈ E_L∪ E_R, w(e, x) is set to 0.

We assume that w_N(C), w_T(C), and w(e, x) for e ∈ E_Smonotonically increase as ||C| − n_N|, |Tr(C) − n_N|, and |x − n(e)| increase, respectively. Then, we define the cost function W(C) as follows:

W (C) = w_{N} (C) + w_{T} (C) + \sum_{e \in E} w (e, m (C, e)) .

(1)

We assume that each term is 0 if and only if

(\begin{gathered} | C | = n_{N}, \\ T r (C) = n_{T}, \\ m (C, e) = n (e) for e \in E_{S} . \end{gathered}\}

(2)

With these notations, we formulate the problem of inferring the global structure of chromosomes as follows:

Definition 1 (Chromosome problem (ChrP)) Suppose that we are given a chromosome graph G = (V,E), a cost function W(C), and parameters λ_i(1 ≤ i ≤ N_L), where N_Lis the maximum possible number of chromosomes. Then, find a multi-set of chromosomes C on G that minimizes W(C) under the constraint that |c_i| ≤ λ_ifor c_i∈ C.

Although a similar problem was proposed previously [18], its computational complexity was not analyzed.

Theorem 1 ChrP is NP-complete.

In the Methods section, we prove Theorem 1.

Polynomial-time solvable variation

We propose a variation of ChrP that is solvable in polynomial time. For e ∈ E_L∪ E_R, it is highly likely that m(C, e) ≥ 1 if e is supported by a large number of paired-reads. Therefore, it is worth considering a variation in which some edges in E_L∪ E_Rmust appear in the target genome. We refer to the edges as required edges. In addition, because chromosomal truncations can be detected, it is also worth considering a variation in which we know where the ends of the chromosomes of the target genome exist in the reference genome. Because the definition of W(C) is abstract, we focus on a cost function such that

(\begin{gathered} w_{N} (C) = Q_{N} | | C | - n_{N} |, \\ w_{T} (C) = Q_{T} | T r (C) - n_{T} |, \\ w (e, x) = | e | | x - n (e) |, \end{gathered}\}

(3)

where Q_Nand Q_Tare constants given as parameters. The values of Q_Nand Q_Tare tuned in advance so that known global structures of genomes are well reconstructed.

Weakly connected constraint

Let G = (V, E) be a general bidirected graph. A subgraph g of G is a weakly connected component if g is a connected component when all directions are removed [33]. In addition, g is maximal if g is not a subgraph of a larger weakly connected component. For a subset E' of E, we define CC(G,E') as a set of maximal weakly connected components of a graph induced from G by removing the edges not in E'.

Definition 2 (Weakly connected constraint (WCC)) Let G = (V, E) be a chromosome graph. Also let V_Wand E_Wbe subsets of V and E, respectively. Each g ∈ CC(G, E_W) is good if g contains at least one vertex in V_W. Then, G satisfies the weakly connected constraint (WCC) if all g ∈ CC(G, E_W) are good.

We use WCC by setting V_Wto a set of vertices that correspond to ends of chromosomes in the target genome, and E_W= {e ∈ E_S|n(e) ≥ 1} ∪ {e ∈ E_L∪ E_R|e is required}. See Figure 4 for an example. An instance that satisfies WCC can be obtained as follows. First, V_Wis obtained by finding the positions of chromosomal truncations, as well as the ends of the chromosomes of the reference genome that remain in the target genome. Because a chromosome that does not include detected ends can be in a solution, V_Wdoes not need to contain all ends of chromosomes in the target genome. We assume that n_T≥ |V_W|. Next, if g ∈ CC(G, E_W) is not good, edges e ∈ E on some path connecting g and good g' ∈ CC(G, E_W) are added to E_W. To do this, if possible, we experimentally confirm that n(e) ≥ 1 if e ∈ E_Sor that e is required if e ∈ E_L∪ E_R. Finally, if some g ∈ CC(G,E_W) that are not good still remain, edges in g are forcibly removed from E_Wby setting n(e) to 0 if e ∈ E_Sor by changing e not required if e ∈ E_L∪ E_R.

**An example of a chromosome graph that satisfies WCC**. Gray circles are vertices in V_Wand thick arrows are edges in E_W.

Definition 3 (Chromosome problem with WCC (ChrW)) Let G = (V,E) be a chromosome graph that satisfies WCC with respect to some V_W⊂ V and E_W⊂ E. Then, find a set C of chromosomes on G that minimizes W(C) when (3) is satisfied.

Theorem 2 The problem ChrW can be solved in O(|E|₂log |V | log |E|) time.

See the Methods section for the algorithm that solves ChrW.

Restriction on the length of chromosomes

In ChrW, we removed restrictions on the length of chromosomes. This relaxation is necessary to make the problem solvable in polynomial time.

Definition 4 (ChrW with restriction on length (ChrL)) ChrW with restriction on length (ChrL) is the same problem as ChrW, except that the length of each chromosome c_iis bounded by a parameter λ_i(1 ≤ i ≤ N_L), where N_Lis the maximum possible number of chromosomes.

Theorem 3 The problem ChrL is NP-complete.

See the Methods section for proof that problem ChrL is NP-complete.

Discussion

Handling practical situations

Solutions to the chromosome problems are affected by errors in given SV data. However, some errors can be mitigated as follows. First, a false positive aberrant adjacency may be correctly ignored in the optimal solution because a set of chromosomes that uses such an adjacency is expected to have a larger cost than those ignoring the adjacency. Second, the effects of a missing aberrant adjacency may be limited to segments including its ends because a chromosome that contains the missing adjacency may be recognized as two split chromosomes. Finally, there is a chance that incorrect copy numbers will be corrected if they are inconsistent with other SVs.

In addition to segments in the reference genome, our method can handle newly inserted fragments not in the reference genome. Such a fragment is incorporated Yasuda and Miyano Page 6 of 11 into a chromosome graph as a new chromosome. In particular, an edge e, where |e| is equal to the length of the fragment, is added to E_S, and edges that connect vertices in a chromosome graph to e are added to E_L. If any breakpoints are contained within the new fragment, vertices and edges are added to V_Mand E_R, respectively. If a breakpoint corresponds to any aberrant adjacency, edges are also added to E_L.

If a gene duplication has occurred in the target genome, it causes an increased copy number and aberrant adjacencies flanking the gene. If it is a tandem duplication, an aberrant adjacency connecting the upstream and downstream regions of the gene should exist. If these SVs exist in given SV data, any solution to our problem has to take into account gene duplication.

Limitations

A mixture of many cells cannot be handled because it is difficult to correctly estimate copy numbers. However, our method may generate meaningful results for data obtained from multiple cells if the sum of copy numbers is correctly estimated. In this case, the solution is a mixture of chromosomes of all cells in the sample, although some of the chromosomes might be fused.

Note that many optimal solutions may exist depending on how an optimal circulation is converted into chromosomes. (Figure 5). Choosing the right solution requires additional information such as the mate-pairs of long genomic fragments, or the result of experiments involving such techniques as fluorescence in situ hybridization (FISH) that indicate whether or not distant genomic regions are in the same chromosome.

**An example of a chromosome graph that has more than one optimal solution**. Bold digits represent an optimal circulation on this graph. The chromosome graph in this figure has two optimal solutions {e_{1, 0}e_L1e_{2, 1}e_L2e_1,2, e_2,0ê_2,1e_2,1ê_{2, 2}e_2,2} and {e_1,0e_L1e_{2, 1}ê_2,2e_2,2, e_2,0ê_2,1e_2,1e_L2e_1,2}. Edges in E_N∪ E_Dare omitted, and the flow on each edge in E_Dhas been subtracted from the flow of a corresponding edge in E_S.

Toward implementation

For implementation, we require an algorithm that can calculate an optimal circulation on the bidirected graph. It would be difficult to implement Gabow's algorithm because no efficient implementation is currently known. Another option would be to use Medvedev's algorithm [19]. Any solver for general integer programming could also be used, as demonstrated by Oesper et al. [18], although the computational time bound is not guaranteed.

Conclusions

Continuing technological innovations in DNA sequencing will, in future, allow the prediction of an enormous number of SVs. However, detecting only individual SVs cannot reveal the global structure of chromosomes. Here, we formulated the problem of inferring chromosomes from the aberrant adjacencies of genomic regions, copy number variations (CNVs), and the number and length of chromosomes. The problem, which we term as the chromosome problem (ChrP), was proved to be NP-complete. However, if an instance of ChrP satisfies a constraint, which we call a weakly connected constraint (WCC), and if the length of chromosomes is ignored, the problem can be solved in O(|E|²log |V | log |E|) time.

This work provides a theoretical basis for the development of practical computational tools that are emerging for use in analysis of the global structure of chromosomes based on SVs.

Methods

In this section, we show how we proved the theorems stated in the Results section.

Proof of Theorem 1

We first present an upper bound on the size of an optimal solution of ChrP to show that ChrP is in NP. Then, we prove that ChrP is NP-hard.

Lemma 1 Let G = (V, E) be a chromosome graph. Also, let C be a multi-set of chromosomes on G that minimizes W(C) such that |c_i| ≤ λ_ifor c_i∈ C. Then, C has at most U(4|V| + 1)(|E| + 1) edges.

Proof Let c ∈ C be a chromosome in C. We define an edge e in c as non-excessive if e ∈ E_Sand m(C, e) ≤ n(e), and excessive otherwise. Let t_cbe the number of non-excessive edges visited by c. If t_c>0, c can be written as c = p₁e₁p₂e₂... e_tcp_tc₊₁, where e_k(1 ≤ k ≤ t_c) is a non-excessive edge and p_k(1 ≤ k ≤ t_c+1) is a possibly empty path that contains only excessive edges (Figure 6). If p_kcontains a cycle as its subpath, the cycle can be removed to decrease W(C), a contradiction. Accordingly, p_kdoes not contain a cycle. This implies that p_kvisits at most 2|V| vertices and, thus, 2|V| edges. Therefore, at most, 4|V | excessive edges are visited for each non-excessive edge. Note that a non-excessive edge e can be visited, at most, n(e)-times. Therefore, $\sum_{c \in C} t_{c} \leq \sum_{e \in E s} n (e)$ .

**An example of a chromosome that consists of non-excessive and excessive edges**. Straight arrows represent non-excessive edges, while jagged lines represent sequences of excessive edges.

Chromosomes such that t_c= 0 can exist only if they contribute to the decrease of the first or the second term of W(C) defined by (1). Accordingly, the number of such chromosomes is, at most, n_N+n_T. In addition, a chromosome c, such that t_c= 0, does not contain any cycles because such a cycle can be removed to decrease W(C). Therefore, at most, c visits 2|V| vertices and, thus, 2|V| edges.

Consequently, C contains, at most, 2|V|(n_N+n_T) + (4|V| +1) P_e***_ES n(e) ≤ U(4|V |+1)(|E|+1) edges.

Lemma 2 The problem ChrP is in NP.

Proof Once an optimal solution C is given, whether or not W(C) is greater than a given constant can be determined in O(|V ||E|) time by Lemma 1. □

Lemma 3 The problem ChrP is NP-hard.

Proof The Hamiltonian Cycle problem (HC) is a problem of finding a cycle that visits each vertex of a graph exactly once, and is a well-known NP-complete problem [34]. Here, we reduce HC to ChrP. Consider HC on a directed graph H = (V', E'), where $V^{'} = {v_{1}^{'}, v_{2}^{'}, \dots, v_{| V^{'} |}^{'}}$ is a set of vertices and E' is a set of edges. We construct a chromosome graph G = (V, E) from H (Figure 7), where

**An instance of ChrP for solving the Hamiltonian Cycle problem (HC)**. In this graph, solid edges are constructed for each vertex in a graph H of HC, whereas dashed edges correspond to edges in H.

V = ⋃_{1 \leq i \leq | V^{'} |} {v_{i, 0}^{-}, v_{i, 1}^{+}, v_{i, 1}^{-}, v_{i, 2}^{+}, v_{i, 2}^{-}, v_{i, 3}^{+}}

is a set of vertices, and E = E_S∪ E_L∪ E_Ris a set of edges. Here, E_Sconsists of

\begin{gathered} e_{1, 0} = ⟨ - v_{1, 0}^{-}, + v_{1, 1}^{+}, 1, 1 ⟩, \\ e_{1, 1} = ⟨ - v_{1, 1}^{-}, + v_{1, 2}^{+}, 2, 1 ⟩, \\ e_{1, 2} = ⟨ - v_{1, 2}, + v_{1, 3}^{+}, 1, 1 ⟩, \\ e_{i, 0} = ⟨ - v_{i, 0}, + v_{i 1}^{+}, 0, 1 ⟩ (2 \leq i \leq | V^{'} |), \\ e_{i, 1} = ⟨ - v_{i, 1}^{-}, + v_{i 2}^{‡}, 1, 1 ⟩ (2 \leq i \leq | V^{'} |), \\ e_{i, 2} = ⟨ - v_{i, 2}^{-}, + v_{i, 3}^{‡}, 0, 1 ⟩ (2 \leq i \leq | V^{'} |) . \end{gathered}

E_Rconsists of

\begin{gathered} {\hat{e}}_{i, 1} = ⟨ - v_{i, 1}^{+}, + v_{i, 1}^{-}, 0, 0 ⟩ (1 \leq i \leq | V^{'} |), \\ {\hat{e}}_{i, 2} = ⟨ - v_{i, 2}^{+}, + v_{i, 2}^{-}, 0, 0 ⟩ (1 \leq i \leq | V^{'} |) . \end{gathered}

E_Lconsists of

e_{i^{'} : i} = ⟨ - v_{i, 2}^{+}, + v_{i, 1}^{-}, 0, 0 ⟩ ((v_{i}^{'}, v_{i}^{'}) \in E^{'}) .

In addition, we set n_N= 1, n_T= 0, and λ_i= |V'| + 3 for any i. Then, we prove that H has a Hamiltonian cycle if, and only if, ChrP on G has a solution C such that W(C) = 0. Suppose that h is a Hamiltonian cycle on H. Let c be a chromosome that begins with $e_{1, 0} {\hat{e}}_{1, 1} e_{1, 1}$ and then visits $e_{i^{'} : i} e_{i, 1}$ in the order that edges $(v_{i^{'}}, v_{i})$ appear in h from i' = 1, and finally ends with $e_{1, 1} {\hat{e}}_{1, 2} e_{1, 2}$ . Then, a set of a single chromosome C = {c} satisfies W(C) = 0 and $| c | = | V^{'} | + 3 \leq λ_{1}$ .

Conversely, let C be a solution of ChrP that satisfies W(C) = 0. Because (2) holds, |C| = 1, Tr(C) = 0, and m(C, e) = n(e). Let c be the only chromosome in C. Because n(e_1,1) = 2 and n(e_i,1) = 1 for 2 ≤ i ≤ |V'|, a path that visits vertices $v_{i}^{'} \in V^{'}$ in the order that e_i,1appears in c is a Hamiltonian cycle on H. □

Theorem 1 directly follows Lemma 2 and 3.

Proof of Theorem 2

Circulation on a bidirected graph

Let G = (V, E) be a bidirected graph, and a_v,efor v ∈ V and e ∈ E be an integer such that

a_{v, e} = \{\begin{gathered} 2 if e {has two}^{'} +^{'} -ends at v, \\ 1 if e {has only one}^{'} +^{'} -end at v, \\ - 1 if e {has only one}^{'} -^{'} -end at v, \\ - 2 if e {has two}^{'} -^{'} -ends at v, \\ 0 if e is not connected to v . \end{gathered})

Also let b_vbe an integer defined for each v ∈ V, Z be the set of non-negative integers, and l(e) and u(e) be two non-negative integers assigned to each edge e ∈ E called a lower bound and an upper bound, respectively. Unless otherwise specified, in this study l(e) = 0 and u(e) = ∞.

Definition 5 A bidirected flow (biflow) [19, 20] is a mapping f : E → Z such that

l (e) \leq f (e) \leq u (e) f o r e a c h e \in E,

(4)

\sum_{e \in E} a_{v, e} f (e) = b_{v} f o r e a c h v \in V .

(5)

The cost of f is defined as W(f) = ∑_e∈Ew(f, e), where w(f, e) is a cost of f on e ∈ E. A circulation is a biflow such that b_v= 0 for any v ∈ V.

Circular chromosome graph

Definition 6 (Circular chromosome graph) Let G = (V,E) be a chromosome graph, and let v_Nand v_Tbe new vertices. In addition, let E_Nbe a set of the following edges: for 1 ≤ i ≤ N_C,

\begin{gathered} e_{t} (v_{i, 0}^{-}) = ⟨ - v_{N}, + v_{i, 0}^{-}, 0, 0 ⟩, \\ e_{t} (v_{i, n_{i}}^{+}) = ⟨ - v_{N}, - v_{i, n_{i}}^{+}, 0, 0 ⟩, \\ e_{t} (v_{i, j}^{+}) = ⟨ - v_{T}, - v_{i, j}^{+}, 0, 0 ⟩ (1 \leq j \leq n_{i}), \\ e_{t} (v_{i, j}^{-}) = ⟨ - v_{T}, + v_{i, j}^{-}, 0, 0 ⟩ (1 \leq j \leq n_{i}), \end{gathered}

and

\begin{gathered} e_{T} = ⟨ - v_{N}, + v_{T}, n_{T}, Q_{T} ⟩, \\ e_{N} = ⟨ + v_{N}, + v_{N}, n_{N}, Q_{N} ⟩ . \end{gathered}

Also, let E_Dbe a set of the following edges for e ∈ E_S∪ {e_N, e_T}:

\bar{e} = ⟨ - d (e, v_{i_{1}, j_{1}}) v_{i_{1}, j_{1}}, - d (e, v_{i_{2}, j_{2}}) v_{i_{2}, j_{2}}, 0, | e | ⟩,

where $v_{i_{1}, j_{1}}$ and $v_{i_{2}, j_{2}}$ are vertices at the ends of e. The graph $\tilde{G} = (V \cup {v_{N}, v_{T}}, E \cup E_{N} \cup E_{D})$ is called a circular chromosome graph.

See Figure 8 for an example. Let n(e_N) = n_Nand n(e_T) = n_T. For e ∈ E_S∪{e_N, e_T}, we set l(e) = n(e), $l (ē) = 0$ , and $u (ē) = n (e)$ . For e ∈ E_L∪ E_R, we set l(e) = 1. We also set l(e_t(v)) to 1 for v ∈ V_Wbecause these edges have to be visited in the solution.

**An example of a circular chromosome graph**. The problem of optimizing multiple chromosomes is converted to the problem of finding a cycle on this graph. For simplicity, we omitted e_t(·), except for the leftmost chromosome in the reference genome.

Lemma 4 Let w(f, e) = |e|f(e) and W₀= Q_Nn_N+ Q_Tn_T+ ∑_e∈E|e|n(e). For any multi-set C of chromosomes on G, there is a circulation f on $\tilde{G}$ such that

W (f) = W (C) + W_{0} .

(6)

Conversely, for any circulation f on $\tilde{G}$ that minimizes W(f), there is a multi-set C of chromosomes on G that satisfies (6). In addition, C can be calculated in $O (\sum_{e \in E \cup E_{N} \cup E_{D}} f (e))$ time.

Let E₊= {e ∈ E ∪ E_N∪ E_D|l(e) ≥ 1 or n(e) ≥ 1}. Note that $CC (\tilde{G}, E +)$ has only one weakly connected component because of WCC.

Proof First, we show that for any multi-set C of chromosomes on G, there exists a circulation f on $\tilde{G}$ that satisfies (6). Let End(v) be the number of chromosomes that begin or end with v. Consider the following f:

\begin{array}{l} f (e) & = & \max {n (e), m (C, e)} & (e \in E_{S}), \\ f (\bar{e}) & = & \max {0, n (e) - m (C, e)} & (e \in E_{S}), \\ f (e) & = & m (C, e) & (e \in E_{L} \cup E_{R}), \\ f (e_{t} (v)) & = & End (v) & (v \in V), \\ f (e_{N}) & = & \max {n_{N}, | C |}, \\ f ({\bar{e}}_{N}) & = & \max {0, n_{N} - | C |}, \\ f (e_{T}) & = & \max {n_{T}, Tr (C)}, \\ f ({\bar{e}}_{T}) & = & \max {0, n_{T} - Tr (C)} . \end{array}

Then, f is a circulation on $\tilde{G}$ because f satisfies (4) and (5). Thus, we observe that

w (e, m (C, e)) = | e | f (e) + | e | f (ē) - | e | n (e),

for e ∈ E_S, and

\begin{gathered} w_{N} (C) = | e_{N} | f (e_{N}) + | e_{N} | f (ē_{N}) - Q_{N} n_{N}, \\ w_{T} (C) = | e_{T} | f (e_{T}) + | e_{T} | f (ē_{T}) - Q_{T} n_{T} . \end{gathered}

Therefore, because |e| = 0 for e ∈ E_L∪ E_R∪{e_t(v)|v ∈ V} and w(f, e) = |e|f(e), f satisfies (6).

Conversely, let f be a circulation on $\tilde{G}$ that minimizes W(f). We show how to construct a multi-set C of chromosomes on G that satisfies (6).

First, for e ∈ E_S∪ {e_N, e_T}, we subtract $f (ē)$ from f(e), and also set $f (ē)$ to 0.

Second, we construct a set R of cycles such that m(R, e) = f(e) for any edge e in $\tilde{G}$ . For directed graphs, the flow decomposition theorem [35] ensures that such R can be obtained in $O (\sum_{e \in E \cup E_{N} \cup E_{D}} f (e))$ time. This is also true for bidirected graphs.

Third, we merge cycles in R. Whenever a vertex is shared by two cycles in R, they are merged into a single cycle. Because of WCC, $CC (\tilde{G}, E +)$ consists of only one weakly connected component. This implies that all cycles that contain edges in E₊can be merged into a single cycle. Note that any r ∈ R contains at least one edge in E₊, because otherwise r can be removed to decrease W(f). Therefore, all cycles in R can be merged into a single cycle $\tilde{r}$ .

Finally, let C be a multi-set of paths generated by removal of v_N, v_T, and edges in E_Nfrom $\tilde{r}$ . Because c ∈ C is connected to edges in E_Nin $\tilde{r}$ , the first and last edge of c is in E_Sdue to the directions of these edges. Accordingly, c is a chromosome. Therefore, C is a multi-set of chromosomes on G.

All of these steps can be completed in $O (\sum_{e \in E \cup E_{N} \cup E_{D}} f (e))$ time. In addition, we observe that the following equations hold:

\begin{gathered} | C | = f (e_{N}) - f ({\bar{e}}_{N}), \\ Tr (C) = f (e_{T}) - f ({\bar{e}}_{T}), \\ m (C, e) = f (e) + f (\bar{e}) (e \in E_{S}) . \end{gathered}

Accordingly, $w (e, m (C, e)) = w (f, e) + w (f, ē) + | e | n (e)$ for e ∈ E_S, and

\begin{gathered} w_{N} (C) = w (f, e_{N}) + w (f, {\bar{e}}_{N}) + Q_{N} n_{N}, \\ w_{T} (C) = w (f, e_{T}) + w (f, {\bar{e}}_{T}) + Q_{T} n_{T}, \\ w (e, m (C, e)) = 0 (e \in E_{L} \cup E_{R}) . \end{gathered}

Therefore, C satisfies (6).

By Lemma 4, the solution of ChrW can be obtained by calculating a circulation f on $\tilde{G}$ that minimizes W(f). By Lemma 1, setting u(e) = U(4|V| + 1)(|E| + 1) does not affect the solution. In addition, |E_N| = O(|E|) and |E_D| = O(|E|). Accordingly, the circulation f can be calculated in O(|E|₂log |V| log |E|) time by using Gabow's algorithm [20]. Therefore, the optimal solution can be calculated in O(|E|₂log |V| log |E|) time.

Proof of Theorem 3

ChrL is in NP because of Lemma 1.

Here, we show that the well-known PARTITION problem [34] can be reduced to ChrL. Let n be a positive integer and S = {i ∈ Z|1 ≤ i ≤ n}. Also, let s(i) be an integer function defined for i ∈ S such that Yasuda and Miyano Page 9 of 11 s(i) > 0, and S_Σ= ∑_i∈Ss(i). The problem of finding a subset S' ⊂ S such that

\sum_{i \in S^{'}} s (i) = \sum_{i \in S - S^{'}} s (i) = S_{Σ} / 2

is called the partition problem (hereafter referred to as PARTITION) [34]. It is well known that PARTITION is NP-complete. We reduce PARTITION to ChrL by constructing a chromosome graph whose solution for ChrL contains two chromosomes that correspond to two subsets of a solution of PARTITION.

Let G = (V, E) be a chromosome graph, where

V = ⋃_{1 \leq i \leq n + 1} {v_{i, 0}^{-}, v_{i, 1}^{+}, v_{i, 1}^{-}, v_{i, 2}^{+}, v_{i, 2}^{-}, v_{i, 3}^{+}}

is a set of vertices, and E = E_S∪ E_L∪ E_Rbe a set of edges. Here, E_Sconsists of

\begin{gathered} e_{i, 0} = ⟨ - v_{i, 0}^{-}, + v_{i, 1}^{+}, 1, 9 S_{Σ} ⟩ (1 \leq i \leq n), \\ e_{i, 1} = ⟨ - v_{i, 1}^{-}, + v_{i, 2}^{+}, 2, s (i) ⟩ (1 \leq i \leq n) \\ e_{i, 2} = ⟨ - v_{i, 2}^{-}, + v_{i, 3}^{+}, 1, S_{Σ} - s (i) ⟩ (1 \leq i \leq n) \\ e_{n + 1, 0} = ⟨ - v_{n + 1, 0}^{-}, + v_{n + 1, 1}^{+}, 2, 9 S_{Σ} / 2 ⟩, \\ e_{n + 1, 1} = ⟨ - v_{n + 1, 1}^{-}, + v_{n + 1, 2}^{+}, n + 2, 0 ⟩, \\ e_{n + 1, 2} = ⟨ - v_{n + 1, 2}^{-}, + v_{n + 1, 3}^{+}, 2, 5 S_{Σ} ⟩ . \end{gathered}

In addition, E_Rconsists of

\begin{gathered} {\hat{e}}_{i, 1} = ⟨- v_{i, 1}^{+}, + v_{i, 1}^{-}, 0, 0⟩ (1 \leq i \leq n + 1), \\ {\hat{e}}_{i, 2} = ⟨- v_{i, 2}^{+}, + v_{i, 2}^{-}, 0, 0⟩ (1 \leq i \leq n + 1), \end{gathered}

and E_Lconsists of

\begin{gathered} e_{L i} = ⟨+ v_{i, 1}^{-}, - v_{n + 1, 2}^{+}, 0, 0⟩ (1 \leq i \leq n), \\ e_{L i} = ⟨- v_{i, 2}^{+}, + v_{n + 1, 1}^{-}, 0, 0⟩ (1 \leq i \leq n) . \end{gathered}

We set λ_i= 10S_Σfor any i ≥ 1, Q_N= Q_T= 100S_Σ, n_N= n+2, and n_T= 0. See Figure 9 for an example. In addition, we set V_Wto V₅∪ V₃, and E_Wto E by making all edges in E_L∪ E_Rrequired so that G satisfies WCC.

**An example of a chromosome graph for solving the partition problem (PARTITION)**. In this example, n = 4.

We show that PARTITION for S has a solution S' ⊂ S if, and only if, there exists a solution C of ChrL such that W(C) = 0. First, suppose that PARTITION has a solution S'. Let r_S'be a cycle generated by merging cycles $e_{n + 1, 1} e_{L i} e_{i, 1} e_{L i}^{'}$ for i ∈ S'. We define r_S-S'in the same way. Consider a multi-set C = {c₁,..., c_n+2}, where c_i∈ C is a chromosome on G such that

\begin{gathered} c_{i} = e_{i, 0} {\hat{e}}_{i, 1} e_{i, 1} {\hat{e}}_{i, 2} e_{i, 2} (1 \leq i \leq n), \\ c_{n + 1} = e_{n + 1, 0} {\hat{e}}_{n + 1, 1} r_{S^{'}} e_{n + 1, 1} {\hat{e}}_{n + 1, 2} e_{n + 1, 2}, \\ c_{n + 2} = e_{n + 1, 0} {\hat{e}}_{n + 1, 1} r_{S - S^{'}} e_{n + 1, 1} {\hat{e}}_{n + 1, 2} e_{n + 1, 2} . \end{gathered}

Then, W(C) = 0 because |C| = n + 2, Tr(C) = 0, and m(C, e) = n(e) for e ∈ E_S. In addition, C visits all required edges. Furthermore, |c_i| = 10Σ ≤ λ_ifor 1 ≤ i ≤ n + 2.

Conversely, suppose that ChrL for G has an optimal solution C that satisfies W(C) = 0. Because W(C) = 0, we obtain |C| = n + 2, Tr(C) = 0, and m(C, e) = n(e) for e ∈ E. Because ∑_e∈E|e|n(e) = 10(n + 2)S_Σ, |c| = 10Σ for each c ∈ C. Let c_ibe a chromosome that begins with e_i,0for 1 ≤ i ≤ n. The other two chromosomes are denoted by c_n+1and c_n+2. Then, c₁begins with $e_{1, 0} {\hat{e}}_{1, 1} e_{1, 1}$ . Suppose that c₁does not visit ${\hat{e}}_{1, 2} e_{1, 2}$ . Then, there is a chromosome c_ithat visits ${\hat{e}}_{1, 2} e_{1, 2}$ , whose previous edge has to be e_1,1in c_i. Therefore, for some paths p₁and p₂,

\begin{gathered} c_{1} = e_{1, 0} {\hat{e}}_{1, 1} e_{1, 1} p_{1} \\ c_{i} = p_{2} e_{1, 1} {\hat{e}}_{1, 2} e_{1, 2} . \end{gathered}

(7)

Because of (7), $| c_{1} | = | e_{1, 0} | + | {\hat{e}}_{1, 1} | + | e_{1, 1} | + | p_{1} | = 10 S_{Σ} = | e_{1, 0} | + | {\hat{e}}_{1, 1} | + | e_{1, 1} | + | {\hat{e}}_{1, 2} | + | e_{1, 2} |$ . Therefore, $| p_{1} | = | {\hat{e}}_{1, 2} | + | e_{1, 2} |$ . We modify C so that

\begin{gathered} c_{1} = e_{1, 0} {\hat{e}}_{1, 1} {\hat{e}}_{1, 2} e_{1, 2,} \\ c_{i} = p_{2} e_{1, 1} p_{1} . \end{gathered}

The modified C still satisfies the required conditions. After this modification is repeated for 2 ≤ i ≤ n until no more modifications can be applied, C satisfies $c_{i} = e_{i, 0} {\hat{e}}_{1, 1} e_{i, 1} {\hat{e}}_{i, 2} e_{i, 2}$ for 1 ≤ i ≤ n. Another chromosome exists that visits e_i,1for each 1 ≤ i ≤ n, which is one of c_n+1and c_n+2. Let S' = {i|m(c_n+1, e_i,1) >0}. Then, ∑_i∈S's(i) = 10S_Σ− (9/2+5)S_Σ= 1/2S_Σ. Therefore, S' is a solution of PARTITION.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

TY formulated the problem, proved the NP-completeness, developed a polynomial-time algorithm, and composed the manuscript. SM critically revised the manuscript.

Declarations

TY would like to acknowledge financial support from the Human Genome Center, Institute of Medical Science, University of Tokyo.

This article has been published as part of BMC Genomics Volume 16 Supplement 2, 2015: Selected articles from the Thirteenth Asia Pacific Bioinformatics Conference (APBC 2015): Genomics. The full contents of the supplement are available online at http://www. biomedcentral. com/bmcgenomics/supplements/16/S2

References

DNA Sequencing Costs. http://www.genome.gov/sequencingcosts/
Weischenfeldt J, Symmons O, Spitz F, Korbel JO. Phenotypic impact of genomic structural variation: insights from and for human disease. Nat Rev Genet. 2013;14(2):125–138. doi: 10.1038/nrg3373. [DOI] [PubMed] [Google Scholar]
Bashir A, Volik S, Collins C, Bafna V, Raphael BJ. Evaluation of paired-end sequencing strategies for detection of genome rearrangements in cancer. PLoSComputBiol. 2008;4(4):1000051. doi: 10.1371/journal.pcbi.1000051. Yasuda and Miyano Page 10 of 11. [DOI] [PMC free article] [PubMed] [Google Scholar]
Campbell PJ, Stephens PJ, Pleasance ED, O'Meara S, Li H, Santarius T, Stebbings LA, Leroy C, Edkins S, Hardy C, Teague JW, Menzies A, Goodhead I, Turner DJ, Clee CM, Quail MA, Cox A, Brown C, Durbin R, Hurles ME, Edwards PAW, Bignell GR, Stratton MR, Futreal PA. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nat Genet. 2008;40(6):722–9. doi: 10.1038/ng.128. [DOI] [PMC free article] [PubMed] [Google Scholar]
Malhotra A, Lindberg MR, Faust GG, Leibowitz ML, Clark RA, Layer R, Quinlan AR, Hall IM. Breakpoint profiling of 64 cancer genomes reveals numerous complex rearrangements spawned by homology-independent mechanisms. Genome Res. 2013;23(5):762–776. doi: 10.1101/gr.143677.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stephens PJ, Greenman CD, Fu B, Yang F, Bignell GR, Mudie LJ, Pleasance ED, Lau KW, Beare D, Stebbings LA. et al. Massive genomic rearrangement acquired in a single catastrophic event during cancer development. Cell. 2011;144(1):27–40. doi: 10.1016/j.cell.2010.11.055. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shen MM. Chromoplexy: A new category of complex rearrangements in the cancer genome. Cancer Cell. 2013;23(5):567–569. doi: 10.1016/j.ccr.2013.04.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
Van Allen. et al. Punctuated evolution of prostate cancer genomes. Cell. 2013;153(3):666–677. doi: 10.1016/j.cell.2013.03.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kloosterman WP, Tavakoli-Yaraki M, van Roosmalen MJ, vanBinsbergen E, Renkens I, Duran K, Ballarati L, Vergult S, Giardino D, Hansson K, Ruivenkamp CAL, Jager M, vanHaeringen A, Ippel EF, Haaf T, Passarge E, Hochstenbach R, Menten B, Larizza L, Guryev V, Poot M, Cuppen E. Constitutional chromothripsis rearrangements involve clustered double-stranded DNA breaks and nonhomologous repair mechanisms. Cell Rep. 2012;1(6):648–55. doi: 10.1016/j.celrep.2012.05.009. [DOI] [PubMed] [Google Scholar]
Mills RE, Walter K, Stewart C, Handsaker RE, Chen K, Alkan C, Abyzov A, Yoon SC, Ye K, Cheetham RK, Chinwalla A, Conrad DF, Fu Y, Grubert F, Hajirasouliha I, Hormozdiari F, Iakoucheva LM, Iqbal Z, Kang S, Kidd JM, Konkel MK, Korn J, Khurana E, Kural D, Lam HYK, Leng J, Li R, Li Y, Lin C-Y, Luo R. et al. 1000 genomes project: Mapping copy number variation by population-scale genome sequencing. Nature. 2011;470(7332):59–65. doi: 10.1038/nature09708. [DOI] [PMC free article] [PubMed] [Google Scholar]
Medvedev P, Stanciu M, Brudno M. Computational methods for discovering structural variation with next-generation sequencing. Nat Methods. 2009;6:13–20. doi: 10.1038/nmeth.1374. [DOI] [PubMed] [Google Scholar]
Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, McGrath SD, Wendl MC, Zhang Q, Locke DP, Shi X, Fulton RS, Ley TJ, Wilson RK, Ding L, Mardis ER. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods. 2009;6(9):677–681. doi: 10.1038/nmeth.1363. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hormozdiari F, Alkan C, Eichler EE, Sahinalp SC. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res. 2009;19(7):1270–1278. doi: 10.1101/gr.088633.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
Quinlan AR, Clark RA, Sokolova S, Leibowitz ML, Zhang Y, Hurles ME, Mell JC, Hall IM. Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Res. 2010;20(5):623–635. doi: 10.1101/gr.102970.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
Abyzov A, Urban AE, Snyder M, Gerstein M. CNVnator: An approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 2011;21(6):974–984. doi: 10.1101/gr.114876.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ye K, Schulz MH, Long Q, Apweiler R, Ning Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics. 2009;25(21):2865–2871. doi: 10.1093/bioinformatics/btp394. [DOI] [PMC free article] [PubMed] [Google Scholar]
Genome Reference Consortium. http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/
Oesper L, Ritz A, Aerni S, Drebin R, Raphael B. Reconstructing cancer genomes from paired-end sequencing data. BMC Bioinformatics. 2012;13(Suppl 6):10. doi: 10.1186/1471-2105-13-S6-S10. [DOI] [PMC free article] [PubMed] [Google Scholar]
Medvedev P, Brudno M. Maximum likelihood genome assembly. J Comput Biol. 2009;16(8):1101–1116. doi: 10.1089/cmb.2009.0047. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gabow HN. An efficient reduction technique for degree-constrained subgraph and bidirected network flow problems. Proceedings of the 15-th annual ACM symposium on Theory of computing (STOC) 1983. pp. 448–456.
Nagarajan N, Pop M. Parametric complexity of sequence assembly: Theory and applications to next generation sequencing. J Comput Biol. 2009;16(7):897–908. doi: 10.1089/cmb.2009.0005. [DOI] [PubMed] [Google Scholar]
Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, Li S, Yang H, Wang J, Wang J. Denovo assembly of human genomes with massively parallel short read sequencing. Genome Res. 2010;20(2):265–272. doi: 10.1101/gr.097261.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zerbino DR, Birney E. Velvet: Algorithms for de novo short read assembly using de bruijn graphs. Genome Res. 2008;18:821–829. doi: 10.1101/gr.074492.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol Ì. ABySS: A parallel assembler for short read sequence data. Genome Res. 2009;19(6):1117–1123. doi: 10.1101/gr.089532.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, Sharpe T, Hall G, Shea TP, Sykes S, Berlin AM, Aird D, Costello M, Daza R, Williams L, Nicol R, Gnirke A, Nusbaum C, Lander ES, Jaffe DB. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci USA. 2011;108(4):1513–1518. doi: 10.1073/pnas.1017351108. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kim J, Larkin DM, Cai Q, Asan, Zhang Y, Ge R-L, Auvil L, Capitanu B, Zhang G, Lewin HA, Ma J. Reference-assisted chromosome assembly. Proc Natl Acad Sci USA. 2013;110(5):1785–1790. doi: 10.1073/pnas.1220349110. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pop M. Genome assembly reborn: recent computational challenges. Brief Bioinform. 2009;10(4):354–366. doi: 10.1093/bib/bbp026. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gaul E, Blanchette M. In: Comparative Genomics. Lecture Notes in Computer Science. Bourque, G., El-Mabrouk, N, editor. Vol. 4205. Springer, Germany; 2006. Ordering partially assembled genomes using gene arrangements; pp. 113–128. [DOI] [Google Scholar]
Navin N, Kendall J, Troge J, Andrews P, Rodgers L, McIndoo J, Cook K, Stepansky A, Levy D, Esposito D, Muthuswamy L, Krasnitz A, McCombie WR, Hicks J, Wigler M. Tumour evolution inferred by single-cell sequencing. Nature. 2011;472(7341):90–94. doi: 10.1038/nature09807. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jahani S, Setarehdan SK. Centromere and length detection in artificially straightened highly curved human chromosomes. International Journal of Biological Engineering. 2012;2(5):56–61. doi: 10.5923/j.ijbe.20120205.04. [DOI] [Google Scholar]
Kasai F, O'Brien PCM, Ferguson-Smith MA. Afrotheria genome; overestimation of genome size and distinct chromosome GC content revealed by flow karyotyping. Genomics. 2013;102(5-6):468–471. doi: 10.1016/j.ygeno.2013.09.002. [DOI] [PubMed] [Google Scholar]
Myers EW. The fragment assembly string graph. Bioinformatics. 2005;21:79–85. doi: 10.1093/bioinformatics/bti1114. [DOI] [PubMed] [Google Scholar]
Sorge M, van Bevern R, Niedermeier R, Weller M. A new view on rural postman based on Eulerian extension and matching. Journal of Discrete Algorithms. 2012;16:12–33. [Google Scholar]
Garey MR, Johnson DS. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Company, NewYork; 1979. [Google Scholar]
Ahuja RK, Magnanti TL, Orlin JB. Network Flows: Theory, Algorithms, and Applications. Prentice Hall, New Jersey; 1993. [Google Scholar]

[B1] DNA Sequencing Costs. http://www.genome.gov/sequencingcosts/

[B2] Weischenfeldt J, Symmons O, Spitz F, Korbel JO. Phenotypic impact of genomic structural variation: insights from and for human disease. Nat Rev Genet. 2013;14(2):125–138. doi: 10.1038/nrg3373. [DOI] [PubMed] [Google Scholar]

[B3] Bashir A, Volik S, Collins C, Bafna V, Raphael BJ. Evaluation of paired-end sequencing strategies for detection of genome rearrangements in cancer. PLoSComputBiol. 2008;4(4):1000051. doi: 10.1371/journal.pcbi.1000051. Yasuda and Miyano Page 10 of 11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] Campbell PJ, Stephens PJ, Pleasance ED, O'Meara S, Li H, Santarius T, Stebbings LA, Leroy C, Edkins S, Hardy C, Teague JW, Menzies A, Goodhead I, Turner DJ, Clee CM, Quail MA, Cox A, Brown C, Durbin R, Hurles ME, Edwards PAW, Bignell GR, Stratton MR, Futreal PA. Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nat Genet. 2008;40(6):722–9. doi: 10.1038/ng.128. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] Malhotra A, Lindberg MR, Faust GG, Leibowitz ML, Clark RA, Layer R, Quinlan AR, Hall IM. Breakpoint profiling of 64 cancer genomes reveals numerous complex rearrangements spawned by homology-independent mechanisms. Genome Res. 2013;23(5):762–776. doi: 10.1101/gr.143677.112. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] Stephens PJ, Greenman CD, Fu B, Yang F, Bignell GR, Mudie LJ, Pleasance ED, Lau KW, Beare D, Stebbings LA. et al. Massive genomic rearrangement acquired in a single catastrophic event during cancer development. Cell. 2011;144(1):27–40. doi: 10.1016/j.cell.2010.11.055. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] Shen MM. Chromoplexy: A new category of complex rearrangements in the cancer genome. Cancer Cell. 2013;23(5):567–569. doi: 10.1016/j.ccr.2013.04.025. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] Van Allen. et al. Punctuated evolution of prostate cancer genomes. Cell. 2013;153(3):666–677. doi: 10.1016/j.cell.2013.03.021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] Kloosterman WP, Tavakoli-Yaraki M, van Roosmalen MJ, vanBinsbergen E, Renkens I, Duran K, Ballarati L, Vergult S, Giardino D, Hansson K, Ruivenkamp CAL, Jager M, vanHaeringen A, Ippel EF, Haaf T, Passarge E, Hochstenbach R, Menten B, Larizza L, Guryev V, Poot M, Cuppen E. Constitutional chromothripsis rearrangements involve clustered double-stranded DNA breaks and nonhomologous repair mechanisms. Cell Rep. 2012;1(6):648–55. doi: 10.1016/j.celrep.2012.05.009. [DOI] [PubMed] [Google Scholar]

[B10] Mills RE, Walter K, Stewart C, Handsaker RE, Chen K, Alkan C, Abyzov A, Yoon SC, Ye K, Cheetham RK, Chinwalla A, Conrad DF, Fu Y, Grubert F, Hajirasouliha I, Hormozdiari F, Iakoucheva LM, Iqbal Z, Kang S, Kidd JM, Konkel MK, Korn J, Khurana E, Kural D, Lam HYK, Leng J, Li R, Li Y, Lin C-Y, Luo R. et al. 1000 genomes project: Mapping copy number variation by population-scale genome sequencing. Nature. 2011;470(7332):59–65. doi: 10.1038/nature09708. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] Medvedev P, Stanciu M, Brudno M. Computational methods for discovering structural variation with next-generation sequencing. Nat Methods. 2009;6:13–20. doi: 10.1038/nmeth.1374. [DOI] [PubMed] [Google Scholar]

[B12] Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, McGrath SD, Wendl MC, Zhang Q, Locke DP, Shi X, Fulton RS, Ley TJ, Wilson RK, Ding L, Mardis ER. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods. 2009;6(9):677–681. doi: 10.1038/nmeth.1363. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] Hormozdiari F, Alkan C, Eichler EE, Sahinalp SC. Combinatorial algorithms for structural variation detection in high-throughput sequenced genomes. Genome Res. 2009;19(7):1270–1278. doi: 10.1101/gr.088633.108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] Quinlan AR, Clark RA, Sokolova S, Leibowitz ML, Zhang Y, Hurles ME, Mell JC, Hall IM. Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. Genome Res. 2010;20(5):623–635. doi: 10.1101/gr.102970.109. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] Abyzov A, Urban AE, Snyder M, Gerstein M. CNVnator: An approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 2011;21(6):974–984. doi: 10.1101/gr.114876.110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] Ye K, Schulz MH, Long Q, Apweiler R, Ning Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics. 2009;25(21):2865–2871. doi: 10.1093/bioinformatics/btp394. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] Genome Reference Consortium. http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/

[B18] Oesper L, Ritz A, Aerni S, Drebin R, Raphael B. Reconstructing cancer genomes from paired-end sequencing data. BMC Bioinformatics. 2012;13(Suppl 6):10. doi: 10.1186/1471-2105-13-S6-S10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] Medvedev P, Brudno M. Maximum likelihood genome assembly. J Comput Biol. 2009;16(8):1101–1116. doi: 10.1089/cmb.2009.0047. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] Gabow HN. An efficient reduction technique for degree-constrained subgraph and bidirected network flow problems. Proceedings of the 15-th annual ACM symposium on Theory of computing (STOC) 1983. pp. 448–456.

[B21] Nagarajan N, Pop M. Parametric complexity of sequence assembly: Theory and applications to next generation sequencing. J Comput Biol. 2009;16(7):897–908. doi: 10.1089/cmb.2009.0005. [DOI] [PubMed] [Google Scholar]

[B22] Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, Li S, Yang H, Wang J, Wang J. Denovo assembly of human genomes with massively parallel short read sequencing. Genome Res. 2010;20(2):265–272. doi: 10.1101/gr.097261.109. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] Zerbino DR, Birney E. Velvet: Algorithms for de novo short read assembly using de bruijn graphs. Genome Res. 2008;18:821–829. doi: 10.1101/gr.074492.107. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol Ì. ABySS: A parallel assembler for short read sequence data. Genome Res. 2009;19(6):1117–1123. doi: 10.1101/gr.089532.108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25] Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, Sharpe T, Hall G, Shea TP, Sykes S, Berlin AM, Aird D, Costello M, Daza R, Williams L, Nicol R, Gnirke A, Nusbaum C, Lander ES, Jaffe DB. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci USA. 2011;108(4):1513–1518. doi: 10.1073/pnas.1017351108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] Kim J, Larkin DM, Cai Q, Asan, Zhang Y, Ge R-L, Auvil L, Capitanu B, Zhang G, Lewin HA, Ma J. Reference-assisted chromosome assembly. Proc Natl Acad Sci USA. 2013;110(5):1785–1790. doi: 10.1073/pnas.1220349110. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B27] Pop M. Genome assembly reborn: recent computational challenges. Brief Bioinform. 2009;10(4):354–366. doi: 10.1093/bib/bbp026. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B28] Gaul E, Blanchette M. In: Comparative Genomics. Lecture Notes in Computer Science. Bourque, G., El-Mabrouk, N, editor. Vol. 4205. Springer, Germany; 2006. Ordering partially assembled genomes using gene arrangements; pp. 113–128. [DOI] [Google Scholar]

[B29] Navin N, Kendall J, Troge J, Andrews P, Rodgers L, McIndoo J, Cook K, Stepansky A, Levy D, Esposito D, Muthuswamy L, Krasnitz A, McCombie WR, Hicks J, Wigler M. Tumour evolution inferred by single-cell sequencing. Nature. 2011;472(7341):90–94. doi: 10.1038/nature09807. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B30] Jahani S, Setarehdan SK. Centromere and length detection in artificially straightened highly curved human chromosomes. International Journal of Biological Engineering. 2012;2(5):56–61. doi: 10.5923/j.ijbe.20120205.04. [DOI] [Google Scholar]

[B31] Kasai F, O'Brien PCM, Ferguson-Smith MA. Afrotheria genome; overestimation of genome size and distinct chromosome GC content revealed by flow karyotyping. Genomics. 2013;102(5-6):468–471. doi: 10.1016/j.ygeno.2013.09.002. [DOI] [PubMed] [Google Scholar]

[B32] Myers EW. The fragment assembly string graph. Bioinformatics. 2005;21:79–85. doi: 10.1093/bioinformatics/bti1114. [DOI] [PubMed] [Google Scholar]

[B33] Sorge M, van Bevern R, Niedermeier R, Weller M. A new view on rural postman based on Eulerian extension and matching. Journal of Discrete Algorithms. 2012;16:12–33. [Google Scholar]

[B34] Garey MR, Johnson DS. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Company, NewYork; 1979. [Google Scholar]

[B35] Ahuja RK, Magnanti TL, Orlin JB. Network Flows: Theory, Algorithms, and Applications. Prentice Hall, New Jersey; 1993. [Google Scholar]

PERMALINK

Inferring the global structure of chromosomes from structural variations

Tomohiro Yasuda

Satoru Miyano

Supplement

Conference

Abstract

Background

Results

Conclusion

Background

Results

Experimental data

Aberrant adjacencies

Figure 1.

Copy numbers

Number of chromosomes and truncations

Chromosome length

Problem definition

Prototype chromosome graph

Figure 2.

Chromosome graph

Figure 3.

Paths and chromosomes

Copy numbers and lengths

Upper bound on parameters

Formulation of the problem

Polynomial-time solvable variation

Weakly connected constraint

Figure 4.

Restriction on the length of chromosomes

Discussion

Handling practical situations

Limitations

Figure 5.

Toward implementation

Conclusions

Methods

Proof of Theorem 1

Figure 6.

Figure 7.

Proof of Theorem 2

Circulation on a bidirected graph

Circular chromosome graph

Figure 8.

Proof of Theorem 3

Figure 9.

Competing interests

Authors' contributions

Declarations

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases