Abstract
Homologous recombination between the maternal and paternal copies of a chromosome is a key mechanism for human inheritance and shapes population genetic properties of our species. However, a similar mechanism can also act between different copies of the same sequence, then called non-allelic homologous recombination (NAHR). This process can result in genomic rearrangements—including deletion, duplication, and inversion—and is underlying many genomic disorders. Despite its importance for genome evolution and disease, there is a lack of computational models to study genomic loci prone to NAHR. In this work, we propose such a computational model, providing a unified framework for both (allelic) homologous recombination and NAHR. Our model represents a set of genomes as a graph, where haplotypes correspond to walks through this graph. We formulate two founder set problems under our recombination model, provide flow-based algorithms for their solution, describe exact methods to characterize the number of recombinations, and demonstrate scalability to problem instances arising in practice.
Supplementary Information
The online version contains supplementary material available at 10.1186/s13015-023-00241-3.
Keywords: Founder set reconstruction, Variation graph, Pangenomics, NAHR, Homologous recombination
Background
Twenty years ago, Esko Ukkonen introduced the problem of inferring founder sets from haplotyped single nucleotide polymorphism (SNP) sequences under allelic recombination [1]. Ukkonen’s work has since inspired a wealth of research addressing various aspects and applications of founder set reconstruction ranging from the reconstruction of ancestral recombinations and pangenomics to applications in phage evolution [2–4]. In its original setting, the problem sets out from a given set of m sequences of equal length n, where characters across sequences residing at the same index position correspond to a SNP. It then asks for a smallest set of sequences, called founder set, such that each given sequence can be constructed through a series of crossovers between sequences of the founder set, where each segment between two successive recombinations must meet a minimum length threshold. The Minimum Founder Set problem is NP-complete in general [5], but is solvable in linear time for the special case of founder sets of size two [1, 6]. Since its introduction, various heuristics and approximations have been proposed [6–8]. A variant of this problem restricts crossovers to coincide at certain positions, thereby decomposing the input sequences into a universally shared succession of blocks. The resulting problem, known as Minimum Segmentation Problem is polynomial [9]. In his seminal paper, Ukkonen devised a algorithm for its solution which has been substantially improved by Norri et al. [10] to linear time, i.e. O(nm) by exploiting the positional Burrows-Wheeler transform [11].
Just like the Minimum Founder Set Problem, the vast majority of population genetic analyses and genome-wide association studies have been focused on SNPs in the past decades, neglecting more complex forms of variation—mostly for technical difficulties in detecting them. In particular, structural variants (SVs), commonly defined as variants of at least 50bp, have posed substantial challenges and studies based on short sequencing reads typically detect less than half of all SVs present in a genome [12]. Recent technological and algorithmic advances help to overcome these limitations [13]. Long read technologies now enable haplotype-resolved de novo assembly of human genomes [14], which in turn enables a much more complete ascertainment of SVs [15]. In 2022, the first complete telomere-to-telomere assembly of a human genome was announced [16], heralding a new era of genomics where high-quality, haplotype-resolved assemblies of complex repetitive genomic structures become broadly available. Presently, the Human Pangenome Reference Consortium (HPRC), is applying these techniques to generate a large panel of haplotype-resolved genome assemblies from samples of diverse ancestries [17, 18]. These emerging data sets enable studying genetic loci involving duplicated sequence, called segmental duplications (SDs), which are amenable to NAHR, are therefore highly mutable, and show complicated evolutionary trajectories [19, 20]. The T2T-CHM13 study alone reports over 40 thousand segmental duplications that amount to 202Mb ( of the human genome) [16].
Interestingly, at loci with highly similar segments arranged in opposite orientations, such as Segment 3 in Fig. 1, NAHR can lead to inversion, i.e. the reversal of the interior sequence (Segment 4 in Fig. 1). Because of being flanked by a pair of copies of the same sequence (cf. Segment 3) that often comprises tens of thousands of bases, such events have been largely undetectable by sequencing technologies with read lengths below the length of the duplicated sequence; in particular by conventional short read sequencing. Recent studies applying multiple technologies reveal that inversions affect tens of megabases of sequence in a typical human genome [21]. Unlike most other classes of genetic variation, inversions are often recurrent with high mutation rates, that is, the same events have happened multiple times in human history [22]. Depending on the structures of duplicated sequence at a particular locus, individual human haplotypes can differ in their potential for NAHR. This can have important implications for the risk for a range of genetic disorders caused by NAHR-mediated mutations [22].
In the past two decades, various mathematical models and algorithms to study genome rearrangements have been proposed. These range from the classic reversal [23, 24] and transposition [25] model to composed models for two or more balanced rearrangements [26, 27], to generalized models such as the popular Double Cut and Join (DCJ) [28, 29]. As the research in this field continues, advanced models can additionally accommodate one or more types of unbalanced rearrangements, i.e., deletion, insertion, and duplication [30, 31]. Yet, none of these models adequately considers sequence similarity as a prerequisite for NAHR, which is a key molecular mechanism shaping many complex loci in the human genome. In summary, there are now technological opportunities to study the population history of recalcitrant SD loci that are prone to genome rearrangements and relevant to disease, but computational models to facilitate this have so far been lacking.
This work addresses this deficit by proposing a rearrangement model that is based on the molecular mechanism of homologous recombination and by solving variants of Ukkonen’s Founder Set Problem that can provide insights into the evolution of complex loci driven by NAHR. The genome model underlying the approach at hand represents DNA sequences at a level of abstraction where they are already decomposed into genomic markers with assigned homologies. Here, our notion of homology is a synonym for high DNA sequence similarity, as we adopt the terminology underlying the concept of homologous recombination. Our model permits recombination events to occur between homologous markers independent of their position within or between haplotypes, as long as the markers’ orientations are respected. In other words, a marker can only recombine with a homologous marker alongside the same direction, as illustrated by Fig. 1, because a recombination event can only occur between homologous markers if they are aligned to each other. By virtue of recapitulating the underlying NAHR, it implicitly allows for all the rearrangements this molecular mechanism can give rise to, including deletion, duplication, and inversion.
Marker decomposition and homology assignment can be done in practice with genome graph building tools such as MBG [32], minigraph [33], or pggb [34]. Our algorithms can work with any variation graph or pangenome graph with nodes corresponding to homologous DNA segments and edges between segments corresponding to observed adjacencies in a given set of haplotypes.
Methods
Preliminaries
A (genomic) marker m is an element of the finite universe of markers denoted by , and is associated with a fragment of a double-stranded DNA molecule. Each marker can be traversed in forward and reverse direction. A marker in forward orientation (which is the default orientation) is traversed from left to right. Overline notation indicates the reversal of a marker m, which is carried out relative to its orientation, i.e., . Similarly, represents the set of all reversed markers. We designate two forward markers as terminal markers. In what follows, we study terminal sequences, that is, sequences drawn from the alphabet of oriented markers that start with s or , end in S or and do not contain any further terminal markers in between. A terminal sequence can be traversed in forward and reverse direction. A haplotype is a terminal sequence that starts with s (source) and ends with S (sink).
Example 1
Consider in the following two sequences of genomic markers A and X drawn from the universe of markers , where and . Sequence A starts and ends with terminal markers and , respectively, thus constituting a haplotype over . Conversely, X starts with and ends in and therefore is a terminal sequence, but not a haplotype.
Given a sequence A, |A| indicates the length of A which corresponds to the number of A’s constituting elements. defines the reverse complementation of sequence A, i.e., the simultaneous reversal of the sequence and its constituting elements. The element at the ith position in sequence A is denoted by A[i]. A segment of sequence A starting at position i and ending at and including position j is denoted by A[i..j]. Then, and denote the prefix and suffix of A, respectively. Given two sequences A and B, then indicates that B is a segment of A, i.e., and there exists some with . Finally, the operator “” indicates the concatenation of two sequences.
Example 1
(cont’d) The length of A is ; its reverse complement is ; A[4..6] is a segment of A and corresponds to sequence , and consequently holds true; The segments and are a prefix and a suffix of X and A, respectively; The concatenation of prefix X[..6] and suffix A[7..] results in haplotype .
A recombination is an operation that acts on a shared oriented marker m of any two terminal sequences A and B: let ; recombination produces terminal sequence . For a given set of haplotypes , denotes the span, i.e., the set of all haplotypes generated by applying on haplotypes and the resulting terminal sequences. More precisely, let be the universe of terminal sequences, defined recursively by such that for any with some the recombinant and its reverse complement is also in . Then . Accordingly, we also say that “ is a generating set of )”. Conversely, given any (possibly infinite) set of haplotypes and some , is a generating set of if and only if .
Example 1
(cont’d) Recombination produces terminal sequence . Subsequent recombination produces haplotype . If is a given set of haplotypes, then .
In this paper, we study the following three problems:
Problem 1
(Founder Set) Given a set of haplotypes , find a generating set such that and is minimized.
We minimize total length because current knowledge on the evolution of complex genomic loci indicates their contained segmental duplications often causes them to expand over time [35]. Consequently we expect ancestral loci to be more compact and contain fewer duplications. We prefer this formulation over minimizing the founder set’s cardinality, because the latter would allow for solutions with founder sequences of unbounded length, which is biologically irrelevant. We call a solution to Problem 1 a founder set and its members founder sequences. The following problem is related to Ukkonen’s Minimum Segmentation Problem [1]:
Problem 2
(Recombination Count) Given a set of terminal sequences and a terminal sequence Q, count the number of recombinations r of the form , with and , that are necessary to generate Q from , i.e., and , if feasible and report its infeasibility otherwise.
At last, the combination of Problems 1 and 2 motivates the following:
Problem 3
(Parsimonious Founder Set) Given a set of haplotypes , find a founder set that minimizes the total number of recombinations to generate all founder sequences from .
Constructing founder sets
In this section, we present a three-step solution to Problem 1 that is based on a network flow analysis of the variation graph over the input set of haplotypes. To this end, we introduce the notion of variation graphs and describe their construction for our specific setting. Subsequently, we define network flow and detail how a founder set can be derived. Our proposed network flow problem is subordinate to the Chinese Postman Problem on edge-colored multigraphs for which Gutin et al. proposed a polynomial algorithm [36]. Consequently, all other steps of our solution being polynomial, Problem 1 can be solved in polynomial time. However, we propose an integer linear program in lieu of Gutin et al’s impractical algorithm. Then, in Section Results we show feasibility of our approach in experiments on simulated variation graphs and an exemplar biological data set.
Variation graph construction. We now address the construction of variation graph from a given set of haplotypes . Graph is an undirected edge-colored multigraph where each edge can have one of two colors corresponding to their membership in edge sets E and . In constructing , each marker m of the universe of forward-oriented markers is represented by a tuple of its extremities also called “tail” and “head” of m, respectively, and its reverse-oriented counterpart is represented as . (Note that our notation is based on common practice of illustrating markers as arrows, that, in natural reading direction, face from left, i.e., tail of the arrow, to right, i.e., head of the arrow.) Node set V of graph corresponds to the set of all marker extremities, and each marker gives rise to one marker edge . Further, any two (not necessarily distinct) nodes are connected by one adjacency edge if they occur in one of the haplotypes either in forward or reverse order. More formally, there is an adjacency edge if and only if there exists a sequence with such that , and .
Example 2
Let , , , and , then the variation graph of is as illustrated in Fig. 2, with marker edges drawn in gray and adjacency edges in black.
Proposition 1
Let be the variation graph of haplotypes , and the set of all walks between terminal markers and in with edges alternating between E and , then .
Proof
Observe that no recombination can create a new pair of consecutive markers that is not contained in any sequence . Therefore, each haplotype is a succession of consecutive markers drawn from sequences in , i.e., B can be delineated in by following adjacency edges corresponding to its succession of consecutive markers.
Given an alternating walk , we show how to express X as a series of recombination events:
Pick some haplotype and initialize ;
Let be a sequence such that for some position j, with and . Then .
Increase i by 2 and repeat step b unless .
Observe that by construction of the variation graph , a suitable sequence must exist in each iteration of step b.
Defining flows on variation graphs. We determine a minimum set of founder sequences by solving a network flow problem in variation graph where flow is allowed to travel along adjacency edges in either direction. Algorithm 1 describes the network flow problem. Each node is associated with two capacities corresponding to incoming and outgoing flowWe then find a non-negative flow such that the total flow of graph is minimized and satisfies constraints. Note that the flow can travel in both directions of an edge and that does not hold true in general.
Example 2
(cont’d) The drawing in Fig. 3 illustrates a flow solution on variation graph , with the direction and amount of flow along adjacency edges indicated by labeled arrowed arcs.
Deriving haplotypes from flows.
By applying the Flow Decomposition Theorem [37, p. 80f], any flow, i.e., solution to the above-specified constraints, is decomposable into a set of alternating paths going from source to sink and a set of alternating cycles. Ahuja et al. [37] give a simple and efficient algorithm that does so in polynomial time and which we describe below, adapted to our circumstances. The idea is to perform a random walk in the graph from source to sink or within a cycle, thereby consuming flow along adjacency edges until all flow is depleted. The proof of the algorithm remains unchanged to that given by Ahuja et al., thus is not repeated here.
Set .
Each node is adjacent to exactly one other node through a marker edge. Setting out from current node u, traverse this incident marker edge to some node v, choose any neighbor w of v for which . Follow the adjacency edge to v and decrease the flow by 1. Set .
As long as do as follows: if u has been visited in the traversal before, then extract the corresponding alternating cycle from the recorded sequence and report it. Proceed with the traversal by repeating Step 2.
However, if , follow the marker edge to and report the recorded sequence as a path.
If is incident with edges with positive flow, proceed with Step 1. Otherwise, there still might be strictly positive flow remaining in the graph corresponding to unreported cycles. In that case, pick any node such that for some node w, , and , and proceed with Step 2 in order to report the next cycle.
Example 2
(cont’d) The components of the flow solution on variation graph comprise two cycles C1 and C2, and two -paths P1 and P2 are illustrated in Fig. 5.
What remains is the integration of cycles into walks that then correspond to the haplotypes of the founder set. The integration is facilitated by a graph structure, the component graph. The component graph is an edge-labeled, directed multigraph, where, in its initial construction, each alternating -path and each cycle reported during flow decomposition is represented by a distinct node of . In the component graph , each cycle c of the flow decomposition sharing one or more markers with another component is connected by one or more directed edges to that component, with each edge’s label corresponding to one distinct shared marker, oriented according to the their succession in c (which may not be the same as in ). The component graph is then successively deconstructed until empty as follows:
Remove and report all -walks with in-degree 0 from node set . Note that by construction, -walks have out-degree 0, i.e., those with in-degree 0 are singleton in .
Pick a cycle with in-degree 0, or, if none such exists, any arbitrary cycle .
Pick an outgoing edge such that is a -walk. If no such exists, c is only adjacent to cycles, out of which one is picked arbitrarily. Let , . If marker m is embedded in in same orientation, i.e. , then linearize c in m, i.e., , and integrate it into such that . Otherwise, integrate the reversed linearization of c, i.e, . Remove cycle c and its outgoing edges from component graph .
Proceed with step 1 until no more components remain and all -walks are reported.
The search for components with in-degree 0 can be efficiently implemented through preorder traversal of . Note that each cycle must have at least one outgoing edge and that ultimately all cycles must be integrable into a -walk, otherwise this would imply that contains a disconnected, circular component that is not reachable by an alternating path from source to sink , thus contradicting the correctness of ’s construction. The reported -walks represent the wanted haplotypes of a founder set.
Example 2
(cont’d) The plots in Fig. 4 depict the component graph of components C1, C2, P1, and P2 (left) and the final two -walks that collectively represent a founder set of (right).
We define the multiplicity of a consecutive marker pair , for any , as the number of times it appears as segment in forward or reverse order in a set of sequences and introduce the following function for its retrieval:
Theorem 1
Let , is a solution to Problem 1 if and only if corresponds to a minimum network flow in .
Proof
Any flow of variation graph is decomposable into a set of haplotypes , as demonstrated above. Observe that the above-listed flow constraints enforce the derived haplotypes to cover the entire graph and consequently . This implies that . Further, the total number of consecutive markers in a haplotype sequence A equals and therefore solutions to the specified network flow problem minimize quantity . This is equivalent to minimizing , because it is not possible to reduce the founder set size by concatenating two or more founder sequences without increasing the number of consecutive markers by an equal amount. Conversely, in the network flow specification, the sink node has no outgoing flow to the source node and therefore any founder set derived by a flow solution cannot be reduced by concatenation.
We show that every founder set is also a solution to the specified minimum network flow problem. Assume that is a founder set of haplotypes and observe that multiplicities correspond to a valid flow in . Now assume that there exists another flow such that . Then, following the algorithm above, can be decomposed into haplotype set such that , contradicting the premise that is a solution to Problem 1.
Counting recombinations in founder sequences
We now provide a general algorithm for solving Problem 2. We show how this algorithm can be implemented to scale linearly with the input set of terminal sequences and query sequence Q in time and space by utilizing generalized suffix trees. Supplementary Note N2 further describes a solution based on suffix arrays that has the same asymptotic runtime and space guarantees, but is considered more practical. Our approach builds on the concept that each terminal sequence Q that can be generated from set is segmentable into a set of overlapping segments, where each such segment corresponds to a segment in a terminal sequence of . We call these segments -blocks for the remainder of this manuscript.
Lemma 2
Q can be generated from terminal sequences if and only if Q is segmentable into a sequence of overlapping -blocks , with , i.e., for each , with .
Proof
If Q can be generated from then there exists a series of recombinations , ,.., such that and . Consequently, Q can be segmented into the set of overlapping -blocks .
For a given segmentation , Q is generated by a series of recombinations from as follows: Let , ; For each x in 2..n, where , ; Observe that .
Finding a minimum -block segmentation is equivalent to computing the minimum number of recombinations—the former differs in size from the latter only by an increment of 1. The recursive function defined below calculates the number of recombinations to generate query sequence Q from terminal sequences by moving from one maximal -block of Q to the next. To this end, we define as the length of the longest prefix of Q that is a -block, i.e., .
1 |
indicates that Q cannot be generated from . We prove that the algorithm is optimal, i.e., computes the minimum number of recombinations:
Proof
We prove this by induction over the number of recombinations identified by Eq. 1. Note that the total number of recombinations is bounded by the length of query Q. We show that for every k with that reports the minimum number of recombinations for sequence .
-
(IB)
In iteration , receives the full-length query sequence Q and chooses the longest prefix of query Q that is a -block. It is clear that this is an optimal choice, since choosing a smaller prefix can only increase the number of recombinations. Note that if , Q cannot be generated from and R(Q) returns .
-
(IS)
Let be the -block identified in the k-th recurrence of with the current query sequence being . In step , will again identify the segment of maximal length l that is a -block.
Let us now claim that there is an shorter sequence of -blocks and , as illustrated in Fig. 6. Then there must be some for which . But if there were indeed a -block , then is a suffix of and that would be the longest common prefix chosen by in iteration , contradicting the definition of . Therefore, a shorter sequences of -blocks cannot exist.
The algorithm can efficiently count recombinations by utilizing the suffix tree data structure [38]. To this end, the suffix tree is constructed on sequence corresponding to a concatenation of terminal sequences of any given order, terminated by sentinel “”. In doing so, we assume that terminal markers , and abide lexicographic order $ . Suffix trees can be constructed in linear time and space [39], and matching substrings in can be performed in time linear to the length of the matching. To assess the time complexity of the recursion, observe that is recursed at most times, if all -blocks have length 2. We conclude:
Theorem 3
Problem 2 is solvable in time and space.
Minimizing recombinations in founder sequences
We now present an algorithm towards solving Problem 3, i.e., the problem of finding a founder set that minimizes the number of recombinations needed for its construction from a given set of haplotypes . Solving this problem requires the simultaneous computation of solutions to both the Founder Set and the Recombination Count problem and constitutes in combing through an exponentially large search space. We simplify the problem by presuming that the multiplicities of consecutive marker pairs in a solution to the Parsimonious Founder Set Problem are also optimal under the Founder Set problem. In other words, our approach is exact under the assumption that the overall multiplicity of each pair of consecutive markers in a founder set that is a solution to Problem 3 is known, yet the pair’s particular orientation and location in the founder sequences are not. To this end, we presume a function acting as oracle for the overall multiplicity of any given pair of consecutive oriented markers in a solution to Problem 3. More specifically, reports the total number of occurrences of and in founder set . Note that our experiments directly use the results of Problem 1 as input for Problem 3, i.e., reports the number of occurrences of in a solution to Problem 1. This makes our experimental solutions to Problem 3 heuristic. In addition, we make use of function to retrieve the multiplicity of any marker . Note that and are symmetric with respect to the relative orientation of markers, and . Our solution makes use of the flow graph that is defined in the subsequent paragraph. We calculate a matching in the flow graph that describes a set of founder sequences, each corresponding to a succession of segments of haplotypes . The objective of the matching is to minimize the total number of -blocks across all founder sequences which is equivalent to minimizing the number of recombinations for their construction from haplotype set .
Flow graph construction. The flow graph is a directed edge-colored multigraph with adjacency edges and marker edges , where each marker extremity with and , gives rise to elements in node set , representing many in (i) and many out (o) nodes. Hence, each node in the flow graph is represented by a triple of the form with the complete vertex set being . Each out node is incident with one and only one directed adjacency edge (u, v) connecting u to some in node v thereby realizing one occurrence of its representing pair of consecutive oriented markers in a founder sequence. Conversely, each forward-oriented marker contributes many directed marker edges that connect in/tail nodes with out/head nodes, i.e., . Analogously, each reverse-oriented marker contributes many in/head-to-out/tail-directed marker edges .
Example 2
(cont’d) Fig. 7 visualizes the flow graph for the given set of haplotypes , , , and a given .
Graph decomposition. A perfect matching of marker edges in flow graph produces a set of alternating walks and alternating cycles through , yet only half of the graph is eligible to form a solution to Problem 3. More precisely, for each marker , exactly half of the number of its associated nodes in must be saturated, i.e., incident with a matching edge. The other half as well as their incident edges must remain unsaturated. Further, we aim to admit only matchings that consist entirely of alternating -walks, because only those correspond to valid haplotypes of .
At last, we aim to assign to each saturated node a position in some haplotype A of given haplotype set . That way, we are able to determine whether the incident adjacency edge serves as continuation of the associated haploblock of A, or whether the incident saturated marker edge implies a recombination between two distinct -blocks.
The Integer Linear Program shown in Algorithm 2 implements the above-stated constraints.
Example 2
(cont’d) Fig. 8 illustrates a matching that is solution to Algorithm 2 for . The founder sequences are spelled out on the bottom, colored by haplotype (red, blue and green for haplotypes 2, 3 and 4 respectively). Unsaturated nodes and edges are grayed out, haplotype assignments implied by colored paths. The solution features two recombinations, marked by “” along their associated marker edges.
Objective. The ILP maximizes the sum over all variables, which corresponds to finding a set of founder sequences that has a maximum number of marker pairs associated with consecutive positions in any of the haplotypes . Conversely, any marker pair that is not linked to a position in a haplotype of represents a recombination event.
Matching constraints. Each edge and node of flow graph is associated with binary variables of and , respectively, that determine their saturation in a solution (cf. domains D.1 and D.2). Constraint C.01 ensures that each saturated marker edge is incident with saturated nodes. Perfect matching constraints, i.e., constraints that impose each saturated node being incident with exactly one marker edge, are implemented by constraint C.02. Similarly, constraint C.03 ensures that an adjacency edge is saturated if and only if its incident nodes are saturated. In other words, constraints C.01-C.03 together ensure that each component of the saturated graph corresponds to an alternating path or cycle component (the latter being prohibited by further constraints). The following two constraints C.04 and C.05 control the overall size of the saturated graph. In doing so, they ensure that, in a solution to Problem 3, the number of saturated nodes and adjacency edges matches the postulated multiplicity of markers , , and pairs of consecutive markers , , respectively.
Path constraints. Constraints C.05-C.08 force each component of the saturated graph to start and end in nodes associated with source and sink , respectively, thereby ruling out any cycles. To this end, they make use of a set of integer variables over all vertices (cf. Domain D.03) that define an increasing flow within each saturated component that is bounded by constant T corresponding to the total flow of the graph, i.e., . In each saturated marker edge, the flow is increased by 1 while along each adjacency edge, flow is kept constant. This prevents the formation of saturated cycles, because their flow would be infinite. Lastly, constraint C.08 preclude paths from starting in or ending in , leaving only one option for any saturated component open, that is, the formation of a -path.
Haplotype assignment. Each node in a solution to the ILP is associated with exactly one position in a haplotype A of , recorded by binary variables .Q Moreover, any marker edge whose incident pair of nodes is associated with the same position of the same haplotype corresponds to a -block, i.e, no recombination within this marker has taken place. Each marker edge that is linked by the ILP solver to a position j in a haplotype contributes a score unit to the objective function. These score units are encoded by binary variables (cf. domain D.05). Constraint C.09 ensures that each marker is associated with exactly one position j in a haplotype A of set , while C.10 confines incident nodes of adjacency edges to represent a consecutive marker pair . At last, constraint C.11 allows variables of marker edges to take on value 1 only if that marker edge is saturated and its incident nodes are associated with the same haplotype position.
Results
We implemented our methods in the programming language Rust [40] and used Gurobi [41] as the solver. Our software is open source and publicly available online [42]. To run Algorithm 2 on a given set of haplotypes , we estimated the overall multiplicity of pairs of consecutive markers from a network flow solution to Problem 1 on . Note that, because there is no guarantee that an optimal solution to Problem 3 exists that has also optimal flow under Problem 1, our approach does not guarantee exact solutions.
For benchmarking purposes, we ran Gurobi single-threaded and recorded wall clock time (in seconds) and Proportional Set Size (PSS) (in Megabytes (MB)) for memory usage. The choice of using PSS rather than measures such as Resident Set Size (RSS) or Unit Set Size (USS) is largely arbitrary, however all three measures were highly similar in all experiments and within 100 MB of each other at the extreme. Optimization time was capped at 30 min, beyond which the solver stops and returns its best-effort solution found thus far.
Experimental data
We benchmarked the performance of our algorithms by conducting experiments on both simulated data and a real-world data set. The former presumed a simulator, capable of generating haplotypes with duplicated and inverted markers that can produce intricate homologous recombinations while providing control over the degree of complexity. To this end, we implemented our own simulation tool that constructs a single haplotype sequence sampled at random to serve as seed. This seed sequence is adjustable by the following parameters: (i) number of distinct markers, i.e., the size of its variation graph, (ii) ratio of duplications, i.e., the number of additional edges inducing duplications in a walk of the graph, (iii) ratio of inversions, i.e., the proportion of inverted orientations within the set of duplications, and lastly (iv) the number of haplotypes that are input to subsequent founder set reconstruction. The latter are generated by performing random walks in the seed sequence’s variation graph and retaining only those leading from source to sink. In doing so, our simulator does not report nor have knowledge of a true founder set. Our simulator, discussed in more detail in Supplementary Note N1, enables us to explore various parameterizations that match different situations in biological data.
One important point concerns co-optimality. Problems 1 and 3 do not guarantee a unique solution. In fact, the pool of co-optimal solutions is often large for both problems. One contributing factor to co-optimality are cycles that are shared across multiple haplotypes, because they can be integrated in different orders. Further, the solution does not provide any information that could enable one to generate all co-optimal solutions nor discern between them, making a measure of accuracy challenging, since there is no guarantee that the “correct” founder sequence(s) will be seen in any number of trials.
In addition to simulated data, we applied our methods on a biological data set from the human 1p36.13 locus described by Porubsky et al. [22] to demonstrate the computational performance on realistic instances.
Simulation experiments
To assess the impact of parameter configurations on the results, we ran a number of different experiments wherein all but one parameters are fixed. A reasonable choice of constants seemed to be 100 distinct markers, of duplications, of inversions and 10 haplotypes, motivated by our data on the 1p36.13 locus (8 markers, 68 haplotypes, 57% of duplications) and statistics compiled by Porubsky et al. [22] ( duplications in the whole genome, inversions).
Reduction in number of recombinations. To evaluate the efficacy of our solution to Problem 3, we compared the number of recombinations returned by Algorithm 2 to that in a solution obtained by our network flow algorithm for Problem 1. To this end, we set the output of Algorithm 1 against an implementation of a solution to Problem 2, described in further detail in Supplementary Note N2. Figure 9 summarizes the outcome of this experiment. Overall, Algorithm 2 found a solution with fewer recombinations in all instances but a few where Gurobi returned barely best-effort solutions after reaching the time limit of 30 min, all of which exhibited a gap of at least 100%. The parameter settings in those cases were extremal.
Across all experiments, the mean estimated number of recombinations increased linearly by approximately 1.7 per 100 markers after minimization, compared to 4.5 per 100 without it. The values reached respectively 3.8 and 9.1 at 200 markers. The simulations here were carried out with a fixed number of haplotypes and ratios of duplications and of inversions. Results for experiments with other variable parameters are shown in Additional file 1: Figure S1.
Flow solution benchmark. Computing solutions with our network flow algorithm proved to be in almost all of our experiments near-instantaneous. By varying the number of distinct markers, the algorithm’s performance begins to deteriorate only with very large instances beyond 100k distinct markers and becomes excruciating for instances above 1M markers. When varying other parameters, we fixed the number of distinct markers to 100k rather than 100. Under 100k markers, execution completed after a mean wall clock time of seconds. In of all experiments, the solver’s runtime was too short to make sufficient measurements for benchmarking memory usage; the maximum PSS for the remaining ones measured at 78 MB. Over the 100k mark, both the graph size and duplication ratio began to reduce performance, with an average runtime of s. The ratio of inversions on the other hand did not affect performance (Suppl. Figure S3). We measured peak memory consumption at 758 MB across all conditions, which also occurred only at the very extremes of 100k distinct markers and a 100% ratio of duplications (Fig. 10).
Recombination minimization benchmark. As shown previously, Algorithm 2 successfully reduces the number of recombinations in solutions to Problem 1. However, its runtime increased dramatically with only moderate increments of any but one parameter of our simulator, the ratio of inversions, which did not play any role in performance (Additional file 1: Figure S2). For the remaining three, going beyond instances of 200 distinct markers, 20% of duplications, or 40 haplotypes typically did not allow for the optimization to finish in a reasonable amount of time (Fig. 11, Additional file 1: Figure S2). A similar but much less pronounced trend was seen with memory usage, which still remained relatively low. Peak memory usage was again observed at extreme parameter values with a PSS of 1072 MB with 50 haplotypes.
Application: locus 1p36.13
We obtained data from 68 human haplotypes (two per 34 individuals) at the 1p36.13 locus from Porubsky et al. [22] and the T2T-CHM13 human reference sequence [16]. The sequences comprise only eight distinct markers, terminal markers included. The sequences are attributed to five super populations, out of which 18 are of African origin (AFR), 16 of Eastern Asian (EAS), 12 of Admixed American (AMR), 12 of European (EUR), and 10 are South Asian (SAS). Their variation graph is densely connected with 26 edges (Fig. 12). The 68 haplotypes display a high degree of genetic diversity, with haplotype sequences differing in order, orientation, and copy number of the marker (Suppl. Table T1). Haplotype lengths in terms of the number of markers vary from 15 to 26, with a median of 19.
Our network flow algorithm determined that the data set can be generated from a single founder sequence. Our randomized algorithm for calculation of the minimum number of recombinations in a solution to Problem 1 asserted 15 recombinations after 1M trials, while Algorithm 1 obtained an optimal solution that revealed only 9 recombinations. Minimization completed in 60.3 s with a peak PSS of 225 MB. Note that there exists multiple other co-optimal solutions; Suppl. Figure S4 is an illustration of one.
Discussion
The advent of sequencing technology and genome assembly methodology to reconstruct full human genomes enables research into previously inaccessible segmental duplication loci. This exciting opportunity entails a demand for explanatory models that can infer evolutionary relationships and histories of complex repetitive genomic regions. In this work, we propose a model capable of explaining a broad range of balanced and unbalanced genome rearrangements. Our experiments on simulated data and on the 1p36.13 locus demonstrate that our algorithmic solutions to the founder set problem and the problem of minimizing recombinations in founder sets are capable of processing realistic instances. While the complexity of Problem 3 remains undetermined, we conjecture it to be NP hard.
Importantly, the model we are proposing is based on a molecular mechanism with a well-established role in shaping segmental duplication architecture. In our view, many past models of genome rearrangements have not sufficiently captured biological reality and there is an important need for further research aiming to incorporate knowledge of molecular mechanisms into such models. For instance, we envision future models that additionally include mechanisms like non-homologous end joining (NHEJ) and mobile element insertions. Furthermore, actual rates at which NAHR occurs depend on factors like the length of the duplicated sequence, the sequence similarity, as well as the presence of specific sequence motifs. In our current approach, these aspects are only partially and indirectly captured through the graph construction process. We aim to address and model these factors explicitly in future work.
Supplementary Information
Acknowledgements
The authors kindly thank Dr. Feyza Yilmaz of the Lee Lab (JAX) for providing the haplotype data of the 1p36.13 locus.
Abbreviations
- DCJ
Double cut and join
- HPRC
Human pangenome reference consortium
- ILP
Integer linear program
- LCA
Least common ancestor
- MB
Megabytes
- NAHR
Non-allelic homologous recombination
- NHEJ
Non-homologous end joining
- PSS
Proportional set size
- RMQ
Range minimum query
- RSS
Resident set size
- SD
Segmental duplication
- SNP
Single nucleotide polymorphism
- SV
Structural variant
- USS
Unit set size
Author contributions
TM initiated, TM and DD directed the research project. DD proposed solutions for Problems 1 and 3, and DD and KB implemented the algorithms. DD proposed the suffix tree-based algorithm, KB proposed and implemented the suffix array-based algorithm solving Problem 2. KB developed a method to simulate (N-)AHR, devised and implemented workflows, evaluation tools, and visualizations, and performed the experimental analysis. All authors wrote the manuscript, and read and approved its final version.
Funding
Open Access funding enabled and organized by Projekt DEAL. This work was supported in part by the National Institutes of Health grant 1U01HG010973 to T.M., by the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Grant agreement No 956229, by the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A532B, 031A533A, 031A533B, 031A534A, 031A535A, 031A537A, 031A537B, 031A537C, 031A537D, 031A538A), and by the MODS project funded from the programme “Profilbildung 2020” (grant no. PROFILNRW-2020-107-A), an initiative of the Ministry of Culture and Science of the State of Northrhine Westphalia.
Availibility of data and materials
All data used for the analysis of the 1p36.13 locus is included in Additional file 1: Table T1. The simulation experiments are solely based on data generated by a program named hapsim, available in source code form in the public repository [42].
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Footnotes
Tobias Marschall and Daniel Doerr are Joint last authors.
The original version of the article was reivsed: Additional file 1 has been corrected.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Change history
12/6/2023
A Correction to this paper has been published: 10.1186/s13015-023-00244-0
Contributor Information
Tobias Marschall, Email: tobias.marschall@hhu.de.
Daniel Doerr, Email: daniel.doerr@hhu.de.
References
- 1.Ukkonen E. Finding Founder Sequences from a Set of Recombinants. In: 2nd International Workshop on algorithms in bioinformatics (WABI 2002). Algorithms in bioinformatics. Berlin, Heidelberg: Springer Berlin Heidelberg; 2002:277-86.
- 2.Norri T, Cazaux B, Dönges S, Valenzuela D, Mäkinen V. Founder reconstruction enables scalable and seamless pangenomic analysis. Bioinformatics. 2021;37(24):4611–9. doi: 10.1093/bioinformatics/btab516. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Parida L, Melé M, Calafell F, Bertranpetit J, Consortium G. Estimating the ancestral recombinations graph (ARG) as compatible networks of SNP patterns. J Comput Biol. 2008;15(9):1133–53. doi: 10.1089/cmb.2008.0065. [DOI] [PubMed] [Google Scholar]
- 4.Swenson KM, Guertin P, Deschênes H, Bergeron A. Reconstructing the modular recombination history of Staphylococcus aureus phages. BMC Bioinform. 2013;14(15):1–9. doi: 10.1186/1471-2105-14-S15-S17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Rastas P, Ukkonen E. Haplotype Inference Via Hierarchical Genotype Parsing. In: Giancarlo R, Hannenhalli S, editors. 7th International Workshop on Algorithms in Bioinformatics (WABI 2007). Algorithms in Bioinformatics. Berlin, Heidelberg: Springer Berlin Heidelberg; 2007:85-97.
- 6.Wu Y, Gusfield D. Improved algorithms for inferring the minimum mosaic of a set of recombinants. In: Ma B, Zhang K, editors. Combinatorial Pattern Matching. Springer, Berlin Heidelberg: Berlin, Heidelberg; 2007. pp. 150–61. [Google Scholar]
- 7.Roli A, Benedettini S, Stützle T, Blum C. Large neighbourhood search algorithms for the founder sequence reconstruction problem. Comput Oper Res. 2012;39(2):213–24. doi: 10.1016/j.cor.2011.03.012. [DOI] [Google Scholar]
- 8.Roli A, Blum C, et al. Tabu search for the founder sequence reconstruction problem: a preliminary study. In: Omatu S, Rocha MP, Bravo J, Fernández F, Corchado E, Bustillo A, et al., editors. Distributed computing, artificial intelligence, bioinformatics, soft computing, and ambient assisted living. Springer, Berlin Heidelberg: Berlin, Heidelberg; 2009. pp. 1035–42. [Google Scholar]
- 9.Schwartz R, Clark AG, Istrail S. Methods for inferring block-wise ancestral history from haploid sequences. In: 2nd International Workshop on Algorithms in Bioinformatics (WABI 2002). Algorithms in Bioinformatics. Berlin, Heidelberg: Springer Berlin Heidelberg; 2002:44-59.
- 10.Norri T, Cazaux B, Kosolobov D, Mäkinen V. Minimum Segmentation for Pan-genomic Founder Reconstruction in Linear Time. In: 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). vol. 113 of Leibniz International Proceedings in Informatics (LIPIcs). Dagstuhl, Germany: Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik; 2018:15:1-15:15.
- 11.Durbin R. Efficient haplotype matching and storage using the positional burrows-wheeler transform (PBWT) Bioinformatics. 2014;30(9):1266–72. doi: 10.1093/bioinformatics/btu014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Zhao X, Collins RL, Lee WP, Weber AM, Jun Y, Zhu Q, et al. Expectations and blind spots for structural variation detection from long-read assemblies and short-read genome sequencing technologies. Am J Hum Genet. 2021;108:919. doi: 10.1016/j.ajhg.2021.03.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Sedlazeck FJ, Lee H, Darby CA, Schatz MC. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat Rev Genet. 2018;19:329. doi: 10.1038/s41576-018-0003-4. [DOI] [PubMed] [Google Scholar]
- 14.Porubsky D, Ebert P, Audano PA, Vollger MR, Harvey WT, Marijon P, et al. Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads. Nat Biotechnol. 2020;39:302. doi: 10.1038/s41587-020-0719-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Ebert P, Audano PA, Zhu Q, Rodriguez-Martin B, Porubsky D, Bonder MJ, et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science. 2021 doi: 10.1126/science.abf7117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, et al. The complete sequence of a human genome. Science. 2022;376(6588):44–53. doi: 10.1126/science.abj6987. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Wang T, Antonacci-Fulton L, Howe K, Lawson HA, Lucas JK, Phillippy AM, et al. The human pangenome project: a global resource to map genomic diversity. Nature. 2022;604(7906):437–46. doi: 10.1038/s41586-022-04601-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Liao WW, Asri M, Ebler J, Doerr D, Haukness M, Hickey G, et al. A Draft Human Pangenome Reference. bioRxiv. 2022. https://www.biorxiv.org/content/early/2022/07/09/2022.07.09.499321. [DOI] [PMC free article] [PubMed]
- 19.Marques-Bonet T, Girirajan S, Eichler EE. The origins and impact of primate segmental duplications. Trends Genet. 2009;25(10):443–54. doi: 10.1016/j.tig.2009.08.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Vollger MR, Guitart X, Dishuck PC, Mercuri L, Harvey WT, Gershman A, et al. Segmental duplications and their variation in a complete human genome. Science. 2022;376(6588):eabj6965. doi: 10.1126/science.abj6965. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Chaisson MJP, Sanders AD, Zhao X, Malhotra A, Porubsky D, Rausch T, et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat Commun. 2019;10(1):1784. doi: 10.1038/s41467-018-08148-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Porubsky D, Höps W, Ashraf H, Hsieh P, Rodriguez-Martin B, Yilmaz F, et al. Recurrent inversion polymorphisms in humans associate with genetic instability and genomic disorders. Cell. 2022;185:1986. doi: 10.1016/j.cell.2022.04.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Bafna V, Pevzner PA. Genome rearrangements and sorting by reversals. SIAM J Comput. 1996;25(2):272–89. doi: 10.1137/S0097539793250627. [DOI] [Google Scholar]
- 24.Bader DA, Moret BM, Yan M. A linear-time algorithm for computing inversion distance between signed permutations with an experimental study. In: 1st International Workshop on Algorithms in Bioinformatics (WABI 2001). Algorithms in Bioinformatics. Berlin, Heidelberg: Springer Berlin Heidelberg; 2001:365-76.
- 25.Bafna V, Pevzner PA. Sorting by transpositions. SIAM J Discret Math. 1998;11(2):224–40. doi: 10.1137/S089548019528280X. [DOI] [Google Scholar]
- 26.Walter MEM, Dias Z, Meidanis J. Reversal and transposition distance of linear chromosomes. In: Proceedings. String Processing and Information Retrieval: A South American Symposium (Cat. No. 98EX207). IEEE; 1998:96-102.
- 27.Dias Z, Meidanis J. Genome Rearrangements Distance by Fusion, Fission, and Transposition is Easy. In: spire. Citeseer; 2001:250-3.
- 28.Yancopoulos S, Attie O, Friedberg R. Efficient sorting of genomic permutations by translocation, inversion and block interchange. Bioinformatics. 2005;21(16):3340–6. doi: 10.1093/bioinformatics/bti535. [DOI] [PubMed] [Google Scholar]
- 29.Bergeron A, Mixtacki J, Stoye J. A Unifying View of Genome Rearrangements. In: Bucher P, Moret BME, editors. 6th International Workshop on Algorithms in Bioinformatics (WABI 2006). vol. 4175 of Algorithms in Bioinformatics. Berlin, Heidelberg: Springer Berlin Heidelberg; 2006:163-73.
- 30.Shao M, Lin Y, Moret BME. An exact algorithm to compute the double-cut-and-join distance for genomes with duplicate genes. J Comput Biol. 2015;22(5):425–35. doi: 10.1089/cmb.2014.0096. [DOI] [PubMed] [Google Scholar]
- 31.Bohnenkämper L, Braga MD, Doerr D, Stoye J. Computing the rearrangement distance of natural genomes. J Comput Biol. 2021;28(4):410–31. doi: 10.1089/cmb.2020.0434. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Rautiainen M, Marschall T. MBG: minimizer-based sparse de Bruijn graph construction. Bioinformatics. 2021;37(16):2476–8. doi: 10.1093/bioinformatics/btab004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Li H, Feng X, Chu C. The design and construction of reference pangenome graphs with minigraph. Genome Biol. 2020;21(1):1–19. doi: 10.1186/s13059-020-02168-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Garrison E, Guarracino A, Heumos S, Villani F, Bao Z, Tattini L, et al.. pggb: the PanGenome Graph Builder; 2023. [Paper submission pending; Online, accessed 27-January-2023]. https://github.com/pangenome/pggb.
- 35.Höps W, Rausch T, Ebert P, Human Genome Structural Variation Consortium (HGSVC), Korbel JO, Sedlazeck FJ. Impact and characterization of serial structural variations across humans and great apes; 2023. [DOI] [PMC free article] [PubMed]
- 36.Gutin G, Jones M, Sheng B, Wahlström M, Yeo A. Chinese postman problem on edge-colored multigraphs. Discrete Appl Math. 2017;217:196–202. doi: 10.1016/j.dam.2016.08.005. [DOI] [Google Scholar]
- 37.Ahuja RK, Magnanti TL, Orlin JB. Network Flows: Theory, Algorithms, and Applications. 1st ed.; 1993.
- 38.Gusfield D. Algorithms on stings, trees, and sequences: computer science and computational biology. Acm Sigact News. 1997;28(4):41–60. doi: 10.1145/270563.571472. [DOI] [Google Scholar]
- 39.Ukkonen E. On-line construction of suffix trees. Algorithmica. 1995;14(3):249–60. doi: 10.1007/BF01206331. [DOI] [Google Scholar]
- 40.Matsakis ND, Klock II FS. The rust language. In: ACM SIGAda Ada Letters. vol. 34(3). ACM; 2014. p. 103-4.
- 41.Gurobi Optimization L. Gurobi Optimizer Reference Manual; 2019. http://www.gurobi.com.
- 42.Bonnet K, Doerr D. Analysis of the set of founder sequences under a homologous recombination model; 2023. https://github.com/marschall-lab/hrfs.
- 43.Wick RR, Schultz MB, Zobel J, Holt KE. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics. 2015;31(20):3350–2. doi: 10.1093/bioinformatics/btv383. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All data used for the analysis of the 1p36.13 locus is included in Additional file 1: Table T1. The simulation experiments are solely based on data generated by a program named hapsim, available in source code form in the public repository [42].