New algorithms for structure informed genome rearrangement

Eden Ozeri; Meirav Zehavi; Michal Ziv-Ukelson

doi:10.1186/s13015-023-00239-x

. 2023 Dec 1;18:17. doi: 10.1186/s13015-023-00239-x

New algorithms for structure informed genome rearrangement

Eden Ozeri ^1,^✉, Meirav Zehavi ¹, Michal Ziv-Ukelson ¹

PMCID: PMC10691145 PMID: 38037088

Abstract

We define two new computational problems in the domain of perfect genome rearrangements, and propose three algorithms to solve them. The rearrangement scenarios modeled by the problems consider Reversal and Block Interchange operations, and a PQ-tree is utilized to guide the allowed operations and to compute their weights. In the first problem, $Constrained TreeToString Divergence$ ( $CTTSD$ ), we define the basic structure-informed rearrangement measure. Here, we assume that the gene order members of the gene cluster from which the PQ-tree is constructed are permutations. The PQ-tree representing the gene cluster is ordered such that the series of gene IDs spelled by its leaves is equivalent to that of the reference gene order. Then, a structure-informed genome rearrangement distance is computed between the ordered PQ-tree and the target gene order. The second problem, $TreeToString Divergence$ ( $TTSD$ ), generalizes $CTTSD$ , where the gene order members are not necessarily permutations and the structure informed rearrangement measure is extended to also consider up to $d_{S}$ and $d_{T}$ gene insertion and deletion operations, respectively, when modelling the PQ-tree informed divergence process from the reference gene order to the target gene order. The first algorithm solves $CTTSD$ in $O (n γ^{2} \cdot (m_{p} \cdot 1 . 381^{γ} + m_{q}))$ time and $O (n^{2})$ space, where $γ$ is the maximum number of children of a node, n is the length of the string and the number of leaves in the tree, and $m_{p}$ and $m_{q}$ are the number of P-nodes and Q-nodes in the tree, respectively. If one of the penalties of $CTTSD$ is 0, then the algorithm runs in $O (n m γ^{2})$ time and $O (n^{2})$ space. The second algorithm solves $TTSD$ in $O (n^{2} γ^{2} {d_{T}}^{2} {d_{S}}^{2} m^{2} (m_{p} \cdot 5^{γ} γ + m_{q}))$ time and $O (d_{T} d_{S} m (m n + 5^{γ}))$ space, where $γ$ is the maximum number of children of a node, n is the length of the string, m is the number of leaves in the tree, $m_{p}$ and $m_{q}$ are the number of P-nodes and Q-nodes in the tree, respectively, and allowing up to $d_{T}$ deletions from the tree and up to $d_{S}$ deletions from the string. The third algorithm is intended to reduce the space complexity of the second algorithm. It solves a variant of the problem (where one of the penalties of $TTSD$ is 0) in $O (n γ^{2} {d_{T}}^{2} {d_{S}}^{2} m^{2} (m_{p} \cdot 4^{γ} γ^{2} n (d_{T} + d_{S} + m + n) + m_{q}))$ time and $O (γ^{2} n m^{2} d_{T} d_{S} (d_{T} + d_{S} + m + n))$ space. The algorithm is implemented as a software tool, denoted MEM-Rearrange, and applied to the comparative and evolutionary analysis of 59 chromosomal gene clusters extracted from a dataset of 1487 prokaryotic genomes.

Keywords: PQ-tree, Gene cluster, Breakpoint distance

Introduction

Recent advances in pyrosequencing techniques, combined with global efforts to study infectious diseases, yield huge and rapidly-growing databases of microbial genomes [1, 2]. These big new data statistically empower genomic-context based approaches to functional and evolutionary analysis: the biological principle underlying such analyses is that groups of genes that are located close to each other across many genomes often code for proteins that interact with one another, suggesting a common functional association.

Groups of genes that are co-locally conserved across many genomes are denoted gene clusters. The order of the genes in distinct genomic occurrences of a gene cluster may not be conserved. A specific order of the genes of a gene cluster, that is co-linearly conserved across many genomes, is denoted a gene order of the gene cluster. The distinct genomes in which a gene order occurs are denoted instances of the gene order. Gene clusters in prokaryotic genomes often correspond to (one or several) operons; those are neighbouring genes that constitute a single unit of transcription and translation.

In this paper, our biological goal is to study the evolution of gene clusters in prokaryotes, by computing the divergence between pairs of gene orders that belong to the same gene cluster, based on genome rearrangement scenarios. When defining this computational task as an optimization problem, one needs to take into account that parsimony considerations may not be sufficient: driven by the objective to keep the genome small and efficient, in spite of the high rate of gene shuffling in the prokaryotic genome, only gene orders that are reinforced by conveying some advantage in fitness (i.e., in adaptation to some niche) will be kept in the genome. This calls for a Structure-Informed Genome Rearrangement (SIGR) divergence measure that will interleave parsimony considerations with some learned structural and functional information regarding the gene cluster under study. Such a measure could more accurately assess the degree of divergence from one order of a gene cluster to another, and provide further understanding of gene-context level environmental-specific adaptations [3, 4].

To this end, we propose a new SIGR-based divergence measure and provide efficient algorithms to compute it. According to our approach, information regarding the structure of the gene cluster is learned from the known gene orders of the gene cluster and represented by a PQ-tree (formally defined in “Preliminaries” section). The PQ-tree is then utilized to both guide the allowed operations and to compute their weights. A motivating example for our proposed approach, including exemplifying figures, can be found in “A motivating example” section.

PQ-trees have been advocated as a representation for gene clusters [5–7]. A PQ-tree describes the possible permutations of a given sequence, and can be constructed in polynomial-time [8]. The PQ-tree representing a given gene cluster describes its hierarchical inner structure and the relations between instances of the gene cluster succinctly, assists in predicting the functional association between the genes in the gene cluster, yields insights into the evolutionary history of the gene cluster, and provides a natural and meaningful way of visualizing complex gene clusters.

The biological assumptions underlying the representation of gene clusters as PQ-trees is that operons evolve mainly via progressive merging of sub-operons, where the most basic units in this recursive operon assembly are colinearly conserved sub-operons [9]. In the case where an operon is assembled from sub-operons that are colinearly dependent, the conserved gene order could correspond, e.g., to the order in which the transcripts of these genes interact in the metabolic pathway in which they are functionally associated [10]. Thus, rearrangement events that shuffle the order of the genes (or of smaller sub-operons) within this sub-operon could affect the function of its product. On the other hand, inversion events in which the genes participating in this sub-operon remain colinearly ordered with respect to the transcription order, have less of an affect on the interactions between the sub-operon’s gene products.

The case of colinearly conserved sub-operons is represented in the PQ-tree by a Q-node (marked with a rectangle in the exemplifying figures), and by a Reversal operation in the corresponding pairwise gene order rearrangement scenario. In the case where an operon is assembled from sub-operons that are not colinearly co-dependent, convergent evolution could yield various orders of the assembled components [9]. This case is represented in the PQ-tree by a P-node (marked with a circle in the exemplifying figures), and by a Block Interchange operation in the corresponding pairwise gene order rearrangement scenario.

Background on structure informed genome rearrangement (SIGR) scenarios

A generic formulation of genome rearrangement problems is, given two genomes and some allowed edit operations, to transform one genome into the other using a minimum number of edit operations [11–14]. A famous algorithmic result related to genome rearrangements concerns the problem of sorting signed permutations by Reversals. This problem aims at computing a shortest sequence of Reversals that transforms one signed permutation into another, and can be solved in polynomial time [15–17]. It was later generalized to handle, still in polynomial time, multichromosomal genomes with linear chromosomes, using rearrangements such as Translocations, Chromosome Fusions and Fissions [18, 19]. Then, a general operation called Double Cut-and-Join (DCJ), was introduced in [20] for handling problem instances where the common intervals are organized in a nonlinear structure. A DCJ can be, among others, a Reversal, a Translocation, a Fusion or a Fission, but two consecutive DCJ operations can also simulate a Block-Interchange or a Transposition.

Previous works proposed related forms of SIGR by considering rearrangement scenarios on two permutations that preserve their common intervals (groups of co-localized genes). Such scenarios, which may not be shortest among all scenarios, are called perfect [21]. Computing a Reversal scenario of minimum length that preserves a given subset of the common intervals of two signed permutations is NP-hard [21] and several papers have explored this problem, describing families of instances that can be solved in polynomial time [22–25], fixed parameter tractable algorithms [23, 26], and an exponential time algorithm (which works also in the more general weighted case) [27]. We note that all the perfect scenarios mentioned above considered only Reversal operations, while for our settings Block-Interchange operations should also be considered. A heuristic approach that implements, among others perfect reversals and block interchange is described in [28].

In [29] the notion of perfect scenario was extended to the Perfect DCJ model, thus capturing additional operations in perfect scenarios, including Cut, Join and Block-Interchange. When considering the perfect rearrangement scenarios that best fit our problem, this is the model that is most relevant to our settings, as the other (non-heuristic) previous works do not include the Block-Interchange operation. The operations considered by the Perfect DCJ model are very general, which renders the problem computationally intractable in its general setting. Indeed, Berard et al. [29] thus only obtain positive algorithmic results for special cases that, in particular, do not encompass the structure of a PQ-tree, and with a parameter that can often be of the magnitude of the entire input size. For us, the aforementioned special cases are too restricted, and cannot model the problem we have at hand. On the other hand, fortunately, for us, considering Cut and Join operations is an unjustified overhead. Specifically, we seek to model the considered evolutionary scenarios by a formulation that is more specific to our biological problem, in order to increase the divergence measure accuracy as well as tighten the parameters driving the complexity of the algorithms for the problem. Since, in our problem, we are dealing with prokaryotic gene clusters and the data in our experiment is typically confined to one chromosome per genome, we need not consider Cut and Join operations. In addition, the intervals in prokaryotic gene clusters follow a strongly conserved hierarchy, naturally modeled by the PQ-tree learned from the members of the gene cluster. In terms of divergence measure accuracy, we would like to enforce the PQ-tree structure as a constraint to the considered rearrangements. Furthermore, while the Perfect DCJ model is unweighted (and simply counts the number of DCJ operations applied), we use the PQ-tree as a guide affecting the weights of the applied rearrangement operations. In terms of tightening the parameters driving the complexity of the computation, the PQ-tree constraint enables us to use dynamic programming algorithms and to reduce the parameter from n to the out degree of the tree. In particular, this means that the more hierarchical the input is, the smaller our parameter is likely to be, and the faster our algorithm is—in other words, the running time of our algorithm naturally scales with the amount of structure given by the PQ-tree.

In general, our algorithm is intended for widely conserved prokaryotic gene clusters (with cluster length bounded by tens of genes) that display a hierarchical structure. In such cases, the maximal degree of a P-node (which is the exponential factor in the time complexity of our proposed algorithm) is typically small, in particular with respect to the maximal degree of a Q-node (which affects the time complexity of the algorithm by a polynomial factor.) This is due to the fact that gene clusters in bacteria are very highly colinearly conserved [30, 31], even when the benchmark dataset is large and spans a wide taxonomical range of prokaryotes [32]. Furthermore, the very same data set used in this study (consisting of 59 prokaryotic genomes) was previously analyzed by Svetlitsky et al. [32] in order to study the degree of conserved collinearity among widely spread gene clusters in prokaryotic genomes. The study found a negative correlation between the proportion of shuffled gene clusters and their length, i.e. the longer the gene cluster is, the more it is colinearly conserved. All this inspired us to develop an algorithm that, on one hand captures the structure of gene clusters in an informed way, while on the other hand employs the P-node out-degree as a bound on its exponential factor. To this end we propose an FPT algorithm for the TTSD problem, where the structure of the gene cluster is represented by a PQ-tree, and the parameter binding the exponential factor of the algorithm is the maximal out-degree of a P-node in the tree.

Our contribution

We propose a new, two-step approach to SIGR: In the first step, given the gene orders of the gene cluster under study, the internal topology properties of a gene cluster are learned from its corresponding gene orders and a PQ-tree is constructed accordingly. Then, in the second step, given a reference gene order and a target gene order, a SIGR scenario is computed from the reference to the target, such that colinear dependencies among genes and between sub-operons, as learned by the PQ-tree, are taken into account by the penalties assigned to the rearrangement operations.

To this end, we define two new theoretical problems and propose three algorithms to solve them. In the first problem, denoted $Constrained TreeToString Divergence$ ( $CTTSD$ ), we define the basic SIGR divergence measure. Here, we assume that the gene order members of the gene cluster from which the PQ-tree is constructed are permutations. The rearrangement operations considered by this problem include (weighted) Reversals and Block-Interchange operations. In this problem, the PQ-tree representing the gene cluster (Fig. 3A) is ordered such that the series of gene IDs spelled by its leaves is equivalent to the reference gene order (Fig. 3B). Then, a weighted SIGR measure is computed from the ordered PQ-tree to the target gene order (Fig. 3C).

Fig. 3 — Re-considering the signed break-points (marked in blue) between the two gene orders exemplified in Fig. 2, this time taking into account the PQ-tree representing the corresponding gene cluster. A The PQ-tree. Note that in this figures, as well as the following figures, P-nodes are marked with circles and Q-nodes are marked with rectangles. B The first gene order. C The second gene order

The second problem, denoted $TreeToString Divergence$ ( $TTSD$ ), is a generalization of the first problem, where the gene order members are not necessarily permutations and the genome rearrangement measure is extended to also consider up to $d_{S}$ gene insertion operations and up to $d_{T}$ gene deletion operations.

The first fixed parameter tractable (FPT) algorithm (in “Constrained TreeToString Divergence: algorithm” section) solves $CTTSD$ in $O (n γ^{2} \cdot (m_{p} \cdot 1 . 381^{γ} + m_{q}))$ time and $O (n^{2})$ space, where $γ$ is the maximum number of children of a node, n is the length of the string and the number of leaves in the tree, and $m_{p}$ and $m_{q}$ are the number of P-nodes and Q-nodes in the tree, respectively. If one of the penalties of $CTTSD$ is defined to be 0, then the algorithm runs in $O (n m γ^{2})$ time and $O (n^{2})$ space.

The second FPT algorithm solves $TTSD$ in $O (n^{2} γ^{2} {d_{T}}^{2} {d_{S}}^{2} m^{2} (m_{p} \cdot 5^{γ} γ + m_{q}))$ time and $O (d_{T} d_{S} m (m n + 5^{γ}))$ space, where $γ$ is the maximum number of children of a node, n is the length of the string, m is the number of leaves in the tree, $m_{p}$ and $m_{q}$ are the number of P-nodes and Q-nodes in the tree, respectively, and allowing $d_{T}$ deletions from the tree and $d_{S}$ deletions from the string.

Dynamic programming is common on trees, and on sequences, and our algorithms combine the two types. While our first algorithm is simple and intuitive (based on one dynamic programming and two greedy procedures), for our second algorithm (based on three dynamic programming procedures), more technical ingredients are required. For example, one challenge is the need to compute a vertex cover in a graph that is not fully known by any single entry of our dynamic programming table. Specifically, when we consider a single entry, some of the relevant vertices are not yet processed, and for those that are processed, we cannot store enough information (for the sake of efficiency) so as to deduce which edges exist between them.

The third FPT algorithm is intended to reduce the space complexity of the second algorithm. It solves a variant of the problem (where one of the penalties of $TTSD$ is defined to be 0) in $O (n γ^{2} {d_{T}}^{2} {d_{S}}^{2} m^{2} (m_{p} \cdot 4^{γ} γ^{2} n (d_{T} + d_{S} + m + n) + m_{q}))$ time and $O (γ^{2} n m^{2} d_{T} d_{S} (d_{T} + d_{S} + m + n))$ space. This algorithm employs the principle of inclusion–exclusion for the sake of space reduction, which, to the best of our knowledge, is not commonly used in the study of problems in computational biology.

The proposed general algorithm is implemented as a software tool, denoted MEM-Rearrange, and applied to the comparative and evolutionary analysis of 59 chromosomal gene clusters extracted from a dataset of 1487 prokaryotic genomes (in “Results” section). Our preliminary results, based on the analysis of the 59 gene clusters, indicate that our proposed measure correlates well with an index that is computed by comparing the class composition of the genomic instances of the two compared gene orders. The correlations yielded by our measure are shown to significantly increase with the increase in conserved structure of the corresponding gene clusters (as modelled by PQ-trees).

Roadmap

The rest of the paper is organized as follows. In “A motivating example” section we present a motivating biological example. Previous works are reviewed in “Previous related works” section. In “Preliminaries” section, we formally define the terminology used throughout the paper, and, in particular, the two problems studied in this paper. In “Constrained TreeToString Divergence: algorithm” section, we present our first algorithm, which solves the $CTTSD$ problem. In “TreeToString Divergence: algorithm” section, we present our second algorithm, which solves the $TTSD$ problem. In “TTSD: polynomial space complexity” section, we present our third algorithm, which solves the $CTTSD$ problem and improves the space complexity of the second algorithm. In “Methods and datasets” section, we specify the details of our data set construction and experiment. Finally, in “Results” section, we compare the performance of our proposed rearrangement measure versus that of signed break-point distance on a benchmark of 59 chromosomal gene clusters. Concluding remarks are given in “Conclusions” section.

A motivating example

In this section we give a biological example to motivate the new problems defined in this paper. To this end, we chose to interpret the first result shown in Table 2 (found in “Appendix”), which corresponds to a gene cluster consisting of three gene orders of a known ABC transporter operon, encoding genes participating in a carbohydrate uptake system [33]. This example is illustrated in Fig. 1. In each of the gene orders shown in the figure, the product of gene number one is responsible for regulating the transcription of the other genes. The genes numbered two, three and four code for proteins that are part of the transporter machine. Gene number five codes for a protein that is responsible for some metabolism of the transported substrate, such as breaking down a complex sugar into smaller ones [34]. The genes numbered two, three, four and five are located close to each other in all gene orders and are included in a single operon, whose transcription is regulated by the product of gene number one [34].

Table 2.

59 gene clusters analyzed in our experiment

#	Gene cluster ID	$Diverge$	$d_{SBP}$	$d_{reversals}$	Size	PQ-tree
1	19876	1.000	0.655	0.500	3	((0− [1− 2− 3−]) 4+)
2	21344	0.905	0.615	0.439	4	([0− 1−] [2+ 3+])
3	19877	0.997	0.915	0.592	3	([0+ 1+ 2+] 3+)
4	27180	0.828	0.828	0.436	3	([0− 1− 2−] 3+)
5	14602	0.978	0.985	0.654	4	(0− [1− 2− 3− 4−])
6	23340	0.974	0.809	0.684	3	(0− (1+ [2+ 3+]))
7	19853	0.967	0.825	0.703	3	(0− ([1+ 2+ 3+] 4+))
8	14790	0.994	0.994	0.808	3	([0− 1− 2− 3−] 4− 5−)
9	26244	0.971	0.996	0.817	3	(0− [1− 2− 3−])
10	26243	0.903	0.909	0.757	4	(([0− 1−] 2−) 3−)
11	26476	0.925	0.998	0.791	3	(0+ [1+ 2+ 3+])
12	15297	0.991	0.945	0.866	3	(0− ([1− 2−] [3− 4+]))
13	22866	0.845	0.600	0.722	5	([0+ 1+] [2+ 3+])
14	26238	0.785	0.807	0.663	7	([0− 1−] 2− 3−)
15	18995	0.770	0.621	0.681	9	((0+ [1+ 2+]) 3+)
16	28119	0.559	0.559	0.478	5	(0− 1− 2− 3−)
17	25371	0.983	0.942	0.907	4	(0− ([1− 2−] 3−))
18	26231	0.974	0.802	0.909	4	((0+ [1+ 2+]) 3+)
19	14796	0.760	0.686	0.700	9	(0− 1− [2− 3−] 4−)
20	20007	0.999	0.999	0.960	3	(0− 1− 2− [3− 4−])
21	22299	0.929	0.929	0.891	3	(0− 1− 2− 3−)
22	19255	0.948	0.940	0.918	5	([0− 1−] [2− 3−] 4−)
23	20764	0.880	0.987	0.852	3	([0+ 1+ 2+] 3+)
24	21553	0.982	0.982	0.961	3	(0− [1− 2−] 3− 4−)
25	27851	1.000	1.000	0.982	3	(0+ [1+ 2+] 3+)
26	27427	0.866	0.866	0.866	3	(0+ [1+ 2+ 3+])
27	15467	1.000	1.000	1.000	3	([0− 1− 2− 3−] 4− 5−)
28	19610	0.998	0.998	0.998	3	(0− ([1− 2−] 3− 4−))
29	20078	0.982	0.945	0.982	3	(0+ (1+ [2+ 3+]))
30	22181	0.912	0.940	0.912	3	(([0+ 1+] 2+) 3+)
31	25612	0.997	0.888	0.997	4	([0− 1−] [2− 3−])
32	27177	0.945	0.945	0.945	3	(0− [1+ 2+ 3+])
33	8962	0.933	0.933	0.933	3	(0− 1− [2− 3− 4− 5− 6−])
34	27530	0.999	0.974	0.999	3	[[0− 1−] 2− 3−]
35	19256	0.779	0.737	0.798	8	(0− 1− 2− 3−)
36	21317	0.974	0.974	0.995	3	(0+ 1+ [2+ 3+ 4+])
37	19852	0.890	0.821	0.919	6	(0− [1− 2−] 3−)
38	27250	0.967	0.967	0.997	3	([0+ 1+] [2+ 3+])
39	30976	0.963	1.000	0.996	3	[0− 1+ [2+ 3+]]
40	28245	0.715	0.715	0.775	3	(0− 1− 2− 3−)
41	26257	0.933	0.988	0.999	3	((0+ [1+ 2+]) 3+)
42	14839	0.860	0.912	0.928	8	([0− 1−] 2− 3− 4−)
43	27207	0.748	0.749	0.824	7	(0− [1− 2−] 3+)
44	12685	0.917	0.917	0.994	3	([0− 1− 2−] 3− 4− 5−)
45	21268	0.876	0.876	0.956	4	([0+ 1+] [2+ 3+])
46	23083	0.832	0.916	0.912	5	(([0− 1−] 2−) 3−)
47	25597	0.810	0.817	0.892	5	(0+ [1+ 2+] 3+)
48	19005	0.902	0.997	0.997	3	((0+ 1+ 2+) 3+)
49	15035	0.623	0.671	0.732	4	((0+ ([1+ 2+] 3+)) 4+)
50	28547	0.828	0.828	0.961	7	(0− 1− 2− 3−)
51	25375	0.866	0.866	1.000	3	([0− 1−] 2− 3−)
52	20332	0.860	0.860	0.997	5	(0+ 1+ [2+ 3+])
53	26364	0.839	0.869	0.988	5	(0+ 1+ 2+ 3+)
54	19909	0.818	0.818	0.968	6	(0− [1− 2−] 3−)
55	20601	0.586	0.586	0.756	5	(0− 1− 2− 3−)
56	15391	0.784	0.784	0.999	3	(0− (1− [2− 3− [4− 5−]]))
57	14573	0.501	0.501	0.722	4	(0− 1− (2− [3− 4−]) 5−)
58	25554	0.721	0.971	0.971	3	(([0+ 1+] 2+) 3+)
59	19526	0.543	0.471	0.814	4	((0− [1− 2−]) 3−)

Open in a new tab

For each gene cluster (row in the table) we report the Pearson correlation versus the corresponding instance-abundance measure for each of the three genome rearrangement measures compared in our experiment: our proposed measure ( $Diverge$ ), signed break-point distance ( $d_{SBP}$ ) as in Definition 8 and CREx reversals distance ( $d_{reversals}$ ) [28]. The column denoted “Size” gives the number of genes in the gene cluster, and the column denoted “PQ-tree” gives the corresponding PQ-tree representing the gene cluster. The functional category annotations for each cluster are given in Tables 3 and 4

Fig. 1 — Gene cluster of a known sugar transporter operon. A Three gene order instances of the gene cluster. B The color-coding key for the functional categories of the genes

Note that even though random shuffling by evolution forms different gene orders, not every order is conserved. In prokaryotic genomes, gene clusters often correspond to operons, which are adjacently localized genes that are co-transcribed and co-translated. Here, selection for specific gene orders in operons can be due to the assembly order of a multifunctional enzymatic complex [35] or the performance of consecutive reactions in a metabolic pathway [36]. The conserved gene order within operons also facilitates the evolution of multiple post-transcriptional and post-translational feedback mechanisms that regulate the expression of enzymes catalyzing different steps in the pathway according to metabolite availability [37]. Thus, the gene orders we see today are functionally conserved. In order to take this into consideration in our genome-rearrangement approach, we use a PQ-tree to represent the gene cluster and to model the possible evolutionary events that yielded the various gene orders (see Fig. 3).

Considering the first two gene orders in the cluster, the signed break-point distance between them is 2 (see Fig. 2). Now, let’s use our knowledge of the evolution of the gene cluster, using the PQ-tree (in Fig. 3), and reconsider the break-point between genes one and two. Looking at the PQ-tree, we can see that the proposed break-point actually falls between gene one and the P-node y, which represents genes from the cluster that are included in the same operon. Notice that the adjacency of gene one and P-node y appears both in the first gene order and in the second gene order. Therefore, according to the PQ-tree, this proposed break-point is functionally irrelevant and should not be penalized by the corresponding SIGR score! This makes biological sense, since gene 1 is a transcription factor that typically appears upstream to the operon it regulates, regardless of how the genes are arranged within the operon [34].

Next, consider a comparative analysis between these two gene orders based on a perfect reversals distance scenario. A perfect reversals scenario would acknowledge the observation that the transcription factor gene is not colinearly dependent on a specific gene from the transporter operon, however it would entail a long and expensive series of reversal operations: first, reversal of the sequence of genes 2–5, and then the individual reversal of each one of these genes. Although this may have been the actual evolutionary scenario that led to the second gene order, this scenario is quite uninformative in terms of both assessing the functional distance as well as interpreting the functional discrepency between the two gene orders. The point is, as would be observed by a PQ-tree guided block-interchange operation, that in the first order the sugar transporter is transcribed prior to the transcription of the sugar-metabolising gene, while in the second gene order the transcription order of these units is reversed.

Previous related works

Permutations on strings representing syntenic gene blocks on genomes have been studied earlier by [38–42] and the idea of a maximal permutation pattern was introduced by Eres et al. [41]. In [8] an algorithm was proposed for representation and detection of gene clusters in multiple genomes using PQ-trees. The proposed algorithm identifies a minimal consensus PQ-tree, and it is proven that this tree is equivalent to a maximal permutation pattern and that each subgraph of the PQ-tree corresponds to a non-maximal permutation pattern. In that paper, the authors also present a general scheme to handle gene multiplicity and missing genes in permutations and give a linear time algorithm to construct the minimal consensus PQ-tree. These previous works fall in the domain of permutation discovery and PQ-tree construction. In contrast, the problems addressed in our paper take as input a previously constructed PQ-tree.

Three additional related works fall in the domain of “PQ-tree language distance computations”. The first problem is defined and solved in [7]. It asks, given a PQ-tree T and a string s, to find, among all permutations that T can generate, a permutation p such that the edit distances between p and s is minimal. Thus, the work of [7] used the input PQ-tree only as a constraint to guide the alignment process, and did not project the guiding information to the comparative score, as we do. Furthermore, the work of [7] did not consider genome rearangement operations at all, only string edit operations.

For break-point distance, two problems were proposed and solved in [43]. The first problem asks, given two PQ-trees over permutation gene orders, and a parameter k, whether there are two strings $S_{1}$ and $S_{2}$ generated from each of the trees, respectively, such that the break-point distance between $S_{1}$ and $S_{2}$ is up to k. The second problem asks, given a PQ-tree T, a set of p permutations and a parameter k, whether T can generate a permutation s such that the sum of the break-point distances between s and each of the given p permutations is bounded by k. However, in both of the problems addressed in [43], the order of the leaves in the input tree, as well as the actions taken on it are not taken into account. Additionally, the break-point distance is computed between two strings, and the tree is not a part of that distance computation. In this paper we employ a scoring strategy that takes the order of the leaves in the tree and the actions on the tree into account. In particular, we define the divergence from an ordered PQ-tree to a string by the rearrangement actions applied on the tree.

Preliminaries

Let $S = s_{1} . . . s_{n}$ be a string. Denote by S[i] the character in position i in S, i.e. $S [i] = s_{i}$ . In addition, denote by S[i : j] ( $i \leq j$ ) the subsequence of S from position i to position j, i.e. $S [i : j] = s_{i} . . . s_{j}$ .

PQ-tree—representing the pattern

A PQ-tree is a rooted tree with three types of nodes: P-nodes, Q-nodes and leaves. The children of a P-node can appear in any order, while the children of a Q-node must appear in either left-to-right order or right-to-left order. Booth and Lueker [5] were interested in permutations of a set, thus every member of U appears exactly once as a label of a leaf in the PQ-tree. We, on the other hand, allow each member of the set to appear as a label any non-negative number of times. The possible reordering of the children nodes in a PQ-tree can potentially create many equivalent PQ-trees. Booth and Lueker defined two PQ-trees T and $T^{'}$ as equivalent (denoted $T \equiv T^{'}$ ) if and only if one tree can be obtained by legally reordering the nodes of the other; namely, randomly permuting the children of a P-node, and reversing the children of a Q-node. A generalization of their definition, to allow for insertions and deletions, is defined as follows.

Definition 1

(Quasi-Equivalence) Two PQ-trees $T, T^{'}$ are quasi-equivalent with parameter d, denoted by $T ≅_{d} T^{'}$ , if and only if $T^{'}$ can be obtained by (a) randomly permuting the children of the P-nodes of T, (b) reversing the children of the Q-nodes of T, (c) deleting up to d leaves of T.

Denote by $T_{x}$ the subtree of a PQ-tree T rooted in the node x. Denote by $Leaves (x)$ the set of leaves of the PQ-tree $T_{x}$ , $span (x) = | Leaves (x) |$ and for a set of nodes U, $span (U) = \sum_{v \in U} span (v)$ . Denote by $children (x)$ the set of children of the node x, and let $r o o t_{T}$ the root node of the tree T.

Given a PQ-tree T, we denote the label of a leaf x of T by $label (x)$ . The frontier of the PQ-tree T, denoted F(T), is the sequence of labels on the leaves of T read from left to right. In addition to Definition 1, for each node in a PQ-tree (internal node or leaf), we define a unique “color” which will help us distinguish and map between nodes of two quasi-equivalent PQ-trees. Colors are used only for the analysis—they are not used explicitly in the algorithm. In the input we receive one PQ-tree (T) and “assign” arbitrary (but unique) colors to its nodes. In the algorithms, we perform actions on T, which reorder its nodes. We refer by $T^{'}$ to the original tree (T) with reordered nodes. Thus, we use colors just to keep track of the nodes after the shuffling, see Fig. 4. From now on, when we say that two PQ-trees $T, T^{'}$ are quasi-equivalent, we assume that the equivalence is with parameter d. In addition, we assume that the PQ-trees $T, T^{'}$ are colored in the same unique colors (each color is assigned only to one node in T and at most one node in $T^{'}$ , and the nodes in $T^{'}$ have the same colors as their corresponding nodes in T). In addition, we say that the frontier of $T^{'}$ , $F (T^{'})$ , is derived from T, and we call this a derivation. We can also say that $T^{'}$ is ordered as $F (T^{'})$ . When a string S is derived from $T_{x}$ , we also say that S is derived from x.

Fig. 4 — a PQ-tree T. b PQ-tree $T^{'}$ which is quasi-equivalent to T. The colors are represented by the number assigned to each node, the labels of the leaves are the letters. Each internal node is T is marked by a letter (x for example), and its equivalent node is marked by the same letter with a dash ( $x^{'}$ )

Definition 2

(Equivalent Nodes) Given two quasi-equivalent PQ-trees $T, T^{'}$ with parameter d and two nodes $x \in T$ and $x^{'} \in T^{'}$ , x and y are equivalent nodes if they share the same color.

In Fig. 4, the colors of the trees are shown as unique numbers near each node (notice that each node “keeps” its color in $T^{'}$ compared to T). We say that the string $` ` a b a b$ ” is derived from T, because $` ` a b a b$ ” is the frontier of $T^{'}$ .

Break-point distances

The definitions of our problems make use of the notion of break-point distance [43] to determine the distance between two strings, as defined below.

Definition 3

(Gene Mapping) Let $G = g_{1}, \dots, g_{n}$ and $H = h_{1}, \dots, h_{m}$ be two strings. A gene mapping of G and H, denoted by $M$ , is a set of pairs $(i, j) \in {1, \dots, n} \times {1, \dots, m}$ such that $g_{i} = h_{j}$ 1 and every position in G and H is in exactly one pair in $M$ . When no confusion arises, we suppose that the gene mapping contains the pairs of genes $(g_{i}, h_{j})$ themselves.

Definition 4

(Break-Point) Given two strings $G = g_{1}, \dots, g_{n}$ , $H = h_{1}, \dots, h_{m}$ and a gene mapping $M$ , a break-point between G and H is a pair of consecutive genes $g_{i} g_{i + 1}$ in G (resp. $h_{i} h_{i + 1}$ in H) such that the following is true: $g_{i}$ and $g_{i + 1}$ (resp. $h_{i}$ and $h_{i + 1}$ ) belong to $M$ , say, $(g_{i}, h_{j}), (g_{i + 1}, h_{k}) \in M$ (resp. $(g_{j}, h_{i}), (g_{k}, h_{i + 1}) \in M$ ), but neither $k = j + 1$ nor $k = j - 1$ .

Denote by ${NUM}_{BP} (G, H, M)$ the number of break-points of G between G and H with respect to $M$ .

Definition 5

(Break-Point Distance) Let $S_{1}$ and $S_{2}$ be two strings. The break-point distance between $S_{1}$ and $S_{2}$ , denoted by $d_{BP} (S_{1}, S_{2})$ , is the minimum of ${NUM}_{BP} (G, H, M)$ among all gene mappings $M$ of G and H.

For example, suppose we are given the strings $S_{1} = a b c d$ and $S_{2} = a c b d$ . Then, there exists exactly one gene mapping of $S_{1}$ and $S_{2}$ . The break-points of $S_{1}$ are the pairs (a, b), (c, d). Therefore, the break-point distance between them is 2. We also can count the number of break-points of $S_{2}$ , which are (a, c), (b, d).

We will use a variant of break-point distance that takes the signs of the characters in a string into account. Towards that, we define the notion of a signed string.

Definition 6

(Signed String) A signed string is a string where each character is assigned a sign (‘+’ or ‘−’).

When we say that a string S is a signed string, we suppose to have a sign function $sign$ that returns the sign of the character in each position in S. For the sake of illustration, in our examples, we will indicate the sign of each character by ‘+’ or ‘−’ of its left side.

Definition 7

(Signed Break-Point) Given two signed strings $G = g_{1}, \dots, g_{n}$ , $H = h_{1}, \dots, h_{m}$ and a gene mapping $M$ , a signed break-point between G and H is a pair of consecutive genes $g_{i} g_{i + 1}$ in G (resp. $h_{j} h_{j + 1}$ in H) such that the following is true: $g_{i}$ and $g_{i + 1}$ (resp. $h_{i}$ and $h_{i + 1}$ ) belong to $M$ , say, $(g_{i}, h_{j}), (g_{i + 1}, h_{k}) \in M$ (resp. $(h_{i}, g_{j}), (h_{i + 1}, g_{k}) \in M$ ) but neither $k = j + 1$ , $sign_{G} (i) = sign_{H} (j)$ and $sign_{G} (i + 1) = sign_{H} (k)$ nor $k = j - 1$ , $sign_{G} (i) = - sign_{H} (j)$ and $sign_{G} (i + 1) = - sign_{H} (k)$ .

Denote by ${NUM}_{SBP} (G, H, M)$ the number of signed break-points of G between G and H with respect to $M$ .

Definition 8

(Signed Break-Point Distance) Let $S_{1}$ and $S_{2}$ be two signed strings. The signed break-point distance between $S_{1}$ and $S_{2}$ , denoted by $d_{SBP} (S_{1}, S_{2})$ , is the minimum of ${NUM}_{SBP} (G, H, M)$ among all gene mappings $M$ .

For example, consider the signed strings $S_{1} = + a + b + c + d$ and $S_{2} = + a - b - d - c$ . Then, there exists exactly one gene mapping of $S_{1}$ and $S_{2}$ . The signed break-points of $S_{1}$ are the pairs (a, b), (b, c). Therefore, $d_{SBP} (S_{1}, S_{2}) = 2$ . We can also count the number of signed break-points of $S_{2}$ , which are (a, b), (b, d).

To accommodate deletions, we will use Definition 8 with respect to strings obtained from the given ones after deleting characters.

Problem preliminaries

Given an internal node x in a PQ-tree T, we define its sign, $sign (x)$ , as the majority sign of the leaves in $T_{x}$ , $Leaves (x)$ . If the number of negative signed leaves is equal to the number of positive signed leaves, then we abuse notation and consider $sign (x)$ as $+$ as well as −.

Given a node x in a PQ-tree, let S(x) denote the signed string of colors of the nodes in $children (x)$ as they are ordered in the tree (from left to right). For example, consider the PQ-tree shown in Fig. 5a where the character assigned to each internal node is its color, then $S (z) = + b + c$ , $S (y) = + a + z + d$ and $S (x) = + y + e + f$ .

Fig. 5 — a A PQ-tree T. b A PQ-tree $T^{'}$ that is equivalent to T and is ordered as S

In order to measure the divergence from an ordered PQ-tree to a string, we take into account the actions performed on the PQ-tree to order it as needed. Towards defining the divergence from an (ordered) PQ-tree T to a string S, we first define a penalty for taking an action on an internal node of a PQ-tree, denoted by $Δ_{violation}$ . The penalty $Δ_{violation}$ is a combination of several types of penalties. The first type concerns cases where large units “jump” while reordering the children of a P-node x to have the same order as the children of its equivalent node $x^{'}$ . Specifically, we want to penalize according to the sizes of these units: We will not penalize a single leaf that jumps, but only units whose size is larger than 1, and the penalty for these units increases with their sizes. Thus, we consider the following penalty. If the size of a unit that jumps is t, then we penalize this operation by $(t - 1) / 2$ . In particular, a leaf does not get penalized. To do so, we build a graph $G (x, x^{'})$ (defined formally later), whose vertex set is the children of the P-node. Each vertex has a weight relative (as mentioned before and will defined ahead) to its size (defined as the span of the child it represents). Roughly speaking, the graph G has an edge for each pair of children that “changed their order” (with respect to the their signs). To be precise, we need the following definition.

Definition 9

(Change Of Signed Order) Let $x, x^{'}$ be two equivalent P-nodes of two quasi-equivalent PQ-trees T and $T^{'}$ with parameter d, $S (x) = c_{1}, \dots, c_{n}$ be the signed string of colors of the nodes in $children (x)$ that were not deleted, as they ordered in T, $S (x^{'}) = c_{1}^{'}, \dots, c_{n^{'}}^{'}$ be the signed string of colors of the nodes in $children (x^{'})$ , as they ordered in $T^{'}$ , and $M$ be the2 gene mapping of S(x) and $S (x^{'})$ . Given two nodes y (with color $c_{i}$ and $y^{'}$ is the equivalent node of y in $T^{'}$ ) and z (with color $c_{j}$ and $z^{'}$ is the equivalent node of z in $T^{'}$ ) in $children (x)$ , say, $j > i$ , and such that $(c_{i}, c_{k}^{'}), (c_{j}, c_{t}^{'}) \in M$ for some $c_{k}^{'}$ and $c_{t}^{'}$ , y and z change their signed order if the following are both false.

$t > k$ , $sign (y) = sign (y^{'})$ and $sign (z) = sign (z^{'})$ .
$t < k$ , $sign (y) = - sign (y^{'})$ and $sign (z) = - sign (z^{'})$ .

We denote by $G [x, x^{'}]$ the graph (described before Definition 9) of two equivalent P-nodes x and $x^{'}$ of two quasi-equivalent PQ-trees. Formally, it is defined as follows.

Definition 10

( $G [x, x^{'}]$ ) Given two equivalent P-nodes $x, x^{'}$ of two quasi-equivalent PQ-trees with parameter d, $G [x, x^{'}] = (V, E, w)$ is the undirected graph with vertex set V, edge set E and vertex weight function w which defined as follows.

$V = children (x)$ .
$E = {(v, u) | u a n d v c h a n g e d t h e i r s i g n e d o r d e r}$ .
For each $x \in V, w (x) = (span (x) - 1) / 2$ .

Notice that the graph is dependent on the colors, to determine the edge set.

After building $G [x, x^{'}]$ , we find the minimum weight of a vertex cover of $G [x, x^{'}]$ in order to sum all penalties for the units that jumped while reordering the children of a P-node. Observe that by computing the minimum weight of a vertex cover, we identify a “best” (in terms of penalty) set of nodes that jump.

Definition 11

(Minimum Weighted Vertex Cover) Let $G = (V, E, w)$ be a graph with vertex set V, edge set E and vertex weight function w. The minimum weighted vertex cover of G is the minimum weight3 among all vertex covers of G.4

Definition 12

( $Δ_{jump}^{P} (x, x^{'})$ ) Given two equivalent P-nodes x and $x^{'}$ of two quasi-equivalent PQ-trees with parameter d, and the weight t of a minimum weighted vertex cover of $G [x, x^{'}]$ , the jump violation between x and $x^{'}$ , denoted by $Δ_{jump}^{P} (x, x^{'})$ , is t.

Now, we define a violation between two equivalent internal nodes. Towards that, given two equivalent nodes $x, x^{'}$ of two quasi-equivalent PQ-trees T and $T^{'}$ (where x is not deleted), let $isFlipped (x, x^{'})$ be a procedure that returns 1 if x and $x^{'}$ “flipped”, and 0 otherwise. That is, if x and $x^{'}$ are leaves, $isFlipped (x, x^{'}) = 1$ if $sign (x) = - sign (x^{'})$ ; otherwise, $isFlipped (x, x^{'}) = 0$ . If x and $x^{'}$ are internal nodes, $isFlipped (x, x^{'}) = 1$ if for each child $y \in children (x)$ that is not deleted, $isFlipped (y, y^{'}) = 1$ where $y^{'}$ is the child of $x^{'}$ that is equivalent node of y, and the order of $children (x^{'})$ in $T^{'}$ is the reversal of the order of $children (x)$ in T; otherwise, $isFlipped (x, x^{'}) = 0$ .

Given an internal node x, denote by ${\tilde{S}}_{d} (x)$ the set of signed strings where $\tilde{S} \in {\tilde{S}}_{d} (x)$ if $\tilde{S}$ is obtained from S(x) by deleting up to d leaves from $T_{x}$ .

Definition 13

( $Δ_{violation}^{d} (x, x^{'})$ ) Given two equivalent internal nodes x and $x^{'}$ of two quasi-equivalent PQ-trees with parameter d, and input numbers $δ_{ord}^{Q}$ and $δ_{flip}^{Q}$ , the violation between x and $x^{'}$ , denoted by $Δ_{violation}^{d} (x, x^{'})$ , is defined as follows.

If x is a P-node, $Δ_{violation}^{P} (x, x^{'}) = min_{\tilde{S} \in {\tilde{S}}_{d} (x)} d_{SBP} (\tilde{S}, S (x^{'})) + Δ_{jump}^{P} (x, x^{'})$ .
If x is a Q-node, $Δ_{violation}^{Q} (x, x^{'}) = δ_{ord}^{Q} \cdot min_{\tilde{S} \in {\tilde{S}}_{d} (x)} d_{SBP} (\tilde{S}, S (x^{'})) + isFlipped (x, x^{'}) \cdot δ_{flip}^{Q}$ .

When we use the notation $Δ_{violation}^{d}$ , if d is clear from the context, we will use the notation $Δ_{violation}$ instead. In Definition 13, one of the penalties for reordering the children of a Q-node is the flip penalty ( $δ_{flip}^{Q}$ ), and it is performed if the Q-node has flipped (as indicated by $isFlipped$ ). When considering this event, notice the case where a Q-node x flipped and all of the non-deleted children in $children (x)$ flipped as well (including the P-nodes). In this case, we penalise each of the non-deleted children in $children (x)$ and x as well for flipping, but, in fact, the event is flipping only x. Thus, we unnecessarily penalise the non-deleted children in $children (x)$ . Therefore, we employ a flip correction procedure as defined next.

Definition 14

( $FlipCorrection (x, x^{'})$ ) Given two equivalent internal nodes x and $x^{'}$ of two quasi-equivalent PQ-trees with parameter d, the flip correction between them, denoted by $FlipCorrection (x, x^{'})$ , is 0 if not all of $children (x)$ flipped or deleted, and otherwise it is the sum of the flip penalties of all of the Q-nodes and leaves in $children (x)$ that were not deleted.

Finally, after defining the violation between equivalent nodes, we simply sum up the violations (and corrections) of all internal nodes to define the divergence from an ordered and labeled PQ-tree to a signed string as follows.

Definition 15

( ${Diverge}_{d_{T}, d_{S}} (T, S)$ ) Let T be an (ordered and leaf-labeled) PQ-tree, and S be a signed string. Let $O_{T}$ be the set of all quasi-equivalent $T^{'}$ PQ-trees of T with parameter $d_{T}$ (i.e. $T ≅_{d_{T}} T^{'}$ ) such that $T^{'}$ is ordered as a subsequence of S obtained by deleting up to $d_{S}$ characters from S. For any $T^{'} \in O_{T}$ , let $M_{T^{'}}$ be the unique5 mapping that maps each node in $T^{'}$ to its equivalent node in T. Then,

\begin{matrix} {Diverge}_{d_{T}, d_{S}} (T, S) = min_{T^{'} \in O_{T}} \sum_{(x, x^{'}) \in M_{T^{'}}} (Δ_{violation} (x, x^{'}) - FlipCorrection (x, x^{'})) . \end{matrix}

In case $d_{T} = 0$ and $d_{S} = 0$ , we will use the notation $Diverge (T, S)$ instead of ${Diverge}_{d_{T}, d_{S}} (T, S)$ for simplicity.

For example, consider the ordered PQ-tree T in Fig. 5, which is ordered as $+ a + b + c + d + e + f$ , and the signed string $S = + f - c - b - a - d + e$ . To order T as S, we flip the Q-node z, and for this we pay $Δ_{violation}^{Q} (z, z^{'}) = δ_{flip}^{Q}$ . Now, consider the P-node y. We swap between a and z, so $d_{SBP} (S (y), S (y^{'})) = 1$ , and hence $Δ_{violation}^{P} (y, y^{'}) = 1$ (notice that there is no jump of a large unit, because the leaf a can jump). Finally, for the P-node x, $Δ_{jump}^{P} (x, x^{'}) = 0$ (we can order the children by moving the leaf f to the left-most position). In addition, $d_{SBP} (S (x), S (x^{'})) = 1$ . Thus, for x we pay $Δ_{violation}^{P} (x, x^{'}) = 1$ . In total $Diverge (T, S) = Δ_{violation}^{Q} (z, z^{'}) + Δ_{violation}^{P} (y, y^{'}) + Δ_{violation}^{P} (x, x^{'}) = δ_{flip}^{Q} + 2$ .

Problem definitions

$Constrained TreeToString Divergence$

The input to $CTTSD$ consists of two signed permutations of length n, $S_{1} = σ_{1} \dots σ_{n} \in Σ^{n}$ , $| Σ | = n$ , such that $σ_{i} \neq σ_{j}$ for all $1 \leq i < j \leq n$ , and $S_{2} = λ_{1} \dots λ_{n} \in Σ^{n}$ such that $λ_{i} \neq λ_{j}$ for all $1 \leq i < j \leq n$ ; a PQ-tree T ordered as $S_{1}$ with $m_{p}$ P-nodes and $m_{q}$ Q-nodes; and two numbers $δ_{ord}^{Q}$ and $δ_{flip}^{Q}$ . We aim to perform actions on T to reorder it as $S_{2}$ . That is, we reorder T as $T^{'}$ so that $F (T^{'}) = S_{2}$ . If this is not possible, we answer “NO”. Else, we return the divergence from T to $S_{2}$ . Concretely, $CTTSD$ is defined as follows.

Definition 16

( $Constrained TreeToString Divergence$ ) Given two signed permutations $S_{1}$ and $S_{2}$ of the same length and the same characters, two numbers $δ_{ord}^{Q}$ and $δ_{flip}^{Q}$ , and a PQ-tree T ordered as $S_{1}$ , return $Diverge (T, S_{2})$ or answer “NO” if $S_{2}$ cannot be derived from T.

In “Constrained TreeToString Divergence: algorithm” section, we propose an algorithm to solve this problem.

$TreeToString Divergence$

Generalizing $CTTSD$ , in $TTSD$ we do not assume that the input strings are permutations, and we allow deletions. The input to $TTSD$ consists of two signed strings, $S_{1} = σ_{1} \dots σ_{m} \in Σ_{T}^{m}$ and $S_{2} = λ_{1} \dots λ_{n} \in Σ_{S}^{n}$ ; a PQ-tree T ordered as $S_{1}$ with $m_{p}$ P-nodes and $m_{q}$ Q-nodes; $d_{T} \in N \cup {0}$ , which specifies the number of allowed deletions from T; $d_{S} \in N \cup {0}$ , which specifies the number of allowed deletions from $S_{2}$ ; and two numbers $δ_{ord}^{Q}$ , $δ_{flip}^{Q}$ indicating the penalty of the events of changing order and flipping, respectively, a Q-node. In $TTSD$ we perform actions on T to reorder it as a subsequence $S_{2}^{'}$ of $S_{2}$ , allowing $d_{T}$ deletions from T, and so that $S_{2}^{'}$ is obtained from $S_{2}$ by using up to $d_{S}$ deletions from $S_{2}$ . That is, after reordering T as $T^{'}$ (and performing up to $d_{T}$ deletions), $F (T^{'}) = S_{2}^{'}$ . If it is not possible, we answer “NO”. Else, we return the divergence from T to $S_{2}^{'}$ corresponding to $d_{T}$ and $d_{S}$ . Concretely, $TTSD$ is defined as follows.

Definition 17

( $TreeToString Divergence$ ) Given two signed strings $S_{1}$ and $S_{2}$ , a PQ-tree T ordered as $S_{1}$ , and parameters $d_{T}$ , $d_{S}$ , $δ_{ord}^{Q}$ , $δ_{flip}^{Q}$ , return ${Diverge}_{d_{T}, d_{S}} (T, S_{2})$ or “NO” if $S_{2}$ cannot be derived from T with respect to $d_{T}$ and $d_{S}$ .

Parameterized complexity

Let $Π$ be an NP-hard problem. In the framework of Parameterized Complexity, each instance of $Π$ is associated with a parameter k. Here, the goal is to confine the combinatorial explosion in the running time of an algorithm for $Π$ to depend only on k. Formally, we say that $Π$ is fixed-parameter tractable (FPT) if any instance (I, k) of $Π$ is solvable in time $f (k) \cdot {| I |}^{O (1)}$ , where f is an arbitrary function of k.

Algorithms preliminaries

Given a node x and the numbers of deletions $k_{T}$ and $k_{S}$ of a derivation, the length of the derived string $S^{'}$ can be calculated using the length function given in Definition 18 below.

Definition 18

(The Length Function) $L (x, k_{T}, k_{S}) ≐ span (x) - k_{T} + k_{S}$ .

Using the length function and the start point s of the derivation, the end-point of the derivation can be calculated using the end-point function given in Definition 19 below.

Definition 19

(The End-Point Function) $E (x, s, k_{T}, k_{S}) ≐ s - 1 + L (x, k_{T}, k_{S})$ .

Let $A$ be a DP table used in an algorithm. Addressing $A$ with some of its indices given as dots refers to the subtable of $A$ that is comprised of all entries of $A$ indexed by the indices that are specified (i.e., all indices not marked by dots). For example, $A [x, \cdot, \cdot, \cdot, \cdot]$ , refers to the subtable of $A$ that is comprised of all entries of $A$ whose first entry is x.

In the algorithms, for a given node x in a PQ-tree T, number of positive signed leaves in $T_{x}$ , pos, a possible sign $s \in {+, -}$ , and the number of deletions from $T_{x}$ , $k_{T}$ (if not implied by context, then $k_{T} = 0$ ), we say that pos is consistent with s (or s is consistent with pos) if $s = +$ and $p o s \geq (span (x) - k_{T}) / 2$ or if $s = -$ and $p o s \leq (span (x) - k_{T}) / 2$ . Moreover, in some situations, for a given set of nodes C, we will sometimes use a set $N \subseteq C$ to describe the nodes in C that have a negative sign: for a given node $x \in C$ , if $x \in N$ , then $sign (x) = -$ ; otherwise, $sign (x) = +$ . In these situations, we say that pos is consistent with N (or N is consistent with pos) if $x \in N$ and $p o s \leq (span (x) - k_{T}) / 2$ or if $x \notin N$ and $p o s \geq (span (x) - k_{T}) / 2$ .

$Constrained TreeToString Divergence$ : algorithm

In this section, we present a greedy algorithm to solve $CTTSD$ . Our algorithm consists of three components: the main algorithm, and two procedures called P-Mapping and Q-Mapping. We first present and explain the main algorithm and the procedures. Afterwards, we demonstrate the execution of the algorithm and analyze its running time.

The main algorithm

Recall that the input to $CTTSD$ consists of two signed permutations $S_{1}$ and $S_{2}$ of length n, two numbers $δ_{ord}^{Q}$ and $δ_{flip}^{Q}$ , and a PQ-tree T ordered as $S_{1}$ . If $S_{2}$ can be derived from T, then the output of the algorithm is the divergence from T to $S_{2}$ , $Diverge (T, S_{2})$ . Otherwise, the output is “NO” (specifically, the algorithm returns $\infty$ ).

The main algorithm (whose pseudo-code is given in Algorithm 1) constructs a 2-dimensional DP table $A$ of size $m^{'} \times n$ where $m^{'} = n + m_{p} + m_{q}$ is the number of nodes in T. For each node x in T and index $ℓ$ , $A$ has an entry $A [x, ℓ]$ . In the algorithm, for each node x, we keep two indices $ℓ$ and r (denoted by $x . ℓ$ and x.r respectively) such that $S_{2} [x . ℓ : x . r]$ is derived from $T_{x}$ . Then, the purpose of an entry of the DP table, $A [x, x . ℓ]$ , is to hold the divergence from the subtree $T_{x}$ to the subsequence $S_{2} [x . ℓ : x . r]$ of $S_{2}$ . That is,

\begin{matrix} A [x, x . ℓ] = Diverge (T_{x}, S_{2} [x . ℓ : x . r]) . \end{matrix}

If any subsequence of $S_{2}$ starting at position $ℓ$ cannot be derived from $T_{x}$ , then $A [x, ℓ] = \infty$ .

Some entries of the DP table define illegal derivations. Such are derivations where the length of the frontier of the subtree is larger than the length of the longest subsrting starting at the specified index $ℓ$ . These entries are called invalid entries and their value is defined as $\infty$ throughout the algorithm. Formally, an entry $A [x, ℓ]$ is invalid if $span (x) > n - ℓ + 1$ .

The algorithm first initializes the entries of $A$ that are meant to hold divergences of derivations of every possible subsequence of $S_{2}$ (a single character) from the leaves of T. Specifically, for a leaf x, if it did not flip, we put 0 in the corresponding entry. If x did flip, we put $δ_{flip}^{Q}$ . After that, we update $x . ℓ$ and x.r. As described in the initialization, if $label (x) = S_{2} [ℓ]$ , $S_{2} [ℓ]$ ( $S_{2} [ℓ : ℓ]$ ) is derived from $label (x) = S_{2} [ℓ]$ , then we put 0 or $δ_{flip}^{Q}$ in $A [x, ℓ]$ and we put $ℓ$ in $x . ℓ$ and x.r.

After the initialization, all other entries of $A$ are filled as follows. Go over the internal nodes of T in postorder. For every internal node x, go in descending order over every index $1 \leq ℓ \leq n$ that can be a start index for the subsequence of $S_{2}$ derived from $T_{x}$ (in case of invalid entry, we continue to the next iteration). For every x and $ℓ$ , use the algorithm for P-mapping or Q-mapping according to the type of x. Both algorithms receive the following input: the node x, $S_{2}$ , start and end indices $ℓ, e$ of a subsequence of $S_{2}$ , and the collection of derivations of the children of x (entries of $A$ that have already been computed and hold the divergence of a derivation). In addition, the Q-Mapping algorithm receives as input the penalty parameters $δ_{ord}^{Q}$ and $δ_{flip}^{Q}$ . After being called, both algorithms return the divergence from $T_{x}$ to $S_{2} [ℓ : e]$ , that is, $Diverge (T_{x}, S_{2} [ℓ : e])$ .

Finally, having filled the DP table, $A [r o o t_{T}, 1]$ holds the divergence from T to $S_{2}$ ( $Diverge (T, S_{2})$ ), and so we return $A [r o o t_{T}, 1]$ .

P-node mapping: the algorithm

Recall that the input consists of a P-node x, a string $S_{2}$ , two indices $ℓ$ and e, and a set of derivations $D$ . Notice that each value in $D$ is the divergence from the subtree rooted in a child c of x to $S_{2} [c . ℓ : c . e]$ , where $S_{2} [c . ℓ : c . e]$ is a subsequence of $S_{2} [ℓ : e]$ that is derived from $T_{c}$ . These values, $A [c, c . ℓ]$ for each $c \in children (x)$ , were calculated in earlier iterations and saved in $D$ . If $S_{2} [ℓ : e]$ can be derived from $T_{x}$ , then the output of the algorithm is the divergence from $T_{x}$ to $S_{2} [ℓ : e]$ , $Diverge (T_{x}, S_{2} [ℓ : e])$ . Otherwise, the output is “NO” (specifically, the algorithm returns $\infty$ ). Denote by $T_{x}^{'}$ the quasi-equivalent PQ-tree of $T_{x}$ ordered as $S_{2} [ℓ : e]$ . Note that if $T_{x}^{'}$ exists, then it is unique (because we deal with permutations and forbid deletions).

The algorithm (whose pseudo-code is given in Algorithm 2) first checks if the interval $[ℓ, e]$ can be “completed” by all of the intervals defined by the indices of the children of x. Specifically, we check if there is any order of the children of x, say, ordered as $c_{1}, \dots, c_{| children (x) |} \in children (x)$ , such that $c_{1} . ℓ = ℓ$ , $c_{| children (x) |} . e = e$ , and for each $1 \leq j \leq | children (x) | - 1$ , $c_{j} . e + 1 = c_{j + 1} . ℓ$ . If there is no such order, then the interval $[ℓ, e]$ cannot be completed, and so $S_{2} [ℓ, e]$ cannot be derived from $T_{x}$ . In this case, we return $\infty$ . Otherwise, $S_{2} [ℓ : e]$ can be derived from $T_{x}$ by reordering the children of x according to the unique order that completes the interval $[ℓ, e]$ . Second, we sum up all of the values in $D$ (and store the sum in the variable childrenDist). Next, we calculate the violation between x and its equivalent node $x^{'}$ in $T_{x}^{'}$ , $Δ_{violation}^{P} (x, x^{'})$ , according to Definition 13 (and store the result in the variable violation). Finally, we return the sum of childrenDist and violation, which is the divergence from $T_{x}$ to $S_{2} [ℓ : e]$ , $Diverge (x, S_{2} [ℓ : e])$ (according to Definition 15).

Q-node mapping: the algorithm

Recall that the input consists of a Q-node x, a string $S_{2}$ , two indices $ℓ$ and e, a set of derivations $D$ and penalty parameters $δ_{ord}^{Q}$ and $δ_{flip}^{Q}$ . Notice that each value in $D$ is the divergence from the subtree rooted in a child c of x to $S_{2} [c . ℓ : c . e]$ , where $S_{2} [c . ℓ : c . e]$ is a subsequence of $S_{2} [ℓ : e]$ and is derived from $T_{c}$ . These values, $A [c, c . ℓ]$ for each $c \in children (x)$ , were calculated in earlier iterations and saved in $D$ . If $S_{2} [ℓ : e]$ can be derived from $T_{x}$ , then the output of the algorithm is the divergence from $T_{x}$ to $S_{2} [ℓ : e]$ , $Diverge (T_{x}, S_{2} [ℓ : e])$ . Otherwise, the output is “NO” (specifically, the algorithm returns $\infty$ ). Denote by $T_{x}^{'}$ the (unique) quasi-equivalent PQ-tree of $T_{x}$ ordered as $S_{2} [ℓ : e]$ .

The algorithm (whose pseudo-code is given in Algorithm 3) first checks if the interval $[ℓ, e]$ can be “completed consecutively” by all of the intervals defined by the indices of the children of x. Specifically, we check if there is a consecutive order of the children of x, say, ordered as $c_{1}, \dots, c_{| children (x) |} \in children (x)$ , such that $c_{1} . ℓ = ℓ$ , $c_{| children (x) |} . e = e$ , and for each $1 \leq j \leq | children (x) | - 1$ , $c_{j} . e + 1 = c_{j + 1} . ℓ$ . As apposed to a P-node, here the order of the children completing the interval $[ℓ, e]$ must be consecutive with respect to their indices (the same order as the children of x in T or the reverse order). If there is no such order, then the interval $[ℓ, e]$ cannot be completed consecutively, and so $S_{2} [ℓ : e]$ cannot be derived from $T_{x}$ . In this case, we return $\infty$ . Otherwise, $S_{2} [ℓ : e]$ can be derived from $T_{x}$ by keeping the order of the children of x, or flipping it. Second, we sum up all of the values in $D$ (and store the sum in the variable childrenDist). Next, we calculate the violation between x and its equivalent node $x^{'}$ in $T_{x}^{'}$ , $Δ_{violation}^{Q} (x, x^{'})$ , according to Definition 13 (and store the result in the variable violation). Afterwards, we calculate the flip correction according to Definition 14 (and store the result in the variable childrenFlipCorrection). Finally, we return the sum of childrenDist and violation minus childrenFlipCorrection, which is the divergence from $T_{x}$ to $S_{2} [ℓ : e]$ , $Diverge (x, S_{2} [ℓ : e])$ (according to Definition 15).

Example

Consider the following input: $S_{1} = + a + b + c + d + e + f$ , $S_{2} = + f - c - b - a - d - e$ , PQ-tree T ordered as $S_{1}$ , $δ_{ord}^{Q} = 3$ and $δ_{flip}^{Q} = 3$ . We iterate through the nodes of the tree in post-order, thus we initiate the leaves before their parents. For each leaf $x \in Leaves (r o o t_{T})$ , if $label (x) = S_{2} [ℓ]$ , then $A (x, ℓ) = 0$ ; otherwise, $A (x, ℓ) = \infty$ .

Figure 6 describes the PQ-tree T and its quasi-equivalent PQ-tree $T^{'}$ ordered as $S_{2}$ , after the initialization of the leaves only. Notice that the order of the initialization is in fact different, in postorder; for simplicity, we show the tree where only the leaves are initialized. In addition, in the figures, the equivalent nodes of T and $T^{'}$ are shown as the same nodes. But in the explanations, in order to distinguish between them, for each node $x \in T^{'}$ , we denote it by $x^{'}$ . The pair of numbers shown in the figure near a node represent its $ℓ$ and r values. In addition, the sign $+$ or − near a node represents its sign. For example, for $a \in T$ , $a . ℓ = a . r = 4$ and $sign (a) = +$ . For each internal node, the character assigned to it represents its color.

Fig. 6 — a PQ-tree T. b PQ-tree $T^{'}$ , which is quasi-equivalent to T and ordered as $S_{2}$

First, consider the iteration where $A [z, 2]$ is calculated (the values for all other entries with node z is $\infty$ ). The intervals of the children of z, [3, 3] and [2, 2], complete the interval (2, e) consecutively where $e = 2 + span (z) - 1 = 3$ . Thus, the subsequence $S_{2} [2 : 3]$ is derived from $T_{z}$ , and in order to generate it we need to flip the order of $children (z)$ . For this the penalty of flipping, $δ_{flip}^{Q}$ , is applied, and so $Δ_{violation}^{Q} (z, z^{'}) = δ_{flip}^{Q} = 3$ . $c h i l d r e n D i s t = A [b, 3] + A [c, 2] = 6$ because $A [b, 3] = A [c, 2] = 3$ . $FlipCorrection (z, z^{'}) = 6$ because both b and c have been flipped. Therefore, $A [z, 2] = c h i l d r e n D i s t + v i o l a t i o n - c h i l d r e n F l i p C o r r e c t i o n = 3$ . Now, we update z’s indices, $z . ℓ = 2$ and $z . r = 3$ . Figure 7 describes the PQ-tree T and its quasi-equivalent PQ-tree $T^{'}$ after calculating $A [z, 2]$ .

Next, consider the iteration where $A [y, 2]$ is calculated (the values for all other entries with node y is $\infty$ ). Recall that Fig. 7 describes T and $T^{'}$ after calculating the entries of the children of the P-node y. The intervals of the children complete the interval (2, e), where $e = 2 + span (y) - 1 = 5$ , so we can continue with the iteration. $c h i l d r e n D i s t = A [a, 4] + A [z, 2] + A [d, 5] = 9$ . $d_{SBP} (S (y), S (y^{'})) = 1$ where the pair (z, d) is the only singed break-point (note that $S (y) = + a + z + d$ , $S (y^{'}) = - z - a - d$ ). $Δ_{jump}^{P} (y, y^{'}) = 0$ because to reorder the children of y as $S_{2} [2 : 5]$ we can move z only (and its size is 1). Therefore, $Δ_{violation}^{P} (y, y^{'}) = d_{SBP} (S (y), S (y^{'})) + Δ_{jump}^{P} (y, y^{'}) = 1$ . Then, $A [y, 2] = 10$ and we can update y’s indices, $y . ℓ = 2$ and $y . r = 5$ . Figure 8 describes the PQ-tree T and its quasi-equivalent PQ-tree $T^{'}$ after calculating $A [y, 2]$ .

Fig. 8 — a PQ-tree T. b PQ-tree $T^{'}$ , which is quasi-equivalent to T and ordered as $S_{2}$

Finally, consider the iteration where $A [x, 1]$ is calculated (the values for all other entries with node x is $\infty$ ). Figure 8 describes T and $T^{'}$ after calculating the entries of the children of the P-node x. The intervals of the children complete the interval (1, e), where $e = 1 + span (x) - 1 = 6$ , so we can continue with the iteration. $c h i l d r e n D i s t = A [y, 2] + A [e, 6] + A [f, 1] = 10$ . $d_{SBP} (S (x), S (x^{'})) = 2$ where the pairs (y, e) and (e, f) are the signed break-points (note that $S (x) = + y + e + f$ , $S (x^{'}) = + f - y + e$ ). $Δ_{jump}^{P} (x, x^{'}) = 0$ because to reorder the children of x as $S_{2}$ we can move f only (and its size is 1). Therefore, $Δ_{violation}^{P} (x, x^{'}) = d_{SBP} (S (x), S (x^{'})) + Δ_{jump}^{P} (x, x^{'}) = 2$ . Then, $A [x, 1] = 10 + 2 = 12$ and the algorithm returns 12. Figure 9 describes the PQ-tree T and its quasi-equivalent PQ-tree $T^{'}$ after calculating $A [x, 1]$ .

Fig. 9 — a PQ-tree T. b PQ-tree $T^{'}$ , which is quasi-equivalent to T and ordered as $S_{2}$

Complexity analysis

In this section we analyse the time and space complexities of Algorithm 1, which solves $CTTSD$ . First, we analyse Algorithms 2 and 3, which are used as procedures in Algorithm 1. For a given PQ-tree, we denote by $γ$ the maximum number of children of an internal node.

Lemma 5.1

Algorithm 5 takes $O (1 . 381^{γ} γ^{2})$ time and $O (γ^{2})$ space.

Proof

The most space consuming part of the algorithm, besides the computation of a vertex cover, is to store $x^{'}$ (which is equivalent to x and ordered as a specific string). We simply save the children of x in a specific order. Thus, the space is $O (γ)$ .

The algorithm first checks if the interval $[ℓ, e]$ can be “completed” by all of the intervals defined by the indices of the children of x. We can check this in $O (γ^{2})$ time (naively). Then, we sum up the values in $D$ . Notice that $| D | = | children (x) | = O (γ)$ , thus this step takes $O (γ)$ time.

After that, we calculate $Δ_{violation}^{P} (x, x^{'})$ , which its most time consuming calculation is the jump violation, $Δ_{jump}^{P} (x, x^{'})$ , which requires to find a minimum weighted vertex cover of the graph $G [x, x^{'}]$ . In order to find such vertex cover, we can use, for example, the algorithm in [44] which takes $O (1 . 381^{γ} γ^{2})$ time and $O (γ^{2})$ space. Thus, this step takes $O (1 . 381^{γ} γ^{2})$ time and $O (γ^{2})$ space.

All other steps in the algorithm are basic operations and thus they take O(1) time. Hence, the algorithm takes $O (1 . 381^{γ} γ^{2})$ time and $O (γ^{2})$ . $□$

Lemma 5.2

The Q-Mapping algorithm, Algorithm 6, takes $O (γ^{2})$ time and $O (γ)$ space.

Proof

The most space consuming part of the algorithm is to store $x^{'}$ (which is equivalent to x and ordered as a specific string). We simply save the children of x in a specific order. Thus, the space of the algorithm is $O (γ)$ .

The algorithm first checks if the interval $[ℓ, e]$ can be “completed consecutively” by all of the intervals defined by the indices of the children of x. We can check this in $O (γ)$ time. Then, we sum up the values in $D$ . Notice that $| D | = | children (x) | = O (γ)$ , thus this step takes $O (γ)$ time. After that, we calculate $Δ_{violation}^{Q} (x, x^{'})$ and $FlipCorrection (x, x^{'})$ , which take $O (γ^{2})$ time (naively). All other steps in the algorithm are basic operations and thus they take O(1) time. Hence, the algorithm takes $O (γ^{2})$ time. $□$

Lemma 5.3

The main algorithm, Algorithm 1, takes $O (n γ^{2} \cdot (m_{p} \cdot 1 . 381^{γ} + m_{q}))$ time and $O (n^{2})$ space.

Proof

The number of leaves in the PQ-tree T is n, hence there are O(n) nodes in the tree, thus the size of the first dimension of the DP table, $A$ , is O(n). The size of the second dimension ( $1 \leq ℓ \leq n$ ) is also n. Thus, the DP table $A$ is of size $O (n^{2})$ .

In the initialization step, all calculations are basic and take O(1) time. The P-Mapping algorithm is called for every P-node in T and every possible start index i, so the P-Mapping algorithm is called $O (n m_{p})$ times. Similarly, the Q-Mapping algorithm is called $O (n m_{q})$ times. Thus, it takes $O (n \cdot (m_{p} \cdot Time(P-Mapping) + m_{q} \cdot Time(Q-Mapping)))$ time to fill the DP table. In the final stage of the algorithm, we return $A [r o o t_{T}, 1]$ , which takes O(1) time.

From Lemma 5.1, the P-Mapping algorithm takes $O (1 . 381^{γ} γ^{2})$ time and $O (γ^{2})$ space, and from Lemma 5.2, the Q-Mapping algorithm takes $O (γ^{2})$ time and $O (γ)$ space. Thus, in total, our algorithm runs in $O (1) + O (n \cdot (m_{p} \cdot (2^{γ} γ^{2}) + m_{q} \cdot (γ^{2}))) = O (n γ^{2} \cdot (m_{p} \cdot 1 . 381^{γ} + m_{q}))$ time. Adding the space required for the P-Mapping and Q-Mapping algorithms to the space required for the main DP table results in a total space complexity of $O (γ^{2}) + O (γ) + O (n^{2}) = O (n^{2})$ . $□$

Observe that is $Δ_{jump}^{P}$ is set to 0, then the computation of a vertex cover is not required. Hence, by the analysis above, we obtain the following time and space complexities.

Lemma 5.4

If $Δ_{jump}^{P}$ is set to 0, then the main algorithm, Algorithm 1, takes $O (n m γ^{2})$ time and $O (n^{2})$ space.

$TreeToString Divergence$ : algorithm

In this section, we develop a dynamic programming (DP) algorithm to solve $TTSD$ . Our algorithm receives as input an instance of $TTSD$ , $(S_{1}, S_{2}, T, d_{T}, d_{S}, δ_{ord}^{Q}, δ_{flip}^{Q})$ . The output of the algorithm is the minimum divergence from T to a subsequence of $S_{2}$ with up to $d_{T}$ deletions from T and up to $d_{S}$ deletions from the subsequence (or “NO”).

Brief overview: On a high level, our algorithm consists of three components: the main algorithm, and two other algorithms that are used as procedures by the main algorithm. Apart from an initialization phase, the crux of the main algorithm is a loop that traverses the given PQ-tree, T. For each internal node x, it calls one of the two other algorithms: P-Mapping or Q-Mapping. These algorithms return the divergence from the subtree of T rooted in x, $T_{x}$ , to subsequences of S, based on the type of x (P-node or Q-node). Then, these divergences are stored in the DP table.

In the following sections, we describe the main algorithm (“The main algorithm” section), the P-Mapping algorithm (“P-node mapping: the algorithm” section) and the Q-Mapping algorithm (“Q-node mapping: the algorithm” section).