Skip to main content
Algorithms for Molecular Biology : AMB logoLink to Algorithms for Molecular Biology : AMB
. 2023 Dec 1;18:17. doi: 10.1186/s13015-023-00239-x

New algorithms for structure informed genome rearrangement

Eden Ozeri 1,, Meirav Zehavi 1, Michal Ziv-Ukelson 1
PMCID: PMC10691145  PMID: 38037088

Abstract

We define two new computational problems in the domain of perfect genome rearrangements, and propose three algorithms to solve them. The rearrangement scenarios modeled by the problems consider Reversal and Block Interchange operations, and a PQ-tree is utilized to guide the allowed operations and to compute their weights. In the first problem, ConstrainedTreeToStringDivergence (CTTSD), we define the basic structure-informed rearrangement measure. Here, we assume that the gene order members of the gene cluster from which the PQ-tree is constructed are permutations. The PQ-tree representing the gene cluster is ordered such that the series of gene IDs spelled by its leaves is equivalent to that of the reference gene order. Then, a structure-informed genome rearrangement distance is computed between the ordered PQ-tree and the target gene order. The second problem, TreeToStringDivergence (TTSD), generalizes CTTSD, where the gene order members are not necessarily permutations and the structure informed rearrangement measure is extended to also consider up to dS and dT gene insertion and deletion operations, respectively, when modelling the PQ-tree informed divergence process from the reference gene order to the target gene order. The first algorithm solves CTTSD in O(nγ2·(mp·1.381γ+mq)) time and O(n2) space, where γ is the maximum number of children of a node, n is the length of the string and the number of leaves in the tree, and mp and mq are the number of P-nodes and Q-nodes in the tree, respectively. If one of the penalties of CTTSD is 0, then the algorithm runs in O(nmγ2) time and O(n2) space. The second algorithm solves TTSD in O(n2γ2dT2dS2m2(mp·5γγ+mq)) time and O(dTdSm(mn+5γ)) space, where γ is the maximum number of children of a node, n is the length of the string, m is the number of leaves in the tree, mp and mq are the number of P-nodes and Q-nodes in the tree, respectively, and allowing up to dT deletions from the tree and up to dS deletions from the string. The third algorithm is intended to reduce the space complexity of the second algorithm. It solves a variant of the problem (where one of the penalties of TTSD is 0) in O(nγ2dT2dS2m2(mp·4γγ2n(dT+dS+m+n)+mq)) time and O(γ2nm2dTdS(dT+dS+m+n)) space. The algorithm is implemented as a software tool, denoted MEM-Rearrange, and applied to the comparative and evolutionary analysis of 59 chromosomal gene clusters extracted from a dataset of 1487 prokaryotic genomes.

Keywords: PQ-tree, Gene cluster, Breakpoint distance

Introduction

Recent advances in pyrosequencing techniques, combined with global efforts to study infectious diseases, yield huge and rapidly-growing databases of microbial genomes [1, 2]. These big new data statistically empower genomic-context based approaches to functional and evolutionary analysis: the biological principle underlying such analyses is that groups of genes that are located close to each other across many genomes often code for proteins that interact with one another, suggesting a common functional association.

Groups of genes that are co-locally conserved across many genomes are denoted gene clusters. The order of the genes in distinct genomic occurrences of a gene cluster may not be conserved. A specific order of the genes of a gene cluster, that is co-linearly conserved across many genomes, is denoted a gene order of the gene cluster. The distinct genomes in which a gene order occurs are denoted instances of the gene order. Gene clusters in prokaryotic genomes often correspond to (one or several) operons; those are neighbouring genes that constitute a single unit of transcription and translation.

In this paper, our biological goal is to study the evolution of gene clusters in prokaryotes, by computing the divergence between pairs of gene orders that belong to the same gene cluster, based on genome rearrangement scenarios. When defining this computational task as an optimization problem, one needs to take into account that parsimony considerations may not be sufficient: driven by the objective to keep the genome small and efficient, in spite of the high rate of gene shuffling in the prokaryotic genome, only gene orders that are reinforced by conveying some advantage in fitness (i.e., in adaptation to some niche) will be kept in the genome. This calls for a Structure-Informed Genome Rearrangement (SIGR) divergence measure that will interleave parsimony considerations with some learned structural and functional information regarding the gene cluster under study. Such a measure could more accurately assess the degree of divergence from one order of a gene cluster to another, and provide further understanding of gene-context level environmental-specific adaptations [3, 4].

To this end, we propose a new SIGR-based divergence measure and provide efficient algorithms to compute it. According to our approach, information regarding the structure of the gene cluster is learned from the known gene orders of the gene cluster and represented by a PQ-tree (formally defined in “Preliminaries” section). The PQ-tree is then utilized to both guide the allowed operations and to compute their weights. A motivating example for our proposed approach, including exemplifying figures, can be found in “A motivating example” section.

PQ-trees have been advocated as a representation for gene clusters [57]. A PQ-tree describes the possible permutations of a given sequence, and can be constructed in polynomial-time [8]. The PQ-tree representing a given gene cluster describes its hierarchical inner structure and the relations between instances of the gene cluster succinctly, assists in predicting the functional association between the genes in the gene cluster, yields insights into the evolutionary history of the gene cluster, and provides a natural and meaningful way of visualizing complex gene clusters.

The biological assumptions underlying the representation of gene clusters as PQ-trees is that operons evolve mainly via progressive merging of sub-operons, where the most basic units in this recursive operon assembly are colinearly conserved sub-operons [9]. In the case where an operon is assembled from sub-operons that are colinearly dependent, the conserved gene order could correspond, e.g., to the order in which the transcripts of these genes interact in the metabolic pathway in which they are functionally associated [10]. Thus, rearrangement events that shuffle the order of the genes (or of smaller sub-operons) within this sub-operon could affect the function of its product. On the other hand, inversion events in which the genes participating in this sub-operon remain colinearly ordered with respect to the transcription order, have less of an affect on the interactions between the sub-operon’s gene products.

The case of colinearly conserved sub-operons is represented in the PQ-tree by a Q-node (marked with a rectangle in the exemplifying figures), and by a Reversal operation in the corresponding pairwise gene order rearrangement scenario. In the case where an operon is assembled from sub-operons that are not colinearly co-dependent, convergent evolution could yield various orders of the assembled components [9]. This case is represented in the PQ-tree by a P-node (marked with a circle in the exemplifying figures), and by a Block Interchange operation in the corresponding pairwise gene order rearrangement scenario.

Background on structure informed genome rearrangement (SIGR) scenarios

A generic formulation of genome rearrangement problems is, given two genomes and some allowed edit operations, to transform one genome into the other using a minimum number of edit operations [1114]. A famous algorithmic result related to genome rearrangements concerns the problem of sorting signed permutations by Reversals. This problem aims at computing a shortest sequence of Reversals that transforms one signed permutation into another, and can be solved in polynomial time [1517]. It was later generalized to handle, still in polynomial time, multichromosomal genomes with linear chromosomes, using rearrangements such as Translocations, Chromosome Fusions and Fissions [18, 19]. Then, a general operation called Double Cut-and-Join (DCJ), was introduced in [20] for handling problem instances where the common intervals are organized in a nonlinear structure. A DCJ can be, among others, a Reversal, a Translocation, a Fusion or a Fission, but two consecutive DCJ operations can also simulate a Block-Interchange or a Transposition.

Previous works proposed related forms of SIGR by considering rearrangement scenarios on two permutations that preserve their common intervals (groups of co-localized genes). Such scenarios, which may not be shortest among all scenarios, are called perfect [21]. Computing a Reversal scenario of minimum length that preserves a given subset of the common intervals of two signed permutations is NP-hard [21] and several papers have explored this problem, describing families of instances that can be solved in polynomial time [2225], fixed parameter tractable algorithms [23, 26], and an exponential time algorithm (which works also in the more general weighted case) [27]. We note that all the perfect scenarios mentioned above considered only Reversal operations, while for our settings Block-Interchange operations should also be considered. A heuristic approach that implements, among others perfect reversals and block interchange is described in [28].

In [29] the notion of perfect scenario was extended to the Perfect DCJ model, thus capturing additional operations in perfect scenarios, including Cut, Join and Block-Interchange. When considering the perfect rearrangement scenarios that best fit our problem, this is the model that is most relevant to our settings, as the other (non-heuristic) previous works do not include the Block-Interchange operation. The operations considered by the Perfect DCJ model are very general, which renders the problem computationally intractable in its general setting. Indeed, Berard et al. [29] thus only obtain positive algorithmic results for special cases that, in particular, do not encompass the structure of a PQ-tree, and with a parameter that can often be of the magnitude of the entire input size. For us, the aforementioned special cases are too restricted, and cannot model the problem we have at hand. On the other hand, fortunately, for us, considering Cut and Join operations is an unjustified overhead. Specifically, we seek to model the considered evolutionary scenarios by a formulation that is more specific to our biological problem, in order to increase the divergence measure accuracy as well as tighten the parameters driving the complexity of the algorithms for the problem. Since, in our problem, we are dealing with prokaryotic gene clusters and the data in our experiment is typically confined to one chromosome per genome, we need not consider Cut and Join operations. In addition, the intervals in prokaryotic gene clusters follow a strongly conserved hierarchy, naturally modeled by the PQ-tree learned from the members of the gene cluster. In terms of divergence measure accuracy, we would like to enforce the PQ-tree structure as a constraint to the considered rearrangements. Furthermore, while the Perfect DCJ model is unweighted (and simply counts the number of DCJ operations applied), we use the PQ-tree as a guide affecting the weights of the applied rearrangement operations. In terms of tightening the parameters driving the complexity of the computation, the PQ-tree constraint enables us to use dynamic programming algorithms and to reduce the parameter from n to the out degree of the tree. In particular, this means that the more hierarchical the input is, the smaller our parameter is likely to be, and the faster our algorithm is—in other words, the running time of our algorithm naturally scales with the amount of structure given by the PQ-tree.

In general, our algorithm is intended for widely conserved prokaryotic gene clusters (with cluster length bounded by tens of genes) that display a hierarchical structure. In such cases, the maximal degree of a P-node (which is the exponential factor in the time complexity of our proposed algorithm) is typically small, in particular with respect to the maximal degree of a Q-node (which affects the time complexity of the algorithm by a polynomial factor.) This is due to the fact that gene clusters in bacteria are very highly colinearly conserved [30, 31], even when the benchmark dataset is large and spans a wide taxonomical range of prokaryotes [32]. Furthermore, the very same data set used in this study (consisting of 59 prokaryotic genomes) was previously analyzed by Svetlitsky et al. [32] in order to study the degree of conserved collinearity among widely spread gene clusters in prokaryotic genomes. The study found a negative correlation between the proportion of shuffled gene clusters and their length, i.e. the longer the gene cluster is, the more it is colinearly conserved. All this inspired us to develop an algorithm that, on one hand captures the structure of gene clusters in an informed way, while on the other hand employs the P-node out-degree as a bound on its exponential factor. To this end we propose an FPT algorithm for the TTSD problem, where the structure of the gene cluster is represented by a PQ-tree, and the parameter binding the exponential factor of the algorithm is the maximal out-degree of a P-node in the tree.

Our contribution

We propose a new, two-step approach to SIGR: In the first step, given the gene orders of the gene cluster under study, the internal topology properties of a gene cluster are learned from its corresponding gene orders and a PQ-tree is constructed accordingly. Then, in the second step, given a reference gene order and a target gene order, a SIGR scenario is computed from the reference to the target, such that colinear dependencies among genes and between sub-operons, as learned by the PQ-tree, are taken into account by the penalties assigned to the rearrangement operations.

To this end, we define two new theoretical problems and propose three algorithms to solve them. In the first problem, denoted ConstrainedTreeToStringDivergence (CTTSD), we define the basic SIGR divergence measure. Here, we assume that the gene order members of the gene cluster from which the PQ-tree is constructed are permutations. The rearrangement operations considered by this problem include (weighted) Reversals and Block-Interchange operations. In this problem, the PQ-tree representing the gene cluster (Fig. 3A) is ordered such that the series of gene IDs spelled by its leaves is equivalent to the reference gene order (Fig. 3B). Then, a weighted SIGR measure is computed from the ordered PQ-tree to the target gene order (Fig. 3C).

Fig. 3.

Fig. 3

Re-considering the signed break-points (marked in blue) between the two gene orders exemplified in Fig. 2, this time taking into account the PQ-tree representing the corresponding gene cluster. A The PQ-tree. Note that in this figures, as well as the following figures, P-nodes are marked with circles and Q-nodes are marked with rectangles. B The first gene order. C The second gene order

The second problem, denoted TreeToStringDivergence (TTSD), is a generalization of the first problem, where the gene order members are not necessarily permutations and the genome rearrangement measure is extended to also consider up to dS gene insertion operations and up to dT gene deletion operations.

The first fixed parameter tractable (FPT) algorithm (in “Constrained TreeToString Divergence: algorithm” section) solves CTTSD in O(nγ2·(mp·1.381γ+mq)) time and O(n2) space, where γ is the maximum number of children of a node, n is the length of the string and the number of leaves in the tree, and mp and mq are the number of P-nodes and Q-nodes in the tree, respectively. If one of the penalties of CTTSD is defined to be 0, then the algorithm runs in O(nmγ2) time and O(n2) space.

The second FPT algorithm solves TTSD in O(n2γ2dT2dS2m2(mp·5γγ+mq)) time and O(dTdSm(mn+5γ)) space, where γ is the maximum number of children of a node, n is the length of the string, m is the number of leaves in the tree, mp and mq are the number of P-nodes and Q-nodes in the tree, respectively, and allowing dT deletions from the tree and dS deletions from the string.

Dynamic programming is common on trees, and on sequences, and our algorithms combine the two types. While our first algorithm is simple and intuitive (based on one dynamic programming and two greedy procedures), for our second algorithm (based on three dynamic programming procedures), more technical ingredients are required. For example, one challenge is the need to compute a vertex cover in a graph that is not fully known by any single entry of our dynamic programming table. Specifically, when we consider a single entry, some of the relevant vertices are not yet processed, and for those that are processed, we cannot store enough information (for the sake of efficiency) so as to deduce which edges exist between them.

The third FPT algorithm is intended to reduce the space complexity of the second algorithm. It solves a variant of the problem (where one of the penalties of TTSD is defined to be 0) in O(nγ2dT2dS2m2(mp·4γγ2n(dT+dS+m+n)+mq)) time and O(γ2nm2dTdS(dT+dS+m+n)) space. This algorithm employs the principle of inclusion–exclusion for the sake of space reduction, which, to the best of our knowledge, is not commonly used in the study of problems in computational biology.

The proposed general algorithm is implemented as a software tool, denoted MEM-Rearrange, and applied to the comparative and evolutionary analysis of 59 chromosomal gene clusters extracted from a dataset of 1487 prokaryotic genomes (in “Results” section). Our preliminary results, based on the analysis of the 59 gene clusters, indicate that our proposed measure correlates well with an index that is computed by comparing the class composition of the genomic instances of the two compared gene orders. The correlations yielded by our measure are shown to significantly increase with the increase in conserved structure of the corresponding gene clusters (as modelled by PQ-trees).

Roadmap

The rest of the paper is organized as follows. In “A motivating example” section we present a motivating biological example. Previous works are reviewed in “Previous related works” section. In “Preliminaries” section, we formally define the terminology used throughout the paper, and, in particular, the two problems studied in this paper. In “Constrained TreeToString Divergence: algorithm” section, we present our first algorithm, which solves the CTTSD problem. In “TreeToString Divergence: algorithm” section, we present our second algorithm, which solves the TTSD problem. In “TTSD: polynomial space complexity” section, we present our third algorithm, which solves the CTTSD problem and improves the space complexity of the second algorithm. In “Methods and datasets” section, we specify the details of our data set construction and experiment. Finally, in “Results” section, we compare the performance of our proposed rearrangement measure versus that of signed break-point distance on a benchmark of 59 chromosomal gene clusters. Concluding remarks are given in “Conclusions” section.

A motivating example

In this section we give a biological example to motivate the new problems defined in this paper. To this end, we chose to interpret the first result shown in Table 2 (found in “Appendix”), which corresponds to a gene cluster consisting of three gene orders of a known ABC transporter operon, encoding genes participating in a carbohydrate uptake system [33]. This example is illustrated in Fig. 1. In each of the gene orders shown in the figure, the product of gene number one is responsible for regulating the transcription of the other genes. The genes numbered two, three and four code for proteins that are part of the transporter machine. Gene number five codes for a protein that is responsible for some metabolism of the transported substrate, such as breaking down a complex sugar into smaller ones [34]. The genes numbered two, three, four and five are located close to each other in all gene orders and are included in a single operon, whose transcription is regulated by the product of gene number one [34].

Table 2.

59 gene clusters analyzed in our experiment

# Gene cluster ID Diverge dSBP dreversals Size PQ-tree
1 19876 1.000 0.655 0.500 3 ((0− [1− 2− 3−]) 4+)
2 21344 0.905 0.615 0.439 4 ([0− 1−] [2+ 3+])
3 19877 0.997 0.915 0.592 3 ([0+ 1+ 2+] 3+)
4 27180 0.828 0.828 0.436 3 ([0− 1− 2−] 3+)
5 14602 0.978 0.985 0.654 4 (0− [1− 2− 3− 4−])
6 23340 0.974 0.809 0.684 3 (0− (1+ [2+ 3+]))
7 19853 0.967 0.825 0.703 3 (0− ([1+ 2+ 3+] 4+))
8 14790 0.994 0.994 0.808 3 ([0− 1− 2− 3−] 4− 5−)
9 26244 0.971 0.996 0.817 3 (0− [1− 2− 3−])
10 26243 0.903 0.909 0.757 4 (([0− 1−] 2−) 3−)
11 26476 0.925 0.998 0.791 3 (0+ [1+ 2+ 3+])
12 15297 0.991 0.945 0.866 3 (0− ([1− 2−] [3− 4+]))
13 22866 0.845 0.600 0.722 5 ([0+ 1+] [2+ 3+])
14 26238 0.785 0.807 0.663 7 ([0− 1−] 2− 3−)
15 18995 0.770 0.621 0.681 9 ((0+ [1+ 2+]) 3+)
16 28119 0.559 0.559 0.478 5 (0− 1− 2− 3−)
17 25371 0.983 0.942 0.907 4 (0− ([1− 2−] 3−))
18 26231 0.974 0.802 0.909 4 ((0+ [1+ 2+]) 3+)
19 14796 0.760 0.686 0.700 9 (0− 1− [2− 3−] 4−)
20 20007 0.999 0.999 0.960 3 (0− 1− 2− [3− 4−])
21 22299 0.929 0.929 0.891 3 (0− 1− 2− 3−)
22 19255 0.948 0.940 0.918 5 ([0− 1−] [2− 3−] 4−)
23 20764 0.880 0.987 0.852 3 ([0+ 1+ 2+] 3+)
24 21553 0.982 0.982 0.961 3 (0− [1− 2−] 3− 4−)
25 27851 1.000 1.000 0.982 3 (0+ [1+ 2+] 3+)
26 27427 0.866 0.866 0.866 3 (0+ [1+ 2+ 3+])
27 15467 1.000 1.000 1.000 3 ([0− 1− 2− 3−] 4− 5−)
28 19610 0.998 0.998 0.998 3 (0− ([1− 2−] 3− 4−))
29 20078 0.982 0.945 0.982 3 (0+ (1+ [2+ 3+]))
30 22181 0.912 0.940 0.912 3 (([0+ 1+] 2+) 3+)
31 25612 0.997 0.888 0.997 4 ([0− 1−] [2− 3−])
32 27177 0.945 0.945 0.945 3 (0− [1+ 2+ 3+])
33 8962 0.933 0.933 0.933 3 (0− 1− [2− 3− 4− 5− 6−])
34 27530 0.999 0.974 0.999 3 [[0− 1−] 2− 3−]
35 19256 0.779 0.737 0.798 8 (0− 1− 2− 3−)
36 21317 0.974 0.974 0.995 3 (0+ 1+ [2+ 3+ 4+])
37 19852 0.890 0.821 0.919 6 (0− [1− 2−] 3−)
38 27250 0.967 0.967 0.997 3 ([0+ 1+] [2+ 3+])
39 30976 0.963 1.000 0.996 3 [0− 1+ [2+ 3+]]
40 28245 0.715 0.715 0.775 3 (0− 1− 2− 3−)
41 26257 0.933 0.988 0.999 3 ((0+ [1+ 2+]) 3+)
42 14839 0.860 0.912 0.928 8 ([0− 1−] 2− 3− 4−)
43 27207 0.748 0.749 0.824 7 (0− [1− 2−] 3+)
44 12685 0.917 0.917 0.994 3 ([0− 1− 2−] 3− 4− 5−)
45 21268 0.876 0.876 0.956 4 ([0+ 1+] [2+ 3+])
46 23083 0.832 0.916 0.912 5 (([0− 1−] 2−) 3−)
47 25597 0.810 0.817 0.892 5 (0+ [1+ 2+] 3+)
48 19005 0.902 0.997 0.997 3 ((0+ 1+ 2+) 3+)
49 15035 0.623 0.671 0.732 4 ((0+ ([1+ 2+] 3+)) 4+)
50 28547 0.828 0.828 0.961 7 (0− 1− 2− 3−)
51 25375 0.866 0.866 1.000 3 ([0− 1−] 2− 3−)
52 20332 0.860 0.860 0.997 5 (0+ 1+ [2+ 3+])
53 26364 0.839 0.869 0.988 5 (0+ 1+ 2+ 3+)
54 19909 0.818 0.818 0.968 6 (0− [1− 2−] 3−)
55 20601 0.586 0.586 0.756 5 (0− 1− 2− 3−)
56 15391 0.784 0.784 0.999 3 (0− (1− [2− 3− [4− 5−]]))
57 14573 0.501 0.501 0.722 4 (0− 1− (2− [3− 4−]) 5−)
58 25554 0.721 0.971 0.971 3 (([0+ 1+] 2+) 3+)
59 19526 0.543 0.471 0.814 4 ((0− [1− 2−]) 3−)

For each gene cluster (row in the table) we report the Pearson correlation versus the corresponding instance-abundance measure for each of the three genome rearrangement measures compared in our experiment: our proposed measure (Diverge), signed break-point distance (dSBP) as in Definition 8 and CREx reversals distance (dreversals) [28]. The column denoted “Size” gives the number of genes in the gene cluster, and the column denoted “PQ-tree” gives the corresponding PQ-tree representing the gene cluster. The functional category annotations for each cluster are given in Tables 3 and 4

Fig. 1.

Fig. 1

Gene cluster of a known sugar transporter operon. A Three gene order instances of the gene cluster. B The color-coding key for the functional categories of the genes

Note that even though random shuffling by evolution forms different gene orders, not every order is conserved. In prokaryotic genomes, gene clusters often correspond to operons, which are adjacently localized genes that are co-transcribed and co-translated. Here, selection for specific gene orders in operons can be due to the assembly order of a multifunctional enzymatic complex [35] or the performance of consecutive reactions in a metabolic pathway [36]. The conserved gene order within operons also facilitates the evolution of multiple post-transcriptional and post-translational feedback mechanisms that regulate the expression of enzymes catalyzing different steps in the pathway according to metabolite availability [37]. Thus, the gene orders we see today are functionally conserved. In order to take this into consideration in our genome-rearrangement approach, we use a PQ-tree to represent the gene cluster and to model the possible evolutionary events that yielded the various gene orders (see Fig. 3).

Considering the first two gene orders in the cluster, the signed break-point distance between them is 2 (see Fig. 2). Now, let’s use our knowledge of the evolution of the gene cluster, using the PQ-tree (in Fig. 3), and reconsider the break-point between genes one and two. Looking at the PQ-tree, we can see that the proposed break-point actually falls between gene one and the P-node y, which represents genes from the cluster that are included in the same operon. Notice that the adjacency of gene one and P-node y appears both in the first gene order and in the second gene order. Therefore, according to the PQ-tree, this proposed break-point is functionally irrelevant and should not be penalized by the corresponding SIGR score! This makes biological sense, since gene 1 is a transcription factor that typically appears upstream to the operon it regulates, regardless of how the genes are arranged within the operon [34].

Fig. 2.

Fig. 2

Computing the signed break-point distance between the first two gene orders from the example shown in Fig. 1. The two signed break-points are denoted by the red lines

Next, consider a comparative analysis between these two gene orders based on a perfect reversals distance scenario. A perfect reversals scenario would acknowledge the observation that the transcription factor gene is not colinearly dependent on a specific gene from the transporter operon, however it would entail a long and expensive series of reversal operations: first, reversal of the sequence of genes 2–5, and then the individual reversal of each one of these genes. Although this may have been the actual evolutionary scenario that led to the second gene order, this scenario is quite uninformative in terms of both assessing the functional distance as well as interpreting the functional discrepency between the two gene orders. The point is, as would be observed by a PQ-tree guided block-interchange operation, that in the first order the sugar transporter is transcribed prior to the transcription of the sugar-metabolising gene, while in the second gene order the transcription order of these units is reversed.

Previous related works

Permutations on strings representing syntenic gene blocks on genomes have been studied earlier by [3842] and the idea of a maximal permutation pattern was introduced by Eres et al. [41]. In [8] an algorithm was proposed for representation and detection of gene clusters in multiple genomes using PQ-trees. The proposed algorithm identifies a minimal consensus PQ-tree, and it is proven that this tree is equivalent to a maximal permutation pattern and that each subgraph of the PQ-tree corresponds to a non-maximal permutation pattern. In that paper, the authors also present a general scheme to handle gene multiplicity and missing genes in permutations and give a linear time algorithm to construct the minimal consensus PQ-tree. These previous works fall in the domain of permutation discovery and PQ-tree construction. In contrast, the problems addressed in our paper take as input a previously constructed PQ-tree.

Three additional related works fall in the domain of “PQ-tree language distance computations”. The first problem is defined and solved in [7]. It asks, given a PQ-tree T and a string s, to find, among all permutations that T can generate, a permutation p such that the edit distances between p and s is minimal. Thus, the work of [7] used the input PQ-tree only as a constraint to guide the alignment process, and did not project the guiding information to the comparative score, as we do. Furthermore, the work of [7] did not consider genome rearangement operations at all, only string edit operations.

For break-point distance, two problems were proposed and solved in [43]. The first problem asks, given two PQ-trees over permutation gene orders, and a parameter k, whether there are two strings S1 and S2 generated from each of the trees, respectively, such that the break-point distance between S1 and S2 is up to k. The second problem asks, given a PQ-tree T, a set of p permutations and a parameter k, whether T can generate a permutation s such that the sum of the break-point distances between s and each of the given p permutations is bounded by k. However, in both of the problems addressed in [43], the order of the leaves in the input tree, as well as the actions taken on it are not taken into account. Additionally, the break-point distance is computed between two strings, and the tree is not a part of that distance computation. In this paper we employ a scoring strategy that takes the order of the leaves in the tree and the actions on the tree into account. In particular, we define the divergence from an ordered PQ-tree to a string by the rearrangement actions applied on the tree.

Preliminaries

Let S=s1...sn be a string. Denote by S[i] the character in position i in S, i.e. S[i]=si. In addition, denote by S[i : j] (ij) the subsequence of S from position i to position j, i.e. S[i:j]=si...sj.

PQ-tree—representing the pattern

A PQ-tree is a rooted tree with three types of nodes: P-nodes, Q-nodes and leaves. The children of a P-node can appear in any order, while the children of a Q-node must appear in either left-to-right order or right-to-left order. Booth and Lueker [5] were interested in permutations of a set, thus every member of U appears exactly once as a label of a leaf in the PQ-tree. We, on the other hand, allow each member of the set to appear as a label any non-negative number of times. The possible reordering of the children nodes in a PQ-tree can potentially create many equivalent PQ-trees. Booth and Lueker defined two PQ-trees T and T as equivalent (denoted TT) if and only if one tree can be obtained by legally reordering the nodes of the other; namely, randomly permuting the children of a P-node, and reversing the children of a Q-node. A generalization of their definition, to allow for insertions and deletions, is defined as follows.

Definition 1

(Quasi-Equivalence) Two PQ-trees T,T are quasi-equivalent with parameter d, denoted by TdT, if and only if T can be obtained by (a) randomly permuting the children of the P-nodes of T, (b) reversing the children of the Q-nodes of T, (c) deleting up to d leaves of T.

Denote by Tx the subtree of a PQ-tree T rooted in the node x. Denote by Leaves(x) the set of leaves of the PQ-tree Tx, span(x)=|Leaves(x)| and for a set of nodes U, span(U)=vUspan(v). Denote by children(x) the set of children of the node x, and let rootT the root node of the tree T.

Given a PQ-tree T, we denote the label of a leaf x of T by label(x). The frontier of the PQ-tree T, denoted F(T), is the sequence of labels on the leaves of T read from left to right. In addition to Definition 1, for each node in a PQ-tree (internal node or leaf), we define a unique “color” which will help us distinguish and map between nodes of two quasi-equivalent PQ-trees. Colors are used only for the analysis—they are not used explicitly in the algorithm. In the input we receive one PQ-tree (T) and “assign” arbitrary (but unique) colors to its nodes. In the algorithms, we perform actions on T, which reorder its nodes. We refer by T to the original tree (T) with reordered nodes. Thus, we use colors just to keep track of the nodes after the shuffling, see Fig. 4. From now on, when we say that two PQ-trees T,T are quasi-equivalent, we assume that the equivalence is with parameter d. In addition, we assume that the PQ-trees T,T are colored in the same unique colors (each color is assigned only to one node in T and at most one node in T, and the nodes in T have the same colors as their corresponding nodes in T). In addition, we say that the frontier of T, F(T), is derived from T, and we call this a derivation. We can also say that T is ordered as F(T). When a string S is derived from Tx, we also say that S is derived from x.

Fig. 4.

Fig. 4

a PQ-tree T. b PQ-tree T which is quasi-equivalent to T. The colors are represented by the number assigned to each node, the labels of the leaves are the letters. Each internal node is T is marked by a letter (x for example), and its equivalent node is marked by the same letter with a dash (x)

Definition 2

(Equivalent Nodes) Given two quasi-equivalent PQ-trees T,T with parameter d and two nodes xT and xT, x and y are equivalent nodes if they share the same color.

In Fig. 4, the colors of the trees are shown as unique numbers near each node (notice that each node “keeps” its color in T compared to T). We say that the string ``abab” is derived from T, because ``abab” is the frontier of T.

Break-point distances

The definitions of our problems make use of the notion of break-point distance [43] to determine the distance between two strings, as defined below.

Definition 3

(Gene Mapping) Let G=g1,,gn and H=h1,,hm be two strings. A gene mapping of G and H, denoted by M, is a set of pairs (i,j){1,,n}×{1,,m} such that gi=hj1 and every position in G and H is in exactly one pair in M. When no confusion arises, we suppose that the gene mapping contains the pairs of genes (gi,hj) themselves.

Definition 4

(Break-Point) Given two strings G=g1,,gn, H=h1,,hm and a gene mapping M, a break-point between G and H is a pair of consecutive genes gigi+1 in G (resp. hihi+1 in H) such that the following is true: gi and gi+1 (resp. hi and hi+1) belong to M, say, (gi,hj),(gi+1,hk)M (resp. (gj,hi),(gk,hi+1)M), but neither k=j+1 nor k=j-1.

Denote by NUMBP(G,H,M) the number of break-points of G between G and H with respect to M.

Definition 5

(Break-Point Distance) Let S1 and S2 be two strings. The break-point distance between S1 and S2, denoted by dBP(S1,S2), is the minimum of NUMBP(G,H,M) among all gene mappings M of G and H.

For example, suppose we are given the strings S1=abcd and S2=acbd. Then, there exists exactly one gene mapping of S1 and S2. The break-points of S1 are the pairs (ab), (cd). Therefore, the break-point distance between them is 2. We also can count the number of break-points of S2, which are (ac), (bd).

We will use a variant of break-point distance that takes the signs of the characters in a string into account. Towards that, we define the notion of a signed string.

Definition 6

(Signed String) A signed string is a string where each character is assigned a sign (‘+’ or ‘−’).

When we say that a string S is a signed string, we suppose to have a sign function sign that returns the sign of the character in each position in S. For the sake of illustration, in our examples, we will indicate the sign of each character by ‘+’ or ‘−’ of its left side.

Definition 7

(Signed Break-Point) Given two signed strings G=g1,,gn, H=h1,,hm and a gene mapping M, a signed break-point between G and H is a pair of consecutive genes gigi+1 in G (resp. hjhj+1 in H) such that the following is true: gi and gi+1 (resp. hi and hi+1) belong to M, say, (gi,hj),(gi+1,hk)M (resp. (hi,gj),(hi+1,gk)M) but neither k=j+1, signG(i)=signH(j) and signG(i+1)=signH(k) nor k=j-1, signG(i)=-signH(j) and signG(i+1)=-signH(k).

Denote by NUMSBP(G,H,M) the number of signed break-points of G between G and H with respect to M.

Definition 8

(Signed Break-Point Distance) Let S1 and S2 be two signed strings. The signed break-point distance between S1 and S2, denoted by dSBP(S1,S2), is the minimum of NUMSBP(G,H,M) among all gene mappings M.

For example, consider the signed strings S1=+a+b+c+d and S2=+a-b-d-c. Then, there exists exactly one gene mapping of S1 and S2. The signed break-points of S1 are the pairs (ab), (bc). Therefore, dSBP(S1,S2)=2. We can also count the number of signed break-points of S2, which are (ab), (bd).

To accommodate deletions, we will use Definition 8 with respect to strings obtained from the given ones after deleting characters.

Problem preliminaries

Given an internal node x in a PQ-tree T, we define its sign, sign(x), as the majority sign of the leaves in Tx, Leaves(x). If the number of negative signed leaves is equal to the number of positive signed leaves, then we abuse notation and consider sign(x) as + as well as −.

Given a node x in a PQ-tree, let S(x) denote the signed string of colors of the nodes in children(x) as they are ordered in the tree (from left to right). For example, consider the PQ-tree shown in Fig. 5a where the character assigned to each internal node is its color, then S(z)=+b+c, S(y)=+a+z+d and S(x)=+y+e+f.

Fig. 5.

Fig. 5

a A PQ-tree T. b A PQ-tree T that is equivalent to T and is ordered as S

In order to measure the divergence from an ordered PQ-tree to a string, we take into account the actions performed on the PQ-tree to order it as needed. Towards defining the divergence from an (ordered) PQ-tree T to a string S, we first define a penalty for taking an action on an internal node of a PQ-tree, denoted by Δviolation. The penalty Δviolation is a combination of several types of penalties. The first type concerns cases where large units “jump” while reordering the children of a P-node x to have the same order as the children of its equivalent node x. Specifically, we want to penalize according to the sizes of these units: We will not penalize a single leaf that jumps, but only units whose size is larger than 1, and the penalty for these units increases with their sizes. Thus, we consider the following penalty. If the size of a unit that jumps is t, then we penalize this operation by (t-1)/2. In particular, a leaf does not get penalized. To do so, we build a graph G(x,x) (defined formally later), whose vertex set is the children of the P-node. Each vertex has a weight relative (as mentioned before and will defined ahead) to its size (defined as the span of the child it represents). Roughly speaking, the graph G has an edge for each pair of children that “changed their order” (with respect to the their signs). To be precise, we need the following definition.

Definition 9

(Change Of Signed Order) Let x,x be two equivalent P-nodes of two quasi-equivalent PQ-trees T and T with parameter d, S(x)=c1,,cn be the signed string of colors of the nodes in children(x) that were not deleted, as they ordered in T, S(x)=c1,,cn be the signed string of colors of the nodes in children(x), as they ordered in T, and M be the2 gene mapping of S(x) and S(x). Given two nodes y (with color ci and y is the equivalent node of y in T) and z (with color cj and z is the equivalent node of z in T) in children(x), say, j>i, and such that (ci,ck),(cj,ct)M for some ck and ct, y and z change their signed order if the following are both false.

  • t>k, sign(y)=sign(y) and sign(z)=sign(z).

  • t<k, sign(y)=-sign(y) and sign(z)=-sign(z).

We denote by G[x,x] the graph (described before Definition 9) of two equivalent P-nodes x and x of two quasi-equivalent PQ-trees. Formally, it is defined as follows.

Definition 10

(G[x,x]) Given two equivalent P-nodes x,x of two quasi-equivalent PQ-trees with parameter d, G[x,x]=(V,E,w) is the undirected graph with vertex set V, edge set E and vertex weight function w which defined as follows.

  • V=children(x).

  • E={(v,u)|uandvchangedtheirsignedorder}.

  • For each xV,w(x)=(span(x)-1)/2.

Notice that the graph is dependent on the colors, to determine the edge set.

After building G[x,x], we find the minimum weight of a vertex cover of G[x,x] in order to sum all penalties for the units that jumped while reordering the children of a P-node. Observe that by computing the minimum weight of a vertex cover, we identify a “best” (in terms of penalty) set of nodes that jump.

Definition 11

(Minimum Weighted Vertex Cover) Let G=(V,E,w) be a graph with vertex set V, edge set E and vertex weight function w. The minimum weighted vertex cover of G is the minimum weight3 among all vertex covers of G.4

Definition 12

(ΔjumpP(x,x)) Given two equivalent P-nodes x and x of two quasi-equivalent PQ-trees with parameter d, and the weight t of a minimum weighted vertex cover of G[x,x], the jump violation between x and x, denoted by ΔjumpP(x,x), is t.

Now, we define a violation between two equivalent internal nodes. Towards that, given two equivalent nodes x,x of two quasi-equivalent PQ-trees T and T (where x is not deleted), let isFlipped(x,x) be a procedure that returns 1 if x and x “flipped”, and 0 otherwise. That is, if x and x are leaves, isFlipped(x,x)=1 if sign(x)=-sign(x); otherwise, isFlipped(x,x)=0. If x and x are internal nodes, isFlipped(x,x)=1 if for each child ychildren(x) that is not deleted, isFlipped(y,y)=1 where y is the child of x that is equivalent node of y, and the order of children(x) in T is the reversal of the order of children(x) in T; otherwise, isFlipped(x,x)=0.

Given an internal node x, denote by S~d(x) the set of signed strings where S~S~d(x) if S~ is obtained from S(x) by deleting up to d leaves from Tx.

Definition 13

(Δviolationd(x,x)) Given two equivalent internal nodes x and x of two quasi-equivalent PQ-trees with parameter d, and input numbers δordQ and δflipQ, the violation between x and x, denoted by Δviolationd(x,x), is defined as follows.

  • If x is a P-node, ΔviolationP(x,x)=minS~S~d(x)dSBP(S~,S(x))+ΔjumpP(x,x).

  • If x is a Q-node, ΔviolationQ(x,x)=δordQ·minS~S~d(x)dSBP(S~,S(x))+isFlipped(x,x)·δflipQ.

When we use the notation Δviolationd, if d is clear from the context, we will use the notation Δviolation instead. In Definition 13, one of the penalties for reordering the children of a Q-node is the flip penalty (δflipQ), and it is performed if the Q-node has flipped (as indicated by isFlipped). When considering this event, notice the case where a Q-node x flipped and all of the non-deleted children in children(x) flipped as well (including the P-nodes). In this case, we penalise each of the non-deleted children in children(x) and x as well for flipping, but, in fact, the event is flipping only x. Thus, we unnecessarily penalise the non-deleted children in children(x). Therefore, we employ a flip correction procedure as defined next.

Definition 14

(FlipCorrection(x,x)) Given two equivalent internal nodes x and x of two quasi-equivalent PQ-trees with parameter d, the flip correction between them, denoted by FlipCorrection(x,x), is 0 if not all of children(x) flipped or deleted, and otherwise it is the sum of the flip penalties of all of the Q-nodes and leaves in children(x) that were not deleted.

Finally, after defining the violation between equivalent nodes, we simply sum up the violations (and corrections) of all internal nodes to define the divergence from an ordered and labeled PQ-tree to a signed string as follows.

Definition 15

(DivergedT,dS(T,S)) Let T be an (ordered and leaf-labeled) PQ-tree, and S be a signed string. Let OT be the set of all quasi-equivalent T PQ-trees of T with parameter dT (i.e. TdTT) such that T is ordered as a subsequence of S obtained by deleting up to dS characters from S. For any TOT, let MT be the unique5 mapping that maps each node in T to its equivalent node in T. Then,

DivergedT,dS(T,S)=minTOT(x,x)MT(Δviolation(x,x)-FlipCorrection(x,x)).

In case dT=0 and dS=0, we will use the notation Diverge(T,S) instead of DivergedT,dS(T,S) for simplicity.

For example, consider the ordered PQ-tree T in Fig. 5, which is ordered as +a+b+c+d+e+f, and the signed string S=+f-c-b-a-d+e. To order T as S, we flip the Q-node z, and for this we pay ΔviolationQ(z,z)=δflipQ. Now, consider the P-node y. We swap between a and z, so dSBP(S(y),S(y))=1, and hence ΔviolationP(y,y)=1 (notice that there is no jump of a large unit, because the leaf a can jump). Finally, for the P-node x, ΔjumpP(x,x)=0 (we can order the children by moving the leaf f to the left-most position). In addition, dSBP(S(x),S(x))=1. Thus, for x we pay ΔviolationP(x,x)=1. In total Diverge(T,S)=ΔviolationQ(z,z)+ΔviolationP(y,y)+ΔviolationP(x,x)=δflipQ+2.

Problem definitions

ConstrainedTreeToStringDivergence

The input to CTTSD consists of two signed permutations of length n, S1=σ1σnΣn, |Σ|=n, such that σiσj for all 1i<jn, and S2=λ1λnΣn such that λiλj for all 1i<jn; a PQ-tree T ordered as S1 with mp P-nodes and mq Q-nodes; and two numbers δordQ and δflipQ. We aim to perform actions on T to reorder it as S2. That is, we reorder T as T so that F(T)=S2. If this is not possible, we answer “NO”. Else, we return the divergence from T to S2. Concretely, CTTSD is defined as follows.

Definition 16

(ConstrainedTreeToStringDivergence) Given two signed permutations S1 and S2 of the same length and the same characters, two numbers δordQ and δflipQ, and a PQ-tree T ordered as S1, return Diverge(T,S2) or answer “NO” if S2 cannot be derived from T.

In “Constrained TreeToString Divergence: algorithm” section, we propose an algorithm to solve this problem.

TreeToStringDivergence

Generalizing CTTSD, in TTSD we do not assume that the input strings are permutations, and we allow deletions. The input to TTSD consists of two signed strings, S1=σ1σmΣTm and S2=λ1λnΣSn; a PQ-tree T ordered as S1 with mp P-nodes and mq Q-nodes; dTN{0}, which specifies the number of allowed deletions from T; dSN{0}, which specifies the number of allowed deletions from S2; and two numbers δordQ, δflipQ indicating the penalty of the events of changing order and flipping, respectively, a Q-node. In TTSD we perform actions on T to reorder it as a subsequence S2 of S2, allowing dT deletions from T, and so that S2 is obtained from S2 by using up to dS deletions from S2. That is, after reordering T as T (and performing up to dT deletions), F(T)=S2. If it is not possible, we answer “NO”. Else, we return the divergence from T to S2 corresponding to dT and dS. Concretely, TTSD is defined as follows.

Definition 17

(TreeToStringDivergence) Given two signed strings S1 and S2, a PQ-tree T ordered as S1, and parameters dT, dS, δordQ, δflipQ, return DivergedT,dS(T,S2) or “NO” if S2 cannot be derived from T with respect to dT and dS.

Parameterized complexity

Let Π be an NP-hard problem. In the framework of Parameterized Complexity, each instance of Π is associated with a parameter k. Here, the goal is to confine the combinatorial explosion in the running time of an algorithm for Π to depend only on k. Formally, we say that Π is fixed-parameter tractable (FPT) if any instance (Ik) of Π is solvable in time f(k)·|I|O(1), where f is an arbitrary function of k.

Algorithms preliminaries

Given a node x and the numbers of deletions kT and kS of a derivation, the length of the derived string S can be calculated using the length function given in Definition 18 below.

Definition 18

(The Length Function) L(x,kT,kS)span(x)-kT+kS.

Using the length function and the start point s of the derivation, the end-point of the derivation can be calculated using the end-point function given in Definition 19 below.

Definition 19

(The End-Point Function) E(x,s,kT,kS)s-1+L(x,kT,kS).

Let A be a DP table used in an algorithm. Addressing A with some of its indices given as dots refers to the subtable of A that is comprised of all entries of A indexed by the indices that are specified (i.e., all indices not marked by dots). For example, A[x,·,·,·,·], refers to the subtable of A that is comprised of all entries of A whose first entry is x.

In the algorithms, for a given node x in a PQ-tree T, number of positive signed leaves in Tx, pos, a possible sign s{+,-}, and the number of deletions from Tx, kT (if not implied by context, then kT=0), we say that pos is consistent with s (or s is consistent with pos) if s=+ and pos(span(x)-kT)/2 or if s=- and pos(span(x)-kT)/2. Moreover, in some situations, for a given set of nodes C, we will sometimes use a set NC to describe the nodes in C that have a negative sign: for a given node xC, if xN, then sign(x)=-; otherwise, sign(x)=+. In these situations, we say that pos is consistent with N (or N is consistent with pos) if xN and pos(span(x)-kT)/2 or if xN and pos(span(x)-kT)/2.

ConstrainedTreeToStringDivergence: algorithm

In this section, we present a greedy algorithm to solve CTTSD. Our algorithm consists of three components: the main algorithm, and two procedures called P-Mapping and Q-Mapping. We first present and explain the main algorithm and the procedures. Afterwards, we demonstrate the execution of the algorithm and analyze its running time.

The main algorithm

Recall that the input to CTTSD consists of two signed permutations S1 and S2 of length n, two numbers δordQ and δflipQ, and a PQ-tree T ordered as S1. If S2 can be derived from T, then the output of the algorithm is the divergence from T to S2, Diverge(T,S2). Otherwise, the output is “NO” (specifically, the algorithm returns ).

The main algorithm (whose pseudo-code is given in Algorithm 1) constructs a 2-dimensional DP table A of size m×n where m=n+mp+mq is the number of nodes in T. For each node x in T and index , A has an entry A[x,]. In the algorithm, for each node x, we keep two indices and r (denoted by x. and x.r respectively) such that S2[x.:x.r] is derived from Tx. Then, the purpose of an entry of the DP table, A[x,x.], is to hold the divergence from the subtree Tx to the subsequence S2[x.:x.r] of S2. That is,

A[x,x.]=Diverge(Tx,S2[x.:x.r]).

If any subsequence of S2 starting at position cannot be derived from Tx, then A[x,]=.

Some entries of the DP table define illegal derivations. Such are derivations where the length of the frontier of the subtree is larger than the length of the longest subsrting starting at the specified index . These entries are called invalid entries and their value is defined as throughout the algorithm. Formally, an entry A[x,] is invalid if span(x)>n-+1.

The algorithm first initializes the entries of A that are meant to hold divergences of derivations of every possible subsequence of S2 (a single character) from the leaves of T. Specifically, for a leaf x, if it did not flip, we put 0 in the corresponding entry. If x did flip, we put δflipQ. After that, we update x. and x.r. As described in the initialization, if label(x)=S2[], S2[] (S2[:]) is derived from label(x)=S2[], then we put 0 or δflipQ in A[x,] and we put in x. and x.r.

After the initialization, all other entries of A are filled as follows. Go over the internal nodes of T in postorder. For every internal node x, go in descending order over every index 1n that can be a start index for the subsequence of S2 derived from Tx (in case of invalid entry, we continue to the next iteration). For every x and , use the algorithm for P-mapping or Q-mapping according to the type of x. Both algorithms receive the following input: the node x, S2, start and end indices ,e of a subsequence of S2, and the collection of derivations of the children of x (entries of A that have already been computed and hold the divergence of a derivation). In addition, the Q-Mapping algorithm receives as input the penalty parameters δordQ and δflipQ. After being called, both algorithms return the divergence from Tx to S2[:e], that is, Diverge(Tx,S2[:e]).graphic file with name 13015_2023_239_Figa_HTML.jpg

Finally, having filled the DP table, A[rootT,1] holds the divergence from T to S2 (Diverge(T,S2)), and so we return A[rootT,1].

P-node mapping: the algorithm

Recall that the input consists of a P-node x, a string S2, two indices and e, and a set of derivations D. Notice that each value in D is the divergence from the subtree rooted in a child c of x to S2[c.:c.e], where S2[c.:c.e] is a subsequence of S2[:e] that is derived from Tc. These values, A[c,c.] for each cchildren(x), were calculated in earlier iterations and saved in D. If S2[:e] can be derived from Tx, then the output of the algorithm is the divergence from Tx to S2[:e], Diverge(Tx,S2[:e]). Otherwise, the output is “NO” (specifically, the algorithm returns ). Denote by Tx the quasi-equivalent PQ-tree of Tx ordered as S2[:e]. Note that if Tx exists, then it is unique (because we deal with permutations and forbid deletions).

The algorithm (whose pseudo-code is given in Algorithm 2) first checks if the interval [,e] can be “completed” by all of the intervals defined by the indices of the children of x. Specifically, we check if there is any order of the children of x, say, ordered as c1,,c|children(x)|children(x), such that c1.=, c|children(x)|.e=e, and for each 1j|children(x)|-1, cj.e+1=cj+1.. If there is no such order, then the interval [,e] cannot be completed, and so S2[,e] cannot be derived from Tx. In this case, we return . Otherwise, S2[:e] can be derived from Tx by reordering the children of x according to the unique order that completes the interval [,e]. Second, we sum up all of the values in D (and store the sum in the variable childrenDist). Next, we calculate the violation between x and its equivalent node x in Tx, ΔviolationP(x,x), according to Definition 13 (and store the result in the variable violation). Finally, we return the sum of childrenDist and violation, which is the divergence from Tx to S2[:e], Diverge(x,S2[:e]) (according to Definition 15).graphic file with name 13015_2023_239_Figb_HTML.jpg

Q-node mapping: the algorithm

Recall that the input consists of a Q-node x, a string S2, two indices and e, a set of derivations D and penalty parameters δordQ and δflipQ. Notice that each value in D is the divergence from the subtree rooted in a child c of x to S2[c.:c.e], where S2[c.:c.e] is a subsequence of S2[:e] and is derived from Tc. These values, A[c,c.] for each cchildren(x), were calculated in earlier iterations and saved in D. If S2[:e] can be derived from Tx, then the output of the algorithm is the divergence from Tx to S2[:e], Diverge(Tx,S2[:e]). Otherwise, the output is “NO” (specifically, the algorithm returns ). Denote by Tx the (unique) quasi-equivalent PQ-tree of Tx ordered as S2[:e].

The algorithm (whose pseudo-code is given in Algorithm 3) first checks if the interval [,e] can be “completed consecutively” by all of the intervals defined by the indices of the children of x. Specifically, we check if there is a consecutive order of the children of x, say, ordered as c1,,c|children(x)|children(x), such that c1.=, c|children(x)|.e=e, and for each 1j|children(x)|-1, cj.e+1=cj+1.. As apposed to a P-node, here the order of the children completing the interval [,e] must be consecutive with respect to their indices (the same order as the children of x in T or the reverse order). If there is no such order, then the interval [,e] cannot be completed consecutively, and so S2[:e] cannot be derived from Tx. In this case, we return . Otherwise, S2[:e] can be derived from Tx by keeping the order of the children of x, or flipping it. Second, we sum up all of the values in D (and store the sum in the variable childrenDist). Next, we calculate the violation between x and its equivalent node x in Tx, ΔviolationQ(x,x), according to Definition 13 (and store the result in the variable violation). Afterwards, we calculate the flip correction according to Definition 14 (and store the result in the variable childrenFlipCorrection). Finally, we return the sum of childrenDist and violation minus childrenFlipCorrection, which is the divergence from Tx to S2[:e], Diverge(x,S2[:e]) (according to Definition 15). graphic file with name 13015_2023_239_Figc_HTML.jpg

Example

Consider the following input: S1=+a+b+c+d+e+f, S2=+f-c-b-a-d-e, PQ-tree T ordered as S1, δordQ=3 and δflipQ=3. We iterate through the nodes of the tree in post-order, thus we initiate the leaves before their parents. For each leaf xLeaves(rootT), if label(x)=S2[], then A(x,)=0; otherwise, A(x,)=.

Figure 6 describes the PQ-tree T and its quasi-equivalent PQ-tree T ordered as S2, after the initialization of the leaves only. Notice that the order of the initialization is in fact different, in postorder; for simplicity, we show the tree where only the leaves are initialized. In addition, in the figures, the equivalent nodes of T and T are shown as the same nodes. But in the explanations, in order to distinguish between them, for each node xT, we denote it by x. The pair of numbers shown in the figure near a node represent its and r values. In addition, the sign + or − near a node represents its sign. For example, for aT, a.=a.r=4 and sign(a)=+. For each internal node, the character assigned to it represents its color.

Fig. 6.

Fig. 6

a PQ-tree T. b PQ-tree T, which is quasi-equivalent to T and ordered as S2

First, consider the iteration where A[z,2] is calculated (the values for all other entries with node z is ). The intervals of the children of z, [3, 3] and [2, 2], complete the interval (2, e) consecutively where e=2+span(z)-1=3. Thus, the subsequence S2[2:3] is derived from Tz, and in order to generate it we need to flip the order of children(z). For this the penalty of flipping, δflipQ, is applied, and so ΔviolationQ(z,z)=δflipQ=3. childrenDist=A[b,3]+A[c,2]=6 because A[b,3]=A[c,2]=3. FlipCorrection(z,z)=6 because both b and c have been flipped. Therefore, A[z,2]=childrenDist+violation-childrenFlipCorrection=3. Now, we update z’s indices, z.=2 and z.r=3. Figure 7 describes the PQ-tree T and its quasi-equivalent PQ-tree T after calculating A[z,2].

Fig. 7.

Fig. 7

a PQ-tree T. b PQ-tree T, which is quasi-equivalent to T and ordered as S2

Next, consider the iteration where A[y,2] is calculated (the values for all other entries with node y is ). Recall that Fig. 7 describes T and T after calculating the entries of the children of the P-node y. The intervals of the children complete the interval (2, e), where e=2+span(y)-1=5, so we can continue with the iteration. childrenDist=A[a,4]+A[z,2]+A[d,5]=9. dSBP(S(y),S(y))=1 where the pair (zd) is the only singed break-point (note that S(y)=+a+z+d, S(y)=-z-a-d). ΔjumpP(y,y)=0 because to reorder the children of y as S2[2:5] we can move z only (and its size is 1). Therefore, ΔviolationP(y,y)=dSBP(S(y),S(y))+ΔjumpP(y,y)=1. Then, A[y,2]=10 and we can update y’s indices, y.=2 and y.r=5. Figure 8 describes the PQ-tree T and its quasi-equivalent PQ-tree T after calculating A[y,2].

Fig. 8.

Fig. 8

a PQ-tree T. b PQ-tree T, which is quasi-equivalent to T and ordered as S2

Finally, consider the iteration where A[x,1] is calculated (the values for all other entries with node x is ). Figure 8 describes T and T after calculating the entries of the children of the P-node x. The intervals of the children complete the interval (1, e), where e=1+span(x)-1=6, so we can continue with the iteration. childrenDist=A[y,2]+A[e,6]+A[f,1]=10. dSBP(S(x),S(x))=2 where the pairs (ye) and (ef) are the signed break-points (note that S(x)=+y+e+f, S(x)=+f-y+e). ΔjumpP(x,x)=0 because to reorder the children of x as S2 we can move f only (and its size is 1). Therefore, ΔviolationP(x,x)=dSBP(S(x),S(x))+ΔjumpP(x,x)=2. Then, A[x,1]=10+2=12 and the algorithm returns 12. Figure 9 describes the PQ-tree T and its quasi-equivalent PQ-tree T after calculating A[x,1].

Fig. 9.

Fig. 9

a PQ-tree T. b PQ-tree T, which is quasi-equivalent to T and ordered as S2

Complexity analysis

In this section we analyse the time and space complexities of Algorithm 1, which solves CTTSD. First, we analyse Algorithms 2 and 3, which are used as procedures in Algorithm 1. For a given PQ-tree, we denote by γ the maximum number of children of an internal node.

Lemma 5.1

Algorithm 5 takes O(1.381γγ2) time and O(γ2) space.

Proof

The most space consuming part of the algorithm, besides the computation of a vertex cover, is to store x (which is equivalent to x and ordered as a specific string). We simply save the children of x in a specific order. Thus, the space is O(γ).

The algorithm first checks if the interval [,e] can be “completed” by all of the intervals defined by the indices of the children of x. We can check this in O(γ2) time (naively). Then, we sum up the values in D. Notice that |D|=|children(x)|=O(γ), thus this step takes O(γ) time.

After that, we calculate ΔviolationP(x,x), which its most time consuming calculation is the jump violation, ΔjumpP(x,x), which requires to find a minimum weighted vertex cover of the graph G[x,x]. In order to find such vertex cover, we can use, for example, the algorithm in [44] which takes O(1.381γγ2) time and O(γ2) space. Thus, this step takes O(1.381γγ2) time and O(γ2) space.

All other steps in the algorithm are basic operations and thus they take O(1) time. Hence, the algorithm takes O(1.381γγ2) time and O(γ2).

Lemma 5.2

The Q-Mapping algorithm, Algorithm 6, takes O(γ2) time and O(γ) space.

Proof

The most space consuming part of the algorithm is to store x (which is equivalent to x and ordered as a specific string). We simply save the children of x in a specific order. Thus, the space of the algorithm is O(γ).

The algorithm first checks if the interval [,e] can be “completed consecutively” by all of the intervals defined by the indices of the children of x. We can check this in O(γ) time. Then, we sum up the values in D. Notice that |D|=|children(x)|=O(γ), thus this step takes O(γ) time. After that, we calculate ΔviolationQ(x,x) and FlipCorrection(x,x), which take O(γ2) time (naively). All other steps in the algorithm are basic operations and thus they take O(1) time. Hence, the algorithm takes O(γ2) time.

Lemma 5.3

The main algorithm, Algorithm 1, takes O(nγ2·(mp·1.381γ+mq)) time and O(n2) space.

Proof

The number of leaves in the PQ-tree T is n, hence there are O(n) nodes in the tree, thus the size of the first dimension of the DP table, A, is O(n). The size of the second dimension (1n) is also n. Thus, the DP table A is of size O(n2).

In the initialization step, all calculations are basic and take O(1) time. The P-Mapping algorithm is called for every P-node in T and every possible start index i, so the P-Mapping algorithm is called O(nmp) times. Similarly, the Q-Mapping algorithm is called O(nmq) times. Thus, it takes O(n·(mp·Time(P-Mapping)+mq·Time(Q-Mapping))) time to fill the DP table. In the final stage of the algorithm, we return A[rootT,1], which takes O(1) time.

From Lemma 5.1, the P-Mapping algorithm takes O(1.381γγ2) time and O(γ2) space, and from Lemma 5.2, the Q-Mapping algorithm takes O(γ2) time and O(γ) space. Thus, in total, our algorithm runs in O(1)+O(n·(mp·(2γγ2)+mq·(γ2)))=O(nγ2·(mp·1.381γ+mq)) time. Adding the space required for the P-Mapping and Q-Mapping algorithms to the space required for the main DP table results in a total space complexity of O(γ2)+O(γ)+O(n2)=O(n2).

Observe that is ΔjumpP is set to 0, then the computation of a vertex cover is not required. Hence, by the analysis above, we obtain the following time and space complexities.

Lemma 5.4

If ΔjumpP is set to 0, then the main algorithm, Algorithm 1, takes O(nmγ2) time and O(n2) space.

TreeToStringDivergence: algorithm

In this section, we develop a dynamic programming (DP) algorithm to solve TTSD. Our algorithm receives as input an instance of TTSD, (S1,S2,T,dT,dS,δordQ,δflipQ). The output of the algorithm is the minimum divergence from T to a subsequence of S2 with up to dT deletions from T and up to dS deletions from the subsequence (or “NO”).

Brief overview: On a high level, our algorithm consists of three components: the main algorithm, and two other algorithms that are used as procedures by the main algorithm. Apart from an initialization phase, the crux of the main algorithm is a loop that traverses the given PQ-tree, T. For each internal node x, it calls one of the two other algorithms: P-Mapping or Q-Mapping. These algorithms return the divergence from the subtree of T rooted in x, Tx, to subsequences of S, based on the type of x (P-node or Q-node). Then, these divergences are stored in the DP table.

In the following sections, we describe the main algorithm (“The main algorithm” section), the P-Mapping algorithm (“P-node mapping: the algorithm” section) and the Q-Mapping algorithm (“Q-node mapping: the algorithm” section).

The main algorithm

The algorithm (whose pseudocode is given in Algorithm 4) constructs a 5-dimensional DP table A of size m×m×n×dT+1×dS+1, where m=m+mp+mq is the number of nodes in T. The purpose of an entry of the DP table, A[x,pos,i,kT,kS], is to hold the divergence from the subtree Tx to a subsequence S of S2 starting at index i with kT deletions from Tx and kS deletions from S, where exactly pos leaves of Tx have a positive sign. If S cannot be derived from Tx, A[x,pos,i,kT,kS]=.

Some entries of the DP table define illegal derivations, namely, derivations for which the number of deletions is inconsistent with the start index i, the derived node and S2. For example, such are derivations that have more deletions from the string than there are characters in the derived string. These entries are called invalid entries, and their value is defined as throughout the algorithm. Formally, an entry A[x,pos,i,kT,kS] is invalid if one of the following is true: pos>span(x)-kT, kT>span(x), kS>L(x,kT,kS), or E(x,i,kS,kT)>n.

Let x be a leaf, S be a signed string, i be an index, kS be the number of deletions from S and pos{0,1}. Then, we define Ix,S,i,kS,pos={iji+kSlabel(x)=S[j] and pos is consistent with sign(S[j])}. Intuitively, each jIx,S,i,kS,pos corresponds to a possible alignment of x to S[j], therefore the label of x must match S[j] and pos must be consistent with sign(S[j]).

The algorithm first initializes the entries of A that are meant to hold scores of derivations of the leaves of T to every possible subsequence of S using the following rule. For every 0kSdS, every leaf xLeaves(T) and each possible sign of y (pos{0,1}), do:

  1. A[x,pos,i,1,kS]=ρdelT+kS·ρdelS (if dT>0).

  2. A[x,pos,i,0,kS]= if there is no derivation from x to a character in S2[i:i+kT].

    Otherwise:

    1. A[x,pos,i,0,kS]=0 if pos is consistent with sign(x) (no flipping).
    2. A[x,pos,i,0,kS]=δflipQ if pos is not consistent with sign(x) (flipping).

After the initialization, all other entries of A are filled as follows. Go over the internal nodes of T in postorder. For every internal node x, go in ascending order over every index i that can be a start index for a subsequence of S2 derived from Tx (the possible values of i are explained in the next paragraph). For every x and i, use the algorithm for P-mapping or Q-mapping according to the type of x. Both algorithms receive the following input: a subsequence S of S2, the node x, its children x1,,xγ, the collection of all possible derivations of the children (denoted by D), which have already been computed and stored in A (as will be explained ahead), and the deletion arguments dT,dS. Q-Mapping also receives the penalty arguments δordQ and δflipQ as input. Intuitively, the subsequence S is the longest subsequence of S starting at index i that can be derived from Tx given dT and dS. After being called, both algorithms return a set of divergences of derivations of Tx to a prefix of S=S[i:e]. The set holds the divergences of derivations for every E(x,i,dT,0)eE(x,i,0,dS) and for every 0kTdT, 0kSdS.graphic file with name 13015_2023_239_Figd_HTML.jpg

We now explain the possible values of i and the definition of S more formally. To this end, recall the length function given in Definition 18, L(x,kT,kS)=span(x)-kT+kS. Thus, on the one hand, a subsequence of maximum length is obtained when there are no deletions from the tree and dS deletions from the string. Hence, S=S[i:E(x,i,0,dS)]. On the other hand, a shortest subsequence is obtained when there are dT deletions from the tree and none from the string. Then, the length of the subsequence is L(x,dT,0)=span(x)-dT. Hence, the index i runs between 1 and n-(span(x)-dT)+1.

We now turn to address the aforementioned input collection D in more detail. Formally, it contains derivations of every child x of x to every subsequence of S with up to dT and dS deletions from the tree and string, respectively. It is obtained from the entries A[x,pos,i,kT,kS] (where each entry yields one derivation) for all kT and kS, all i between i and the end index of S, i.e., iiE(x,i,0,dS), and all possible pos (0posspan(x)-kT).

In the final stage of the main algorithm, when the DP table is full, the score of a best derivation is the minimum of {A[rootT,pos,i,kT,kS]:kTdT,kSdS,1in-(span(rootT)-kT)+1,0posspan(rootT)-kT}.

P-node and Q-node mapping: terminology

Before describing the P-mapping and Q-mapping algorithms, we set up some useful terminology.

We first define the notion of a partial derivation. In the P-Mapping and Q-Mapping algorithms, derivations of the input node x are built by considering subsets C of its children. With respect to such a subset C, a derivation μ of x is built as if x had only the nodes in C as children, and is called a partial derivation. We denote by μ.v the root of the subtree of the derivation (Tμ.v). μ.pos denotes the number of positive signed of leaves considered in the derivation (in Tμ.v). μ.delT and μ.delS denote the number of deletions from Tμ.v and from the string in the derivation, respectively. μ.score denotes the divergence score of the derivation μ.

Definition 20

Let x be a node. μ is a partial derivation between x and a string if μ.v=x and there is a subset of children Cchildren(x) such that the two following conditions are true.

  1. For every uchildren(x)\C, each of the leaves in Tu is neither mapped nor deleted under μ.

  2. For every uC, each of the leaves in Tu is either mapped or deleted under μ.

For every uchildren(x)\C, we say that u is ignored under μ. Notice that any derivation is a partial derivation, where the set of ignored nodes (U above) is empty.

In the P-Mapping algorithm for Cchildren(x), the notation x(C) is used to indicate that the node x is considered as if its only children are the nodes in C (the nodes in children(x)\C are ignored). Consequentially, the span of x(C) is defined as span(x(C))=cCspan(c), and the set D(C,N,y,kT,kS) (in Definition 21 where U={x(C)}) now refers to a set of partial derivations. To use x(C) to describe the base cases of the algorithm, let us define x() (x(C) for C=) as a tree with no labeled leaves to map.

Since all derivations that are computed in a single call to the P-Mapping algorithm have the same start-point i, it can be omitted (for brevity) from the end-point function; thus, we denote E(x,kT,kS)=L(x,kT,kS). Also, for a set U of nodes, we define L(U,kT,kS)=xUspan(x)+kS-kT, and, accordingly, E(U,kT,kS)=L(U,kT,kS).

We now define certain collections of derivations with common properties (such as having the same number of deletions and end-point).

Definition 21

Let C be a set of nodes, NC be the set of nodes which have negative sign among the nodes in C, and kT and kS be two numbers. The collection of all the derivations of yC to suffixes of S[1:E(C,kT,kS)] with exactly kT deletions from the tree and exactly kS deletions from the string and are consistent with N is denoted by D(C,N,y,kT,kS). By consistency with N we mean: if yN, then for each μD(C,N,y,kT,kS), μ.pos(span(y)-kT)/2; if yN, then for each μD(C,N,y,kT,kS), μ.pos(span(y)-kT)/2.

Definition 22

Let C be a set of nodes, NC be a set of nodes which define the signs of the nodes in C, and kT and kS be two numbers. The collection of all derivations of yC to suffixes of S[1:E(C,kT,kS)] with up to kT deletions from the tree, and up to kS deletions from the string is denoted by D(C,N,y,kT,kS). Specifically, for the node yC, kTkT and kSkS, the set D(C,N,y,kT,kS) holds only one derivation of y to a suffix of S[1:EI(C,kT,kS)] with kT and kS deletions from the tree and string, respectively, if such derivation exists. In addition, the derivations in D(C,N,kT,kS) are consistent with N: if yN, then for each μD(C,N,y,kT,kS), μ.pos(span(y)-kT)/2; if yN, then for each μD(C,N,y,kT,kS), μ.pos(span(y)-kT)/2.

It is important to distinguish between these two definitions. First, the derivations in D(C,N,y,kT,kS) have exactly kT and kS deletions, while the derivations in D(C,N,y,kT,kS) have up to kT and kS deletions. Second, in D(C,N,y,kT,kS) there can be several derivations that differ only in their scores and in the one-to-one mappings that yield them, while in D(C,N,y,kT,kS) there is only one derivation for every deletion combination pair (kT,kS). Note that the end-points of all of the derivations are equal.

In every step of the P-Mapping algorithm, a different set of derivations of the children of x is examined, thus, Definition 22 is used for Cchildren(x). In addition, the set of derivations D that is received as input to the algorithms can be described using Definition 22 as can be seen in Eq. 1 below. In this equation, the union is over all Cchildren(x) because in this way the derivations of all the children of x with every possible end-point are obtained (in contrast to having only C=children(x), which results in the derivations of all the children of x with the end-point E(children(x),kT,kS)).

D=Cchildren(x)NCyCkTdTkSdSD(C,N,y,kT,kS) 1

In the P-Mapping algorithm for Cchildren(x), the notation x(C) is used to indicate that the node x is considered as if its only children are the nodes in C (the nodes in children(x)\C are ignored). Consequentially, the span of x(C) is defined as span(x(C))=cCspan(c), and the set D(C,N,y,kT,kS) (in Definition 21 where C={x(C)}) now refers to a set of partial derivations. To use x(C) to describe the base cases of the algorithm, let us define x() (x(C) for C=) as a tree with no labeled leaves to map.

In the P-Mapping algorithm, we use the following notation.

Definition 23

Let C be a set of children of a P-node in a PQ-tree. For x,yC, the set of all nodes in C that are between x and y is denoted by Cx,y. That is, Cx,y={vC:visbetweenxandy}.

See an example in Fig. 10.

Fig. 10.

Fig. 10

Cy,f={e,g},Ca,d={b},Cy,e={}

In the algorithms, we take in account signed break-points between the nodes in the derivations. So, we define a procedure which receives a node x, two children of x, y and z that are considered to be adjacent in a quasi-equivalent tree Tx of Tx and a set N that defines the signs of y and z in Tx. The procedure returns 1 if there is a signed break-point between y and z, and 0 otherwise. The procedure checks if the nodes are adjacent in the tree Tx and if they changed their signed order (recall Definition 9) according to N, as defined next.

Definition 24

BPDelta(x,y,z,N)=0,ifyandzareadjacentinTxanddidnotchangetheirsignedorder1,else 2

Similar to Definition 24, we define a procedure which is used in a case where we necessarily consider the two children yz of x to be adjacent in the tree Tx (specifically, when Cy,z is deleted). This procedure returns 1 if there is a signed break-point between y and z, but does not check if they are adjacent.

Definition 25

BPDelta2(x,y,z,N)=0,ifyandzdidnotchangetheirsignedorder1,else 3

In the P-Mapping algorithm, we calculate the jump violation (recall Definition 12). To do so, we define the procedure JumpViolationDelta, which receives a vertex cover C and a node x, and returns the penalty defined in Definition 12 of x as follows.

Definition 26

JumpViolationDelta(C,x)=(span(x)-1)/2,ifxC0,else 4

P-node mapping: the algorithm

Recall that the input consists of a P-node x, a string S, bounds on the number of deletions from T and S, dT and dS, respectively, and a set of derivations D (see Eq. 1). The output of the algorithm is the collection of divergences of derivations of x to every prefix of S having exactly kT deletions from Tx, kS deletions from the prefix of S and pos number of leaves with positive sign, for each combination of 0kTdT, 0kSdS and 0posspan(x). Thus, the output contains O(dT·dS·span(x)) derivations.

The algorithm (whose pseudocode is given in Algorithm 5) constructs a 7-dimensional DP table P, which has an entry for every 0kTdT, 0kSdS, subset Cchildren(x), subset CC, subset NC, number 0posspan(x) and for every yC. Recall that we need to calculate JumpViolation for the children of x. To do so, we use the variables C and N. The purpose of an entry P[C,C,N,pos,y,kT,kS] is to hold the divergence of a partial derivation rooted in x(C) to a prefix of S with exactly kT deletions from the tree, kS deletions from the string, pos leaves with positive sign in Tx(C), where x is the node equivalent to x, while considering the nodes in N to have a negative sign, C as a possible vertex cover set of G[x(C),x(C)] (minus the children that were deleted), and considering derivations of y only to suffixes of S[1:E(C,kT,kS)]. The children of x that are not in C are ignored under the partial derivation stored by the DP table entry P[C,C,N,pos,y,kT,kS], thus they are neither deleted nor counted in the number of deletions from the tree, kT. (They will be accounted for in the computation of other entries of P).

Similarly to the main algorithm, some of the entries of P are invalid, and their value is defined as . Formally, an entry P[C,C,N,pos,y,kT,kS] is invalid if one of the following is true: kT>span(C), L(x(C),kT,kS)>len(S), pos is not consistent with N, or C is not a vertex cover of G[x(C),x(C)] where the signs of children(x(C)) are defined by N.

Every entry P[C,C,N,pos,y,kT,kS] for which |C|=1 is initialized with D(C,N,y,pos,kT,kS) (stored in D) which is the divergence of the derivation rooted in Ty to the suffix of S[1:E(C,kT,kS)], with exactly kT deletions from the tree, kS deletions from the string, pos leaves with positive sign in Ty such that N is consistent with pos (if it exists, this derivation is stored in D). If such a derivation does not exist, D(C,N,y,pos,kT,kS)=.

After the initialization, the remaining entries of P are calculated using the recursion rule in Expression 1 ahead. The order of computation is ascending with respect to the size of the subsets C of the children of x, and for a given Cchildren(x), the order is ascending with respect to the number of deletions from both tree and string. With a lesser priority, the order is ascending with respect to the number of positive signed leaves in Tx(C).graphic file with name 13015_2023_239_Fige_HTML.jpg

In the algorithm, P[C,C,N,pos,y,kT,kS] is computed by taking the minimum between the following expressions (we will refer the minimum of the three expressions as Expression 1):

  1. P[C,C,N,y,kT,kS-1]+ρdelS Explanation: Intuitively, every entry P[C,C,N,pos,y,kT,kS] defines some index e of S that is the end-point of every partial derivation in D(C,N,y,kT,kS). Thus, S[e] must be a part of any partial derivation μD(C,N,y,kT,kS); so, either S[e] is deleted under μ or it is mapped under μ. The former option is captured by the first case of the recursion rule.

  2. minμD(C,N,y,kT,kS)s.t.μ.delT<span(y)minzC\{y}P[C\{y},C\{y},N\{y},pos-μ.pos,z,kT-μ.delT,kS-μ.delS]+μ.score+BPDelta(x,y,z,N)+JumpViolationDelta(C,y)
    Explanation: If S[e] is mapped under μ, then due to the hierarchical structure of Tx, it must be mapped under some derivation μ of one of the children of x that are in C. Thus, we receive the second and third cases of the recursion rule. In the second case, we consider zC\{y} and derivations of z only to suffixes of S[1:E(C\{y},kT-μ.delT,kS-μ.delS)].
  3. minμD(C,N,y,kT,kS)s.t.μ.delT<span(y)minzC\{y}s.t.Cy,zCP[C\Cy,z\{y},C\Cy,z\{y},N\Cy,z\{y},z,kT-μ.delT-span(Cy,z),kS-μ.delS]+μ.score+ρdelT·span(Cy,z)+BPDelta2(x,y,z,N)+JumpViolationDelta(C,y)
    Explanation: The third case captures the option of deletion of all the nodes between y and z (Cy,z), so that after the deletion we consider y and z as adjacent in Tx(C). In this case we consider zC\{y} and derivations of z only to suffixes of S[1:E(C\Cy,z\{y},kT-μ.delT-span(Cy,z),kS-μ.delS)] (1)

Once the entire DP table is filled, for every combination of (pos,kT,kS) the algorithm returns the divergence from Tx to S. This is done by going over the partial derivations in P as well as deleting the ignored nodes (and penalizing the deletion accordingly).

Q-node mapping: the algorithm

The input and output of the Q-Mapping algorithm are the same as the input and output of the P-mapping algorithm (“P-node mapping: the algorithm” section), respectively, except for the type of the node x received as input, and the penalty parameters δordQ and δflipQ, which are also part of the input.

Given that the children of x in consecutive order are x1,x2,,xγ and given an index 1iγ, x[i] denotes the set of the first i children of x. Formally, x[i]={x1,,xi}. In addition, x(i), denotes the node x as if its only children are the nodes in x[i]. Consequentially, the span of x(i) is defined as j=1ispan(xj) and the set D(x(i),N,y,kT,kS) (see Definition 21 where C={x[i]}) now refers to a set of partial derivations. In addition, x[i:j] denotes the set of children of x from child i to child j; formally, x[i:j]{xi,,xj}. In case i>j, x[i:j]=.

The algorithm (whose pseudocode is given in Algorithm 6) constructs two 5-dimensional DP tables Q and Qr. Both have an entry for every 0kTdT, 0kSdS, index 0iγ, s{+,-} and number 0posspan(x). The purpose of an entry Q[i,kT,kS,s,pos] (and similarly Qr[i,kT,kS,s,pos]) is to hold the divergence of a partial derivation in D(x(i),N,y,kT,kS), i.e. a partial derivation rooted in x(i) to a prefix of S with exactly kT deletions from the tree, kS deletions from the string, while s is the sign of xi in the derivation and pos is the number of positive signed leaves in x(i). The difference between Q and Qr is in the order in which the children of x are arranged. In Q the children of x are considered in a left-to-right order, namely, x1 is the leftmost child of x and x[i] is the set of the i leftmost children of x (then, x[i:j] is defined accordingly). In Qr the children of x are considered in a right-to-left order, namely, x1 is the rightmost child of x and x[i]r is the set of the i rightmost children of x (then, x[i:j]r is defined accordingly). For abbreviation, from now on, Q is used when a notion is true for both Q and Qr. The children of x that are not in x[i] are ignored under the partial derivation stored by the DP table entry Q[i,kT,kS,s,pos], thus they are neither deleted nor counted in the number of deletions from the tree, kT. (They will be accounted for in the computation of other entries of Q).

Similarly to the main algorithm and the P-Mapping algorithm, some of the entries of the DP tables are invalid, and their value is defined as . Formally, an entry Q[i,kT,kS,s,pos] is invalid if one of the following is true: kT>span(x(i)), L(x[i],kT,kS)>|S|, or pos is not consistent with s.graphic file with name 13015_2023_239_Figf_HTML.jpg

Every entry Q[i,kT,kS,s,pos] for which i=1 is initialized with D(x[i],s,xi,pos,kT,kS) (stored in D) which is the divergence of the derivation rooted in Txi to the suffix of S[1:E(x[i],kT,kS)], with exactly kT deletions from the tree, kS deletions from the string, pos leaves with positive sign in Txi such that s is consistent with pos (if it exists, this derivation is stored in D). If such a derivation does not exist, D(x[i],s,xi,pos,kT,kS)=.

After the initialization, the remaining entries of Q are calculated using the recursion rule in Expression 2 ahead. The order of computation is ascending with respect to the child index i, and for a given i, the order of computation is ascending with respect to the number of deletions from the string, and ascending with respect to the number of positive signed leaves in Tx(i). In Expression 2 we use the notation Nxi and Nxi,xj, defined as follows. Let xi and xj be two nodes whose signs in Tx(i) are s and s, respectively. If s=- then xiNxi and xiNxi,xj, and if s=- then xjNxi,xj. If s=+ then xiNxi and xiNxi,xj, and if s=+ then xjNxi,xj.

In the algorithm, Qt[i,kT,kS,s,pos] is computed by taking the minimum between the following expressions (we will refer the minimum of the three expressions as Expression 2):

  1. Qt[i,kT,kS-1,s,pos]+ρdelS

    Explanation: Intuitively, every entry Q[i,kT,kS,s,pos] defines some index e of S that is the end-point of every partial derivation in D(x(i),N,y,kT,kS). Thus, S[e] must be a part of any partial derivation μD(x(i),N,y,kT,kS), so, either S[e] is deleted under μ or it is mapped under μ. The former option is captured by this case.

  2. minμD(x[i]t,Nxi,xi,kT,kS)min1ji-1s{+,-}Qt[j,kT-μ.delT-span(x[j+1:i-1]t),kS-μ.delS,s,pos-μ.pos]+μ.score+δordQ·BPDelta2(x,xi,xj,Nxi,xj)+ρdelT·span(x[j+1:i-1]t)

    Explanation: If S[e] is mapped under μ, then due to the hierarchical structure of Tx, it must be mapped under some derivation μ of one of the children of x that are in x[i]. Thus, we receive the second case of the recursion rule. In this case, we consider xjx[i-1] and derivations of xj only to suffixes of S[1:E(x[j],kT-μ.delT,kS-μ.delS)].

    (2)

Once both DP tables are filled, for every combination of (pos,kT,kS) the algorithm returns the divergence from Tx to S. This is calculated by going over the partial derivations, and deleting the ignored nodes (while penalizing the deletion accordingly), while considering both orders of children(x).

Recall that we want to take into account the option of x to flip, and penalize accordingly. Thus, we run Algorithm 6 twice, and store the outputs for each combination (kT,kS,pos), with small modifications in the second run. The first time we run it as it is (while ignoring the option of x to flip). In the second run, we capture the option of x to flip. We can do that by looking only at derivations where all the nodes changed there sign (if a node ychildren(x) has sign of + in Tx, then we consider its sign in the derivation as −, and vice versa). Specifically, we run Algorithm 6 with the following modifications. In Algorithm 6, in line 3 we define t=r, in line 6 we define s as the opposite sign of xi in Tx, in line 21 we define t=r and s as the opposite sign of xi in Tx. In Expression 2, we define s as the opposite sign of xj in Tx. In addition, for every combination (kT,kS,pos), we apply the penalty for flipping x, by adding δflipQ to each value. After that, we need to apply FlipCorrection(x,x) (recall Definition 14) where x is the equivalent node that corresponds to the derivation of the entry. Note that we need to correct the flip penalties of children(x) that are not deleted in the derivation; to do so, we can simply backtrack the values of the DP table to receive the derivation and in particular find the children who are not deleted.

Finally, we return the minimum of the two runs for every combination (kT,kS,pos).

Complexity analysis

In this section we analyse the time and space complexities of the algorithm that solves TTSD, Algorithm 4. First, we analyse Algorithms 5 and 6, which are used as procedures in Algorithm 4. For a given PQ-tree, we denote by γ the maximum number of children of an internal node.

Lemma 6.1

The P-Mapping algorithm, Algorithm 5, takes (5γm2ndT2dS2γ3) time and O(5γmdTdS) space.

Proof

The most space consuming part of the algorithm is the 7-dimensional DP table. The first dimension, C, can be any subset of the set children(x), the second and third dimensions (C and N) can be any subset of C, therefore the size of all three dimensions together is O(5γ) as explained next. Every choice of C,C,N defines (one-to-one) a partition of children(x) to five sets, children(x)\C,C\(CN),C\N,N\C,NC. Thus, in order to go over all C,C,N, we can go over all partitions of children(x) to five sets, which takes O(5|children(x)|)=O(5γ). The size of the fourth dimension (i.e. pos), in the worst case (where x is the root), is span(x)=m. The fifth dimension, y, can be any node in C, therefore the size of the dimension is O(γ). The size of the sixth and seventh dimensions (i.e. kT and kS) are dT+1 and dS+1, respectively. Hence, the space of the algorithm is O(5γmγdTdS)=O(5γmdTdS).

The algorithm has three parts: initialization, filling the DP table, and returning the derivations. The most time consuming calculation required in the initialization is the calculation of D(C,N,y,pos,kT,kS) and checking if C is a vertex cover of G[x(C),x(C)]. For a given tuple (C,N,y,pos,kT,kS) it takes O(|D|) time to calculate the set D(C,N,y,pos,kT,kS) (by naively going over each derivation in D and checking if it fits the values in the tuple); notice that |D|=O(mndTdS). We calculate this set for each combination of (C,N,y,pos,kT,kS). Thus, the calculations for D(C,N,y,pos,kT,kS) take O(22γγm2ndT2dS2). In addition, we check if C is a vertex cover of G[x(C),x(C)]. To generate G[x(C),x(C)] we go over all pairs of nodes in C, check if they changed their signed order, and if required, connect them with an edge in the graph. For a pair of node, it takes O(γ) to check if they changed their signed order (naively). Thus it takes O(γ3) to generate the graph. After that it takes O(γ2) to check if C is a vertex cover (naively). Hence, the first part of the algorithm takes O(5γmdTdSγ3)+O(22γγm2ndT2dS2)=O(5γm2ndT2dS2γ3).

The second part of the algorithm is done by calculating the value of every entry in the O(5γmdTdS) entries of P, using the recursion rule in Expression 1. The first line among the rule takes O(1) time, since it involves looking in another entry of P and basic computations. The second line of the rule involves going over all derivations μD(C,N,y,kT,kS). Namely, going over all derivations of y with a specific end-point, that has no more than a specific number of deletions from the tree and string, and that are consistent with N (i.e. μ.e=E(C,kT,kS), μ.v=y, μ.delTkT, μ.delSkS and μ.pos(span(y)-μ.delT)/2 if yN or μ.pos(span(y)-μ.delT)/2 if yN). The number of deletions from the tree and string are bounded by dT and dS, respectively, and the number of pos is bounded by m. Afterwards, we go over all zC\{y}, whose number is bounded by γ. In addition, we use the procedures BPDelta which can be calculated in O(γ) time and JumpViolationDelta which takes O(1) time. Thus the second line of the rule takes O(dTdSmγ2). The third line of the rule is similar to the second line, except in addition we calculate Cy,z and calculate the span of Cy,z, which take O(γ) time. Thus, the third line of the rule takes O(dTdSmγ2) time. Hence, the time to calculate one entry of P is O(dTdSmγ2). In total, the second part of the algorithm takes O(5γm2dT2dS2γ2) time.

Finally, to construct the returned set of derivations, the algorithm goes over every combination kT,kS,pos once, i.e. it takes O(dTdSm) time. In total, the algorithm takes O(5γm2ndT2dS2γ3)+O(5γm2dT2dS2γ2)+(dTdSm)=O(5γm2ndT2dS2γ3) time.

Lemma 6.2

The Q-Mapping algorithm, Algorithm 6, takes O(γ2dT2dS2m2n) time and O(γdTdSm) space.

Proof

The most space consuming part of the algorithm is the 5-dimensional DP table. The first dimension, i, is bounded by γ. The size of the second and third dimensions (i.e. kT and kS) are dT+1 and dS+1, respectively. The size of the fourth dimension (i.e. s{+,-}) is 2. The size of the fifth dimension (i.e. pos), in the worst case (where x is the root), is span(x)=m. Hence, the space of the algorithm is O(γdTdSm).

The algorithm has three parts: initialization, filling the DP table, and returning the derivations. The most time consuming calculation required in the initialization is the calculation of D(x[i],s,xi,pos,kT,kS). For a given tuple (x[i],s,xi,pos,kT,kS) it takes O(|D|) time to calculate the set D(C,N,y,pos,kT,kS) (by naively going over all derivations in D and checking if it fits the values in the tuple), notice that |D|=O(mndTdS). We calculate this set for each combination of (i,kT,kS,s,pos) where i=1. Thus, the calculations for D(x[i],s,xi,pos,kT,kS) take O(m2ndTdS) all together. Hence, the first part of the algorithm takes O(m2ndTdS).

The second part of the algorithm is done by calculating the value of every entry in the O(γdTdSm) entries of Q, using the recursion rule in Expression 2. The first line among the rule takes O(1) time, since it involves looking in another entry of Q and basic computations. The second line of the rule involves going over all derivations μD(x[i],Nxi,xi,kT,kS). Namely, going over all derivations of y with a specific end-point, which have no more than a specific number of deletions from the tree and string, and which are consistent with Nxi (i.e. μ.e=E(x[i],kT,kS), μ.v=x.i, μ.delTkT, μ.delSkS and μ.pos(span(xi)-μ.delT)/2 if xiNxi or μ.pos(span(xi)-μ.delT)/2 if yNxi). The number of deletions from the tree and string are bounded by dT and dS, respectively, and the number of values for pos is bounded by m. Afterwards, we go over indices 1ji-1 and sign s{+,-}, whose number is bounded by O(γ). In addition, we use the procedure BPDelta2 which can be calculated in O(γ) time. Thus, the second line of the rule takes O(dTdSmγ). Hence, the second part of the algorithm takes O(γ2dT2dS2m2) time.

Finally, to construct the returned set of derivations, the algorithm goes over every combination kT,kS,pos once, and taking the minimum over t{,r}, 1iγ and s{+,-}; so, this takes O(dTdSmγ)=O(dTdSm) time. Recall that we calculate both Q and Qr, but this does not affect the magnitude of the time. In addition, recall that we run the algorithm twice, while in the second run there are modifications that do not increase the time (in fact, they even improve it). Hence, in total, the algorithm takes O(m2ndTdS)+O(γ2dT2dS2m2)+O(dTdSm)=O(γ2dT2dS2m2n) time.

Lemma 6.3

The main algorithm, Algorithm 4, takes O(n2γ2dT2dS2m2(mp·5γγ+mq)) time and O(dTdSm(mn+5γ)) space.

Proof

The number of leaves in the PQ-tree T is m, hence there are O(m) nodes in the tree, i.e the size of the first dimension of the DP table, A, is O(m). The size of the second dimension (i.e. pos), in the worst case (where x is the root), is span(x)=m. In the algorithm description (“The main algorithm” section) a bound for the possible start indices of subsequences derived from nodes in T is given (for a node x, the start index i runs between 1 and n-(span(x)-dT)+1). The node with the largest span in T is the root that has a span of m. The root is mapped to the longest subsequence when there are dS deletions from the string. Hence, the size of the third dimension of A is O(n-(m+dS)+1)=O(n). The fourth and the fifth dimensions of A are of size dT+1 and dS+1, respectively. In total, the DP table A is of size O(dTdSm2n).

In the initialization step, O(dTdSm2n) entries of A are computed. We go over 0kSdS and pos{0,1}. The most time consuming step is the generation of the set Ix,S,i,kS,pos, and it can be generated in O(|children(x)|)=O(γ) time. Thus the initialization takes O(dTdSm2nγ). The P-Mapping algorithm is called for every P-node in T and every possible start index i, so the P-Mapping algorithm is called O(nmp) times. Similarly, the Q-Mapping algorithm is called O(nmq) times. Thus, it takes O(dTdSm2nγ)+O(n·(mp·Time(P-Mapping)+mq·Time(Q-Mapping))) time to fill the DP table.

In the final stage of the algorithm, the minimum over the entries corresponding to every combination of (0kTdT, 0kSdS,1in-(span(x)-dT)+1},0posspan(rootT)-kT) is computed. So, it takes O(dTdSnm) time to find a derivation with minimum score.

From Lemma 6.1, the P-Mapping algorithm takes (5γm2ndT2dS2γ3) time and O(5γmdTdS) space, and from Lemma 6.2, the Q-Mapping algorithm takes O(γ2dT2dS2m2n) time and O(γdTdSm) space. Thus, in total, our algorithm runs in O(dTdSm2nγ)+O(n·(mp·O(5γm2ndT2dS2γ3)+mq·O(γ2dT2dS2m2n)))=O(n2γ2dT2dS2m2(mp·5γγ+mq)) time. Adding to the space required for the main DP table the space required for the P-Mapping algorithm (the space needed for the Q-Mapping algorithm is insignificant with respect to the P-Mapping algorithm) results in a total space complexity of O(dTdSm2n)+O(5γmdTdS)=O(dTdSm(mn+5γ)).

TTSD: polynomial space complexity

In this section we propose an improved version (in terms of space complexity) of the algorithm presented in “TreeToString Divergence: algorithm” section 6, using the inclusion–exclusion principle. Specifically, the space complexity of the algorithm is polynomial instead of exponential. Our P-Mapping algorithm (see “P-node mapping: the algorithm”, Algorithm 5) uses a DP table whose size is exponential. Thus, we propose a new version of the P-Mapping algorithm, which uses a DP table whose size is polynomial. In what follows, we first describe the inclusion–exclusion principle (“Inclusion–exclusion principle” section). Then, we describe the new version of the P-Mapping algorithm (“P-node mapping: polynomial space complexity” section) that is used in Algorithm 4 instead of Algorithm 5. Note that in this version, we are unable to consider jumps of large units. So, when we calculate Diverge, we assume that ΔjumpP=0.

Inclusion–exclusion principle

Usually, the inclusion–exclusion principle is described as a formula for computing |i=1nAi| for a collection of sets A1,,An. We use the intersection version of inclusion–exclusion, which is described as a formula for computing |i=1nAi|. Denote by [n] the set of indices from 1 until n, i.e. [n]={1,,n}. Let A1,,AnU, where U is a finite set. Denote i(U\Ai)=U. Then,

i[n]Ai=X[n](-1)|X|iX(U\Ai). 5

In typical algorithmic applications of the inclusion–exclusion formula we need to count some objects that belong to a universe U, and it is in some sense hard. More precisely, we are interested in objects that satisfy n requirements A1,,An, and each requirement is defined as the set of objects that satisfy it. Thus, the inclusion–exclusion formula translates the problem of computing |i[n]Ai| into computing 2n terms of the form |iX(U\Ai)|. In our application, computing these terms is in some sense easy. If, for example, each of the terms |iX(U\Ai)| can be computed in polynomial time, then the inclusion–exclusion formula gives an algorithm that performs 2nnO(1) arithmetic operations.

P-node mapping: polynomial space complexity

Recall that the input consists of an internal P-node x, a string S, bounds on the number of deletions from the tree T and the string S, kT and kS, respectively, and a set of derivations D (see Eq. 1). The output of the algorithm is the collection of divergences of derivations of x to every possible prefix of S having exactly dT deletions from the tree, dS deletions from the string, and pos leaves with positive sign, for each combination of 0kTdT, 0kSdS and 0posspan(x). Thus, there are O(dT·dS·span(x)) derivations in the output.

Denote by γ the number of children of x, i.e. γ=|children(x)|. First, we define the collection of sets A1,,Aγ (see “Inclusion–exclusion principle” section). Notice that the derivations we seek take into account all of children(x), using each child exactly once (mapped or deleted). Thus, we define the collection A1,,Aγ where Ai is the set of derivations that use the i’th child of x at least once (mapped or deleted), and use exactly γ children of x (including repetitions). In particular, note that the same child can be both mapped several times (to different subsrings) and deleted several times (even if it has been mapped). As a result, i[γ]Ai is the set of derivations that take into account all of children(x), where each child is used exactly once. We define U as the set of all partial derivations using γ children of x (including repetitions). Therefore, iX(U\Ai) is the set of derivation that do not use the children whose indices are in X, and use exactly γ children of x (including repetitions).

Denote by ScoreSet the set of all possible divergence values of all partial derivations of x to S. We use the collection of sets A1,,Aγ as described above, for every combination of (kT,kS,pos,score) where the divergence values of the derivations in A1,,Aγ are at most score. Specifically, Algorithm 7 constructs a 4-dimensional DP table B, which has an entry for every 0kTdT, 0kSdS, 0posspan(x) and scoreScoreSet. The purpose of an entry B[kT,kS,pos,score] is to hold the number of partial derivations rooted in x to a prefix of S with exactly kT deletions from the tree, kS deletions from the string, pos leaves with positive sign in the derivation whose divergence value is at most score. For each entry in the DP table, we calculate its value using the intersection version of the inclusion–exclusion formula, shown in “Inclusion–exclusion principle” section.

Lemma 7.1

Assuming that Algorithm 8 is correct,6 Algorithm 7 is correct, that is, it returns the collection of divergences of derivations of x to every prefix of S having exactly kT deletions from Tx, kS deletions from the prefix of S and pos number of leaves with positive sign, for each combination of 0kTdT, 0kSdS and 0posspan(x).

Proof

For each combination of 0kTdT, 0kSdS, 0posspan(x) and scoreScoreSet, we define U^U as the set of all partial derivations using γ children of x (including repetitions) with exactly kT deletions from the tree, kS deletions from the string, pos leaves with positive sign in the derivation whose divergence value is at most score. We define Ai^U^ as the set of derivations that use the i’th child of x at least once (mapped or deleted), and use exactly γ children of x (including repetitions). For each combination (kT,kS,pos,score), i[γ]Ai^ is the set of derivations with exactly kT deletions from the tree, kS deletions from the string, pos leaves with positive sign in the derivation whose divergence value is at most score that take into account all of children(x), where each child is used exactly once. For each combination (kT,kS,pos,score), and given Y[γ], Algorithm 8 returns |iY(U^\Ai^)|, which is the number of derivations that do not use the children whose indices are in Y, and use exactly γ children of x (including repetitions). In Algorithm 7 we calculate |i[γ]Ai^| for each combination (kT,kS,pos,score) according to the Inclusion–Exclusion principle and store the results in table B, using Eq. 5 as follows.

i[γ]Ai^=Y[γ](-1)|Y|iY(U^\Ai^).

From the correctness of Algorithm 8 and the correctness of Eq. 5, the values in B are correct.graphic file with name 13015_2023_239_Figg_HTML.jpg

Finally, for each combination of 0kTdT, 0kSdS and 0posspan(x), Algorithm 7 returns the minimum scoreScoreSet for which B[kT,kS,pos,score]>0, meaning that score is the minimum value of the possible derivations, that is, the desired divergence value to be returned.

Internal P-node mapping

In this section, we describe Algorithm 8. The input consists of an internal P-node x, a subset Cchildren(x), a string S, bounds on the number of deletions from the tree T and the string S, dT and dS, respectively, and a set of derivations D (see Eq. 1). The output of the algorithm is the number of all partial derivations of x(C) to every prefix of S for each combination (kT,ks,pos,score) having exactly kT deletions from the tree, kS deletions from the string, pos leaves with positive sign, whose divergence value is at most score, and that use γ children from C (including repetitions). Thus, there are O(dT·dS·span(x)·|ScoreSet|) derivations in the output.

Algorithm 8 constructs an 8-dimensional DP table P, which has an entry for every 1i|children(x)|, yC, s{+,-}, 1j|S|, 0posspan(x), 0kTdT, 0kSdS, and for every scoreScoreSet. The purpose of an entry P[i,y,s,j,pos,kT,kS,score] is to hold the number of all partial derivations rooted in x(C) to a suffix of S[1:j], which use exactly i children from C (not necessarily different children), which map y to the suffix of S[1:j], with exactly kT deletions from the tree, kS deletions from the string, pos leaves with positive sign in Ty such that s is the sign of y in the derivation and it is consistent with pos, and whose divergence value is at most score.

Every entry P[i,y,s,j,pos,kT,kS,score] for which i=1 is initialized with |D(i,y,s,j,pos,kT,kS,score)|, which is the size of the set of all partial derivations rooted in Ty to a suffix of S[1:j], with exactly kT deletions from the tree, kS deletions from the string, pos leaves with positive sign in Ty such that s is the sign of y in the derivation and it is consistent with pos, and whose divergence value is at most score.

After the initialization, the remaining entries of P are calculated using the recursion rule in Expression 3 ahead. The order of computation is ascending with respect to the number of children that should be used i, and it is also ascending with respect to the number of deletions from both tree and string. In addition, the order is ascending with respect to the number of positive signed leaves in the derivation, and it is also ascending with respect to the maximum score of the derivations.graphic file with name 13015_2023_239_Figh_HTML.jpg

In Expression 3 we use the notations Ny and Ny,z, defined as follows. Let y and z be two nodes where the signs of them in Tx(i) considered to be s and s respectively. If s=- then yNy and yNy,z, and if s=- then zNy,z. If s=+ then yNy and yNy,z, and if s=+ then zNy,z. In addition, we use the notation D(j,Ny,y,kT,kS) instead of D(C,Ny,y,kT,kS) (see Definition 22). The two sets are similar except that the derivations in D(j,Ny,y,kT,kS) are derivations of y to a suffixes of S[1:j].

In the algorithm, P[i,y,s,j,pos,kT,kS,score] is computed by taking the minimum between the following expressions (we will refer the minimum of the three expressions as Expression 3):

  1. P[i,y,s,j-1,pos,kT,kS-1,score-ρdelS]

    Explanation: For every entry P[i,y,s,j,pos,kT,kS,score], j is the end-point of every partial derivation. Thus, S[j] must be a part of any partial derivation; so, S[j] is either deleted or mapped. The former option is captured by the first case of the recursion rule.

  2. zCs{+,-}P[i-1,z,s,j-span(y)+μ.delT,pos-μ.pos,kT-μ.delT,kS-μ.delS,score-μ.score-BPDelta(x,y,z,Ny,z)]

    Explanation: If S[j] is mapped, then due to the hierarchical structure of Tx, it must be mapped under some derivation μ of one of the children of x that are in C. Thus, we receive the second and the third cases of the recursion rule. In these cases we take into account every zC to be aligned to the suffix of the derivation’s subsequence S[1:j-span(y)+μ.delT].

  3. zCs.t.Cy,zCs{+,-}P[i-|Cy,z|-1,z,s,j-span(y)+μ.delT,pos-μ.pos,kT-μ.delT-span(Cy,z),kS-μ.delS,score-μ.score-ρdelT·span(Cy,z)-BPDelta2(x,y,z,Ny,z)]

    Explanation: The third case captures the option of deleting all the nodes between y and z (Cy,z), so that after the deletion we consider y and z as adjacent in Tx(C).

    (3)

Once the entire DP table is filled, for every combination of (pos,kT,kS,score) the algorithm returns the number of all partial derivations of x(C) to every prefix of S, which is calculated by going over the the values of the DP table and now delete the ignored nodes and penalize the deletion (CC is the set of children considered to be deleted).

Complexity analysis

In this section we analyse the time and space complexities of the improved algorithm that solves TTSD, Algorithm 4, which uses the improved P-Mapping algorithm. First, we analyse Algorithms 8 and 7, which are used as sub procedures in Algorithm 4 (instead of Algorithm 5).

Lemma 7.2

The Internal P-Mapping algorithm, Algorithm 8, takes O(2γγ4m2n2dT2dS2(dT+dS+m+n)) time and O(γ2nmdTdS(dT+dS+m+n)) space.

Proof

The most space consuming part of the algorithm is the 8-dimensional DP table. The first dimension, 1i|children(x)|, is of size O(|children(x)|)=O(γ). The the second dimention, y, can be any node in C, therefore the size of the second dimension is O(|C|)=O(|children(x)|)=O(γ). The third dimension, s{+,-}, is of size 2. The fourth dimension, 1j|S|, is of size O(n). The size of the fifth dimension (i.e. pos), in the worst case (where x is the root), is span(x)=m. The size of the sixth and seventh dimensions (i.e. kT and kS) are dT+1 and dS+1, respectively. The eighth dimension, score, can be any score in ScoreSet. For fixed penalty parameters δordQ and δflipQ, in the worst case, we pay the maximum number of possible break-point (bounded by m+n), pay the maximum number of deletions from the tree and string dT+dS, and pay for a flip of all nodes in the tree (which yields m values). Therefore the size of the eighth dimension is O(|ScoreSet|)=O(dT+dS+m+n). Hence, the space of the algorithm is O(γ2nmdTdS(dT+dS+m+n)).

The algorithm has three parts: initialization, filling the DP table, and returning the derivations. The most time consuming calculation required in the initialization is the calculation of D(y,s,j,pos,kT,kS,score). For a given tuple (i,y,s,j,pos,kT,kS,score) it takes O(|D|) time to calculate the set D(y,s,j,pos,kT,kS,score) (by naively going over all derivations in D and checking if they fit the values in the tuple); notice that |D|=O(mndTdS). We calculate this set for each combination of (i,y,s,j,pos,kT,kS,score). Thus, the first part of the algorithm takes O(γn2m2dT2dS2(dT+dS+m+n)). The second part of the algorithm is done by calculating the value of every entry in the O(γ2nmdTdS(dT+dS+m+n)) entries of P, using the recursion rule in Expression 3. The first line of the rule takes O(1) time, since it involves looking in another entry of P and basic computations. The second and the third lines of the rule involves going over all derivations μD(j,Ny,y,kT,kS). Namely, going over all derivations of y with a specific end-point, that have no more than a specific number of deletions from the tree and string, and that are consistent with N (i.e. μ.e=j, μ.v=y, μ.delTkT, μ.delSkS and μ.pos(span(y)-μ.delT)/2 if yN or μ.pos(span(y)-μ.delT)/2 if yN). The number of deletions from the tree and string are bounded by dT and dS, respectively, and the number of values of pos is bounded by m. Afterwards, we go over zC, whose number is bounded by γ, and we go over s{+,-}. In addition, we use the procedures BPDelta and BPDelta2, which can be calculated in O(γ) time. Thus the second and third lines of the rule take O(dTdSmγ2). Hence, the time to calculate one entry of P is O(dTdSmγ2). In total, the second part of the algorithm takes O(γ4nm2dT2dS2(dT+dS+m+n)) time. Finally, to construct the returned set of derivations, we go over every combination kT,kS,pos,score, and for every combination, we take the maximum over CC,yC,s{+,-},0j|S|, i.e. it takes O(2γγdTdSmn(dT+dS+m+n)) time. In total, the algorithm takes O(γn2m2dT2dS2(dT+dS+m+n))+O(γ4nm2dT2dS2(dT+dS+m+n))+O(2γγdTdSmn(dT+dS+m+n))=O(2γγ4m2n2dT2dS2(dT+dS+m+n)) time.

Lemma 7.3

The P-Mapping2 algorithm, Algorithm 7, takes O(22γγ4m2n2dT2dS2(dT+dS+m+n)) time and O(γ2nmdTdS(dT+dS+m+n)) space.

Proof

The most space consuming part of the algorithm is the 4-dimensional table. The size of the first and second dimensions (i.e. kT and kS) are dT+1 and dS+1, respectively. The size of the third dimension (i.e. pos), in the worst case (where x is the root), is span(x)=m. The fourth dimension, score, can be any score in ScoreSet, therefore the size of the dimension is O(|ScoreSet|)=O(dT+dS+m+n). In total, the DP table B is of size O(dTdSm(dT+dS+m+n)).

The algorithm has two parts: filling the table, and returning the derivations. In the first part of the algorithm, we go over every subset of children(x), Ychildren(x), and for each one we call Algorithm 8. Thus we call Algorithm 8 O(2|children(x)|)=O(2γ) times. In the second part of the algorithm, after the table is full, for every 0kTdT,0kSdS,0posspan(x)-kT, we return the minimum over scoreScoreSet such that the value of the corresponding entry of the table is positive. Hence, the second part of the algorithm takes O(dTdSm(dT+dS+m+n)) time.

From Lemma 7.2, the Internal P-Mapping algorithm takes O(2γγ4m2n2dT2dS2(dT+dS+m+n)) time and O(γ2nmdTdS(dT+dS+m+n)) space. Thus, in total, our algorithm runs in O(2γ2γγ4m2n2dT2dS2(dT+dS+m+n))=O(22γγ4m2n2dT2dS2(dT+dS+m+n)) time. Adding to the space required for table the space required for the P-Mapping2 algorithm results in a total space complexity of O(dTdSm(dT+dS+m+n))+O(γ2nmdTdS(dT+dS+m+n))=O(γ2nmdTdS(dT+dS+m+n)).

Lemma 7.4

The main algorithm, Algorithm 4, using the improved P-Mapping algorithm, takes O(nγ2dT2dS2m2(mp·22γγ2n(dT+dS+m+n)+mq)) time and O(γ2nm2dTdS(dT+dS+m+n)) space.

Proof

The number of leaves in the PQ-tree T is m, hence there are O(m) nodes in the tree, i.e the size of the first dimension of the DP table, A, is O(m). The size of the second dimension (i.e. pos), in the worst case (where x is the root), is span(x)=m. In the algorithm description (“The main algorithm” section) a bound for the possible start indices of subsequences derived from nodes in T is given (for a node x, the start index i runs between 1 and n-(spanx-dT)+1). The node with the largest span in T is the root that has a span of m. The root is mapped to the longest subsequence when there are dS deletions from the string. Hence, the size of the third dimension of A is O(n-(m+dS)+1)=O(n). The fourth and the fifth dimensions of A are of size dT+1 and dS+1, respectively. In total, the DP table A is of size O(dTdSm2n).

In the initialization step O(dTdSm2n) entries of A are computed. We go over 0kSdS and pos{0,1}. The most time consuming is the generation of the set Ix,S,i,kS,pos, and can be generated in O(|children(x)|)=O(γ) time. Thus the initialization takes O(dTdSm2nγ). The improved P-Mapping algorithm is called for every P-node in T and every possible start index i, i.e. the P-Mapping algorithm is called O(nmp) times. Similarly, the Q-Mapping algorithm is called O(nmq) times. Thus, it takes O(dTdSm2nγ)+O(n(mp·Time(P-Mapping)+mq·Time(Q-Mapping))) time to fill the DP table. In the final stage of the algorithm, the minimum over the entries corresponding to every combination of (0kTdT, 0kSdS,1in-(span(x)-dT)+1},0posspan(rootT)-kT) is computed. So, it takes O(dTdSnm) time to find a derivation with minimum score.

From Lemma 7.3, the P-Mapping2 algorithm takes O(22γγ4m2n2dT2dS2(dT+dS+m+n)) time and O(γ2nmdTdS(dT+dS+m+n)) space, and from Lemma 6.2, the Q-Mapping algorithm takes O(γ2dT2dS2m2) time and O(γdTdSm) space. Thus, in total, our algorithm runs in O(dTdSm2nγ)+O(n(mp·O(22γγ4m2n2dT2dS2(dT+dS+m+n))+mq·O(γ2dT2dS2m2)))=O(nγ2dT2dS2m2(mp·22γγ2n(dT+dS+m+n)+mq)) time. Adding to the space required for the main DP table the space required for the P-Mapping2 and Q-Mapping algorithms, results in a total space complexity of O(dTdSm2n)+O(γ2nmdTdS(dT+dS+m+n))+O(γdTdSm)=O(γ2nm2dTdS(dT+dS+m+n)).

Methods and datasets

Dataset and gene cluster generation

1487 fully sequenced prokaryotic strains with COG ID annotations were downloaded from GenBank (NCBI; ver 10/2012). The gene clusters were generated from this data using the tool CSBFinder-S [45]. CSBFinder-S was applied to all the genomes in the dataset after removing their plasmids, using parameters q=10 (a colinear gene cluster is required to appear in at least ten genomes) and k=0 (no insertions are allowed in a colinear gene cluster), resulting in 79,017 colinear gene orders. From these gene orders, only gene orders whose number of distinct COGs is between 4 and 9 were kept, leaving 28,537 gene orders. Next, ignoring strand and gene order information, colinear gene orders that contain the exact same COGs were united to form the generalized set of 91 gene clusters that abide by the requirements that each gene cluster contains at least 3 gene orders and each COG appears only once in each gene order. For each gene cluster, the most abundant gene order was designated as the “reference” (centroid) gene order. Based on this, the clusters were further filtered to keep only 63 gene clusters whose designated reference has instances in at least 30 genomes. Finally, clusters containing one or more gene orders that are identical to the designated reference gene order, in terms of the list of classes in which they have instances, were removed, leaving a benchmark set of 59 gene clusters.

PQ-tree construction

The input PQ-trees for our algorithm where constructed using the tool PQFinder (available on GitHub [46]). PQFinder was applied to each of the gene clusters in the dataset, to build the PQ-tree representing each cluster. In addition, each Q-node with exactly two children, whose height in the tree is greater than 1, was changed to a P-node (in this special case all children of the node were observed in all shuffling options, which in our opinion better fits the syntax of a P-node than that of a Q-node.)

Parameter settings

In our experiment, we set the parameters of the algorithm as follows. δordQ=1.5, δflipQ=0.5, dT=0, dS=0. The stronger penalty for δordQ versus δflipQ is based on the observation that gene clusters in prokaryotes are very strongly colinearly conserved [30, 31], even when the benchmark dataset is large and spans a wide taxonomical range of prokaryotes [32].

Strand information

Our approach to the comparison of two gene orders focuses on adaptive fitness (in terms of the order by which the gene products are produced). Therefore, we do not distinguish between two gene orders that appear in distinct strands, however are identical in terms of the order and direction of their genes (with respect to the transcription start site). Furthermore, the tool by which the gene orders were identified (CSBfinder [45]) groups together instances from distinct strands into the same CSB. To this end, for any two gene orders being compared in our benchmarks, we compute the rearrangement distance twice: in the first computation, both gene orders preserve the original strand, and in the second computation one of the compared gene order is modified by reversing both the order and the directions of its genes.

Results

Evaluation

In this section we evaluate the accuracy of our approach in measuring the evolutionary divergence between two gene orders that belong to the same gene cluster. To this end, we aim to generate a set of “control” distances, computed from real data, against which the divergence scores computed by our tool can be compared and evaluated.

Recall that in our application, each of the input sequences does not correspond to a specific genomic sequence but rather represents a gene order that occurs in multiple genomes. In addition, abundant gene clusters typically display several paralogous occurrences of distinct gene orders, and possibly several paralogous occurrences of a specific gene order, within the same genome. Furthermore, each distinct occurrence of a specific gene order could differ substantially from another occurrence of the same gene order in terms of its encoding genomic sequence, since each COG represents a cluster of genomic sequences that are not identical however are similar enough (possibly based on local sequence similarity) to be clustered to the same gene orthology group. This poses a challenge in creating a set of “control” distances by which to evaluate the performance of our approach in comparison to extant genome rearrangement distances, since the “ground truth” regarding the evolutionary distance between two gene orders can not be estimated by computing the sequence alignment between the underlying genomic sequences.

Thus, we chose to represent each gene order by the assemblage of its instances, i.e. the set of genomes in which it occurs, and to employ comparative assemblage analysis as a “control” measure. Several similarity (or overlap) indices based on presence/absence (incidence) data have been proposed in the literature [47, 48]. A classical and widely used index in comparative assemblage analysis is the Jaccard index [48]. In our comparative evaluation the instance assemblages are used to estimate divergence rather than similarity, and therefore we use the inverse Jaccard Index as an estimator of the instance assemblage based divergence between two gene orders.

Our proposed divergence measure was evaluated, per each cluster, as follows: first, we applied our approach (Algorithm 1) to measure the structure informed divergence from the cluster’s designated reference (explained in “Methods and datasets” section) to each of the other gene orders. Then, we calculated the Inverse Jaccard based distance from the set of instances of the reference gene order to the sets of instances of each of the other gene orders. In order to tolerate the noise due to inter-specie and inter-genus horizontal transfer of gene orders, we first converted the assemblages of genomes to the assemblages of (taxonomic) classes to which these genomes belong. This resulted in two series of scores, which were then subjected to the computation of Spearman and Pearson correlations between them. The same evaluation procedure was then repeated twice, once using the signed break-point distance (as in Definition 8) instead of our structure-informed divergence measure, and once using the CREx reversal distance [28].

To this end, the 59 gene clusters were distributed to three groups according to the number of Q-nodes in their representative PQ-trees, as we consider the number of Q-nodes in a PQ-tree to be a good estimate of the hierarchical complexity of its colinear components. This yielded 8 gene clusters whose representative tree has no Q-nodes, 41 gene clusters whose representative tree has one Q-node, and 10 gene clusters whose representative tree has two or more Q-nodes. For each group, the average Spearman and Pearson correlation scores were computed for each of the three compared measures. The results are summarized in Table 1. Additional details regarding the comparative results per each gene cluster, as well as the functional categories of the genes in the cluster, are given in Tables 2 and 3. Note that the rows of Table 2 are sorted by decreasing value of the difference Diverge-dreversals, and the first result from the table is interpreted in detail in “A motivating example” section.

Table 1.

A comparison between our proposed rearrangement measure (Diverge), signed break-point distance (dSBP) as in Definition 8, and the CREx reversals distance (dreversals) [28], based on their correlation to a taxonomical instance abundance measure

Num of Q-nodes Correlation Diverge dSBP dreversals
0 Pearson 0.767 0.777 0.824
Spearman 0.680 0.673 0.647
1 Pearson 0.883 0.869 0.845
Spearman 0.775 0.753 0.697
2 Pearson 0.927 0.859 0.842
Spearman 0.907 0.845 0.823

Table 3.

The functional categories for the gene clusters described in Tables 2, based on their COG annotations

# Gene Cluster ID Functional category
1 19876 Carbohydrate transport and metabolism—transcription
2 21344 Signal transduction mechanisms—cell wall/membrane/envelope biogenesis—defense mechanisms
3 19877 Carbohydrate transport and metabolism—general function prediction only
4 27180 Carbohydrate transport and metabolism—transcription
5 14602 Inorganic ion transport and metabolism
6 23340 Transcription—inorganic ion transport and metabolism
7 19853 Transcription—carbohydrate transport and metabolism
8 14790 Amino acid transport and metabolism—lipid transport and metabolism
9 26244 Carbohydrate transport and metabolism
10 26243 Carbohydrate transport and metabolism—transcription
11 26476 Inorganic ion transport and metabolism
12 15297 Energy production and conversion—coenzyme transport and metabolism
13 22866 Signal transduction mechanisms—cell wall/membrane/envelope biogenesis
14 26238 Carbohydrate transport and metabolism—transcription
15 18995 Carbohydrate transport and metabolism—transcription
16 28119 Carbohydrate transport and metabolism
17 25371 Coenzyme transport and metabolism—amino acid transport and metabolism
18 26231 Carbohydrate transport and metabolism
19 14796 Amino acid transport and metabolism
20 20007 Amino acid transport and metabolism
21 22299 Lipid transport and metabolism
22 19255 Posttranslational modification, protein turnover, chaperones—amino acid transport and metabolism
23 20764 Carbohydrate transport and metabolism
24 21553 Carbohydrate transport and metabolism
25 27851 Cell wall/membrane/envelope biogenesis—inorganic ion transport and metabolism
26 27427 Transcription—inorganic ion transport and metabolism
27 15467 Energy production and conversion
28 19610 Lipid transport and metabolism—general function prediction only
29 20078 Cell wall/membrane/envelope biogenesis—general function prediction only
30 22181 Energy production and conversion—posttranslational modification, protein turnover, chaperones
31 25612 Amino acid transport and metabolism
32 27177 Carbohydrate transport and metabolism
33 8962 Signal transduction mechanisms—inorganic ion transport and metabolism
34 27530 Energy production and conversion—transcription
35 19256 Amino acid transport and metabolism
36 21317 General function prediction only—posttranslational modification, protein turnover, chaperones
37 19852 Carbohydrate transport and metabolism
38 27250 Energy production and conversion
39 30976 Transcription—cell wall/membrane/envelope biogenesis —secondary metabolites biosynthesis, transport and catabolism
40 28245 Coenzyme transport and metabolism—carbohydrate transport and metabolism
41 26257 Carbohydrate transport and metabolism
42 14839 Amino acid transport and metabolism
43 27207 Cell wall/membrane/envelope biogenesis—defense mechanisms—transcription
44 12685 Posttranslational modification, protein turnover, chaperones—general function prediction only
45 21268 Defense mechanisms—signal transduction mechanisms
46 23083 Carbohydrate transport and metabolism—general function prediction only
47 25597 Amino acid transport and metabolism—posttranslational modification, protein turnover, chaperones

The first 47 gene clusters (out of 59), this table is continued in Table 4

Table 1 indicates that, in general, the rearrangement-based divergence between a reference gene order and its target gene orders correlates well with the divergence in the taxonomic distribution of their corresponding instances, which is a very interesting result on its own, supporting our choice of gene order instance assemblage-based distance as a control measure for our comparative analysis.

As expected, in gene clusters where there is very weakly conserved structure or none at all, the results are comparable to the other measures. However, the table shows that as the conserved structure of the gene cluster increases, so does our advantage over the signed break-point distance and the CREx reversal distance. In Fig. 11 we further analyse the affect of the level of colinear component co-dependency on the correlation results. Interestingly, a Kruskal-Wallis test indicates significant difference between the correlations computed for our proposed measure Diverge for the three groups of gene clusters (χ2(2)=7.537, P-val=0.023), while such is not the for the other compared measures, dSBP (χ2(2)=2.599, P-val=0.273) and dreversals (χ2(2)=0.981, P-val=0.612).

Fig. 11.

Fig. 11

Distributions of Pearson correlations computed between a series of genomic rearrangement scores and the corresponding series of instance abundance indexes. A Our proposed rearrangement measure (Diverge). B Signed break-point distance (dSBP) as in Definition 8. C CREx reversals distance (dreversals) [28]. For each measure in A–C, the distribution is computed and shown separately per each gene cluster group, as described above: “corr” (Y-axis) denotes the Pearson correlation, “num of Q-nodes” denotes the number of Q-nodes in the PQ-trees of the gene clusters belonging to the specific group

Our implementation of the first algorithm took 0.72 s to complete the analysis of the full dataset when run on a laptop with an Intel(R) Core(TM) i7-8550U CPU (1.99 GHz) using 8 GB RAM. Our implementation of the second algorithm took 5.4 min to complete the same analysis on the same laptop.

Conclusions

In this paper, we defined two (genome rearrangement-based) problems in comparative genomics, denoted TTSD (TreeToStringDivergence) and CTTSD (Constrained TTSD), where the second problem is a special case of the first one. Both problems take as input two sequences of genes S1 and S2, a PQ-tree T representing the known gene orders of a gene cluster of interest, with its leaves ordered according to sequence S1. TTSD also takes as input integer arguments dT and dS (we assume that in CTTSD, dT=dS=0). The objective is to reorder T as a subsequence S2 of S2, allowing up to dT deletions of leaves from T, and such that S2 is obtained from S2 by using up to dS deletions from S2, while calculating a corresponding score that serves as the objective divergence measure.

We proposed an algorithm that solves CTTSD in O(1.381γ) time and O(1) space. The parameter γ is the maximum degree of a P-node in T and O is used to hide polynomial factors in the input size. In the special case where the jump penalty is set to 0, the time complexity of our proposed algorithm is O(1). In addition, we proposed a parameterized algorithm that solves TTSD in O(5γ) time and O(5γ) space. Lastly, we proposed a parameterized algorithm that solves a variant of TTSD, where the jump penalty is set to 0, that reduces the space complexity of the prior algorithm, using the inclusion–exclusion principle. The algorithm take O(4γ) time and O(1) space.

The proposed general algorithm was implemented as a software tool and applied to the comparative and evolutionary analysis of 59 chromosomal gene clusters extracted from a dataset of 1487 prokaryotic genomes. Our preliminary results, based on the analysis of the 59 gene clusters, indicate that our proposed measure correlates well with an instance-abundance index that is computed by comparing the class composition of the genomic instances of two compared gene orders. Comparative analysis versus two extant methods (using very preliminary and simple scoring parameter values for our proposed engine) yields equivalent results in terms of the average correlations to the instance-abundance index values obtained for the benchmark dataset. In future work we propose to assemble a large dataset of prokaryotic gene clusters (large enough to train a more sophisticated scoring scheme model) and use it to train our scoring scheme parameters.

Our proposed measure, however, is shown to more sensitively capture and utilize the conserved structural information characterizing a gene cluster: The statistical test we conducted indicates that as the structural information of a gene cluster, as encoded by its PQ-tree, increases - our proposed measure significantly improves in terms of computing rearrangement scenarios between gene orders belonging to the cluster. This, as opposed to the other two extant measures tested the experiment.

One of the downsides of using PQ-trees to represent gene clusters is that very rare gene orders taken into account in the tree construction could greatly increase the number of allowed rearrangements and thus substantially lower the specificity of the PQ-tree. Thus, a natural continuation of our research would be to increase the specificity of the model by considering a stochastic variation of the algorithms presented in this paper, or alternatively to identify and exclude gene order outliers during PQ-tree construction. In addition, future extensions of this work could also aim to increase the sensitivity of the model by taking into account gene duplications (or at least tandem duplications) and gene-fusion events, which are typical events in gene cluster evolution.

Acknowledgements

Many thanks to the anonymous WABI reviewers for their very helpful comments.

Appendix

See Tables 2, 3, 4.

Table 4.

The functional categories for the gene clusters described in Table 2, based on their COG annotations

# Gene cluster ID Functional Category
48 19005 Carbohydrate transport and metabolism
49 15035 Cell motility
50 28547 Cell wall/membrane/envelope biogenesis
51 25375 Coenzyme transport and metabolism
52 20332 Amino acid transport and metabolism—energy production and conversion
53 26364 Inorganic ion transport and metabolism
54 19909 Amino acid transport and metabolism
55 20601 Inorganic ion transport and metabolism—coenzyme transport and metabolism
56 15391 Amino acid transport and metabolism
57 14573 Transcription—carbohydrate transport and metabolism
58 25554 Amino acid transport and metabolism
59 19526 Cell motility

The last 12 gene clusters (out of 59), this table continues Table 3

Author contributions

MZ and MZU initiated and guided the research. EO developed and implemented the presented algorithms, and participated in the application of the algorithms to the experimental benchmarks. EO developed the bioinformatic pipeline and performed the experiments and the data analysis. EO wrote the paper, with guidance by MZ and MZU. All authors read and approved the final manuscript.

Funding

The research of EO was partially supported by the Planning and Budgeting Committee of the Council for Higher Education in Israel and by the Frankel Center for Computer Science at Ben Gurion University. The research of EO, MZ and MZU was partially supported by the Israel Science Foundation and by the European Research Council (ERC) Starting Grant titled PARAPATH Israel Science Foundation

Availability of data and materials

The code for our tool, the data used in the experiments, and the log file produced by the run of the reported benchmark, can be found on GitHub: http://www.github.com/edenozery/MEM-Rearrange.

Declarations

Ethics approval and consent to participate

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Footnotes

1

That is, the character in position i in G is the same as the character in position j in H.

2

S(x) and S(x) are strings of colors (which are unique), thus there is only one gene mapping of S(x) and S(x).

3

Weight of a set is the sum of weights of its vertices.

4

A vertex cover is a set SV such that every edge of G has at least one endpoint in S.

5

Note that uniqueness follows from our use of colors.

6

That is, given a P-node x, a subset Cchildren(x), a string S, bounds on the number of deletions from the tree T and the string S, dT and dS, respectively, and a set of derivations D, it returns the number of all partial derivations of x(C) to every prefix of S for each combination (kT,ks,pos,score) having exactly kT deletions from the tree, kS deletions from the string, pos leaves with positive sign, whose divergence value is at most score, and that use γ children from C (including repetitions).

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Tatusova T, Ciufo S, Fedorov B, O’Neill K, Tolstoy I. Refseq microbial genomes database: new representation and annotation strategy. Nucleic Acids Res. 2014;42(D1):553–559. doi: 10.1093/nar/gkt1274. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Wattam AR, Abraham D, Dalay O, Disz TL, Driscoll T, Gabbard JL, Gillespie JJ, Gough R, Hix D, Kenyon R, et al. Patric, the bacterial bioinformatics database and analysis resource. Nucleic Acids Res. 2014;42(D1):581–591. doi: 10.1093/nar/gkt1099. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Gatt YE, Margalit H. Common adaptive strategies underlie within-host evolution of bacterial pathogens. Mol Biol Evol. 2021;38(3):1101–1121. doi: 10.1093/molbev/msaa278. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Alm E, Huang K, Arkin A. The evolution of two-component systems in bacteria reveals different strategies for niche adaptation. PLoS Comput Biol. 2006;2(11):143. doi: 10.1371/journal.pcbi.0020143. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Booth KS, Lueker GS. Testing for the consecutive ones property, interval graphs, and graph planarity using PQ-tree algorithms. J Comput Syst Sci. 1976;13(3):335–379. doi: 10.1016/S0022-0000(76)80045-1. [DOI] [Google Scholar]
  • 6.Bergeron A, Gingras Y, Chauve C. Bioinformatics algorithms: techniques and applications. John Wiley and Sons, Inc; 2008. Formal models of gene clusters; pp. 177–202. [Google Scholar]
  • 7.Zimerman GR, Svetlitsky D, Zehavi M, Ziv-Ukelson M. Approximate search for known gene clusters in new genomes using PQ-trees; 2020. arXiv preprint arXiv:2007.03589 [DOI] [PMC free article] [PubMed]
  • 8.Landau GM, Parida L, Weimann O. Gene proximity analysis across whole genomes via PQ trees1. J Comput Biol. 2005;12(10):1289–1306. doi: 10.1089/cmb.2005.12.1289. [DOI] [PubMed] [Google Scholar]
  • 9.Fondi M, Emiliani G, Fani R. Origin and evolution of operons and metabolic pathways. Res Microbiol. 2009;160(7):502–512. doi: 10.1016/j.resmic.2009.05.001. [DOI] [PubMed] [Google Scholar]
  • 10.Wells JN, Bergendahl LT, Marsh JA. Operon gene order is optimized for ordered protein complex assembly. Cell Rep. 2016;14(4):679–685. doi: 10.1016/j.celrep.2015.12.085. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Fertin G, Labarre A, Rusu I, Vialette S, Tannier E. Combinatorics of genome rearrangements. MIT press; 2009. [Google Scholar]
  • 12.Braga MD, Sagot M-F, Scornavacca C, Tannier E. Exploring the solution space of sorting by reversals, with experiments and an application to evolution. IEEE/ACM Trans Comput Biol Bioinform. 2008;5(3):348–356. doi: 10.1109/TCBB.2008.16. [DOI] [PubMed] [Google Scholar]
  • 13.Darling AE, Miklós I, Ragan MA. Dynamics of genome rearrangement in bacterial populations. PLoS Genet. 2008;4(7):1000128. doi: 10.1371/journal.pgen.1000128. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Lemaitre C, Braga MD, Gautier C, Sagot M-F, Tannier E, Marais GA. Footprints of inversions at present and past pseudoautosomal boundaries in human sex chromosomes. Genome Biol Evol. 2009;1:56–66. doi: 10.1093/gbe/evp006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Hannenhalli S, Pevzner PA. Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals. J ACM (JACM) 1999;46(1):1–27. doi: 10.1145/300515.300516. [DOI] [Google Scholar]
  • 16.Bergeron A, Mixtacki J, Stoye J. Mathematics of evolution and phylogeny, chapter the inversion distance problem. Oxford University Press; 2005. [Google Scholar]
  • 17.Tannier E, Bergeron A, Sagot M-F. Advances on sorting by reversals. Discr Appl Math. 2007;155(6–7):881–888. doi: 10.1016/j.dam.2005.02.033. [DOI] [Google Scholar]
  • 18.Hannenhalli S, Pevzner PA. Transforming men into mice (polynomial algorithm for genomic distance problem). In: Proceedings of IEEE 36th annual foundations of computer science. IEEE; 1995. p. 581–92.
  • 19.Jean G, Nikolski M. Genome rearrangements: a correct algorithm for optimal capping. Inf Process Lett. 2007;104(1):14–20. doi: 10.1016/j.ipl.2007.04.011. [DOI] [Google Scholar]
  • 20.Yancopoulos S, Attie O, Friedberg R. Efficient sorting of genomic permutations by translocation, inversion and block interchange. Bioinformatics. 2005;21(16):3340–3346. doi: 10.1093/bioinformatics/bti535. [DOI] [PubMed] [Google Scholar]
  • 21.Figeac M, Varré J-S. Sorting by reversals with common intervals. In: International workshop on algorithms in bioinformatics. Springer; 2004. p. 26–37.
  • 22.Bérard S, Bergeron A, Chauve C. Conservation of combinatorial structures in evolution scenarios. In: RECOMB workshop on comparative genomics. Springer; 2004. p. 1–14.
  • 23.Bérard S, Bergeron A, Chauve C, Paul C. Perfect sorting by reversals is not always difficult. IEEE/ACM Trans Comput Biol Bioinform. 2007;4:4–16. doi: 10.1145/1229968.1229972. [DOI] [PubMed] [Google Scholar]
  • 24.Sagot M-F, Tannier E. Perfect sorting by reversals. In: International computing and combinatorics conference. Springer; 2005. p. 42–51.
  • 25.Diekmann Y, Sagot M-F, Tannier E. Evolution under reversals: parsimony and conservation of common intervals. IEEE/ACM Trans Comput Biol Bioinform. 2007;4(2):301–309. doi: 10.1109/TCBB.2007.1042. [DOI] [PubMed] [Google Scholar]
  • 26.Bérard S, Chauve C, Paul C. A more efficient algorithm for perfect sorting by reversals. Inf Process Lett. 2008;106(3):90–95. doi: 10.1016/j.ipl.2007.10.012. [DOI] [Google Scholar]
  • 27.Hartmann T, Bernt M, Middendorf M. An exact algorithm for sorting by weighted preserving genome rearrangements. IEEE/ACM Trans Comput Biol Bioinform. 2018;16(1):52–62. doi: 10.1109/TCBB.2018.2831661. [DOI] [PubMed] [Google Scholar]
  • 28.Bernt M, Merkle D, Ramsch K, Fritzsch G, Perseke M, Bernhard D, Schlegel M, Stadler PF, Middendorf M. Crex: inferring genomic rearrangements based on common intervals. Bioinformatics. 2007;23(21):2957–2958. doi: 10.1093/bioinformatics/btm468. [DOI] [PubMed] [Google Scholar]
  • 29.Bérard S, Chateau A, Chauve C, Paul C, Tannier E. Computation of perfect DCJ rearrangement scenarios with linear and circular chromosomes. J Comput Biol. 2009;16(10):1287–1309. doi: 10.1089/cmb.2009.0088. [DOI] [PubMed] [Google Scholar]
  • 30.Ling X, He X, Xin D. Detecting gene clusters under evolutionary constraint in a large number of genomes. Bioinformatics. 2009;25(5):571–577. doi: 10.1093/bioinformatics/btp027. [DOI] [PubMed] [Google Scholar]
  • 31.Winter S, Jahn K, Wehner S, Kuchenbecker L, Marz M, Stoye J, Böcker S. Finding approximate gene clusters with gecko 3. Nucleic Acids Res. 2016;44(20):9600–9610. doi: 10.1093/nar/gkw843. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Svetlitsky D, Dagan T, Ziv-Ukelson M. Discovery of multi-operon colinear syntenic blocks in microbial genomes. Bioinformatics. 2020;36(Supplement-1):21–29. doi: 10.1093/bioinformatics/btaa503. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Chandravanshi M, Gogoi P, Kanaujia SP. Structural and thermodynamic correlation illuminates the selective transport mechanism of disaccharide α-glycosides through abc transporter. FEBS J. 2020;287(8):1576–1597. doi: 10.1111/febs.15093. [DOI] [PubMed] [Google Scholar]
  • 34.Gopal S, Berg D, Hagen N, Schriefer E-M, Stoll R, Goebel W, Kreft J. Maltose and maltodextrin utilization by listeria monocytogenes depend on an inducible abc transporter which is repressed by glucose. PLoS ONE. 2010;5(4):10349. doi: 10.1371/journal.pone.0010349. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Marsh JA, Hernández H, Hall Z, Ahnert SE, Perica T, Robinson CV, Teichmann SA. Protein complexes are under evolutionary selection to assemble via ordered pathways. Cell. 2013;153(2):461–470. doi: 10.1016/j.cell.2013.02.044. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Jackson EN, Yaetofsky C. The region between the operator and first structural gene of the tryptophan operon of Escherichia coli may have a regulatory function. J Mol Biol. 1973;76(1):89–101. doi: 10.1016/0022-2836(73)90082-X. [DOI] [PubMed] [Google Scholar]
  • 37.Plumbridge J. Regulation of the utilization of amino sugars by Escherichia coli and Bacillus subtilis: same genes, different control. Microbial Physiology. 2015;25(2–3):154–67. doi: 10.1159/000369583. [DOI] [PubMed] [Google Scholar]
  • 38.Uno T, Yagiura M. Fast algorithms to enumerate all common intervals of two permutations. Algorithmica. 2000;26(2):290–309. doi: 10.1007/s004539910014. [DOI] [Google Scholar]
  • 39.Heber S, Stoye J. Algorithms for finding gene clusters. In: International workshop on algorithms in bioinformatics. Springer; 2001. p. 252–63.
  • 40.Bergeron A, Corteel S, Raffinot M. The algorithmic of gene teams. In: International workshop on algorithms in bioinformatics. Springer; 2002. p. 464–76.
  • 41.Eres R, Landau GM, Parida L. A combinatorial approach to automatic discovery of cluster-patterns. In: WABI. Springer; 2003. p. 139–50.
  • 42.Schmidt T, Stoye J. Quadratic time algorithms for finding common intervals in two and more sequences. In: Combinatorial pattern matching. Springer; 2004. p. 347–58.
  • 43.Jiang H, Chauve C, Zhu B. Breakpoint distance and PQ-trees. In: Annual symposium on combinatorial pattern matching. Springer; 2010. p. 112–24.
  • 44.Shachnai H, Zehavi M. A multivariate framework for weighted FPT algorithms. J Comput Syst Sci. 2017;89:157–189. doi: 10.1016/j.jcss.2017.05.003. [DOI] [Google Scholar]
  • 45.Svetlitsky D, Dagan T, Ziv-Ukelson M. Discovery of multi-operon colinear syntenic blocks in microbial genomes. Bioinformatics. 2020 doi: 10.1093/bioinformatics/btaa503. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Zimerman GR. The PQFinder tool. www.github.com/GaliaZim/PQFinder
  • 47.Magurran AE. Measuring biological diversity. Curr Biol. 2021;31(19):1174–1177. doi: 10.1016/j.cub.2021.07.049. [DOI] [PubMed] [Google Scholar]
  • 48.Chao A, Chazdon RL, Colwell RK, Shen T-J. Abundance-based similarity indices and their estimation when there are unseen species in samples. Biometrics. 2006;62(2):361–371. doi: 10.1111/j.1541-0420.2005.00489.x. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The code for our tool, the data used in the experiments, and the log file produced by the run of the reported benchmark, can be found on GitHub: http://www.github.com/edenozery/MEM-Rearrange.


Articles from Algorithms for Molecular Biology : AMB are provided here courtesy of BMC

RESOURCES