Abstract
Background
The evolutionary distance between two genomes can be estimated by computing a minimum length sequence of operations, called genome rearrangements, that transform one genome into another. Usually, a genome is modeled as an ordered sequence of genes, and most of the studies in the genome rearrangement literature consist in shaping biological scenarios into mathematical models. For instance, allowing different genome rearrangements operations at the same time, adding constraints to these rearrangements (e.g., each rearrangement can affect at most a given number of genes), considering that a rearrangement implies a cost depending on its length rather than a unit cost, etc. Most of the works, however, have overlooked some important features inside genomes, such as the presence of sequences of nucleotides between genes, called intergenic regions.
Results and conclusions
In this work, we investigate the problem of computing the distance between two genomes, taking into account both gene order and intergenic sizes. The genome rearrangement operations we consider here are constrained types of reversals and transpositions, called super short reversals (SSRs) and super short transpositions (SSTs), which affect up to two (consecutive) genes. We denote by super short operations (SSOs) any SSR or SST. We show 3-approximation algorithms when the orientation of the genes is not considered when we allow SSRs, SSTs, or SSOs, and 5-approximation algorithms when considering the orientation for either SSRs or SSOs. We also show that these algorithms improve their approximation factors when the input permutation has a higher number of inversions, where the approximation factor decreases from 3 to either 2 or 1.5, and from 5 to either 3 or 2.
Keywords: Genome rearrangements, Intergenic regions, Super short operations, Approximation algorithms
Background
Given two genomes and , one way to estimate their evolutionary distance is to compute the minimum number of large scale events, called genome rearrangements, that are needed to transform into . The minimality is required due to the commonly accepted parsimony principle, while the allowed genome rearrangements depend on the model, i.e. on the classes of events that supposedly happen during evolution.
Prior to counting rearrangement events, one needs to model the input genomes. Previous works [1–3] have defined genomes as ordered sequences of elements (genes). Variants within this setting can occur. For instance, each gene may appear either once or several times in a genome. In the latter case, genomes are modeled as strings, while in the former case they are modeled as permutations. Besides, genomes modeled as permutations may be signed or unsigned (the sign of an element represents the orientation of that gene in the DNA strand it lies on).
Concerning genome rearrangements, the most commonly studied events are reversal, which consists in taking a continuous sequence in the genome, reversing it, and putting it back at the same location [4], and transposition, which consists in taking a continuous sequence in the genome and putting it back in a different location [5]. A more recent and general type of genome rearrangement is the DCJ (Double-Cut and Join) [3], that cuts a genome between adjacent genes a and b, and adjacent genes c and d, and joins either a to c and b to d, or a to d and b to c.
Since the mid-nineties, a very large amount of work has been done for computing distances between pairs of genomes, depending on the genome model and the allowed set of rearrangements. We refer the reader to Fertin et al. book [6] for a survey of algorithmic aspects.
In populations where the number of rearrangement events that affect a very large portion of the genes are rare, we can restrict events to span at most k genes, for some value of k [7, 8]. During an analysis with closely-related pairs of bacterial genomes, the number of short inversions (inversions affecting up to three genes) was discovered to be very high, especially inversions of a single gene (which we call 1-reversal) [9]. There are also other works showing the prevalence of short inversions in bacterial genomes [10] and eukaryotes genomes [11, 12].
As previously mentioned, most of the works have assumed that a genome is an ordered sequence of genes. It has been argued that this model could underestimate the “true” evolutionary distance, and that other genome features should be taken into account to circumvent this problem [13, 14]. Indeed, genomes carry more information than just their ordered sequences of genes. In particular, consecutive genes are separated by DNA sequences called intergenic regions, each having different lengths in terms of number of nucleotides. These lengths may be used along with gene order to generate a more realistic model for genomes.
This recently led some authors to model a genome as an ordered sequence of genes, together with an ordered list of its intergenic sizes, and to consider the problem of computing the DCJ distance, either in the case where insertions and deletions of nucleotides are forbidden [15], or allowed [16].
Biller and coauthors [13] used the intergenic regions to define what they called fragile regions, regions where rearrangements are more likely to act. After identifying these fragile regions, practical tests showed that considering rearrangements on non-fragile regions can yield incoherent distance estimations. When using the equiprobable model (i.e., rearrangements can occur in any position with the same probability), practical tests [16] showed that statistical properties of the inferred scenarios for DCJs using intergenic regions are closer to the true ones than scenarios which do not use them.
In this work, we also consider genomes as ordered sequences of genes together with their intergenic sizes, in cases where the gene sequence is an unsigned or signed permutation and the considered rearrangement operations are super short reversal (or SSR, i.e. a reversal of (gene) length at most two), super short transposition (or SST, i.e. a transposition affecting only two genes), or both (super short operation or SSO). In this context, our goal is to determine the minimum number of SSRs/SSTs/SSOs that transform one genome into another.
This paper is organized as follows. In Section 2 we provide the notations that we will use throughout the paper, and we introduce novel ideas that will prove useful for studying the problem. In sections 3-7 we derive lower and upper bounds on the sought distance for five different variants, which help us design an approximation algorithm of constant factor for each of these five problems. Section 8 presents a practical analysis of the five algorithms on simulated instances. Section 9 concludes the paper.
Definitions
We represent a genome with n genes as an instance with (i) an n-tuple and (ii) intergenic regions. If there is no duplicated genes, the n-tuple is a (possibly signed) permutation , with , for , and if, and only if, . If gene orientation is known, each element from has a or − sign that indicates the gene orientation it represents, and we say that is a signed permutation; is an unsigned permutation otherwise.
We denote by the identity permutation, in which all elements are in ascending order and with positive signs. The extended permutation is obtained from by adding two new elements: and .
The intergenic region is located before element from the extended permutation , for . We denote by the length of intergenic region , i.e., the number of nucleotides in , with for . Let . An instance here is then formed by .
A reversal applied over an instance , with , , , and , is an operation that generates by (i) reversing the order and the orientation of the elements in the subset of adjacent elements ; (ii) reversing the order of intergenic regions in the subset of adjacent intergenic regions when ; (iii) cutting two intergenic regions: after x nucleotides and after y nucleotides such that and
A reversal is also called a g-reversal, where . A super short reversal is a 1-reversal or a 2-reversal, i.e. a reversal that affects only one or two elements from .
A transposition applied over an instance , with , , , , and , is an operation that generates by (i) exchanging subsets of adjacent elements and ; (ii) moving subsets of adjacent intergenic regions and to start at positions and , respectively; (iii) cutting three intergenic regions: , , and such that , , and .
A transposition is called a g-transposition, where , and we say that a g-transposition is super short if .
Figure 1 shows a sequence of two super short reversals and one super short transposition that transforms the permutation with into with .
Fig. 1.
A sequence of two super short reversals and one super short transposition that transforms , with into , with . Intergenic regions are represented by rectangles, whose dimensions vary according to their sizes. The 1-reversal applied in a transforms into , and it cuts after position 2 at and after position 7 at , resulting in , , and . The 2-reversal applied in b transforms into , and it cuts after position 1 at and after position 5 at , resulting in , , and . Finally, the 2-transposition applied in c transforms into , and it cuts in position 0 at , after position 4 at , and after position 1 at , resulting in , , and . as shown in d
Given an instance , a pair of elements from is called an inversion if and , with . We denote the number of inversions in a permutation by . For the example in Fig. 1a, pairs (3, 2) and (4, 2) are the only inversions, thus .
Given two instances and representing genomes and respectively such that and have the same number of elements, is the imbalance between intergenic regions and , with .
Given two instances and such that (i) and have the same number of elements and (ii) , let denote the cumulative sum of imbalances between intergenic regions of and from positions 1 to j, with . Since , we have that .
From now on, we will consider that (i) the target permutation is such that ; (ii) and have the same number of elements; and (iii) . By doing this, we can compute the distance of , denoted by , that consists in finding the minimum number of super short operations that sorts and transforms into .
Let and be two instances such that and have the same number of elements and . The intergenic graph, denoted by , is such that V is composed by two sets of vertices: intergenic vertices (one for each ), and permutation vertices (one for each of the extended permutation ). The set E is composed by inversion edges: an edge if there is a such that or is an inversion, with and .
We divide vertices of an intergenic graph into components. A component starts and ends with permutation vertices. Besides, the first component starts with the permutation vertex , and the last component ends with the permutation vertex . Consecutive components share exactly one permutation vertex, i.e., the last permutation vertex of a component is the first permutation vertex of its adjacent component to the right.
If a component c starts with vertex and ends with vertex , with , then for and for . Besides, any two intergenic vertices that are connected to each other by an inversion edge must belong to the same component. Thus, if c ends with , then .
The idea is that components break according to into smaller pieces, where it is possible to make a local redistribution of intergenic regions and elements from (with no need to exchange them between components) transforming into . This requires that any component c starting with and ending with must have .
Formally, given an intergenic graph , a component c is a minimal set of vertices from V in which: (i) any two intergenic vertices that are connected to each other by an inversion edge must belong to the same component, (ii) if , with , then and for any , and (iii) the sum of imbalances of its intergenic regions from with respect to is equal to zero, i.e. .
A component with one intergenic vertex is called trivial, and is called non-trivial otherwise. The number of intergenic vertices in a component c is denoted by . A component c is odd if is odd, and it is even otherwise. The number of components in an intergenic graph is denoted by , the number of odd components is denoted by , and the number of even components is denoted by . Figure 2 shows three examples of intergenic graphs.
Fig. 2.
Intergenic graphs , , and , with , 6, 4, 12, 8, 13, 9, 2), , , , and . Black squares represent intergenic vertices, and the number inside it indicate their sizes. Rounded rectangles in blue represent components. Note that in (a) there are three edges in , and , and since there are five intergenic vertices in and three intergenic vertices in . We also have in (a) all values for and , with . The instance is the result of applying to . In (b) we can see that compared to , has one more component, and was removed. In (c) we can see that when we reach the target instance the number of components is equal to the number of intergenic regions in (i.e., )
In the next two lemmas, we analyze the impact of applying super short operations on the number of components.
Lemma 1
Given an instance and a target instance , let be the resulting instance after applying a 1-reversal. It follows that .
Proof
Recall that a 1-reversal is applied over intergenic regions and , with . Besides, since 1-reversals do not create nor remove inversions from , intergenic graphs and satisfy .
If and , this 1-reversal is applied over two different components, which means that is the last intergenic region of c, so . If , we have that , as shown in Fig. 3a.
Fig. 3.
Example of intergenic graphs for all possible values of with respect to where is the resulting instance after applying a 1-reversal to . When the 1-reversal is applied over two components at the same time and , , as shown in a. Otherwise, we have that , , as shown in b and c
Consider now that . If and , or and , then . Otherwise, we have two cases to consider: , if (as shown in Fig. 3b); and if (as shown in Fig. 3c).
Lemma 2
Given an instance and a target instance , let be the resulting instance after applying either a 2-reversal or a 2-transposition. It follows that .
Proof
If a 2-reversal or 2-transposition is applied to intergenic regions of two different components in , then we are necessarily creating a new inversion, and the graph has either (as shown in Fig. 4a) or (as shown in Fig. 4b).
Fig. 4.
Example of intergenic graphs for all possible values of with respect to where is the resulting instance after applying a 2-reversal or a 2-transposition to . When the 2-reversal or 2-transposition is applied over two components at the same time , as shown in a and b. Otherwise, we have that , as shown in c–e
Consider now that this operation is applied to intergenic regions of a same component in , and exchanges elements and , with . If the intergenic graph has , then . Otherwise, we have three cases to consider:
, if and (as shown in Fig. 4c);
if either or (as shown in Fig. 4d);
otherwise (as shown in Fig. 4e).
In the following sections, we will explore five different problems concerning super short operations but also considering intergenic regions, namely Sorting by Super Short Reversals (SbSSR), Sorting by Super Short Transpositions (SbSST), Sorting by Super Short Reversals and Super Short Transpositions (SbSSO), Sorting by Signed Super Short Reversals (SbSigSSR), and Sorting by Signed Super Short Reversals and Super Short Transpositions (SbSigSSO). Table 1 summarizes our results concerning general permutations (GP), permutations with inversions for some (IP), permutations with at least n inversions (1IP), and permutations with at least 2n inversions (2IP).
Table 1.
Summary of the approximation factor of the approximation algorithms presented in this manuscript for general permutations (GP), permutations with inversions for some (IP), permutations with at least n inversions (1IP), and permutations with at least 2n inversions (2IP)
| Sorting problem | GP | IP | 1IP | 2IP |
|---|---|---|---|---|
| SbSSR | 3 | 2 | 1.5 | |
| SbSST | 3 | 2 | 1.5 | |
| SbSSO | 3 | 2 | 1.5 | |
| SbSigSSR | 5 | 3 | 2 | |
| SbSigSSO | 5 | 3 | 2 |
Sorting by Super Short Reversals
In this section, we analyze the version of the problem when only super short reversals (i.e., 1-reversals and 2-reversals) are allowed to transform into . First, we state that if a non-trivial component c of an intergenic graph has no edge (i.e., there is no inversion inside c), then it is always possible to split c into two components with a 1-reversal.
Lemma 3
If a component c of an intergenic graph with contains no edge, then there is always a pair of consecutive intergenic regions to which we can apply a 1-reversal that splits c into two components and such that .
Proof
Let be the index in of the i-th intergenic region inside component c. The last intergenic region of c is at position . By definition of a component, and since c contains no edge, for any we have that . Note that since we have that .
If , let and let k be the index of element from located right after . Apply the reversal .
Otherwise, we have that , and we need to find two intergenic regions and for such that and . Since, by definition of a component, , such pair always exists. Let k be the index of element from located right after . Apply the reversal .
In both cases, the resulting permutation has , , and for any we have that ; thus, as before, all intergenic regions from to must belong to the same component.
This 1-reversal splits c into two components: with all intergenic regions in positions to , and with all intergenic regions in positions to .
Let denote the cumulative sum of imbalances of intergenic regions from and in odd positions only. Using Lemmas 1, 2 and 3, we show in the following two lemmas the minimum and maximum number of super short reversals needed to transform into and into .
Lemma 4
Let be an instance, be the target instance, m the number of intergenic regions in and , and let if and otherwise. It follows that .
Proof
In order to sort , we need to remove all inversions, and since a 2-reversal can remove only one inversion, we have that . Besides, since 2-reversals exchange material between intergenic regions of same parity only, then , with if (in this case we will need at least one 1-reversal to exchange material between an intergenic region located at an odd position and an intergenic region located at an even position), and otherwise.
On the other hand, by Lemmas 1 and 2, we can increase the number of components by at most two with a super short reversal, so to reach m trivial components we need at least super short reversals.
Lemma 5
Let be an instance, be the target instance, and let m be the number of intergenic regions in and . We have that .
Proof
While , has at least one pair of consecutive elements that is an inversion. Suppose that we first remove all inversions from using 2-reversals of type i.e., without modifying its intergenic regions lengths. Let be the resulting permutation, that has . The number of components in cannot be smaller than , since any 2-reversal removing an inversion is applied inside a same component. By Lemma 3, we can go from to m components using 1-reversals, which results in no more than 1-reversals.
Finally, using Lemmas 4 and 5, we prove that it is possible to obtain 3-approximation for this problem.
Theorem 1
Let be an instance, be the target instance, and let be the number of intergenic regions in and . The value of is 3-approximable.
Proof
Let , and let , if , or otherwise. If then, by Lemma 4, , and, by Lemma 5, . Otherwise, , so . By Lemma 4, , and, by Lemma 5, .
Algorithm 1 describes a 3-approximation algorithm that transforms an instance into using super short reversals. Computing takes and it takes up to to build . Computing and take O(n), and it takes constant time to update them. The while loop in line 5 (resp. line 14) iterates up to times, so the overall complexity of Algorithm 1 is .

Let denote the set of all permutations with n elements, and let denote the number of all permutations such that . For there are 762,007 permutations in , which corresponds to 0.16% of the 12! permutations from , and for the number of permutations in never corresponds to more than 0.05% of the n! permutations from [17]. Besides, for the number of permutations in never exceeds 0.03% of the n! permutations from [17].
Algorithm 1 has a better approximation factor when the number of inversions is at least n, as explained in the following theorem.
Theorem 2
Let be an instance, be the target instance, and let be the number of intergenic regions in and . If , Algorithm 1 has an approximation factor of , where .
Proof
Let , and let if and otherwise. Suppose now that for some . Since , by Lemma 4 we have that . Algorithm 1 applies 2-reversals and up to 1-reversals, which results in no more than super short reversals.
Corollary 2.1
Let be an instance, be the target instance, and let be the number of intergenic regions in and . If (resp. ) Algorithm 1 has an approximation factor of at most 2 (resp. 1.5).
Sorting by Super Short Transpositions
In this section, we analyze the version of the problem when only Super Short Transpositions are allowed. First, we investigate how 2-transpositions split non-trivial components from an intergenic graph .
Lemma 6
If a component c of an intergenic graph with (resp. ) has no edge, then we can apply two 2-transpositions that split c into three components , , and such that (resp. two components and such that ).
Proof
Note that any 2-transposition will increase or decrease the number of inversions by one. By Lemma 2, a 2-transposition that removes an inversion can increase the number of components by at most two units, and a 2-transposition creating an inversion cannot increase the number of components. Since there is no inversion in c, for each 2-transposition removing an inversion from c we have a 2-transposition creating that inversion before.
Now we explain how to increase the number of components by two units when . Let be the index in of the i-th intergenic region inside component c. If there is no intergenic region inside c in which the cumulative sum is negative, apply in such a way that and . Now apply . These two 2-transpositions split c into three components: with , with and with the remaining intergenic regions from c. Note that and are odd, and has the same parity as c.
Otherwise, we can find a pair of consecutive intergenic regions and inside c such that and , and since , such pair always exists. If is even or if is odd but is even, apply 0) such that and , followed by such that and .
If and are odd, apply such that and , followed by such that and . These two 2-transpositions split c into three components so if is even then we will end up with two odd components and one even component, and if c is odd we will end up with three odd components due to the choice of the position defined above.
If and we apply such that and , followed by such that and . If and we apply followed by . These two transpositions transform c into two trivial components.
The following lemma gives the number of transpositions needed to transform a permutation and its intergenic regions into with its intergenic regions when .
Lemma 7
Let be an instance, be the target instance, and let be the number of intergenic regions in and . If , then .
Proof
If a 2-transposition applied on a component c of increases the number of components by two units, we can assume by the proof of Lemma 6 that it transforms c into three components , , and such that two of them are odd components and the other has the same parity as .
If c is odd, and if we can always increase the number of components by two units, we end up with a component with only one intergenic region, but if c is even, at some point we will have to increase the number of components by one unit, creating two odd components. This means that for each even component we need to apply two 2-transpositions that increase the number of components by one unit only. Since we can always apply pairs of transpositions that do not increase the number of even components, it follows that .
Lemmas 8 and 9 respectively show the lower and upper bounds for finding using super short transpositions.
Lemma 8
Let be an instance, be the target instance, and let be the number of intergenic regions in and . It follows that .
Proof
In order to sort we need to remove all inversions, and since a 2-transposition can remove only one inversion, we necessarily have that . Besides, by Lemma 2, we can increase the number of components by at most two with a super short transposition. Let . To reach m trivial components, and considering also Lemma 7, we need at least super short transpositions. Thus, .
Lemma 9
Let be an instance, be the target instance, and let be the number of intergenic regions in and . We have that .
Proof
Suppose that we first remove all inversions of using 2-transpositions of type , and let be the resulting permutation. The value of cannot be smaller than since any 2-transposition removing an inversion is applied inside a same component. Let and let
Let us analyze the parity of any component that a 2-transposition breaks: (i) if it transforms an odd component into two, then one component must be odd; (ii) if it transforms an even component into two, then both components are odd or even; (iii) if it transforms an even component into three, then two components must be odd; (iv) if it transforms an odd component into three, then either two components are even or the three components are odd. This means that .
By Lemma 7, we can go from to m components using 2-transpositions, which results, by the analysis above, in no more than 2-transpositions.
Finally, using Lemmas 8 and 9, we prove that it is possible to obtain a 3-approximable solution for this problem.
Theorem 3
Let be an instance, be the target instance, and let be the number of intergenic regions in and . The value of is 3-approximable.
Proof
Let . If then, by Lemma 8, , and, by Lemma 5, . Otherwise, , so . By Lemma 8, , and, by Lemma 9, .
Algorithm 2 describes a 3-approximation algorithm that transforms an instance into using Super Short Transpositions. Similarly to Algorithm 1, Algorithm 2 has a time complexity of .

Algorithm 2 has a better approximation factor when the number of inversions is strictly greater than n, as stated in the following theorem.
Theorem 4
Let be an instance, be the target instance, and let be the number of intergenic regions in and . If , Algorithm 2 has an approximation factor of , where .
Proof
Similar to proof of Theorem 2, given that .
Corollary 4.1
Let be an instance, be the target instance, and let be the number of intergenic regions in and . If (resp. ) Algorithm 2 has an approximation factor of at most 2 (resp. 1.5).
Sorting by Super Short Reversals and Super Short Transpositions
In this section we analyze the version of the problem when both super short reversals and Super Short Transpositions are allowed to transform any into .
Lemma 10
Let be an instance, be the target instance, and let be the number of intergenic regions in and . It follows that .
Proof
Lemma 11
Let be an instance, be the target instance, and let be the number of intergenic regions in and . We have that .
Proof
Suppose that first we remove all inversions of using 2-reversals of type , and let (resp. ) be the resulting permutation (resp. intergenic regions). Let and let . We have that , since 2-reversals removing inversions are always applied inside a same component.
Analogous to Lemma 9, and assuming that for some , then . We use the procedure described in Lemma 9 on components c with , applying two 2-transpositions that increase the number of components by two units. For components c with , we apply a 1-reversal as described in Lemma 1, breaking them into two odd components.
The above procedure applies 2-reversals, 2-transpositions, and 1-reversals, which results in no more than .
Now we prove that it is possible to obtain a 3-approximable solution for this problem.
Theorem 5
Let be an instance, be the target instance, and let be the number of intergenic regions in and . The value of is 3-approximable.
Proof
Similar to proof of Theorem 1, using Lemmas 10 and 11.
Algorithm 3 describes a 3-approximation algorithm that transforms an instance into using both super short reversals and super short transpositions. As algorithms 1 and 2, it has a time complexity of .

As in the previous algorithms, Algorithm 3 has a better approximation factor when the number of inversions is at least n, as explained in the following theorem.
Theorem 6
Let be an instance, be the target instance, and let be the number of intergenic regions in and . If Algorithm 3 has an approximation factor of , where .
Proof
Analogous to proof of Theorem 2.
Corollary 6.1
Let be an instance, be the target instance, and let be the number of intergenic regions in and . If (resp. ) Algorithm 3 has an approximation factor of at most 2 (resp. 1.5).
Sorting by Signed Super Short Reversals
In this section, we analyze the version of the problem when super short reversals are allowed to transform into , where and are signed permutations.
Given a signed permutation , let be the set of elements from such that is even and , and let be the set of elements from such that is odd and . Sets and capture the negative and positive elements from that end with negative signs after any sequence of 2-reversals that puts all elements in their correct positions (i.e., remove all inversions). Let be the number of elements in .
The following lemma, proved by Galvão et al. [8], gives the exact number of super short reversals needed to transform into .
Lemma 12
Given a signed permutation , .
This lemma helps us to state the following lower bound for our problem.
Lemma 13
Let be an instance, be the target instance, and let be the number of intergenic regions in and . We have that .
Proof
Directly from Lemmas 4 and 12.
The following lemma states an upper bound for this problem.
Lemma 14
Let be an instance, be the target instance, and let be the number of intergenic regions in and . We have that .
Proof
Let and let . Suppose that we first remove all inversions of using 2-reversals of type , and let (resp. ) be the resulting permutation (resp. intergenic regions).
Let . We have that . We apply 1-reversals that split every non-trivial component from into two components according to Lemma 3, and let (resp. ) be the resulting permutation (resp. intergenic regions).
At this point, we have a permutation such that , and has no more than negative elements. We just need to apply up to 1-reversals of type (i.e., without modifying the length of its intergenic regions) to each negative element from , and the lemma follows.
Using Lemmas 13 and 14, we prove that the value of is 5-approximable.
Theorem 7
Let be an instance, be the target instance, and let be the number of intergenic regions in and . The value of is 5-approximable.
Proof
Let , and let . If then, by Lemma 12, , and, by Lemma 14, .
Otherwise, , so . By Lemma 12, , and, by Lemma 14, .
Algorithm 4 describes a 5-approximation algorithm that transforms a signed instance into using Signed Super Short Reversals. As in previous algorithms, the time complexity of Algorithm 4 is .

As in the previous algorithms, Algorithm 4 has a better approximation factor when the number of inversions is at least n, as explained in the following theorem.
Theorem 8
Let be an instance, be the target instance, and let be the number of intergenic regions in and . If , Algorithm 3 has an approximation factor of , where .
Proof
Let . Suppose now that for some . Since , by Lemma 4 we have that . Algorithm 4 applies 2-reversals, up to 1-reversals, and up to n 1-reversals to flip the sign of each negative element, which results in no more than super short reversals.
Corollary 8.1
Let be an instance, be the target instance, and let be the number of intergenic regions in and . If (resp. ) Algorithm 4 has an approximation factor of at most 3 (resp. 2).
Sorting by Signed Super Short Reversals and Super Short Transpositions
In this section, we analyze the version of the problem when both super short reversals and Super Short Transpositions are allowed to sort signed permutations.
Let be the inversion graph [18] of the signed permutation , such that and is formed by pairs of elements from that are inversions. In , a component is defined as a maximal subgraph in which any two vertices are connected to each other by paths. A component from is negative if it contains an odd number of negative elements (vertices), and it is positive otherwise.
Let be the number of negative components of . The following lemma, proved by Galvão et al. [8], gives the exact number of super short reversals and Super Short Transpositions needed to transform into , which is a lower bound for our problem.
Lemma 15
Given a signed permutation , super short operations are required to transform into .
Now we state in the following lemma an upper bound for this problem.
Lemma 16
Let be an instance, be the target instance, and let be the number of intergenic regions in and . We have that .
Proof
Suppose that we first remove all inversions of using the polynomial algorithm presented in [8], that uses super short operations such that all 2-reversals are of type , all 2-transpositions are of type , and all the 1-reversals are ignored (i.e., not applied), and let be the resulting permutation.
The number of components in cannot be smaller than , since the 2-reversals and 2-transpositions are applied inside a same component only. Let .
By Lemma 3, we can go from to m components using 1-reversals, which results in no more than 1-reversals. After that, we will have a permutation with up to negative elements, so we can apply up to 1-reversals of type to each negative element of .
Using Lemmas 15 and 16, we prove that it is possible to obtain a 5-approximable solution for this problem.
Theorem 9
Let be an instance, be the target instance, and let be the number of intergenic regions in and . The value of is 5-approximable.
Proof
Let , and let . If then, by Lemma 15, , and, by Lemma 16, .
Otherwise, , so . By Lemma 15, , and, by Lemma 16, .
Algorithm 5 describes a 5-approximation algorithm that transforms a signed instance into using both signed super short reversals and Super Short Transpositions. Regarding the complexity, by previous algorithms we know that lines 1-3 and 6-17 take up to , and according to [8] the while loop in line 4 takes , which is then the time complexity of Algorithm 5.

As for previous algorithms, Algorithm 5 also has a better approximation factor when the number of inversions is at least n. This is the purpose of the following theorem.
Theorem 10
Let be an instance, be the target instance, and let be the number of intergenic regions in and . If , Algorithm 5 has an approximation factor of , where .
Proof
Let . Suppose now that . Since , by Lemma 4 we have that . Algorithm 4 applies operations between 2-reversals and 2-transpositions, up to 1-reversals, and up to n 1-reversals to flip the sign of each negative element, which results in no more than super short operations.
Corollary 10.1
Let be an instance, be the target instance, and let be the number of intergenic regions in and . If (resp. ) Algorithm 5 has an approximation factor of at most 3 (resp. 2).
Experimental tests
We implemented the five proposed algorithms and tested them unsing simulated permutations, in order to observe their performances. We generated two different permutation datasets, which we call fully-random instances (FRI) and almost random instances (ARI). Each dataset has 1,000,000 instances , is a permutation with 100 elements and is a sequence of 101 intergenic regions sizes.
The dataset FRI was generated in the following way: (i) let be an initial instance, being with 100 elements, and each received a random integer . (ii) Generate by applying w consecutive super short operations to , with randomly generated indices for both positions and intergenic sizes, always respecting the current values. We created 10,000 instances for each value of
For Sorting by (Signed) Super Short Reversals we applied 0.8w 2-reversals and 0.2w 1-reversals, and at each step one of them was chosen at random while both were available. For Sorting by Super Short Transpositions we applied w 2-transpositions. For Sorting by (Signed) Super Short Operations we applied 0.5w 2-transpositions, 0.4w 2-reversals, and 0.1w 1-reversals, and at each step one of them was chosen at random while more than one were available.
The dataset ARI was generated in a similar way as FRI, but when the algorithm had to apply either a 2-reversal or a 2-transposition we randomly chose a pair among all pairs of adjacent elements that were not an inversion. Since , at least one pair always exists.
Given any instance from ARI created using w SSOs, we know exactly how many inversions has—it is the number of 2-reversals and 2-transpositions applied. The number of inversions on instances from FRI, however, is not known, but we can compute the expected number of inversions in a permutation with n elements after k random swaps (i.e., 2-reversals and 2-transpositions) applied to the identity permutation [19]:
where , , , and .
Figures 5 and 6 show the experimental results for instances of type FRI and ARI, respectively. We show the average distance returned for each algorithm described in this paper, plus the average and maximum approximation factors calculated based on the lower bound of each problem for each instance.
Fig. 5.
a average returned distance, b maximum returned approximation, and c average returned approximation for instances in FRI. In a dotted curves represent the expected number of inversions, dashed curves represent the algorithms for signed permutations, and line curves represent the algorithms for unsigned permutations. Colors relate problems having the expected number of inversions given by the dotted line of same color. This means that SbSSR and SbSigSSR are expected to have inversions as in Inv. 1, SbSST is expected to have inversions as in Inv. 2, and SbSSO and SbSigSSO are expected to have inversions as in Inv. 3. All the approximations in b and c were calculated based on the lower bound of each problem
Fig. 6.
a average returned distance, b maximum returned approximation, and c average returned approximation for instances in ARI. In a dotted curves represent the exact number of inversions, dashed curves represent the algorithms for signed permutations, and line curves represent the algorithms for unsigned permutations. Colors relate problems having the expected number of inversions given by the dotted line of same color. This means that SbSSR and SbSigSSR have inversions as in Inv. 1, SbSST has inversions as in Inv. 2, and SbSSO and SbSigSSO have inversions as in Inv. 3. All the approximations in b and c were calculated based on the lower bound of each problem
On Figs. 5 and 6, Algorithm 1 is denoted by SbSSR, Algorithm 4 is denoted by SbSigSSR, Algorithm 2 is denoted by SbSST, Algorithm 3 is denoted by SbSSO, and Algorithm 5 is denoted by SbSigSSO. In Fig. 5, the curve Inv. 1 represents the expected number of inversions for SbSSR and SbSigSSR, the curve Inv. 2 represents the expected number of inversions for SbSST, and the curve Inv. 3 denotes the expected number of inversions for SbSSO and SbSigSSO. These three curves were generated using the formula described above. In Fig. 6 the curves Inv. 1, 2, and 3 follow the same idea as the curves in Fig. 5, but instead of expected number of inversions they represent the exact number of inversions.
As distance is directly related to the number of inversions, in Fig. 5a we see that, although in practice we have applied up to 1000 operations, the distance values were never greater than 300 on average—the average returned distances for each algorithm follow the trend dotted line that represents the expected number of inversions of that instances. Algorithms for signed permutations returned distances with a slightly higher value than the same algorithms for unsigned permutations, which is expected given that in addition to inversions and intergenic sizes, they also need to take care of elements with negative signs.
Concerning the approximation factors in Fig. 5b, c, it can be noted that despite the theoretical approximation factors of 3 and 5, the average approximation factors of instances in FRI were between 1 and 2.2. Furthermore, in our tests, no instance for SbSSR and SbSSO, whose theoretical approximation factors are 3, had approximation factor above 2.5, and no instance for SbSigSSR and SbSigSSO, whose theoretical approximation factors are 5, obtained approximation above 3.3 and 3, respectively.
In Fig. 6a we have another scenario where the returned distance follows the number of applied SSOs, but this behavior is due to our choice of applying only operations that do not destroy previously created inversions. One interesting thing about this figure is that distances returned by the algorithm for SSOs were very close to the number of inversions, especially when , something that did not happened on FRI. In Fig. 6b, c, we see that this dataset returned maximum and average approximations systematically better than for dataset FRI: no instance had an approximation factor greater than 2.5, and on average all algorithms have average approximation factors less than 1.7. For (resp. ), where we expect to have around n (resp. 2n) inversions, none of the instances had approximation factor above 2 (resp. 1.5), as we expected given Lemmas 2, 4, 6, 8, and 10.
Conclusion
In this paper, we analyzed the minimum number of super short reversals and/or Super Short Transpositions needed to sort a signed or unsigned permutation and, at the same time, transform its intergenic regions lengths according to .
We defined some bounds and a graph structure that allowed us to build five algorithms (one for each considered problem) that guarantee approximation factors of 3 for unsigned permutations (using either SSRs, SSTs, or both) and 5 for signed permutations (using either SSRs or SSOs). These algorithms have better approximation factors for instances for which the number of inversions is at least n or 2n. In the former case, it is equal to 2 for unsigned permutations (using either SSRs, SSTs, or both), and to 3 for signed permutations (using SSRs or SSOs); in the latter case it is equal to 1.5 for unsigned permutations, and to 2 for signed permutations. All of these algorithms were tested in simulated instances, showing that, on average, they behave better than their theoretical approximation factors predict.
Some questions remain open. For instance, what is the computational complexity of each of these five problems? Besides, how can we incorporate indels (insertions and deletions) of intergenic regions to these problems, to be able to compare two genomes that share the same set of genes but may differ on their total intergenic regions length?
Acknowledgements
This work was supported by the National Council for Scientific and Technological Development-CNPq (Grants 400487/2016-0, 425340/2016-3, and 140466/2018-5), the São Paulo Research Foundation-FAPESP (Grants 2013/08293-7, 2015/11937-9, 2017/12646-3, and 2017/16246-0), the Brazilian Federal Agency for the Support and Evaluation of Graduate Education-CAPES, and the CAPES/COFECUB program (Grant 831/15). We also thank the anonymous reviewers for their helpful suggestions.
Authors’ contributions
First draft: ARO. Proofs: ARO, GJ, and GF. Experiments: ARO, UD, and ZD. Final manuscript: ARO, GJ, GF, UD, and ZD. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Andre R. Oliveira, Email: andrero@ic.unicamp.br
Géraldine Jean, Email: geraldine.jean@univ-nantes.fr.
Guillaume Fertin, Email: guillaume.fertin@univ-nantes.fr.
Ulisses Dias, Email: ulisses@ft.unicamp.br.
Zanoni Dias, Email: zanoni@ic.unicamp.br.
References
- 1.Bafna V, Pevzner PA. Sorting by transpositions. SIAM J Discrete Math. 1998;11(2):224–240. doi: 10.1137/S089548019528280X. [DOI] [Google Scholar]
- 2.Kececioglu JD, Sankoff D. Exact and approximation algorithms for sorting by reversals, with application to genome rearrangement. Algorithmica. 1995;13:180–210. doi: 10.1007/BF01188586. [DOI] [Google Scholar]
- 3.Yancopoulos S, Attie O, Friedberg R. Efficient sorting of genomic permutations by translocation. Inversion and block interchange. Bioinformatics. 2005;21(16):3340–3346. doi: 10.1093/bioinformatics/bti535. [DOI] [PubMed] [Google Scholar]
- 4.Hannenhalli S, Pevzner PA. Transforming men into mice (polynomial algorithm for genomic distance problem). In: Proceedings of the 36th annual symposium on foundations of computer science (FOCS’1995). Washington, DC: IEEE Computer Society Press; 1995. 10.1109/SFCS.1995.492588. p. 581–92.
- 5.Elias I, Hartman T. A 1.375-approximation algorithm for sorting by transpositions. IEEE/ACM Trans Comput Biol Bioinform. 2006;3(4):369–379. doi: 10.1109/TCBB.2006.44. [DOI] [PubMed] [Google Scholar]
- 6.Fertin G, Labarre A, Rusu I, Tannier E, Vialette S. Computational molecular biology. London: The MIT Press; 2009. Combinatorics of genome rearrangements. [Google Scholar]
- 7.Chen T, Skiena SS. Sorting with fixed-length reversals. Discrete Appl Math. 1996;71(1–3):269–295. doi: 10.1016/S0166-218X(96)00069-8. [DOI] [Google Scholar]
- 8.Galvão GR, Lee O, Dias Z. Sorting signed permutations by short operations. Algor Mol Biol. 2015;10:12. doi: 10.1186/s13015-015-0040-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Lefebvre J-F, El-Mabrouk N, Tillier ERM, Sankoff D. Detection and validation of single gene inversions. Bioinformatics. 2003;19(1):190–196. doi: 10.1093/bioinformatics/btg1025. [DOI] [PubMed] [Google Scholar]
- 10.Dalevi DA, Eriksen N, Eriksson K, Andersson SGE. Measuring genome divergence in bacteria: a case study using Chlamydian Data. J Mol Evol. 2002;55(1):24–36. doi: 10.1007/s00239-001-0087-9. [DOI] [PubMed] [Google Scholar]
- 11.Seoighe C, Federspiel N, Jones T, Hansen N, Bivolarovic V, Surzycki R, Tamse R, Komp C, Huizar L, Davis RW, Scherer S, Tait E, Shaw DJ, Harris D, Murphy L, Oliver K, Taylor K, Rajandream M-A, Barrell BG, Wolfe KH. Prevalence of small inversions in yeast gene order evolution. Proc Natl Acad Sci. 2000;97(26):14433–14437. doi: 10.1073/pnas.240462997. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.McLysaght A, Seoighe C, Wolfe KH. High frequency of inversions during eukaryote gene order evolution. In: Sankoff D, Nadeau JH, editors. Comparative genomics: empirical and analytical approaches to gene order dynamics, map alignment and the evolution of gene families. New York: Springer; 2000. pp. 47–58. [Google Scholar]
- 13.Biller P, Guéguen L, Knibbe C, Tannier E. Breaking good: accounting for fragility of genomic regions in rearrangement distance estimation. Genome Biol Evol. 2016;8(5):1427–1439. doi: 10.1093/gbe/evw083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Biller P, Knibbe C, Beslon G, Tannier E. Comparative genomics on artificial life. In: Beckmann A, Bienvenu L, Jonoska N, editors. Pursuit of the universal lecture notes in computer science. Cham: Springer International Publishing; 2016. pp. 35–44. [Google Scholar]
- 15.Fertin G, Jean G, Tannier E. Algorithms for computing the double cut and join distance on both gene order and intergenic sizes. Algor Mol Biol. 2017;12:16. doi: 10.1186/s13015-017-0107-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Bulteau L, Fertin G, Tannier E. Genome rearrangements with indels in intergenes restrict the scenario space. BMC Bioinform. 2016;17(S14):225–231. doi: 10.1186/s12859-016-1264-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Knuth DE. The art of computer programming, Volume 3: Sorting and searching. Reading: Addison-Wesley Publishing Company; 1998. [Google Scholar]
- 18.Rotem D, Urrutia J. Circular permutation graphs. Networks. 1982;12(4):429–437. doi: 10.1002/net.3230120407. [DOI] [Google Scholar]
- 19.Bousquet-Melou M. The expected number of inversions after n adjacent transpositions. Discrete Math Theor Comput Sci. 2010;12(2):65–88. [Google Scholar]






