Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

ArXiv logoLink to ArXiv
[Preprint]. 2023 Nov 8:arXiv:2305.10577v2. Originally published 2023 May 17. [Version 2]

Revisiting the Complexity of and Algorithms for the Graph Traversal Edit Distance and Its Variants

Yutong Qiu 1,*, Yihang Shen 1,*, Carl Kingsford 1,
PMCID: PMC10246088  PMID: 37292475

Abstract

The graph traversal edit distance (GTED), introduced by Ebrahimpour Boroojeny et al. (2018), is an elegant distance measure defined as the minimum edit distance between strings reconstructed from Eulerian trails in two edge-labeled graphs. GTED can be used to infer evolutionary relationships between species by comparing de Bruijn graphs directly without the computationally costly and error-prone process of genome assembly. Ebrahimpour Boroojeny et al. (2018) propose two ILP formulations for GTED and claim that GTED is polynomially solvable because the linear programming relaxation of one of the ILPs always yields optimal integer solutions. The claim that GTED is polynomially solvable is contradictory to the complexity results of existing string-to-graph matching problems.

We resolve this conflict in complexity results by proving that GTED is NP-complete and showing that the ILPs proposed by Ebrahimpour Boroojeny et al. do not solve GTED but instead solve for a lower bound of GTED and are not solvable in polynomial time. In addition, we provide the first two, correct ILP formulations of GTED and evaluate their empirical efficiency. These results provide solid algorithmic foundations for comparing genome graphs and point to the direction of heuristics.

1. Introduction

Graph traversal edit distance (GTED) [1] is an elegant measure of the similarity between the strings represented by edge-labeled Eulerian graphs. For example, given two de Bruijn assembly graphs [2], computing GTED between them measures the similarity between two genomes without the computationally intensive and possibly error-prone process of assembling the genomes. Using an approximation of GTED between assembly graphs of Hepatitis B viruses, Ebrahimpour Boroojeny et al. [1] group the viruses into clusters consistent with their taxonomy. This can be extended to inferring phylogeny relationships in metagenomic communities or comparing heterogeneous disease samples such as cancer. There are several other methods to compute a similarity measure between strings encoded by two assembly graphs [36]. GTED has the advantage that it does not require prior knowledge on the type of the genome graph or the complete sequence of the input genomes. The input to the GTED problem is two unidirectional, edge-labeled Eulerian graphs, which are defined as:

Definition 1 (Unidirectional, edge-labeled Eulerian Graph). A unidirectional, edge-labeled Eulerian graph is a connected directed graph G=(V,E,,Σ), with node set V, edge multiset E, constant-size alphabet Σ, and single-character edge labels :EΣ, such that G contains an Eulerian trail that traverses every edge eE exactly once. The unidirectional condition means that all edges between the same pair of nodes are in the same direction.

Such graphs arise in genome assembly problems (e.g. the de Bruijn subgraphs). Computing GTED is the problem of computing the minimum edit distance between the two most similar strings represented by Eulerian trails each input graph.

Problem 1 (Graph Traversal Edit Distance (GTED) [1]). Given two unidirectional, edge-labeled Eulerian graphs G1 and G2, compute

GTEDG1,G2mint1trailsG1t2trailsG2editstrt1,strt2. (1)

Here, trails(G) is the collection of all Eulerian trails in graph G, str(t) is a string constructed by concatenating labels on the Eulerian trail t=e0,e1,,en, and edit s1,s2 is the edit distance between strings s1 and s2.

Ebrahimpour Boroojeny et al. [1] claim that GTED is polynomially solvable by proposing an integer linear programming (ILP) formulation of GTED and arguing that the constraints of the ILP make it polynomially solvable. This result, however, conflicts with several complexity results on string-to-graph matching problems. Kupferman and Vardi [7] show that it is NP-complete to determine if a string exactly matches an Eulerian tour in an edge-labeled Eulerian graph. Additionally, Jain et al. [8] show that it is NP-complete to compute an edit distance between a string and strings represented by a labeled graph if edit operations are allowed on the graph. On the other hand, polynomial-time algorithms exist to solve string-to-string alignment [9] and string-to-graph alignment [8] when edit operations on graphs are not allowed.

We resolve the conflict among the results on complexity of graph comparisons by revisiting the complexity of and the proposed solutions to GTED. We prove that computing GTED is NP-complete by reducing from the HAMILTONIAN PATH problem, reaching an agreement with other related results on complexity. Further, we point out with a counter-example that the optimal solution of the ILP formulation proposed by Ebrahimpour Boroojeny et al. [1] does not solve GTED.

We give two ILP formulations for GTED. The first ILP has an exponential number of constraints and can be solved by subtour elimination iteratively [10, 11]. The second ILP has a polynomial number of constraints and shares a similar high-level idea of the global ordering approach [11] in solving the Traveling Salesman problem [12].

In Qiu and Kingsford [13], Flow-GTED (FGTED), a variant of GTED is proposed to compare two sets of strings instead of two strings encoded by graphs. FGTED is equal to the edit distance between the most similar sets of strings spelled by the decomposition of flows between a pair of predetermined source and sink nodes. The similarity between the sets of strings reconstructed from the flow decomposition is measured by the Earth Mover’s Edit Distance [13, 14]. FGTED is used to compare pan-genomes, where both the frequency and content of strings are essential to represent the population of organisms. Qiu and Kingsford [13] reduce FGTED to GTED, and via the claimed polynomial-time algorithm of GTED, argue that FGTED is also polynomially solvable. We show that this claim is false by proving that FGTED is also NP-complete.

While the optimal solution to ILP proposed in Ebrahimpour Boroojeny et al. [1] does not solve GTED, it does compute a lower bound to GTED. We characterize the cases when GTED is equal to this lower bound. In addition, we point out that solving this ILP formulation finds a minimum-cost matching between closed-trail decompositions in the input graphs, which may be used to compute the similarity between repeats in the genomes. Ebrahimpour Boroojeny et al. [1] claim their proposed ILP formulation is solvable in polynomial time by arguing that the constraint matrix of the linear relaxation of the ILP is always totally unimodular. We show that this claim is false by proving that the constraint matrix is not always totally unimodular and showing that there exists optimal fractional solutions to its linear relaxation.

We evaluate the efficiency of solving ILP formulations for GTED and its lower bound on simulated genomic strings and show that it is impractical to compute GTED on larger genomes.

In summary, we revisit two important problems in genome graph comparisons: Graph Traversal Edit Distance (GTED) and its variant FGTED. We show that both GTED and FGTED are NP-complete, and provide the first correct ILP formulations for GTED. We also show that the ILP formulation proposed by [1] is a lower bound to GTED. We evaluate the efficiency of the ILPs for GTED and its lower bound on genomic sequences. These results provide solid algorithmic foundations for continued algorithmic innovation on the task of comparing genome graphs and point to the direction of approximation heuristics.

2. GTED and FGTED are NP-complete

2.1. Conflicting results on computational complexity of GTED and string-to-graph matching

The natural decision versions of all of the computational problems described above and below are clearly in NP. Under the assumption that PNP, the results on the computational complexity of GTED and string-to-graph matching claimed in Ebrahimpour Boroojeny et al. [1] and Kupferman and Vardi [7], respectively, cannot be both true.

Kupferman and Vardi [7] show that the problem of determining if an input string can be spelled by concatenating edge labels in an Eulerian trail in an input graph is NP-complete. We call this problem Eulerian Trail Equaling Word. We show in Theorem 1 that we can reduce ETEW to GTED, and therefore if GTED is polynomially solvable, then ETEW is polynomially solvable. The complete proof is in Appendix A.1.

Problem 2 (Eulerian Trail Equaling Word [7]). Given a string sΣ*, an edge-labaled Eulerian graph G, find an Eulerian trail t of G such that str(t) = s.

Theorem 1. If GTEDP then ETEWP.

Proof sketch. We first convert an input instance s,G to ETEW into an input instance G1,G2 to GTED by (a) creating graph G1 that only contains edges that reconstruct string s and (b) modifying G into G2 by extending the anti-parallel edges so that G2 is unidirectional. We show that if GTEDG1,G2=0, there must be an Eulerian trail in G that spells s, and if GTEDG1,G2>0,G must not contain an Eulerian trail that spells s. □

Hence, an (assumed) polynomial-time algorithm for GTED solves ETEW in polynomial time. This contradicts Theorem 6 of Kupferman and Vardi [7] of the NP-completeness of ETEW (under PNP).

2.2. Reduction from Hamiltonian Path to GTED and FGTED

We resolve the contradiction by showing that GTED is NP-complete. The details of the proof are in Appendix A.2.

Theorem 2. GTED is NP-complete.

Proof sketch. We reduce from the Hamiltonian Path problem, which asks whether a directed, simple graph G contains a path that visits every vertex exactly once. Here simple means no self-loops or parallel edges. The reduction is almost identical to that presented in Kupferman and Vardi [7], and from here until noted later in the proof the argument is identical except for the technicalities introduced to force unidirectionality (and another minor change described later).

Let G=(V,E) be an instance of Hamiltonian Path, with n=|V| vertices. We first create the Eulerian closure of G, which is defined as G'=V',E' where

V=vin ,vout :vV{w}. (2)

Here, each vertex in V is split into vin and vout, and w is a newly added vertex. E is the union of the following sets of edges and their labels:

  • E1=vin ,vout :vV, labeled a,

  • E2=uout ,vin :(u,v)E, labeled b,

  • E3=vout ,vin :vV, labeled c,

  • E4=vin ,uout :(u,v)E, labeled c,

  • E5=uin,w:uV, labeled c,

  • E6=w,uin:uV, labeled b.

G is an Eulerian graph by construction but contains anti-parallel edges. We further create G from G by adding dummy nodes so that each pair of antiparallel edges is split into two parallel, length-2 paths with labels x#, where x is the original label.

We also create a graph C that has the same number of edges as G and spells out a string

q=a#(b#a#)n1(c#)2n1(c#b#)|E|+1. (3)

We then argue that G has a Hamiltonian path if and only if G spells out the string q, which uses the same line of arguments and graph traversals as in Kupferman and Vardi [7]. We then show that GTEDG,C=0 if and only if G spells q. □

Following a similar argument, we show that FGTED is also NP-complete, and its proof is in Appendix A.3.

Theorem 3. FGTED is NP-complete.

3. Revisiting the correctness of the proposed ILP solutions to GTED

In this section, we revisit two proposed ILP solutions to GTED by Ebrahimpour Boroojeny et al. [1] and show that the optimal solution to these ILP is not always equal to GTED.

3.1. Alignment graph

The previously proposed ILP formulations for GTED are based on the alignment graph constructed from input graphs. The high-level concept of an alignment graph is similar to the dynamic programming matrix for the string-to-string alignment problem [9].

Definition 2 (Alignment graph). Let G1,G2 be two unidirectional, edge-labeled Eulerian graphs. The alignment graph 𝒜G1,G2=(V,E,δ) is a directed graph that has vertex set V=V1×V2 and edge multi-set E that equals the union of the following:

  • Vertical edges u1,u2,v1,u2 for u1,v1E1 and u2V2,

  • Horizontal edges u1,u2,u1,v2 for u1V1 and u2,v2E2,

  • Diagonal edges u1,u2,v1,v2 for u1,v1E1 and u2,v2E2.

Each edge is associated with a cost by the cost function δ:ER.

Each diagonal edge e=u1,v1,u2,v2 in an alignment graph can be projected to u1,v1 and u2,v2 in G1 and G2, respectively. Similarly, each vertical edge can be projected to one edge in G1, and each horizontal edge can be projected to one edge in G2.

We define the edge projection function πi that projects an edge from the alignment graph to an edge in the input graph Gi. We also define the path projection function Πi that projects a trail in the alignment graph to a trail in the input graph Gi. For example, let a trail in the alignment graph be p=e1,e2,,em, and Πi(p)=πie1,πie2,,πiem is a trail in Gi.

An example of an alignment graph is shown in Figure 1(b). The horizontal edges correspond to gaps in strings represented by G1, vertical edges correspond to gaps in strings represented by G2, and diagonal edges correspond to the matching between edge labels from the two graphs. In the rest of this paper, we assume that the costs for horizontal and vertical edges are 1, and the costs for the diagonal edges are 1 if the diagonal edge represents a mismatch and 0 if it is a match. The cost function δ can be defined to capture the cost of matching between edge labels or inserting gaps. This definition of alignment graph is also a generalization of the alignment graph used in string-to-graph alignment [8].

Figure 1:

Figure 1:

(a) An example of two edge labeled Eulerian graphs G1 (top) and G2 (bottom). (b) The alignment graph A(G_1,G_2). The cycle with red edges is the path corresponding to GTED⁡(G_1,G_2). Red solid edges are matches with cost 0 and red dashed-line edge is mismatch with cost 1.

3.2. The first previously proposed ILP for GTED

Lemma 1 in Ebrahimpour Boroojeny et al. [1] provides a model for computing GTED by finding the minimum-cost trail in the alignment graph. We reiterate it here for completeness.

Lemma 1 ([1]). For any two edge-labeled Eulerian graphs G1 and G2,

GTEDG1,G2=minimize cδ(c)subject toc is a trail in 𝒜G1,G2,Πi(c) is an Eulerian trail in Gi for i=1,2, (4)

where δ(c) is the total edge cost of c, and Πi(c) is the projection from c to Gi.

An example of such a minimum-cost trail is shown in Figure 1(b). Ebrahimpour Boroojeny et al. [1] provide the following ILP formulation and claim that it is a direct translation of Lemma 1:

minimizexN|E|eExeδ(e) (5)
subject toAx=0 (6)
eExeIi(e,f)=1 for i=1,2 and for all fEi (7)
Aue=1 if e=(u,v)E for some vertex vV1 if e=(v,u)E for some uV0 otherwise  (8)

Here, E is the edge set of 𝒜G1,G2. A is the negative incidence matrix of size |V|×|E|, and Ii(e,f) is an indicator function that is 1 if edge e in E projects to edge f in the input graph Gi (and 0 otherwise). We define the domain of each xe to include all non-negative integers. However, due to constraints (7), the values of xe are limited to either 0 or 1. We describe this ILP formulation with the assumption that both input graphs have closed Eulerian trails, which means that each node has equal numbers of incoming and outgoing edges. We discuss the cases when input graphs contain open Eulerian trails in Section 4.

While the ILP in (5)–(8) allows the solutions to select disjoint cycles in the alignment graph, the projection of edges in these disjoint cycles does not correspond to a single string represented by either of the input graphs. We show that the ILP in (5)–(8) does not solve GTED by giving an example where the objective value of the optimal solution to the ILP in (5)–(8) is not equal to GTED.

Construct two input graphs as shown in Figure 2(a). Specifically, G1 spells circular permutations of TTTGAA and G2 spells circular permutations of TTTAGA. It is clear that GTEDG1,G2=2 under Levenshtein edit distance. On the other hand, as shown in Figure 2(a), an optimal solution in 𝒜G1,G2 contains two disjoint cycles with nonzero xe values that have a total edge cost equal to 0. This solution is a feasible solution to the ILP in (5)–(8). It is also an optimal solution because the objective value is zero, which is the lower bound on the ILP in (5)–(8). This optimal objective value, however, is smaller than GTEDG1,G2. Therefore, the ILP in (5)–(8) does not solve GTED since it allows the solution to be a set of disjoint components.

Figure 2:

Figure 2:

(a) The subgraph in the alignment graph induced by an optimal solution to the ILP in (5)–(8) and the ILP in (11)–(12) with input graphs on the left and top. The red and blue edges in the alignment graph are edges matching labels in red and blue font, respectively, and are part of the optimal solution to the ILP in (5)–(8). The cost of the red and blue edges are zero. (b) The subgraph induced by x^”init” with s_1=u_1 and s_2=v_1 according to the ILP in (11)–(12). The rest of the edges in the alignment graph are omitted for simplicity.

3.3. The second previously proposed ILP formulation of GTED

We describe the second proposed ILP formulation of GTED by Ebrahimpour Boroojeny et al. [1]. Following Ebrahimpour Boroojeny et al. [1], we use simplices, a notion from geometry, to generalize the notion of an edge to higher dimensions. A k-simplex is a k-dimensional polytope which is the convex hull of its k+1 vertices. For example, a 1-simplex is an undirected edge, and a 2-simplex is a triangle. We use the orientation of a simplex, which is given by the ordering of the vertex set of a simplex up to an even permutation, to generalize the notion of the edge direction [15, p. 26]. We use square brackets [•] to denote an oriented simplex. For example, v0,v1 denotes a 1-simplex with orientation v0v1, which is a directed edge from v0 to v1, and v0,v1,v2 denotes a 2-simplex with orientation corresponding to the vertex ordering v0v1v2v0. Each k-simplex has two possible unique orientations, and we use the signed coefficient to connect their forms together, e.g. v0,v1=v1,v0.

For each pair of graphs G1 and G2 and their alignment graph 𝒜G1,G2, we define an oriented 2-simplex set TG1,G2 which is the union of:

  • u1,u2,v1,u2,v1,v2 for all u1,v1E1 and u2,v2E2, or

  • u1,u2,u1,v2,v1,v2 for all u1,v1E1 and u2,v2E2,

We use the boundary operator [15, p. 28], denoted by , to map an oriented k-simplex to a sum of oriented (k1)-simplices with signed coefficients.

v0,v1,,vk=i=0p(1)iv0,,vˆi,,vk, (9)

where vˆi denotes the vertex vi is to be deleted. Intuitively, the boundary operator maps the oriented k-simplex to a sum of oriented (k1)-simplices such that their vertices are in the k-simplex and their orientations are consistent with the orientation of the k-simplex. For example, when k=2, we have:

v0,v1,v2=v1,v2v0,v2+v0,v1=v1,v2+v2,v0+v0,v1. (10)

We reiterate the second ILP formulation proposed in Ebrahimpour Boroojeny et al. [1]. Given an alignment graph 𝒜G1,G2=(V,E,δ) and the oriented 2-simplex set TG1,G2,

minimizexN|E|,yZTG1,G2eExeδ(e) subject to x=xinit +[]y (11)

Entries in x and y correspond to 1-simplices and 2-simplices in E and TG1,G2, respectively. [] is a |E|×TG1,G2 boundary matrix where each entry []i,j is the signed coefficient of the oriented 1-simplex (the directed edge) in E corresponding to xi in the boundary of the oriented 2-simplex in TG1,G2 corresponding to yj. The index i,j for each 1-simplex or 2-simplex is assigned based on an arbitrary ordering of the 1-simplices in E or the 2-simplices in TG1,G2. An example of the boundary matrix is shown in Figure 3. δ(e) is the cost of each edge. xinitR|E| is a vector where each entry corresponds to a 1-simplex in E with E1+E2 nonzero entries that represent one Eulerian trail in each input graph. xinit is a feasible solution to the ILP. Let s1 be the source of the Eulerian trail in G1, and s2 be the sink of the Eulerian trail in G2. Each entry in xinit is defined by

xeinit =1if e=u1,s2,v1,s2 or e=s1,u2,s1,v2,0otherwise. (12)

If the Eulerian trail is closed in Gi,si can be any vertex in Vi. An example of xinit is shown in Figure 2(b).

Figure 3:

Figure 3:

(a) A graph that contains an unoriented 2-simplex with three unoriented 1-simplices. (b), (c) The same graph with two different ways of orienting the simplices and the corresponding boundary matrices.

We provide a complete proof in Section B of the Appendix that the ILP in (5)–(8) is equivalent to the ILP in (11)–(12). Therefore, the example we provided in Section 3.2 is also an optimal solution to the ILP in (11)–(12) but not a solution to GTED. Thus, the ILP in (11)–(12) does not always solve GTED.

4. New ILP solutions to GTED

To ensure that our new ILP formulations are applicable to input graphs regardless of whether they contain an open or closed Eulerian trail, we add a source node s and a sink node t to the alignment graph. Figure 4 illustrates three possible cases of input graphs.

  1. If only one of the input graphs has closed Eulerian trails, wlog, let G1 be the input graph with open Eulerian trails. Let a1 and b1 be the start and end of the Eulerian trail that have odd degrees. Add edges s,a1,v2 and b1,v2,t to E for all nodes v2V2 (Figure 4(a)).

  2. If both input graphs have closed Eulerian trails, let a1 and a2 be two arbitrary nodes in G1 and G2, respectively. Add edges s,a1,v2,s,v1,a2,a1,v2,t and v1,a2,t for all nodes v1V1 and v2V2 to E (Figure 4(b)).

  3. If both input graphs have open Eulerian trails, add edges s,a1,a2 and t,b1,b2, where ai and bi are start and end nodes of the Eulerian trails in Gi, respectively (Figure 4(c)).

Figure 4:

Figure 4:

Modified alignment graphs based on input types. (a) G_1 has open Eulerian trails while G_2 has closed Eulerian trails. (b) Both G_1 and G_2 have closed Eulerian trails. (c) Both G_1 and G_2 have open Eulerian trails. Solid red and blue nodes are the source and sink nodes of the graphs with open Eulerian trails. “s” and “t” are the added source and sink nodes. Colored edges are added alignment edges directing from and to source and sink nodes, respectively.

According to Lemma 1, we can solve GTEDG1,G2 by finding a trail in 𝒜G1,G2 that satisfies the projection requirements. This is equivalent to finding a st trail in 𝒜G1,G2 that satisfies constraints:

(u,v)ExuvIi((u,v),f)=1 for all (u,v)E,fGi,us,vt, (13)

where Ii(e,f)=1 if the alignment edge e projects to f in Gi. An optimal solution to GTED in the alignment graph must start and end with the source and sink node because they are connected to all possible starts and ends of Eulerian trails in the input graphs.

Since a trail in 𝒜G1,G2 is a flow network, we use the following flow constraints to enforce the equality between the number of in- and out-edges for each node in the alignment graph except the source and sink nodes.

(s,u)Exsu=1 (14)
(v,t)Exvt=1 (15)
(u,v)Exuv=(v,w)Exvw for all vV (16)

Constraints (13) and (16) are equivalent to constraints (7) and (6), respectively. Therefore, we rewrite the ILP in (5)–(8) in terms of the modified alignment graph.

minimizex|E|eExeδ(e)subjecttoconstraints(13)(16). (lower bound ILP)

As we show in Section 3.2, constraints (13)–(16) do not guarantee that the ILP solution is one trail in 𝒜G1,G2, thus allowing several disjoint covering trails to be selected in the solution and fails to model GTED correctly. We show in Section 5 that the solutions to this ILP is a lower bound to GTED.

According to Lemma 1 in Dias et al. [11], a subgraph of a directed graph G with source node s and sink node t is a st trail if and only if it is a flow network and every strongly connected component (SCC) of the subgraph has at least one edge outgoing from it. Thus, in order to formulate an ILP for the GTED problem, it is necessary to devise constraints that prevent disjoint SCCs from being selected in the alignment graph. In the following, we describe two approaches for achieving this.

4.1. Enforcing one trail in the alignment graph via constraint generation

Section 3.2 of Dias et al. [11] proposes a method to design linear constraints for eliminating disjoint SCCs, which can be directly adapted to our problem. Let 𝒞 be the collection of all strongly connected subgraphs of the alignment graph 𝒜G1,G2. We use the following constraint to enforce that the selected edges form one st trail in the alignment graph:

 If (u,v)E(C)xuv=|E(C)|, then (u,v)ε+(C)xuv1 for all C𝒞, (17)

where E(C) is the set of edges in the strongly connected subgraph C and ε+(C) is the set of edges (u,v) such that u belongs to C and v does not belong to C. (u,v)E(C)xuv=|E(C)| indicates that C is in the subgraph of 𝒜G1,G2 constructed by all edges (u,v) with positive xuv, and (u,v)ε+(C)xuv1 guarantees that there exists an out-going edge of C that is in the subgraph.

We use the same technique as Dias et al. [11] to linearize the “if-then” condition in (17) by introducing a new variable β for each strongly connected component:

(u,v)E(C)xuv|E(C)|βC for all C𝒞 (18)
(u,v)E(C)xuv|E(C)|+1|E(C)|βC0 for all C𝒞 (19)
(u,v)ε+(C)xuvβC for all C𝒞 (20)
βC{0,1} for all C𝒞 (21)

To summarize, given any pair of unidirectional, edge-labeled Eulerian graphs G1 and G2 and their alignment graph 𝒜G1,G2=(V,E,δ), GTED(G1,G2) is equal to the optimal solution of the following ILP formulation:

minimizex{0,1}|E|eExeδ(e) subject to  constraints (13)(16) and  constraints (18)(21). (exponential ILP)

This ILP has an exponential number of constraints as there is a set of constraints for every strongly connected subgraph in the alignment graph. To solve this ILP more efficiently, we can use the procedure similar to the iterative constraint generation procedure in Dias et al. [11]. Initially, solve the ILP with only constraints (13)–(16). Create a subgraph, G, induced by edges with positive xuv. For each disjoint SCC in G that does not contain the sink node, add constraints (18)–(21) for edges in the SCC and solve the new ILP. Iterate until no disjoint SCCs are found in the solution.

4.

4.2. A compact ILP for GTED with polynomial number of constraints

In the worst cases, the number of iterations to solve (exponential ILP) via constraint generation is exponential. As an alternative, we introduce a compact ILP with only a polynomial number of constraints. The intuition behind this ILP is that we can impose a partially increasing ordering on all the edges so that the selected edges forms a st trail in the alignment graph. This idea is similar to the Miller-Tucker-Zemlin ILP formulation of the Travelling Salesman problem (TSP) [12].

We add variables duv that are constrained to provide a partial ordering of the edges in the st trail and set the variables duv to zero for edges that are not selected in the st trail. Intuitively, there must exist an ordering of edges in a st trail such that for each pair of consecutive edges (u,v) and (v,w), the difference in their order variable duv and dvw is 1. Therefore, for each node v that is not the source or the sink, if we sum up the order variables for the incoming edges and outgoing edges respectively, the difference between the two sums is equal to the number of selected incoming/outgoing edges. Lastly, the order variable for the edge starting at source is 1, and the order variable for the edge ending at sink is the number of selected edges. This gives the ordering constraints as follows:

Ifxuv=0,thenduv=0for all(u,v)E (22)
(v,w)Edvw(u,v)Eduv=(v,w)Exvw for all vV{s,t} (23)
(s,u)Edsu=1 (24)
(v,t)Edvt=(u,v)Exuv (25)

We enforce that all variables xe{0,1} and deN for all eE.

The “if-then” statement in Equation (22) can be linearized by introducing an additional binary variable yuv for each edge [11, 16]:

xuv|E|yuv1 (26)
duv|E|1yuv0 (27)
yuv{0,1}. (28)

Here, yuv is an indicator of whether xuv0. The coefficient |E| is the number of edges in the alignment graph and also an upper bound on the ordering variables. When yuv=1,duv0, and yuv does not impose constraints on xuv. When yuv=0,xuv1, and yuv does not impose constraints on duv.

4.3. Correctness of (compact ILP) for GTED

To show that the optimal objective value of (compact ILP) is equal to GTED, we show that the optimal solutions to (compact ILP) always form one connected component.

Lemma 2. Let xe and de be ILP variables. Let G be a subgraph of 𝒜G1,G2 that is induced by edges with xe=1. If xe and de satisfy constraints (13)–(25) for all eE,G is connected with one trail from s to t that traverses each edge in G exactly once.

Proof. We prove the lemma in 2 parts: (1) all nodes except s and t in G have an equal number of in- and out-edges, (2) G contains only one connected component.

The first statement holds because the edges of G form a flow from s to t, and is enforced by constraints (16).

We then show that G does not contain isolated subgraphs that are not reachable from s or t. Due to constraint (16), the only possible scenario is that the isolated subgraph is strongly connected. Suppose for contradiction that there is a strongly connected component, C, in G that is not reachable from s or t.

The sum of the left hand side of constraint (23) over all vertices in C is

vC((u,v)Cduv(v,w)Cdvw)=vC(u,v)CduvvC(v,w)Cdvw (29)
=(u,v)E(C)duv(v,w)E(C)dvw=0. (30)

However, the right-hand side of the same constraints is always positive. Hence we have a contradiction. Therefore, G has only one connected component. □

Due to Lemma 1 and Lemma 2, given input graphs G1 and G2 and the alignment graph 𝒜G1,G2, GTEDG1,G2 is equal to the optimal objective of

minimizex{0,1}|E|eExeδ(e) subject to  constraints (13)(16),constraints(23)(25)andconstraints(26)(28). (compact ILP)

5. Closed-trail Cover Traversal Edit Distance

While the (lower bound ILP) and the ILP in (11)–(12) do not solve GTED, the optimal solution to these ILPs is a lower bound of GTED. These ILP formulations also solve an interesting variant of GTED, which is a local similarity measure between two genome graphs. We call this variant Closed-trail Cover Traversal Edit Distance (CCTED). In the following, we provide the formal definition of the CCTED problem and then show that the (lower bound ILP) is the correct ILP formulation for solving CCTED.

We first introduce the min-cost item matching problem between two multi-sets. Let two multi-sets of items be S1 and S2, and, wlog, let S1S2. Let c:S1{ϵ}×S2N be the cost of matching either an empty item ϵ or an item in S1 with an item in S2. Given S1,S2 and the cost function c, min-cost matching problem finds a matching, cS1,S2, such that each item in S1{ϵ}S2S1 is matched with exactly one distinct item in S2 and the total cost of the matching, s1,s2cS1,S2cs1,s2, is minimized.

The min-cost item matching problem is similar to the Earth Mover’s Distance defined in [17, except that only integral units of items can be matched and the cost of matching an empty item with another item is not constant. Similar to the Earth Mover’s Distance, the min-cost item matching problem can be computed using the ILP formulation of the min-cost max-flow problem [13, 14]. When the cost is the edit distance, the cost to match ϵ with a string is equal to the length of the string.

Define traversal edit distance, edittt1,t2 as the edit distance between the strings constructed from a pair of trails t1 and t2. In other words, edittt1,t2=editstrt1,strt2. CCTED is defined as:

Problem 3 (Closed-Trail Cover Traversal Edit Distance (CCTED)). Given two unidirectional, edge-labeled Eulerian graphs G1 and G2 with closed Eulerian trails, compute

CCTEDG1,G2minC1CCG1,C2CCG2t1,t2edittC1,C2editstrt1,strt2, (31)

Here, CC(G) denotes the collection of all possible sets of edge-disjoint, closed trails in G, such that every edge in G belongs to exactly one of these trails. Each element of CC(G) can be interpreted as a cover of G using such trails. edittC1,C2 is a min-cost matching between two covers using the traversal edit distance as the cost.

CCTED is likely a more suitable metric comparison between genomes that undergo large-scale rearrangements. This analogy is to the relationship between the synteny block comparison [3] and the string edit distance computation, where the former is more often used in interspecies comparisons and in detecting segmental duplications [18, 19] and the latter is more often seen in intraspecies comparisons.

Following similar ideas as Lemma 1, we can compute CCTED by finding a set of closed trails in the alignment graph such that the total cost of alignment edges is minimized, and the projection of all edges in the collection of selected trails is equal to the multi-set of input graph edges.

Lemma 3. For any two edge-labeled Eulerian graphs G1 and G2,

CCTEDG1,G2=minimizeCcCδ(c) (32)
 subject to C is a set of closed trails in 𝒜G1,G2,eCΠi(e)=Ei for i=1,2, (33)

where C is a collection of trails and δ(c) is the total cost of edges in trail c.

Proof. Given any pair of covers C1CCG1 and C2CCG2 and their min-cost matching based on the edit distance edittC1,C2, we can project each pair of matched closed trailed to a closed trail in the alignment graph. For a matching between a trail and the empty item ϵ, we can project it to a closed trail in the alignment graph with all vertical edges if the trail is from G1 or horizontal edges if the trail is from G2. The total cost of the projected edges must be greater than or equal to the objective (32). On the other hand, every collection of trails C that satisfy constraint (33) can be projected to a cover in each of the input graphs, and cCδ(c)CCTEDG1,G2. Hence equality holds. □

5.1. The ILP formulation for CCTED

We show that the ILP in (5)–(8) proposed by Ebrahimpour Boroojeny et al. [1] solves CCTED.

Theorem 4. Given two input graphs G1 and G2, the optimal objective value of the ILP in (5)–(8) based on 𝒜G1,G2 is equal to CCTEDG1,G2.

Proof. As shown in the proof of Lemma 3, any pair of edge-disjoint, closed-trail covers in the input graph can be projected to a set of closed trails in 𝒜G1,G2, which satisfied constraints (6)–(8). The objective of this feasible solution, which is the total cost of the projected closed trails, equals CCTED. Therefore, CCTEDG1,G2 is greater than or equal to the objective of the ILP in (5)–(8).

Conversely, we can transform any feasible solutions of the ILP in (5)–(8) to a pair of covers of G1 and G2. We can do this by transforming one closed trail at a time from the subgraph of the alignment graph, 𝒜 induced by edges with ILP variable xuv=1. Let c be a closed trail in 𝒜. Let c1=Π1(c) and c2=Π2(c) be two closed trails in G1 and G2 that are projected from c. We can construct an alignment between strc1 and strc2 from c by adding match or insertion/deletion columns for each match or insertion/deletion edges in c accordingly. The cost of the alignment is equal to the total cost of edges in c by the construction of the alignment graph. We can then remove edges in c from the alignment graph and edges in c1 and c2 from the input graphs, respectively. The remaining edges in 𝒜 and G1 and G2 still satisfy the constraints (6)–(8). Repeat this process and we get a total cost of eExeδ(e) that aligns pairs of closed trails that form covers of G1 and G2. This total cost is greater than or equal to CCTEDG1,G2.

5.2. CCTED is a lower bound of GTED

Since the constraints for (lower bound ILP) are a subset of (exponential ILP), a feasible solution to (exponential ILP) is always a feasible solution to (lower bound ILP). Since two ILPs have the same objective function, CCTEDG1,G2GTEDG1,G2 for any pair of graphs. Moreover, when the solution to (lower bound) ILP forms only one connected component, the optimal value of (lower bound ILP) is equal to GTED.

Theorem 5. Let 𝒜G1,G2 be the subgraph of 𝒜G1,G2 induced by edges (u,v)E with xuvopt=1 in the optimal solution to lower bound ILP. There exists 𝒜G1,G2 that has exactly one connected component if and only if copt=GTEDG1,G2.

Proof. We first show that if copt=GTEDG1,G2, then there exists 𝒜G1,G2 that has one connected component. A feasible solution to (exponential ILP) is always a feasible solution to (lower bound ILP), and since copt=GTEDG1,G2, an optimal solution to (exponential ILP) is also an optimal solution to (lower bound ILP), which can induce a subgraph in the alignment graph that only contains one connected component.

Conversely, if xopt induces a subgraph in the alignment graph with only one connected component, it satisfies constraints (18)–(21) and therefore is feasible to the ILP for GTED (exponential ILP). Since coptGTEDG1,G2, this solution must also be optimal for GTEDG1,G2. □

In practice, we may estimate GTED approximately by the solution to (lower bound ILP). As we show in Section 6, the time needed to solve (lower bound ILP) is much less than the time needed to solve GTED. However, in adversarial cases, copt could be zero but GTED could be arbitrarily large. We can determine if the copt is a lower bound on GTED or exactly equal to GTED by checking if the subgraph induced by the solution to (lower bound ILP) has multiple connected components.

5.3. NP-completeness of CCTED

We prove that the CCTED problem (Problem 3) is NP-complete by reducing from the Eulerian Trail Equaling Word problem [7].

Theorem 6. Computing CCTED is NP-complete.

Proof. Let Eulerian graph G=(V,E,,Σ) and s be an instance of the Eulerian Tour Equaling Word problem. Construct two graphs, G1 and G2. If G contains open Eulerian trails, add an edge directing from the sink of the graph to the source of the graph. Let the label of the added edge be # that does not appear in Σ. Let the modified graph be G1. If G contains closed Eulerian trails, let G1 be the same as G. Let G2 be a graph that contains one cycle with E1 edges, where E1 is the edge set of G1. Assign labels to the edges in G2 such that the cycle in G2 spells s if G contains closed Eulerian trails, s# otherwise.

If CCTEDG1,G2=0,G2 must contain at least one closed Eulerian trail that spells some circular permutation of s#. If CCTED is not zero, it means that s must not match Eulerian trails in G. □

6. Empirical evaluation of the ILP formulations for GTED and its lower bound

6.1. Implementation of the ILP formulations

We implement the algorithms and ILP formulations for (exponential ILP), (compact ILP) and (lower bound ILP). In practice, the multi-set of edges of each input graph may contain many duplicates of edges that have the same start and end vertices due to repeats in the strings. We reduce the number of variables and constraints in the implemented ILPs by merging the edges that share the same start and end nodes and record the multiplicity of each edge. Each x variable is no longer binary but a non-negative integer that satisfies the modified projection constraints (13):

(u,v)ExuvIi((u,v),f)=Mi(f) for all (u,v)E,fGi,us,vt, (34)

where Mi(f) is the multiplicity of edge f in Gi. Let C be the strongly connected component in the subgraph induced by positive xuv, now (u,v)E(C)xuv is no longer upper bounded by |E(C)|. Therefore, constraints (19) is changed to

(u,v)E(C)xuv|E(C)|+1W(C)βC0 for all C𝒞,W(C)=(u,v)E(C)maxfG1M1(f)I1((u,v),f),fG2M2(f)I2((u,v),f), (35)

where W(C) is the maximum total multiplicities of edges in the strongly connected subgraph in each input graph that is projected from C.

Likewise, constraints (27) that set the upper bounds on the ordering variables also need to be modified as the upper bound of the ordering variable duv for each edge no longer represents the order of one edge but the sum of orders of copies of (u,v) that are selected, which is at most |E|2. Therefore, constraint (27) is changed to

duv|E|21yuv0. (36)

The rest of the constraints remain unchanged.

We ran all our experiments on a server with 48 cores (96 threads) of Intel(R) Xeon(R) CPU E5–2690 v3 @ 2.60GHz and 378 GB of memory. The system was running Ubuntu 18.04 with Linux kernel 4.15.0. We solve all the ILP formulations and their linear relaxations using the Gurobi solver [20] using 32 threads.

6.2. GTED on simulated TCR sequences

We construct 20 de Bruijn graphs with k=4 using 150-character sequences extracted from the V genes from the IMGT database [21]. We solve the linear relaxation of (compact ILP), (exponential ILP) and (lower bound ILP) and their linear relaxation on all 190 pairs of graphs. We do not show results for solving (compact ILP) for GTED on this set of graphs as the running time exceeds 30 minutes on most pairs of graphs.

To compare the time to solve the ILP formulations when GTED is equal to the optimal objective of (lower bound ILP), we only include 168 out of 190 pairs where GTED is equal to the lower bound (GTED is slightly higher than the lower bound in the remaining 22 pairs). On average, it takes 26 seconds wall-clock time to solve (lower bound ILP), and 71 seconds to solve (exponential ILP) using the iterative algorithm. On average, it takes 9 seconds to solve the LP relaxation of (compact ILP) and 1 second to solve the LP relaxation of (lower bound ILP). The time to construct the alignment graph for all pairs is less than 0.2 seconds. The distribution of wall-clock running time is shown in Figure 5(a). The time to solve (exponential ILP) and (lower bound ILP) is generally positively correlated with the GTED values (Figure 5(b)). On average, it takes 7 iterations for the iterative algorithm to find the optimal solution that induces one strongly connected subgraph (Figure 5(c)).

Figure 5:

Figure 5:

(a) The distribution of wall-clock running time for constructing alignment graphs, solving the ILP formulations for GTED and its lower bound, and their linear relaxations on the log scale. (b) The relationship between the time to solve (lower bound ILP), (exponential ILP) iteratively and GTED. (c) The distribution of the number of iterations to solve exponential ILP. The box plots in each plot show the median (middle line), the first and third quantiles (upper and lower boundaries of the box), the range of data within 1.5 inter-quantile range between Q1 and Q3 (whiskers), and the outlier data points.

In summary, it is fastest to compute the lower bound of GTED. Computing GTED exactly by solving the proposed ILPs on genome graphs of size 150 is already time consuming. When the sizes of the genome graphs are fixed, the time to solve for GTED and its lower bound increases as GTED between the two genome graphs increases. In the case where GTED is equal to its lower bound, the subgraph induced by some optimal solutions of (lower bound ILP) contains more than one strongly connected component. Therefore, in order to reconstruct the strings from each input graph that have the smallest edit distance, we generally need to obtain the optimal solution to the ILP for GTED. In all cases, the time to solve the (exponential ILP) is less than the time to solve the (compact ILP).

6.3. GTED on difficult cases

Repeats, such as segmental duplications and translocations [22, 23] in the genomes increase the complexity of genome comparisons. We simulate such structures with a class of graphs that contain n simple cycles of which n1 peripheral cycles are attached to the n-th central cycle at either a node or a set of edges (Figure 6(a)). The input graphs in Figure 2 belong to this class of graphs that contain 2 cycles. This class of graphs simulates the complex structural variants in disease genomes or the differences between genomes of different species.

Figure 6:

Figure 6:

(a) An example of a 3-cycle graph. Cycle 1 and 2 are attached to cycle 3. (b) The distribution of wall-clock time to solve the compact ILP and the iterative (exponential ILP) on 100 pairs of 3-cycle graphs.

We generate pairs of 3-cycle graphs with varying sizes and randomly assign letters from {A,T,C,G} to edges. We compute the lower bound of GTED and GTED using (lower bound ILP) and (compact ILP), respectively. We denote the lower bound of GTED computed by solving (lower bound ILP) as GTEDl. We group the generated 3-cycle graph pairs based on the value of GTEDGTEDl and select 20 pairs of graphs randomly for each GTEDGTEDl value ranging from 1 to 5. The maximum number of edges in all selected graphs is 32.

We show the difficulty of computing GTED using the iterative algorithm on the 100 selected pairs of 3-cycle graphs. We terminate the ILP solver after 20 minutes. As shown in Figure 6, as the difference between GTED and GTEDl increases, the wall-clock time to solve (exponential ILP) for GTED increases faster than the time to solve (compact ILP) for GTED. For pairs on graphs with GTEDGTEDl=5, on average it takes more than 15 minutes to solve (exponential ILP) with more than 500 iterations. On the other hand, it takes an average of 5 seconds to solve (compact ILP) for GTED and no more than 1 second to solve for the lower bound. The average time to solve each ILP is shown in Table S1. In summary, on the class of 3-cycle graphs introduced above, the difficulty to solve GTED via the iterative algorithm increases rapidly as the gap between GTED and GTEDl increases. Although (exponential ILP) is solved more quickly than (compact ILP) for GTED when the sequences are long and the GTED is equal to GTEDl (Section 6.2), (compact ILP) may be more efficient when the graphs contain overlapping cycles such that the gap between GTED and GTEDl is larger.

7. Conclusion

We point out the contradictions in the result on the complexity of labeled graph comparison problems and resolve the contradictions by showing that GTED, as opposed to the results in Ebrahimpour Boroojeny et al. [1], is NP-complete. On one hand, this makes GTED a less attractive measure for comparing graphs since it is unlikely that there is an efficient algorithm to compute the measure. On the other hand, this result better explains the difficulty of finding a truly efficient algorithm for computing GTED exactly. In addition, we show that the previously proposed ILP of GTED [1] does not solve GTED and give two new ILP formulations of GTED.

While the previously proposed ILP of GTED does not solve GTED, it solves for a lower bound of GTED, and we show that this lower bound can be interpreted as a more “local” measure, CCTED, of the distance between labeled graphs. Further, we characterize the LP relaxation of the ILP in (11)–(12) and show that, contrary to the results in Ebrahimpour Boroojeny et al. [1], the LP in (11)–(12) does not always yield optimal integer solutions.

As shown previously [1, 13], it takes more than 4 hours to solve (lower bound ILP) for graphs that represent viral genomes that contain ≈ 3000 bases with a multi-threaded LP solver. Likewise, we show that computing GTED using either (exponential ILP) or (compact ILP) is already slow on small genomes, especially on pairs of simulated genomes that are different due to segmental duplications and translations. The empirical results show that it is currently impossible to solve GTED or its lower bound directly using this approach for bacterial- or eukaryotic-sized genomes on modern hardware. The results here should increase the theoretical interest in GTED along the directions of heuristics or approximation algorithms as justified by the NP-hardness of finding GTED.

Acknowledgements

The authors would like to thank the members of the Kingsford Group for their helpful comments throughout this project, in particular Guillaume Marçais. The authors thank Marina L Knittel, Jacob M Gilbert, and Cenk Sahinalp for their insightful discussion on the NP-completeness of CCTED. This work was supported in part by the US National Science Foundation [DBI-1937540, III-2232121], the US National Institutes of Health [R01HG012470] and by the generosity of Eric and Wendy Schmidt by recommendation of the Schmidt Futures program.

Appendix A. Proofs for the NP-completeness of GTED

A.1. Reduction from ETEW to GTED

We provide below the complete proof for Theorem 1.

Theorem 1. If GTED ∈P then ETEW ∈P.

Proof. Let s,G be an instance of ETEW. Construct a directed, acyclic graph (DAG), C, that has only one path. Let the path in C be P=e1,,e|s| and the edge label of ei be s[i]. Clearly, C is a unidirectional, edge-labeled Eulerian graph, P is the only Eulerian trail in C, and str(P)=s.

For the graph G=VG,EG,G,Σ from the ETEW instance, which may not be unidirectional, create another graph G that contains all of the nodes and edges in G except the anti-parallel edges. Let ΣG=Σ{ϵ}, where ϵ is a character that is not in Σ. For each pair of anti-parallel edges (u,v) and (v,u) in G, add four edges u,w1,w1,v,v,w2,w2,u by introducing new vertices w1,w2 to G. Let Gu,w1=G(u,v) and Gw2,u=G(v,u). Let Gw1,v=Gv,w2=ϵ for every newly introduced vertex. G has at most twice the number of edges as G and is Eulerian and unidirectional.

Define the cost of changing a character from a to b cost(a,b) for a,bΣ{} to be 0 if a=b and 1 otherwise. “−” is the gap character indicating an insertion or a deletion. Define cost(a,ϵ) with aΣ to be 1. Define cost(,ϵ) to be 0.

Use the (assumed) polynomial-time algorithm for GTED to ask whether GTEDC,G0 under edit distance Σ. If yes, then let s1,s2 be the 0-cost alignment of the strings spelled out by the trails in C and G, respectively. The non-gap characters of s1 must spell out s since there is only one Eulerian trail in C. Because the alignment cost is 0, any − (gap) characters in s1 must be aligned with ϵ characters in s2 and any non-gap characters in s1 must be aligned to the same character in s2. The trail in G that spells s2 can be transformed to a trail that spells s3 by collapsing the edges with ϵ character labels, and s3=s1.

If GTEDC,G>0,G must not contain an Eulerian trail that spells s. Otherwise, such a trail could be extended to a trail introducing some ϵ characters that could be aligned to s with zero cost by aligning gaps with ϵ characters.

Hence, an (assumed) polynomial-time algorithm for GTED solves ETEW in polynomial time. □

A.2. Reduction from Hamiltonian Path to GTED

We provide below the complete proof for Theorem 2.

Theorem 2. GTED is NP-complete.

Proof. We reduce from the Hamiltonian Path problem, which asks whether a directed, simple graph G contains a path that visits every vertex exactly once. Here simple means no self-loops or parallel edges. Let G=(V,E) be an instance of Hamiltonian Path, with n=|V| vertices. The reduction is almost identical to that presented in Kupferman and Vardi [7], and from here until noted later in the proof the argument is identical except for the technicalities introduced to force unidirectionality (and another minor change described later). The first step is to construct the Eulerian closure of G, which is defined as G=V,E where

V=vin ,vout :vV{w}, (37)

and E is the union of the following sets of edges and their labels:

  • E1=vin,vout:vV, labeled a,

  • E2=uout,vin:(u,v)E, labeled b,

  • E3=vout,vin:vV, labeled c,

  • E4=vin,uout:(u,v)E, labeled c,

  • E5=uin,w:uV, labeled c,

  • E6=w,uin:uV, labeled b.

Since G is connected and every outgoing edge in G has a corresponding antiparallel incoming edge, G is Eulerian. It is not unidirectional, so we further create G from G by adding dummy nodes to each pair of antiparallel edges and labelling the length-2 paths so created with x#, where x is the original label of the split edge (a, b, or c) and # is some new symbol (shared between all the new edges). We call these length-2 paths introduced to achieve unidirectionality “split edges”.

We now argue that G has a Hamiltonian path iff G has an Eulerian trail that spells out

q=a#(b#a#)n1(c#)2n1(c#b#)|E|+1. (38)

If such an Eulerian trail exists, then the trail starts with spelling the string a#(b#a#)n1, which corresponds to a Hamiltonian trail in G since it visits exactly n “vertex split edges” (type E1, labeled a#) and each vertex split edge can be used only once (since it is an Eulerian trail). Further, successively visited vertices must be connected by an edge in G since those are the only b# split edges in G (except those leaving w, but w must not be involved in spelling out a#(b#a#)n1, since entering w requires using a split edge labeled c#).

For the other direction, if a G has a Hamiltonian path v1,,vn, then walking that sequence of vertices in G will spell out a#(b#a#)n1. This path will cover all E1 edges and the E2 edges that are on the Hamiltonian path. Retracing the path so far in reverse will use 2n1 split edges labeled c#, consuming the (c#)2n1 term in q and covering all nodes’ reverse vertex edges E3 (since the path is Hamiltonian). The reverse path also covers the E4 edges corresponding to reverse Hamiltonian path edges. Our Eulerian trail is now “at” node v1in.

What remains is to complete the Eulerian walk covering (a) edges and their antiparallel counterparts corresponding to edges in G that were not used in the Hamiltonian path, and (b) the edges adjacent to node w. To do this, define pred(v) be the vertices u in G for which edge (u,v) exists and u is not the predecessor of v along the Hamiltonian path. For each upredv1, traverse the split edge labeled c# to uout then traverse the forward split edge labeled b# back to v1in. This results in a string (c#b#)predv1. Once the predecessors of v1 are exhausted, traverse the split edge labeled c# from v1in into node w and then traverse the split edge labeled b# to v2in. This again generates a c#b# string. Repeat the process, covering the edges of v2’s predecessors and returning to w to move to the next node along the Hamiltonian path for each node v3,,vn. After covering the predecessors of vnin, go to v1in through the remaining edges in E5 and E6,vnin,w and w,v1in, which completes the Eulerian tour. This covers all the edges of G. The word spelled out in this last section of the Eulerian trail is a sequence of repetitions of c#b#, with one repetition for each edge that is not in the Hamiltonian path (|E|n+1) and all of the edges in E5 and E6 for entering and leaving each node (2n), with a total of |E|+1 repetitions, which is the final (c#b#)|E|+1 term in q.

This ends the slight modification of the proof in Kupferman and Vardi [7], where the differences are (a) the introduction of the # characters and (b) using the exponent |E|+1 of the final part of q instead of |E|+n+1 as in Kupferman and Vardi [7] since we create w-edges only to vin vertices. (This second change has no material effect on the proof, but reduces the length of the string that must be matched.)

Now, given an instance G=(V,E) of Hamiltonian Path, with n=|V| vertices, we construct G as above (obtaining a unidirectional Eulerian graph) and create graph C that only represents string q. Note that |Σ|=4 and G and C can be constructed in polynomial time. GTEDG,C=0 if and only if an Eulerian path in G spells out q, since there can be no indels or mismatches. By the above argument, an An eulerian tour that spells out q exists if and only if G has a Hamiltonian path. □

A.3. FGTED is NP-complete

Problem 4 (Flow Graph Traversal Edit Distance (FGTED) [13]). Given unidirectional, edge-labeled Eulerian graphs G1 and G2, each of which has distinguished s1,s2 source and t1,t2 sink vertices, compute

FGTEDG1,G2minD1flowG1,s1,t1D1flowG2,s2,t2emeditstrsetD1,strsetD2, (39)

where flow Gi,si,ti is the collection of all possible sets of s1t1 trail decomposition of saturating flow from si to ti, strset (D) is the multi-set of strings constructed from trails in D.

Theorem 3. FGTED is NP-complete.

Proof. Let G=(v,E) be an instance of the Hamiltonian Cycle problem. Let n=|V| be the number of vertices in G. Construct the Eulerian closure of G and split the anti-parallel edges. Let the new graph be G=V,E. Attach a source s and a sink node t to an arbitrary node v1in by adding edge s,v1in and v1in,t with labels s and t, respectively.

Construct a string q, such that

q=sa#(b#a#)n-1(c#)2n-1(c#b#)|E|+1t. (40)

Create a graph Q that only contains one path with labels on the edges of the path that spell the string q. The union of the set of trails in any flow decomposition of G is equal to a set of Eulerian trails, , that starts at s and ends at t. All Eulerian trails in are also closed Eulerian trails of G{s,t} that starts and ends at v1in.

Using the same line of argument in the proof of Theorem 2, an Eulerian trail in G that spells q is equivalent to a Hamilton Cycle in G. In addition, FGTEDQ,G=0 if and only if all Eulerian trails in spell out q. Therefore, if FGTEDQ,G=0, then there is a Hamiltonian Cycle in G. Otherwise, then there must not exist a Hamiltonian Cycle in G. □

Appendix B. Equivalence between two ILPs proposed by Ebrahimpour Boroojeny et al.

The analysis provided by Ebrahimpour Boroojeny et al. [1] states that the LP relaxation of the ILP in (5)–(8) does not always yield integer solutions, but the LP relaxation of the ILP in (11)–(12) always yields integer solutions. This suggests that the two LP relaxations have difference feasibility regions for x. We show that these two LP relaxations are actually equivalent in Theorem 7, Further, we show that the ILP in (5)–(8) and the ILP in (11)–(12) are also equivalent. Since the ILP in (5)–(8) does not solve for GTEDG1,G2 as shown in 3.2, we conclude that the ILP in (11)–(12) also does not solve GTEDG1,G2.

Theorem 7. Given two unidirectional, edge-labeled Eulerian graphs G1,G2, the feasibility region of x in the LP relaxation of the ILP in (11)–(12) is the same as the feasibility region of x in the LP relaxation of the ILP in (5)–(8).

Let 𝒜G1,G2=(V,E,δ) be the alignment graph of G1=V1,E1,1,Σ1 and G2=V2,E2,2,Σ2, and let TG1,G2 be its two-simplex set. First, we have the following result:

Lemma 4. Let yiRTG1,G2 be a vector such that the j-th entry of yi,yij is equal to 0 for all ji. The vector x=x+[]yi satisfies the constraints (6)–(7) if the vector x satisfies the constraints (6)–(7).

Proof. Let σiTG1,G2 be the 2-simplex corresponding to the entry i of yi. Based on the construction of TG1,G2,σi has two forms: u1,u2,v1,u2,v1,v2 or u1,u2,u1,v2,v1,v2. Without loss of generality, we assume σi=u1,u2,v1,u2,v1,v2. We can prove this lemma by using the same way when σi=u1,u2,u1,v2,v1,v2. Since

σi=u1,u2,v1,u2+v1,u2,v1,v2u1,u2,v1,v2,

We have

[]yi=yiixe1+yiixe2yiixe3,

where e1=u1,u2,v1,u2,e2=v1,u2,v1,v2,e3=u1,u2,v1,v2, and xeR|E| is a vector such that all the entries are 0 except that the one corresponding to edge e is 1. we also let xvR|V| be a vector such that all the entries are 0 except that the one corresponding to vertex v is 1. Therefore, we have

Ax=Ax+yiixv2yiixv1+yiixv3yiixv2yiixv3+yiixv1=Ax,

where v1=u1,u2,v2=v1,u2, and v3=v1,v2. Hence, x satisfies the constraint (6) if x satisfies the constraint (6).

In addition, since eExeIi(e,f)=eExeIi(e,f)+yiiIie1,f+yiiIie2,fyiiIie3,f, and:

  • I1e1,u1,v1=1 and Iie1,f=0 for other fGi,

  • I2e2,u2,v2=1 and Iie2,f=0 for other fGi,

  • I1e3,u1,v1=1,I2e3,u2,v2=1, and Iie3,f=0 for other fGi,

we have:

  • yiiI1e1,u1,v1+yiiI1e2,u1,v1yiiI1e3,u1,v1=yii+0yii=0,

  • yiiI2e1,u2,v2+yiiI2e2,u2,v2yiiI2e3,u2,v2=0+yiiyii=0,

  • yiiIie1,f+yiiIie2,fyiiIie3,f=0+00=0 for any other i=1,2 and fEi.

Therefore, eExeIi(e,f)=eExeIi(e,f), meaning that x satisfies the constraint (7) if x satisfies the constraint (7). □

With Lemma 4, we prove that any feasible solution of x in (11) is a feasible solution of (5)–(8). First, it is easy to check that xinit satisfies the constraints (6)–(7). For each feasible solution of x in (11), since x=xinit+[]y=xinit+i[]yi, by iteratively using Lemma 4, we get that x satisfies the constraints (6)–(7). Since xe0 for all eE is a constraint existing in both linear relaxations, x is a feasible solution of (5)–(8).

We now show that any feasible solution of (5)–(8) is a feasible solution of (11). Let x be a feasible solution of (5)–(8). We show that x is also a feasible solution of (11) by proving that x can be converted to xinit in (11) via the boundary operator . First, if there is a diagonal edge e=u1,u2,v1,v2 in E such that xe>0, then it can be replaced by the horizontal edge eh=u1,u2,u1,v2 followed by the vertical edge ev=u1,v2,v1,v2 by using one boundary operation on the 2-simplex u1,u2,u1,v2,v1,v2. Hence, x can be converted to a new vector x, such that xe=0,xeh=xeh+xe,xev=xev+xe, and all the other entries in x are the same as those in x. It is easy to check that x is also a feasible solution of (5)–(8). Therefore, without loss of generality, we assume x to be a vector such that all the entries corresponding to diagonal edges in 𝒜G1,G2 are zero.

We then prove that any x can be converted to xinit in (11) via the boundary operator. Let the source and the sink node of x in 𝒜G1,G2 be s11,s12 and s21,s22, where s1i is the source node of Gi and s2i is the sink node of Gi. When the Eulerian trail is closed (meaning that it is an Eulerian tour) in Gi, we let s1i=s2i be an arbitrary vertex in Vi.xinit can be seen as a trail (tour) in 𝒜G1,G2 that starts from s11,s12, walks along an Eulerian trail of G2 via all the horizontal edges Ph,

Ph=s11,s12,s11,v12,s11,v12,s11,v22,,s11,vi12,s11,vi2,s11,vi2,s11,s22,

and then walks along an Eulerian trail of G1 via all the vertical edges Pv,

Pv=s11,s22,v11,s22,v11,s22,v21,s22,,vj11,s22,vj1,s22,vj1,s22,s21,s22,

until the sink node s21,s22. Here s12,v12,v22,,vi12,vi2,s22 is an Eulerian trail of G2 and s11,v11,v21,,vi11,vi1,s21 is an Eulerian trail of G1. We use P0=Ph,Pv to denote the trail from s11,s12 to s21,s22 that is the concatenation of Ph and Pv. It is easy to see that each edge in P0 is unique.

As shown in Qiu and Kingsford [13, x is a flow of 𝒜G1,G2 with the additional constraint (7). Therefore, according to the flow decomposition theorem [24, p. 80], x can be decomposed into a finite set of weighted paths in 𝒜G1,G2 from s11,s12 to s21,s22, which is denoted as p1,w1p,,pn,wnp, and a finite set of weight cycles in 𝒜G1,G2, which is denoted as c1,w1c,,cm,wmc. Each path or cycle only contains horizontal and vertical edges.

For path i, we use a vector xp,i to represent pi,wip,

xep,i=wip if epi0 otherwise,  (41)

By using the boundary operator, each path pi can actually be converted to a new trail pi such that each edge in pi is also an edge in P0. To prove this, we consider the following two cases:

  • If pi walks along all the horizontal edges followed by all the vertical edges, then every edge in pi is an edge in P0. To see that, let e be an horizontal edge in pi, since pi starts from s11,s12,e has the form s11,v,s11,v where v,vE2. Since Ph corresponds to the Eulerian trail of G2, for each v,vE2, we have s11,v,s11,vPh. Therefore eP0. We can use the same way to prove eP0 when e is a vertical edge. Note that in this case, the number of horizontal edges or vertical edges can be zero.

  • If not, then we let pi=e1i,e2i,,emi, and let eti be the vertical edge with the smallest index t. There exists an integer k(k1) such that eti,et+1i,,et+k1i are all vertical edges and et+ki is an horizontal edge. We denote each vertical edge et+wieti,et+1i,,et+k1i as vw,vt,vw+1,vt and denote et+ki as vk,vt,vk,vt+1. It is easy to see that when w=0,vw=s11. By using the boundary operator, this subpath eti,et+1i,,et+k1i,et+ki can be replaced by another subpath with one horizontal edge s11,vt,s11,vt+1 followed by k vertical edges:

s11,vt+1,v1,vt+1,v1,vt+1,v2,vt+1,,vk1,vt+1,vk,vt+1.

Now we have a new path, denoted as pi1, in which the smallest index of the vertical edges becomes t+1. Figure 7(a) shows an example, in which the blue line represents the subpath of pi and the red line represents the new subpath in pi1.

Figure 7:

Figure 7:

(a) An example of converting three vertical edges followed by one horizontal edge (blue line) to one horizontal edge followed by three vertical edges (red line). It can be done by doing boundary operations on 2-simplices labeled from 0 to 5. (b) An example of a cycle path (red line) and its auxiliary trail (blue line).

To create a new vector that represents pi1, we first create a zero vector yp,i,1RTG1,G2, and from w=0 to w=k1, we iteratively update yp,i,1 via the following equations:

yσp,i,1=yσp,i,1wip if σ=vw,vt,vw+1,vt,vw+1,vt+1yσp,i,1+wip if σ=vw,vt,vw,vt+1,vw+1,vt+10 otherwise.  (42)

The vector xp,i,1=xp,i+[]yp,i,1 is the one that represents pi1.

Since the length of pi is finite, by doing such a transformation a finite number of times, we can convert pi to a new path pi such that pi walks along all the horizontal edges first followed by all the vertical edges, therefore each edge in pi is also an edge in P0. We use the vector xˆp,i to represent pi,xˆp,i=xp,i+[]j=1qyp,i,j where q is the number of transformations. Apperantly, xˆep,i=0 when eP0. Let yp,i=j=1qyp,i,j, we have xˆp,i=xp,i+[]yp,i.

For cycle i, we also use a vector xc,i to represent ci,wic,

xec,i=wic if eci0 otherwise, (43)

Let v,v be an arbitrary chosen node in ci, we construct a trail pauxi that passes v,v as follows:

  • From s11,s12, walk along Ph until the node s11,v. It corresponds to a part of an Eulerian trail of G2.

  • From s11,v, walk along an Eulerian trail of G1 to s21,v. It must passes the node v,v.

  • From s21,v, walk along the remaining part of the Eulerian trail of G2 to the node s21,s22.

Figure 7(b) shows an example, in which the blue line represents pauxi and the red line represents ci.

We use xaux,i to denote the vector representing pauxi. The combination of ci and pauxi, represented by the vector xc,i+xaux,i creates a new trail (may have repeated edges) from s11,s12 to s21,s22: (1) walk along paux i from s11,s12 to v,v, (2) walk along ci from v,v to itself, and (3) walk along the remaining part of pauxi from v,v to s21,s22. By using the same way as we described above, each ci+pauxi or pauxi can be converted to a new trail in which each edge is also an edge in P0. We use xˆc,i or xˆaux,i to represent the new trail accordingly, therefore, we have xˆc,i=xc,i+xaux,i+[]yc,i and xˆaux,i=xaux,i+[]yaux,i. Likewise, xˆec,i=xˆeaux,i=0 when eP0.

We define a new vector xˆ such that:

xˆ=i=1nxˆp,i+j=1mxˆc,jxˆaux,j=i=1nxp,i+[]yp,i+j=1mxc,j+xaux,j+[]yc,jj=1mxaux,j+[]yaux,j=i=1nxp,i+j=1mxc,j+[]i=1nyp,i+j=1myc,jj=1myaux,j=x+[]i=1nyp,i+j=1myc,jj=1myaux,j.

Therefore, xˆ is a vector converted from x via boundary operations. xˆ is equal to xinit because:

  1. xˆe=0 when eP0 since xˆep,i=xˆec,i=xˆeaux,i=0 when eP0 for each i.

  2. As we have proved above, the boundary operator preserves the constraints (6)–(7). Therefore, xˆ satisfies the constraints (6)–(7) since x is a feasible solution of (5)–(8). Combined with the first point, we have that xˆe=1 if eP0 and xˆe=0 otherwise, meaning that xˆ=xinit.

Hence, for each feasible solution x of (5)–(8), we have:

x=xinit[]i=1nyp,i+j=1myc,jj=1myaux,j=xinit+[]i=1nyp,ij=1myc,j+j=1myaux,j,

meaning that x is also a feasible solution of (11).

We proved that the feasibility region of x in (11) is the same as the feasibility region of x in (5)–(8), and since the objective functions of these two linear relaxations are the same, the optimal solutions of them are equal.

By employing the same approach and taking into account that if all edge weights in a flow network are non-negative integers, the flow decomposition theorem guarantees that the network can be decomposed into a finite set of weighted paths and cycles, each with positive integer weight, we can prove that the ILP in (5)–(8) and the ILP in (11)–(12) are also equivalent.

Based on the proof, we can conclude that the way to index the vertices or edges in the alignment graph, or the 2-simplices in TG1,G2, will not affect the equivalence result. Additionally, different choices of orientations for the 2-simplices in TG1,G2 will also not impact the equivalence result. This is because for any two sets TG1,G2 and TG1,G2 containing the same 2-simplices with the same indices but different orientations, if (x,y) is a feasible solution of the ILP in (11)–(12) (or its relaxation) that corresponds to TG1,G2, then x,y is a feasible solution of the ILP in (11)–(12) (or its relaxation) that corresponds to TG1,G2, where yi=yi when σiTG1,G2 has the same orientation as σiTG1,G2, and yi=yi when σiTG1,G2 has the opposite orientation to σiTG1,G2. Therefore, it is acceptable to specify a particular orientation for each 2-simplex when defining TG1,G2.

Appendix C. The linear relaxation of the ILP in (11)–(12) does not always yield integer solutions

C.1. [] is not necessarily totally unimodular

A linear programming formulation always yields integer solutions if its constraint matrix is totally unimodular, which means that all of its square submatrices have determinants of 0, −1 or 1 [25]. To show that the constraint matrix of the LP relaxation of the ILP in (11)–(12) is not totally unimodular, we first write the LP in standard form.

In a standard form of a LP, all variables are greater than, or equal to 0. Since y vectors in the LP relaxation of the ILP in (11)–(12) can contain negative entries, we decompose it into y+y. Given alignment graph 𝒜G1,G2=(V,E,δ) and TG1,G2, we can now write the standard form of the LP in (11)–(12) as

minimizexR|E|,y+,yRTG1,G2eExeδ(e)subject to[I,[],[]]x,y+,y=xinitx,y+,y0. (44)

Hence the constraint matrix of the LP relaxation is A=[I,[],[]]. According to the characteristics of a totally unimodular matrix [26, p. 280] A is not totally unimodular if [] is not totally unimodular. We show that [] is not TU when the input graphs satisfy the constraints given in the following theorem.

Theorem 8. Given two unidirectional, edge-labeled Eulerian graphs G1 and G2 where E12 and E22, the boundary matrix [] constructed from 𝒜G1,G2=(V,E,δ) and TG1,G2 is not totally unimodular if there is a vertex vV1 or V2 such that there are at least 3 unique edges in E1 or E2 that are incident to v. Here, unique edges are edges that connect to v at one end but have different endpoints at the other end.

Proof. To prove that the boundary matrix is not TU, we only need to show that it is not TU under one specific chosen orientation for 1− and 2-simplices, as well as one specific chosen set of indices for 1− and 2-simplices. This is because changing the orientations or indices of 1-simplices in E or 2-simplices in TG1,G2 corresponds to permuting rows and columns of [] or multiplying rows and columns of [] by −1, which preserves the total unimodularity [26, p. 280].

Without loss of generality, let v0V1 be a node that is incident to at least 3 unique edges. Since G1 is an Eulerian graph, v must be part of a cycle C in G1. Also, there must exist another node vk and an edge between v0 and vk in either direction, such that the edge between v0 and vk is not contained in cycle C (Figure 8 (a)). Suppose the number of nodes in the cycle is k(k3 due to the unidirectionality constraint), and let the cycle C=v0,v1,,vk1. Since a specific choice of 1-simplex orientations does not affect the total unimodularity of the boundary matrix, we assume the edge between v0 and vk is vk,v0 without loss of generality. We use G1sub=V1sub,E1sub to denote the subgraph with V1sub=v0,,vk1,vk and E1sub=vi,vi+1:i{0,1,k2}vk,v0. Since E22 and G2 is a connected graph, there exist two consecutive, directed edges in G2. We use G2sub=V2sub,E2sub to denote the subgraph of G2 with V2sub=va,vb,vc and E2sub=va,vb,vb,vc. The alignment graph 𝒜G1sub,G2sub is formed with G1sub and G2sub and is a subgraph of 𝒜G1,G2, therefore, each subgraph of 𝒜G1sub,G2sub is also a subgraph of 𝒜G1,G2. Similarly, the 2-simplex set TG1sub,G2sub  is a subset of TG1,G2.

Figure 8:

Figure 8:

(a) Subgraphs G1sub and G2sub of input graphs G1 and G2. Dots represent a path from node 1 to k1 with middle nodes omitted. (b) The alignment graph 𝒜G1sub,G2sub with different edges labeled with colors. (c) A subgraph of the alignment graph in (b) with edges and triangles numbered. Dots represent horizontal and diagonal edges omitted. The same vertices that are repeated in (c) are marked with yellow and red filling colors.

We extract a sequence of 2-simplices (Figure 8(c)), Tc, from TG1sub,G2sub) via following steps:

  1. Extract all oriented 2-simplices vi,va,vi,vb,vi+1,vb and vi,va,vi+1,va,vi+1,vb for 0ik2 from TG1sub,G2sub. Flip the orientations of vi,va,vi+1,va,vi+1,vb for all 0ik2, obtaining vi,va,vi+1,vb,vi+1,va. Use σ2i to denote vi,va,vi,vb,vi+1,vb, and σ2i+1 to denote vi,va,vi+1,vb,vi+1,va.

  2. Add to the sequence another five oriented 2-simplices from TG1sub,G2sub in the order as specified: σ2k2=vk1,va,vk1,vb,v0,vb,σ2k1=vk1,vb,v0,vb,v0,vc,σ2k=vk,vb,v0,vb,v0,vc,σ2k+1=vk,va,vk,vb,v0,vb and finally σ2k+2=vk,va,v0,va,v0,vb.

In total, we extract a sequence of (2k+3) oriented 2-simplices, Tc=σ0,σ1,,σ2k+2, such that σi and σi+1mod(2k+3) share one edge. The extracted 2-simplices and their orientations as well as all shared edges are shown in Figure 8(c). We flip the orientations of vi,va,vi+1,va,vi+1,vb solely to ensure that the submatrix constructed below has a simple form, which makes it easier to compute the determinant.

Based on Tc, we obtain M1, a (2k+3)×(2k+3) submatrix of [] where each roll corresponds to a shared edge and each column corresponds to a 2-simplex in Tc. The entry values of M1 are the signed coefficients of each selected 1-simplex from the boundaries of selected 2-simplices.

M1=100000001110000000010000000001000000001100000000110000000011000000001100000000110000000011

The determinant of M1 is:

detM1=det110000000100000000100000001100000001100000001100000001100000001100000001det100000001100000001000000001000000011000000011000000011000000011000000011=(1)2k2×(1)12k+2=2.

Since the determinant of M1 is −2, and M1 is a submatrix of [], [] is not totally unimodular. □

The minimal pair of input graphs that satisfy the conditions in Theorem 8 is a graph with one 3-node cycle and one additional edge incident to the cycle and an acyclic, connected graph with three nodes. In practice, most non-trivial edge-labeled Eulerian graphs satisfy these conditions.

According to the definitions in Dey et al. [25], the subgraph used to construct M1 in the above proof (Figure 8(c)) is a Möbius subcomplex, and M1 is a (2k+3)-Möbius cycle matrix (MCM). Theorem 8 also establishes that there may exist a Möbius subcomplex in an alignment graph, which corrects the false claim made in Lemma 2 in [1].

Theorem 2 in Ebrahimpour Boroojeny et al. [1] attempts to employ a more algebraic approach to attempt to demonstrate that [] is TU by establishing that the alignment graph is a Möbius-free product space. However, the property of being Möbius-free globally does not imply the absence of Möbius subcomplexes locally. As we show in Theorem 8, although the alignment graph 𝒜G1sub,G2sub is homotopically equivalent to the one-dimensional circle, which is Möbius-free, it still contains a Möbius subcomplex.

C.2. The LP yields optimal fractional solutions

The fact that [] is not totally unimodular does not guarantee that the LP in (11)–(12) has a fractional optimal objective value. In this section, we prove that the LP in (11)–(12) does not always yield integer optimal solutions by constructing a specific example with a fractional optimal objective value.

Theorem 9. The LP in (5)–(8) and the LP in (11)–(12) do not always yield optimal integer solutions.

We prove the above theorem by giving an example where the LP in (5)–(8) yields a fractional optimal solution. Since by Theorem 7, two LPs are equivalent, it follows that the LP in (11)–(12) also yields the same fractional optimal solution.

Construct G1 and G2 such that their edges and edge labels are equal to the ones specified in Figure 9 (a). Let the edge multi-set of 𝒜G1,G2 be E. We assign an edge cost to 0 if the edge matches two equal characters and 1 otherwise. Construct vector x*R|E| and set entries corresponding to edges in Figure 9 (b) to 0.5 except edge v3,vc,v0,vf to which the corresponding entry is set to 1. Set the rest of the entries of x* to 0.

Figure 9:

Figure 9:

An example of a fractional optimal solution to the LP in (11)–(12) and the LP in (5)–(8). (a) A pair of input graphs to the LP in (5)–(8) and the LP in (5)–(8). Letters in red are edge labels. (b) A subgraph of 𝒜G1,G2 that is induced by alignment edges with non-zero weights (blue font) in an optimal solution to the LPs. The letters in red show the matching between the edge labels or between edge labels and gaps.

Lemma 5. x* is an optimal solution to the LP in (5)–(8) constructed with 𝒜G1,G2 and TG1,G2.

Proof. We prove the optimality of x* via complementary slackness. We first write the LP in (5)–(8) in standard form.

minimizexR|E|eEδexesubject toAx=bxe0 for all eE. (45)

Here, δ is a vector of size |E| where each entry is cost of edge e. The constraint matrix A of the primal LP (45) has |E| columns and |V|+E1+E2=m rows, where V is the vertex set of 𝒜G1,G2, and E1 and E2 are edge multi-sets of the input graphs. The first |V| rows correspond to the constraints specified in (6). The rest of the rows correspond to the constraints in (7) that enforce the projected multi-set of edges to be equal to the multi-set of edges in each input graph. Since the input graphs both contain Eulerian tours, the vector b has size m, where the first |V| entries are zeroes and the rest of the entries are 1 s.

We write the dual form of LP (45) as follows.

maximizeyRmj=1mbjyj subject to Ayδ. (46)

Let the objective value of LP (45) given a x as input is objxp, and the objective value of LP (46) given a y as input is objyd. To show that x* is an optimal solution to the LP in (5)–(8), we need to show that there exists a feasible solution to the dual LP, y*, that satisfies the complementary slackness conditions and that objy*d=objx*p.

Since each alignment edge has two endpoints and is projected to at most one edge in each graph, there are at most 4 non-zero entries in each column of A. The variables in y of the dual form can be interpreted in three parts. Each of the first |V| entries of y can be assigned to each vertex in the alignment graph, and the next E1 entries can be assigned to edges in G1 and the last E2 entries can be assigned to edges in G2. There are |E| constraints in the dual LP, and the e-th constraint can be assigned to one edge in the alignment graph has cost δe. Therefore, each constraint that is assigned to a horizontal or a vertical edge can be written as

yveoutyvein +yeiδe, (47)

where i=1 if e is a horizontal edge, and i=2 if e is a vertical edge. yvein and yveout are the y entries that are assigned to the vertices that are the start and end of edge e, and yei are the y entries that assigned to the πi(e).

Similarly, each constraint that is assigned to a diagonal edge is

yvein yveout +ye1+ye2δe. (48)

We can verify that x* is a feasible solution of the primal form (45) by checking if constraints (6)–(7) are satisfied. The primal objective value can be computed in a straightforward way, and we can obtain objx*p=3.5.

According to complementary slackness conditions, since xe*>0 for edges shown in Figure 9(b), the corresponding constraints in the dual LP (46) must be tight, meaning that the equality must hold in these constraints. The rest of the dual constraints could have slacks.

Let the subgraph of 𝒜G1,G2 shown in Figure 9(b) be A. Denote the cycle that traverses from [(0,f),(4,a)] to [(3,c),(0,f)] be C and the 4-node cycle that traverses ((0,f),(1,a),(2,e),(3,c)) be C. Denote the concatenation of two cycles with C. The projected cycle from C to G1 is

C1=v0,v4,v5,v0,v4,v5,v0,v1,v2,v3,v0,v1,v2,v3,v0. (49)

The projected cycle from C to G2 is

C2=vf,va,ve,vc,vd,va,vb,vc,vd,va,vb,vc,vf,va,ve,vc,vf. (50)

Sum up all the constraints that are assigned edge e where xe*>0. Since these edges form a cycle, we get:

eCyveout yvein +2e1C1ye1+e2C2ye2 (51)
=0+2e1C1ye1+e2C2ye2 (52)
=eCδe=7, (53)
e1C1ye1+e2C2ye2=3.5. (54)

The summed edge cost is 7 as there are 7 edges that are either mismatch edges or vertical edges.

All y entries that correspond to vertices are free variables and are in every constraint. After fixing the y variables that satisfy constraint (54), the rest of the y variables can be set to satisfy the dual cosntraint. We now obtain y* which is a feasible solution to the dual LP.

The only entries in y* that could have non-zero dual costs are those that correspond to edges in E1 and E2. Since these corresponding dual costs are all 1,

objy*d=e1C1ye1+e2C2ye2=3.5=objx*p.

Since the costs of alignment graph edges are all integers, the fact that the LP in (11)–(12) and the LP in (5)–(8) yield fractional optimal objective values mean that they must yield fractional solutions and assign fractional values to entries in x. Theorem 9 follows. Since the LP in (11)–(12) yields fractional solutions and GTED is always an integer, solving the LP in (11)–(12) does not solve GTED.

Appendix D. The average wall-clock time to solve ILPs on 3-cycle graphs

Table S1:

The average wall-clock time to solve (lower bound ILP), (exponential ILP), (compact ILP) and the number of iterations for pairs of 3-cycle graphs for each GTED - GTEDl.

GTED - GTEDl (lower bound ILP) runtime (s) GTED iterative runtime (s) Iterations GTED compact runtime (s)
1.0 0.06 0.17 3.55 0.39
2.0 0.05 0.87 13.00 0.43
3.0 0.08 25.41 67.60 1.24
4.0 0.07 205.59 179.10 1.70
5.0 0.08 943.68 502.85 5.37

Footnotes

The source code to reproduce experimental results is available at https://github.com/Kingsford-Group/gtednewilp/.

Conflict of Interest: C.K. is a co-founder of Ocean Genomics, Inc.

References

  • [1].Boroojeny Ali Ebrahimpour, Shrestha Akash, Sharifi-Zarchi Ali, Gallagher Suzanne Renick, Sahinalp S. Cenk, and Chitsaz Hamidreza. Graph traversal edit distance and extensions. Journal of Computational Biology, 27(3):317–329, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Pevzner Pavel A, Tang Haixu, and Waterman Michael S. An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences of USA, 98(17):9748–9753, 2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Polevikov Evgeny and Kolmogorov Mikhail. Synteny paths for assembly graphs comparison. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019. [Google Scholar]
  • [4].Minkin Ilia and Medvedev Paul. Scalable pairwise whole-genome homology mapping of long genomes with Bubbz. IScience, 23(6):101224, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Mangul Serghei and Koslicki David. Reference-free comparison of microbial communities via de Bruijn graphs. In Proceedings of the 7th ACM international conference on bioinformatics, computational biology, and health informatics, pages 68–77, 2016. [Google Scholar]
  • [6].Huntsman Steve and Rezaee Arman. De Bruijn entropy and string similarity. arXiv preprint arXiv:1509.02975, 2015. [Google Scholar]
  • [7].Kupferman Orna and Vardi Gal. Eulerian paths with regular constraints. In Faliszewski Piotr, Muscholl Anca, and Niedermeier Rolf, editors, 41st International Symposium on Mathematical Foundations of Computer Science (MFCS 2016), volume 58 of Leibniz International Proceedings in Informatics (LIPIcs), pages 62:1–62:15, Dagstuhl, Germany, 2016. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. ISBN 978-3-95977-016-3. [Google Scholar]
  • [8].Jain Chirag, Zhang Haowen, Gao Yu, and Aluru Srinivas. On the complexity of sequence-to-graph alignment. Journal of Computational Biology, 27(4):640–654, 2020. [Google Scholar]
  • [9].Needleman Saul B and Wunsch Christian D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443–453, 1970. [DOI] [PubMed] [Google Scholar]
  • [10].Dantzig George, Fulkerson Ray, and Johnson Selmer. Solution of a large-scale traveling-salesman problem. Journal of the Operations Research Society of America, 2(4):393–410, 1954. [Google Scholar]
  • [11].Dias Fernando HC, Williams Lucia, Mumey Brendan, and Tomescu Alexandru I. Minimum flow decomposition in graphs with cycles using integer linear programming. arXiv preprint arXiv:2209.00042, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Miller Clair E, Tucker Albert W, and Zemlin Richard A. Integer programming formulation of traveling salesman problems. Journal of the ACM (JACM), 7(4):326–329, 1960. [Google Scholar]
  • [13].Qiu Yutong and Kingsford Carl. The effect of genome graph expressiveness on the discrepancy between genome graph distance and string set distance. Bioinformatics, 38:i404–i412, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Rubner Yossi, Tomasi Carlo, and Guibas Leonidas J. The Earth Mover’s distance as a metric for image retrieval. International Journal of Computer Vision, 40(2):99–121, 2000. [Google Scholar]
  • [15].Munkres James R. Elements of algebraic topology. CRC Press, 2018. [Google Scholar]
  • [16].Bradley Stephen P, Hax Arnoldo C, and Magnanti Thomas L. Applied mathematical programming. Addison-Wesley, 1977. [Google Scholar]
  • [17].Pele Ofir and Werman Michael. A linear time histogram metric for improved sift matching. In Computer Vision-ECCV 2008: 10th European Conference on Computer Vision, Marseille, France, October 12–18, 2008, Proceedings, Part III 10, pages 495–508. Springer, 2008. [Google Scholar]
  • [18].Bourque Guillaume, Pevzner Pavel A, and Tesler Glenn. Reconstructing the genomic architecture of ancestral mammals: lessons from human, mouse, and rat genomes. Genome Research, 14(4):507–516, 2004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Vollger Mitchell R, Guitart Xavi, Dishuck Philip C, Mercuri Ludovica, Harvey William T, Gershman Ariel, Diekhans Mark, Sulovari Arvis, Munson Katherine M, Lewis Alexandra P, et al. Segmental duplications and their variation in a complete human genome. Science, 376(6588):eabj6965, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023. URL https://www.gurobi.com.
  • [21].Lefranc Marie-Paule. IMGT, the international ImMunoGeneTics information system. Cold Spring Harbor Protocols, 2011(6):595–603, 2011. [DOI] [PubMed] [Google Scholar]
  • [22].Li Yilong, Roberts Nicola D, Wala Jeremiah A, Shapira Ofer, Schumacher Steven E, Kumar Kiran, Khurana Ekta, Waszak Sebastian, Korbel Jan O, Haber James E, et al. Patterns of somatic structural variation in human cancer genomes. Nature, 578(7793): 112–121, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Darai-Ramqvist Eva, Sandlund Agneta, Müller Stefan, Klein George, Imreh Stefan, and Kost-Alimova Maria. Segmental duplications and evolutionary plasticity at tumor chromosome break-prone regions. Genome Research, 18(3):370–379, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Ahujia Ravindra K, Magnanti Thomas L, and Orlin James B. Network flows: Theory, algorithms and applications. New Jersey: Prentice-Hall, 1993. [Google Scholar]
  • [25].Dey Tamal K, Hirani Anil N, and Krishnamoorthy Bala. Optimal homologous cycles, total unimodularity, and linear programming. SIAM Journal on Computing, 40(4): 1026–1044, 2011. [Google Scholar]
  • [26].Schrijver Alexander. Theory of linear and integer programming. John Wiley & Sons, 1998. [Google Scholar]

Articles from ArXiv are provided here courtesy of arXiv

RESOURCES