Abstract
Background
Superbubbles are distinctive subgraphs in direct graphs that play an important role in assembly algorithms for high-throughput sequencing (HTS) data. Their practical importance derives from the fact they are connected to their host graph by a single entrance and a single exit vertex, thus allowing them to be handled independently. Efficient algorithms for the enumeration of superbubbles are therefore of important for the processing of HTS data. Superbubbles can be identified within the strongly connected components of the input digraph after transforming them into directed acyclic graphs. The algorithm by Sung et al. (IEEE ACM Trans Comput Biol Bioinform 12:770–777, 2015) achieves this task in -time. The extraction of superbubbles from the transformed components was later improved to by Brankovic et al. (Theor Comput Sci 609:374–383, 2016) resulting in an overall -time algorithm.
Results
A re-analysis of the mathematical structure of superbubbles showed that the construction of auxiliary DAGs from the strongly connected components in the work of Sung et al. missed some details that can lead to the reporting of false positive superbubbles. We propose an alternative, even simpler auxiliary graph that solved the problem and retains the linear running time for general digraph. Furthermore, we describe a simpler, space-efficient -time algorithm for detecting superbubbles in DAGs that uses only simple data structures.
Implementation
We present a reference implementation of the algorithm that accepts many commonly used formats for the input graph and provides convenient access to the improved algorithm. https://github.com/Fabianexe/Superbubble.
Keywords: Superbubble, de Bruijn graph, Genome assembly, Linear time algorithm
Background
Under idealizing assumption, the genome assembly problem reduces to finding an Eulerian path in the de Bruijn graph [1] that represents the collection of sequencing reads [2]. In real-life data sets, however, sequencing errors and repetitive sequence elements contaminate the de Bruijn graph with additional, false, vertices and edges. Assembly tools therefore employ filtering steps that are based on recognizing local motifs in the de Bruijn graphs that correspond to this kind of noise, see e.g. [3]. Superbubbles also appear naturally in the multigraphs in the context of supergenome coordinatization [4], i.e., the problem of finding good common coordinate systems for multiple genomes.
The simplest such motif is a bubble, comprising two or more isolated paths connecting a source s to a target t, see [5] for a formal analysis. While bubbles are easily recognized, most other motives are much more difficult to find. Superbubbles are a complex generalization of bubbles that were proposed in [6] as an important class of subgraphs in the context of HTS assembly. It will be convenient for the presentation in this paper to first consider a more general class of structure which are obtained by omitting the minimality criterion:
Definition 1
(Superbubbloid) Let be a directed multi-graph and let (s, t) be an ordered pair of distinct vertices. Denote by the set of vertices reachable from s without passing through t and write for the set of vertices from which t is reachable without passing through s. Then the subgraph induced by is a superbubbloid in G if the following three conditions are satisfied:
, i.e., t is reachable from s (reachability condition).
(matching condition).
is acyclic (acyclicity condition).
We call s, t, and the entrance, exit, and interior of the superbubbloid. We denote the induced subgraph by if it is a superbubbloid with entrance s and exit t.
A superbubble is a superbubbloid that is minimal in the following sense:
Definition 2
A superbubbloid is a superbubble if there is no such that is a superbubbloid.
We note that Definition 2 is a simple rephrasing of the language used in [6], where a simple -time algorithm was proposed that, for each candidate entrance s, explicitly retrieves all superbubbles . Since the definition is entirely based on reachability, multiple edges are irrelevant and can be omitted altogether. Hence we only consider simple digraphs throughout.
The vertex set of every digraph G(V, E) can be partitioned into its strongly connected components. Denote by the set of singletons, i.e., the strongly connected components without edges. One easily checks that the induced subgraph is acyclic. Furthermore, denote by the partition of V comprising the non-singleton connected components of G and the union of the singleton. The key observation of [7] can stated as
Proposition 1
Every superbubble in G (V , E ) is an induced subgraph of G [C] for some.
It ensures that it is sufficient to search separately for superbubbles within G[C] for . However, these induced subgraphs may contain additional superbubbles that are created by omitting the edges between different components. In order to preserve this information the individual components C are augmented by artificial vertices [7]. The augmented component C is then converted into a directed acyclic graph (DAG). Within each DAG the superbubbles can be enumerated efficiently. With the approach of [7], this yields an overall -time algorithm, the complexity of which is determined by the extraction of the superbubbles from the component DAGs. The partitioning of G(V, E) into the components G[C] for and the transformation into DAGs can be achieved in -time. Recently, Brankovic et al. [8] showed that superbubbles can be found in linear time within a DAG. Their improvement uses the fact that the DAG can always be topological ordered in such a way that superbubbles appears as a contiguous blocks. In this ordering, furthermore, the candidates for entrance and exit vertices can be narrowed down considerably. For each pair of entrance and exit candidates (s, t), it can then be decided in constant time whether is indeed a superbubble. Using additional properties of superbubbles to further prune the candidate list of (s, t) pairs results in -time complexity.
The combination of the work of [7] with the improvement of [8] results in the state of the art algorithm. The concept of a superbubble was extended to bi-directed and bi-edged graphs, called ultrabubble in [9–11]. The enumeration algorithm for ultrabubbles in [9] has a worst case complexity of , and hence does not provide an alternative for directed graphs.
A careful analysis showed that sometimes false-positive superbubbles are reported, see Fig. 1. These do not constitute a fatal problem because they can be recognized easily in linear total time simply by checking the tail of incoming and head of outgoing edges. It is nevertheless worth while to analyse the issue and to seek a direct remedy. As we shall see below, the false positive subgraphs are a consequence of the way in which a strongly connected component C is transformed into a two DAGs that are augmented by either the source or target vertices.
Theory
In the first part of this section we revisit the theory of superbubbles in digraphs in some more detail. Although some of the statements below have appeared at least in similar for in the literature [6–8] we give concise proofs and take care to disentangle properties that depend on minimality from those that hold more generally. This refined mathematical analysis sets the stage in the second part for identifying the reason for the problems with the auxiliary graph constructed in [7] shows how the problem can be solved efficiently in these cases using an even simpler auxiliary graph. In the third part we elaborate on the linear time algorithm on [8] for DAGs. We derive a variant that has the same asymptotic running time but seems easier to explain.
Weak superbubbloids
Although we do not intend to compute superbubbloids in practice, they feature several convenient mathematical properties that will simplify the analysis of superbubbles considerably. The main aim of this section is to prove moderate generalizations of the main results of [6, 7]. To this end, it will be convenient to rephrase the reachability and matching conditions (S1) and (S2) for the vertex set U of superbubbloid with entrance s and exit t in the following, a more expanded form.
Lemma 1
Let G be a digraph, and Then (S1) and (S2) holds for if and only if the following four conditions are satisfied
- (S.i)
Every is reachable from s.
- (S.ii)
t is reachable from every .
- (S.iii)
If and then every path contains s.
- (S.iv)
and then every path contains t.
Proof
Suppose (S1) and (S2) are true. Then and implies, by definition, that u is reachable from s, i.e. (S.i) and (S.ii) holds. By (S2) we have . If it is not reachable from s without passing through t. Since every u is reachable from s without passing through t, we would have if w was reachable from any on a path not containing t, hence (S.iv) holds. Similarly, since t is reachable from u without passing through s, we would have if v could be reached from w along a path that does not contain s, i.e. (S.iii) holds.
Now suppose (S.i), (S.ii), (S.iii), and (S.iv) holds. Clear, both (S.i) and (S.ii) already imply (S1). Since is reachable from s by (S.ii) and every path reaching pases through t by (S.iii), we have . By (S.i), t is reachable from every and by (S.iv) t can by reached from only by passing through s, i.e., , i.e., .
Corollary 1
Suppose U, s, and t satisfies (S.i), (S.ii), (S.iii), and (S.iv). Then every path connecting s to and u to t is contained within U.
Proof
Assume, for contradiction, that there path containing a vertex By definition of the set is not reachable from without passing through t first, i.e., w cannot be part of a path.
Corollary 1 shows that subgraphs satisfying (S1) and (S2) related to reachability structures explored in some detail in [12, 13]. In the following it will be useful to consider
- (S.v)
If (u, v) is an edge in U then every path in G contains both t and s.
In the following we shall see that (S.v) acts a slight relaxation of the acyclicity axiom ((S3)).
Lemma 2
Let G(V, E) be a digraph, and
If U is the vertex set of the superbubbloid it satisfies (S.v).
If U satisfies (S.i), (S.ii), (S.iii), (S.iv), and (S.v), then the subgraph induced by U without the potential edge from t to s, is acyclic.
Proof
Suppose U is the vertex set of a superbubbloid with entrance s and exit t. If (u, v) is an edge in U, then by (S3). Since v is reachable from s within U, no path can exist within U, since otherwise there would be a cycle, contradicting (S3), that any path passes through t. There are two cases: If there , any path containing this edge trivially contains both s and t. The existence of the edge (t, s) contradicts (S3). Otherwise, any path contain at least one vertex . By (S.iii) and (S.iv) every path contains t and every path contain s and t, respectively. Hence the first statement holds.
Conversely, suppose (S.v) holds, i.e., every directed cycle Z within U contains s and t. Suppose (t, s) is not contained Z, i.e., there is vertex such that . By (S.ii), t is reachable from u without passing through s, and every path is contained in U by Corollary 1. Thus there is a directed cycle within U that contains u and t but not s, contradicting (S.v). Removing the edge (t, s) thus cuts every directed cycle within U, and hence is acyclic.
Although the definition of [6] (our Definition 2) is also used in [7], the notion of a superbubble is tacitly relaxed in [7] by allowing an edge (t, s) from exit to entrance, although this contradicts the acyclicity condition (S3). This suggests
Definition 3
(Weak Superbubbloid) Let G(V, E) be a digraph, and . The subgraph G[U] induced by U is a weak superbubbloid if U satisfies (S.i), (S.ii), (S.iii), (S.iv), and (S.v).
We denote a weak superbubbloids with entrance s and exit t by and write for its vertex set. As an immediate consequence of Definition 3 and Lemma 2 we have
Corollary 2
A weak superbubbloid is a superbubbloid in G(V, E) if and only if .
The possibility of an edge connecting t to s will play a role below, hence we will focus on weak superbubbloids in this contribution.
First we observe that a weak superbubbloids contained within another weak superbubbloid must be a superbubbloid because the existence of an edge from exit to entrance contradicts (S.v) for the surrounding weak superbubbloid. We record this fact as
Lemma 3
If and are weak superbubbloids with and then is a superbubbloid.
The result will be important in the context of minimal (weak) superbubbloids below.
Another immediate consequence of Lemma 2 is
Corollary 3
Let be a weak superbubbloid in G. If there is an edge (u, v) in that is contained in a cycle, then every edges in is contained in cycle containing s and t.
Proof
By (S.v) there is cycle running though s and t. Let (u, v) be an edge in . Since u is reachable from s and v reaches t within U, there is a cycle containing s, t, and the edge (u, v).
Theorem 1
Every weak superbubbloid in G(V, E) is an induced subgraph of G[C] for some
Proof
First assume that contains an edge (u, v) that is contained in cycle. Then by (S.v), there is cycle through s and t and thus in particular a (t, s) path. For every , there is a path within U from s to t through u by (S.i), (S.ii), and Lemma 1. Thus is contained as an induced subgraph in a strongly connected component G[C] of G. If there is no edge in that is contained in a cycle, then every vertex in is a strongly connected component on its own. is therefore an induced subgraph of .
Theorem 1 establishes Proposition 1, the key result of [7], in sufficient generality for our purposes. Next we derive a few technical results that set the stage for considering minimality among weak superbubbloids.
Lemma 4
Assume that is a weak superbubbloid and let u be an interior vertex of Then is a superbubbloid if and only if is a superbubbloid.
Proof
Suppose is a superbubbloid. Set and consider The subgraph induced by is an induced subgraph of Hence it is acyclic and in particular i.e., (S.v) and even (S3) holds. Since every path from s to t runs through u. Since w is reachable from s there is a path from s through u to w, i.e., w is reachable from u. Thus (S.i) holds. (S.ii) holds by assumption since t is reachable from w. Now suppose and If then every path passes through s and then through u, the exit of before reaching w. If then and thus every path passes through u as the exit of Hence satisfied (S.iii). If then and thus every path passes through s. By (S.v) there is no path within and thus any includes (t, s) or a vertex By construction, all paths contain t, and thus all paths also pass through t and also satisfies (S.iv).
Conversely suppose is a superbubbloid. We have to show that induces a superbubbloid. The proof strategy is very similar. As above we observe that (S.v), (S.i), and (S.ii) are satisfied. Now consider and If then every path contains s; otherwise and passes through t and thus also through s by Corollary 1, thus (S.iii) holds. If then in which case every path passes through u. Otherwise then every runs through and thus in particular also through u. Hence (S.iv) holds.
Lemma 5
Let and be two weak superbubbloids such that u is an interior vertex of s is an interior vertex of w is not contained in and t is not contained in Then the intersection is also a superbubbloid.
Proof
First consider the intersection is reachable from s, hence (S1) holds. Furthermore is an induced subgraph of and hence again acyclic (S3). Set and consider First we note that v is reachable from s by definition of and u is reachable from v by definition of Let and If then every path passes through s; if then (and ) and thus every path passes through w. Since we know that every path contains s.
If then every path passes through u; otherwise but thus every path passes through and hence also through u. Thus is a superbubbloid.
We include the following result for completeness, although it is irrelevant for the algorithmic considerations below.
Lemma 6
Let and be defined as in Lemma 5 . Then the union is superbubbloid if and only if the induced subgraph satisfies (S.v).
Proof
Since are superbubbloids, t is reachable from w, i.e., (S1) holds. By the same token, every is reachable from w or s and reaches u or t. Since s is reachable from w and t is reachable from u, every is reachable from w and reaches t. Now consider and . If every path passed through w; if , it passes through and thus also through w. If , then every path passed through t. If it passes through and thus also through t. Thus satisfies (S2). Thus is a weak superbubbloid if and only if (S.v) holds.
Lemma 7
Let be a weak superbubbloid in G with vertex set Then is a weak superbubbloid in the induced subgraph G[W] whenever
Proof
Conditions (S.i), (S.ii), and (S.v) are trivially conserved when G is restricted to G[W]. Since every and path with and within W is also such a path in V, we conclude that (S.iii) and (S.iv) are satisfied w.r.t. W whenever they are true w.r.t. the larger set V.
The converse is not true. The restriction to induced subgraphs thus can introduce additional (weak) superbubbloids. As the examples in Fig. 1 show, it is also possible to generate additional superbubbles.
Finally we turn our attention to the minimality condition.
Definition 4
A weak superbubbloid is a weak superbubble if there is no interior vertex in such that is a weak superbubbloid.
The “non-symmetric” phrasing of the minimality condition in Definitions 2 and 4 [6–8] is justified by Lemma 4: If and with are superbubbloids, then is also a superbubbloid, and thus is not a superbubble. As a direct consequence of Lemma 3, furthermore, we have
Corollary 4
Every superbubble is also a weak superbubble.
Lemma 4 also implies that every weak superbubbloid, which is not a superbubble itself, can be decomposed into consecutive superbubbles:
Corollary 5
If is a weak superbubbloid, then it is either a weak superbubble or there is a sequence of vertices with such that is a superbubble for all
A useful consequence of Lemma 5, furthermore, is that superbubbles cannot overlap at interior vertices since their intersection is again a superbubbloid and thus neither of them could have been minimal. Furthermore, Lemma 4 immediately implies that and are also superbubbloids, i.e., neither nor is a superbubble in the situation of Lemma 5. Figure 2 shows a graph in which all (weak) superbubbloids and superbubbles are indicated.
Reduction to auperbubble finding in DAGs
Theorem 1 guarantees that every weak superbubbloid and thus every superbubble in G(V, E) is completely contained within one of induced subgraphs G[C], . It does not guarantee, however, that a superbubble in G[C] is also a superbubble in G. This was already noted in [7]. This fact suggests to augment the induced subgraph G[C] of G by an artificial source a and an artificial sink b.
Definition 5
The augmented graph is constructed from G[C] by adding the artificial source a and the artificial sink b. There is an edge (a, x) in whenever has an incoming edge from another component in G and there is an edge (x, b) whenever has an outgoing edge to another component of G.
Since is acyclic, a has only outgoing edges and b only incoming ones, it follows that the augmented graph is also acyclic.
Lemma 8
is a weak superbubbloid in G if and only if it is a weak superbubbloid of or a superbubbloid in that does not contain an axiliary source a or an auxiliary sink b.
Proof
First assume that is an induced subgraph of the strongly connected component G[C] of G. By construction, G[C] is also a strongly connected component of . Thus reachability within C is the same w.r.t. G and . Also by construction, a vertex is reachable from in G if an only of b is reachable from x in . Similarly, a vertex is reachable from if and only if x is reachable from a. Hence is a (weak) superbubbloid w.r.t. G if and only if it a weak superbubbloid w.r.t. . For the special case that is an induced subgraph of the acyclic graph we can argue in exactly the same manner.
For strongly connected components C, the graph contains exactly 3 strongly connected components whose vertex sets are C and the singletons and . Since (a, b) is not an edge in , every weak superbubbloid in is contained in G[C] and hence contains neither a nor b. Superbubbloids containing a or b cannot be excluded for the acyclic component , however.
It is possible, therefore, to find the weak superbubbloids of G by computing the weak superbubbloids not containing an artificial source or sink vertex in the augmented graphs. In the remainder of this section we show how this can be done efficiently.
The presentation below depends strongly on the properties of depth first search (DFS) trees and vertex orders associated with them. We thus briefly recall their relevant features. A vertex order is a bijection . We write is the vertex at the i-th position of the -ordered vertex list. Later we will also need vertex sets that form intervals w.r.t. . These will be denoted by for a -interval of vertices.
DFS on a strongly connected digraph G (exploring only along directed edges) is well known to enumerate all vertices starting from an arbitrary root [14]. The corresponding DFS tree consists entirely of edges of G pointing away from the root. In the following we will reserve the symbol for the reverse postorder of the DFS tree T in a strongly connected graph. Edges of G can be classified relative to a given DFS tree T with root x. By definition, all tree edges (u, v) are considered to be oriented away from the root w; hence . An edge is a forward edge if v is reachable from u along a path consisting of tree edges, hence it satisfied . The edge (u, v) is a backward edge if u is reachable from v along a path of consisting of tree edges, hence . For remaining, so-called cross edges have no well-defined behavior w.r.t. . We refer to [14, 15] for more details on depth first search, DFS trees, and the associated vertex orders.
A topological sorting of a directed graph order of V such that holds for every directed (u, v) [16]. Equivalently, is a topological sorting if there are no backward edges. A directed graph admits a topological sorting if and only if it is a DAG. In particular, if v is reachable from u then must hold. In a DAG, a topological sorting can be obtained as the reverse postorder of an arbitrary DFS tree that is constructed without considering the edge directions in G [15].
Lemma 9
Let G be a strongly connected digraph, be a weak superbubbloid in G, and the inverse postorder of a DFS tree T rooted at w. Then the induced subgraph of G contains no backward edge w.r.t. except possibly (t, s).
Proof
Let T be a DFS tree rooted in T and let denote the preordering of T. First we rule out Since t cannot be reached from anywhere along a path that does not contain s, this is only possible if , i.e., if t is the root of DFS tree T. This contradicts the assumption that for some w outside . Hence . The DFS tree T therefore contains a directed path from s to t. Since interior vertices of are only reachable through s and reach outside only through t, it follows that the subtree of T induced by is a tree and only s and t are incident to edges of T outside of . In the DFS reverse postorder we therefore have for every vertex u interior to , and either or for all w outside of . The graph obtained from by removing the possible (t, s) edge is a DAG, the subtree is a DFS tree on , whose reverse postorder is collinear with rho, i.e., holds whenever . Therefore, there are no back-edges in .
Lemma 9 is the key prerequisite for constructing an acyclic graph that contains all weak superbubbles of . Similar to the arguments above, however, we cannot simply ignore the backward edges. Instead, we will again add edges to the artificial source and sink vertices.
Definition 6
Given a DFS tree T with a root that is neither an interior vertex nor the exit of a weak superbubbloid of , the auxiliary graph is obtained from by replacing every backward edge (v, u) with respect to in with both an edge (a, u) and an edge (v, b).
Note that Definition 6 implies that all backward edges (u, v) of are removed in . As a consequence, is acyclic. The construction of is illustrated in Fig. 3.
Lemma 10
Let C be a strongly connected component of G and let T be a DFS tree on with a root that is neither an interior vertex nor the exit of a weak superbubbloid of G. Then with is a weak superbubble of G contained in if and only if is a superbubble in that does not contain the auxiliary source a or the auxiliary sink b.
Proof
Assume that is a weak superbubble in that does not contain a or b. Lemma 8 ensures that this is equivalent to being a weak superbubble of G. By Lemma 9, contains no backward edges in , with the possible exception of the edge (t, s). Since and by construction differ only in the backward edges, the only difference affecting is the possible insertion of edges from a to s or from t to b. Neither affects a weak superbubble, however, and hence is a superbubble in .
Now assume that is a superbubble in with vertex set and . Since the restriction of to C is by construction a subgraph of , we know that reachability within C w.r.t. to implies reachability w.r.t. . Therefore satisfies (S.i) and (S.ii) also w.r.t. . Therefore, if is not a weak superbubble in then there must be a backwards edge (x, v) or a backward edge (v, x) with v in the interior of . The construction of , however, ensures that then contains an edge (a, v) or (v, b), respectively, which would contradict (S.iii), (S.iv), or acyclicity (in case ) and hence (S.v). Therefore is a superbubble in .
The remaining difficulty is to find a vertex w that can safely be used a root for the DFS tree T. In most cases, one can simply set since Lemma 8 ensures that a is not part of a weak superbubbloid of G. However, there is no guarantee that an edge of the form (a, w) exists, in which case is not connected. Thus another root for the DFS tree must be chosen. A closer inspection shows that three cases have to be distinguished:
-
A.
a has an out-edge. In this case we can choose a as the root of the DFS tree, i.e., .
-
B.
a has no edge, but there b has an in-edge. In this case we have to identify vertices that can only be entrances of a superbubble. These can then be connected with the artificial source vertex without destroying a superbubble.
-
C.
Neither a nor b have edges. The case requires special treatment.
In order to handle case (B), we use the following
Lemma 11
Let a and b be the artificial source and sink of Let and be a successor of a and a predecessor of b, respectively. Then
-
i)
is neither an interior vertex nor the exit of a superbubble.
-
ii)
A predecessor of is neither an interior vertex nor an entrance of a superbubble.
-
iii)
is neither an interior vertex nor the entrance of a superbubble.
-
iv)
A successor of is neither an interior vertex nor an exit of a superbubble.
Proof
If is contained in a superbubble, it must be the entrance, since otherwise its predecessor, the artificial vertex a would belong to the same superbubble. If is in the interior of an entrance, the would be an interior vertex of a superbubble, which is impossible by (i). The statements for b follow analogously.
Corollary 6
If b has an inedge in then every successor of every predecessor of b can be used a root of the DFS search tree. At least one such vertex exists.
Proof
By assumption, b has at least one predecessor . Since G[C] is strongly connected, has at least one successor , which by Lemma 11(iv) is either not contained in a superbubble or is the entrance of a superbubble.
The approach sketched above fails in case (C) because there does not seem to be an efficient way to find a root for DFS tree that is guaranteed not to be an interior vertex or the exit of a (weak) superbubbloid. Sung et al. [7] proposed the construction of a more complex auxiliary DAG H that not only retains the superbubbles of G[C] but also introduces additional ones. Then all weak superbubbles in H(G) are identified and tested whether they also appeared in G[C].
Definition 7
(Sung graphs) Let G be a strongly connected graph with a DFS tree T with root x. The vertex set consists of two copies and of each vertex , a source a, and a sink b. The edge set of H comprises four classes of edges: (i) edges and whenever (u, v) is a forward edge in G w.r.t. T. (ii) edges whenever (u, v) is a backward edge in G. (iii) edges whenever (a, v) is a edge in G and (iv) edges whenever (v, b) is a edge in G.
The graph H is a connected DAG since a topological sorting on H is obtained by using the reverse postorder of T within each copy of V(G) and placing the first copy entirely before the second. We refer to [7] for further details.
The graph H contains two types of weak superbubbloids: those that contain no backward edges w.r.t. T, and those that contain backward edges. Members of the first class do not contain the root of T by Lemma 9 and hence are also superbubbles in G. Every weak superbubble of this type is present (and will be detected) in both and . A weak superbubble with backward edge has a “front part” in and a “back part” in and appears exactly once in H. The vertex sets and are disjoint. It is possible that H contains superbubbles that have duplicated vertices, i.e., vertices and deriving from the same vertex in V. These candidates are removed together with one of the copies of superbubbles appearing in both and . We refer to this filtering step as Sung filtering as it was proposed in [7].
This construction is correct in case (C) if there are no other edges connecting G[C] within G. The additional connections to a and b introduced to account for edges that connect G[C] to other vertices in G, may fail. To see this, consider an interior vertex in a superbubble with a backward edge. It is possible that its original has an external out edge and thus b should be connected to . This is not accounted for in the construction of H, which required that is connected to a only, and is connected to b only. These ”missing” edges may introduce false positive superbubbles as shown in Fig. 1.
This is not a dramatic problem because it is easy to identify the false positives: it suffices to check whether there is an edge (x, w) or (w, y) with , and . Clearly, this can be achieved in linear total time for all superbubble candidates , providing a easy completion for the algorithm of Sung et al. [7]. Our alternative construction eliminates the need for this additional filtering step.
Lemma 12
The (weak) superbubbles in a digraph G(V, E) can be identified in time using Algorithm 1 provided the (weak) superbubbles in a DAG can be found in linear time.
Proof
The correctness of Algorithm 1 is an immediate consequence of the discussion above. Let us briefly consider its running time. The strongly connected components of G can be computed in linear, i.e., time [14, 17, 18]. The cycle-free part as well as its connected components [19] are also obtained in linear time. The construction of directed (to construct T) or undirected DFS search (to construct in a DAG) also require only linear time [14, 15], as does the classification of forward and backward edges. The construction of the auxiliary DAGs and H(C) and the determination of the root for the DFS searches is then also linear in time. Since the vertex sets considered in the auxiliary DAGs are disjoint in G, we conclude that the superbubbles can be identified in linear time in arbitrary digraph if the problem can be solved in linear time in a DAG.
The algorithm of Brankovic et al. [8] shows that this is indeed the case.
Corollary 7
The (weak) superbubbles in a digraph G(V, E) can be identified in time using Algorithm 1.
In the following section we give a somewhat different account of a linear time algorithm for superbubble finding that may be more straightforward than the approach in [8], which heavily relies on range queries. An example graph as the different auxiliary graphs are shown in Fig. 4.
Detecting superbubbles in a DAG
The identification of (weak) superbubbles is drastically simplified in DAGs since acyclicity, i.e., (S3), and thus (S.v), can be taken for granted. In particular, therefore, every weak superbubbloid is a superbubbloid. A key result of [8] is the fact that there are vertex orders for DAGs in which all superbubbles appear as intervals. The proof of Proposition 2 does not make use the minimality condition hence we can state the result here more generally for superbubbloids and arbitrary DFS trees on G:
Proposition 2
([8]) Let G(V, E) be a DAG and let be the reverse postorder of a DFS tree of G. Suppose is a superbubbloid in G. Then
-
i)
Every interior vertex u of satisfied
-
ii)
If then either or
The following two functions were also introduced in [8]:
1 |
We slightly modify the definition here to assign values also to the sink and source vertices of the DAG G. The functions return the predecessor and successor of v that is furthest away from v in terms of the DFS order . It is convenient to extend this definition to intervals by setting
2 |
A main result of this contribution is that superbubbles are characterized completely by these two functions, resulting in an alternative linear-time algorithm for recognizing superbubbles in DAGs that also admits a simple proof of correctness. To this end we will need a few simple properties of the and functions for intervals. First we observe that implies the inequalities
3 |
A key observation for our purposes is the following
Lemma 13
If then
-
i)
is the only successor of
-
ii)
is reachable from every vertex
-
iii)
every path from some to a vertex contains
Proof
-
(i)
By definition has at least one successor. On the other hand, all successor of after are by definition not later than j. Hence is uniquely defined.
-
(ii)
We proceed by induction w.r.t. the length of the interval . If , i.e., a single vertex, the assertion (ii) is obviously true. Now assume that the assertion is true for . By definition of , i has a successor in , from which is reachable.
-
(iii)
Again, we proceed by induction. The assertion holds trivially for single vertices. Assume that the assertion is true for . By definition of , every successor u of is contained in . By induction hypothesis, every path from u to a vertex contains , and also all path from to run through .
It is important to notice that Lemma 13 depends crucially on the fact that , by construction, is a reverse postorder of a DFS tree. It does not generalize to arbitrary topological sortings.
Replacing successor by predecessor in the proof of Lemma 13 we obtain
Lemma 14
If then
-
i)
is the only predecessor of
-
ii)
Every vertex is reachable from
-
iii)
Every path from to a vertex contains
Let us now return to the superbubbloids. We first need two simple properties of the and function for individual vertices:
Lemma 15
Let is a superbubbloid in a DAG G. Then
-
i)
v is an interior vertex of implies and .
-
ii)
and .
-
iii)
If then or and or
Proof
-
(i)
The matching property (S2) implies that for every successor x and predecessor y of an interior vertex v there is a path within the superbubble from s to x and from y to t, respectively. The statement now follows directly from the definition.
-
(ii)
The argument of (i) applies to the successors of s and the predecessors of t.
-
(iii)
Assume, for contradiction, that or . Then Proposition 2 implies that w has a predecessor or successor in the interior of the superbubble. But then has a successor (namely w) outside the superbubble, or has a predecessor (namely w) inside the superbubble. This contradicts the matching condition (S2).
Theorem 2
Let G be a DAG and let be the reverse postorder of a DFS tree on G. Then is a superbubbloid if and only if the following conditions are satisfied:
(predecessor property)
(successor property)
Proof
Suppose and satisfy (F1) and (F2). By (F1) and Lemma 13(ii) we known that t is reachable from every vertex in v with . Thus the reachability condition (S1) is satisfied. Lemma 13(iii) implies that any vertex w with or is reachable from v only through a path that runs through t. The topological sorting then implies that w with is not reachable from at all since w is not reachable from t. Hence . By (F2) and Lemma 14(ii) every vertex v with , i.e., is reachable from s. Lemma 14(ii) implies that v is reachable from a vertex w with or only through paths that contain s. The latter are not reachable at all since s is not reachable from w with in a DAG. Thus , i.e., the matching condition (S2) is satisfied.
Now suppose (S1) and (S2) holds. Lemma 15 implies that . Since some vertex must have s as its predecessor we have , i.e., (F1) holds. Analogously, Lemma 15 implies . Since there must be some that has t as its successor, we must have , i.e. (F2) holds.
We now proceed to showing that the possible superbubbloids and superbubbles can be found efficiently, i.e., in linear time using only the reserve postorder of the DFS tree and the corresponding functions and . As an immediate consequence of (F2) and Lemma 13, we have the following necessary condition for exits:
Corollary 8
The exit t of superbubbloid satisfies
We now use the minimality condition of Definition 2 to identify the superbubbles among the superbubbloids.
Lemma 16
If t is the exit of a superbubbloid, then there is also the exit of a superbubble whose entrance s is vertex with the largest value of such that (F1) and (F2) is satisfied.
Proof
Let be a superbubbloid. According to Definition 2, is a superbubble if there is no superbubbloid with , i.e., there is no vertex with such that (F1) and (F2) is satisfied.
Lemma 17
Suppose and Then there is no superbubbloid with entrance s and exit t.
Proof
Suppose is a superbubbloid. By construction, , contradicting (F2).
Corollary 9
If is a superbubble, then there is no superbubbloid with exit and entrance with
Proof
This is an immediate consequence of Lemma 5, which shows that the intersection would be a superbubbloid, contradicting minimality of .
Corollary 10
If and are two superbubbles with then either or
Thus superbubbles are either nested or placed next to each other, as already noted in [6]. Finally, we show that it is not too difficult to identify false exit candidates, i.e., vertices that satisfy the condition of Corollary 8 but have no matching entrance s.
Lemma 18
Let be a superbubble and suppose is an interior vertex of Then there is a vertex v with such that
Proof
Suppose, for contradiction, that no such vertex v exists. Since is superbubble by assumption, it follows that is correct and so (F1) satisfied for . After no such v exists also is correct and so (F2) is satisfied. Thus is superbubbloid. By Lemma 4 is also a superbubbloid, contradicting the assumption.
Taken together, these observations suggest to organize the search by scanning the vertex set for candidate exit vertices t in reverse order. For every such t, one would then search for the corresponding entrance s such that the pair s, t fulfills (F1) and (F2). Using eq.(3) one can test (F2) independently for each v by checking whether . Checking for (F1) requires that the interval is considered. The value of its function can be obtained incrementally as the minimum of and the interval of the previous step:
4 |
By Lemma 16, the nearest entrance s to the exit t completes the superbubble. The tricky part is to identify all superbubbles in a single scan. Lemma 17 ensures that no valid entrance can be found for exit if a vertex v with is encountered. In this case can be discarded. Lemma 18 ensures that a false exit candidate within a superbubble candidate cannot “mask” the entrance s belonging to t, i.e., there is necessarily a vertex v satisfying with .
It is natural therefore to use a stack to hold the exit candidates. Since the interval explicitly refers to an exit candidate t, it must be re-initialized whenever a superbubble is completed or the candidate exit is rejected. More precisely, the interval of the previous exit candidate t must be updated. This is achieved by computing
5 |
To this end, the value is associated with t when is pushed onto the stack. The values of intervals are not required for arbitrary intervals. Instead, we only need with consecutive exit candidates and t. Hence a single integer associated with each candidate t suffices. This integer initialized with and is then advanced as described above to .
Algorithm 2 presents this idea in a more formal way.
Lemma 19
Algorithm 2 identifies the superbubbles in a DAG G.
Proof
Every reported candidate satisfied (F1) since is used to identify the entrance for the current t. Since is checked for every , (F2) holds due to equ.(3) since by Lemma 13 this is equal to test the interval. Hence every reported candidate is a superbubbloid. By Lemma 16 is minimal and thus a superbubble. Lemma 18 ensures that the corresponding entrance is identified for every valid exit t, i.e., that all false candidate exits are rejected before the next valid entrance in encountered.
Lemma 20
The Algorithm 2 has time complexity
Proof
Given the reverse DFS postorder , the for loop processes every vertex exactly once. All computations except , , and the while loop take constant time. This includes explicit the calculation of the minimum of two integer values that are needed to update of the intervals. The values of and are obtained by iterating over the outgoing or incoming edges of v, respectively, hence the total effort is . Every iteration of the while loop removes one vertex from the stack . Since each vertex is pushed only at most once, the total effort for the while loop is O(|V|). The total running time therefore is .
Recalling the reverse DFS postorder can also be obtained in we have
Corollary 11
([8]) The superbubbles in a DAG can be identified in a linear time.
Some example DAGs together with the values of and are shown in Fig. 5.
Implementation
Algorithms 1 and 2 were implemented in Python and are available as Linear Superbubble Detector, LSD for short. LSD can be installed with pip.1 The source is available on GitHub.2 It is intended as a reference implementation emphasizing easy understanding rather than as a performance-optimized production tool. The underlying graph structures make use of NetworkX [20], which has the benefit that many input formats can be parsed easily.
To our knowlege, SUPBUB3 [8] is the only other publicly available implementation of a superbubble detector. Unfortunately, it has some bugs e.g., in the handling of successors in the DFS tree that leads to problems with superbubble with a backward edge. An analysis of the code shows, furthermore, that the construction of the auxiliary graphs strictly follows [7]. Hence it cannot serve as a reference implementation.
In order to compare our approach to the state of the art algorithm we re-implemented the workflow on Sung et al. [7] and Brankovic et al. [8] using the same python libraries. This allows a direct comparison that focusses on the algorithms rather than the differences between programming languages and compilers. The workflow can be subdivided into two separate tasks: (1) the construction of the DAGs, and (2) the recognition of superbubbles within the DAG. For the first task, we compare our approach and the algorithm of Sung et al. [7] augmented by a simple linear-time filter to detect the false positives. For the second part, we compare our stack-based approach with the range-query method of Brankovic et al. [8].
Table 1 summarized the empirical results for test data of different sizes taken from our recent work on supergenome coordinatization and the Stanford Large Network Dataset Collection [21]. Although the running times are comparable, we find that LSD consistently performs better than the alternative for both tasks. The combined improvement of LSD is a least a factor of 2 in the examples tested here. All results and methods are available in the git repository.4
Table 1.
Data | N | M | S | Running times [s] | |||
---|---|---|---|---|---|---|---|
LSD | S + LSD+ f | LSD + B | S + B + f | ||||
Yeast | 49,795 | 130,993 | 325 | 3 | 4 | 6 | 9 |
EU mail | 265,214 | 420,045 | 13285 | 15 | 16 | 31 | 34 |
Slashdot | 82,168 | 948,464 | 0 | 17 | 27 | 22 | 37 |
Amazon | 403,394 | 3,387,388 | 3 | 60 | 86 | 87 | 158 |
875,713 | 5,105,039 | 6477 | 94 | 127 | 144 | 254 | |
Wikipedia | 2,394,385 | 5,021,410 | 4737 | 147 | 171 | 385 | 418 |
The for combinations of algorithms compared here are: LSD (using the auxiliary graphs and the stack-based superbubble detector), S+LSD using Sung graphs with our stack-based detector plus a post-filter for the false positives, LSD+B using our graph construction with the range-query-based detector of [8], and S+B using the re-implementation of the state of the art method with the post-filter. All computations were performed on a 2.5GHz quad-core Intel Core i7 processor (Turbo Boost up to 3.7GHz) with 6MB shared L3 cache and 16GB of 1600MHz DDR3L onboard memory. Test data sets are taken from [4] and from the Stanford Large Network Dataset Collection [21]. The table lists their number N of vertices, M of edges and S of superbubbles
Conclusion
We have re-investigated the mathematical properties of superbubbles and their obvious generalization, the weak superbubbloids. We not only re-derive foundational results, in particular Propositions 1 and 2 in a more concise way, we also identified a problems with auxiliary graphs proposed in [7] that lead to false positive superbubbles. Although these are not a fatal problem and can be recognized in a post-processing step without affecting the overall time-complexity, we have shown here that the issue can be avoided by using a different, in fact simpler, auxiliary graph that is already acyclic. Capitalizing on the fact that the superbubbles in a DAG can be listed in linear time [8], we show that problem of listing all superbubbles in an arbitrary digraph can indeed be solved in linear time. For the DAG case we proposed a conceptually simpler replacement for the algorithm of [8] that also has linear running time. With LSD we provide a reference implementation in python.
The mathematical analysis of superbubbles suggests to consider generalizations that allow possibly restricted sets of cycles within the “bubble” but retain the idea of an induced subgraph that cannot be transversed without passing through the entrance the exit. For instance, one might relax (S.v) an require only that an interior vertex v cannot be reached from t without passing through s and cannot reach s without passing through t. The false positives generated by the approach of Sung et al. [7] may also be considered a the prototype of a broader class of superbubble-like structures. It does not seem obvious, however, to characterize them beyond being induced acyclic subgraphs with a single source and a single sink vertex. An related structure that also generalizes superbubbles are maximal connected convex acyclic induced subgraphs [22]. Here, the vertex U set has the property that no two vertices are connected by path that is not entirely contained in U.
Authors’ contributions
All authors contributed to the design of the study and the writing of the manuscript. FG and PFS derived the mathematical results, FG produced the references implementation. All authors read and approved the final manuscript.
Acknowledgements
PFS gratefully acknowledges the hospitality of the Instituto de Matemáticas, UNAM Juriquilla, Santiago de Querétaro, in June/July 2018. This work was funded by the German Federal Ministry of Education and Research within the project Competence Center for Scalable Data Services and Solutions (ScaDS) Dresden/Leipzig (BMBF 01IS14014B).
The authors acknowledge support from the German Research Foundation (DFG) and Universität Leipzig within the program of Open Access Publishing.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Footnotes
Contributor Information
Fabian Gärtner, Email: fabian@bioinf.uni-leipzig.de.
Lydia Müller, Email: lydia@bioinf.uni-leipzig.de.
Peter F. Stadler, Email: studla@bioinf.uni-leipzig.de
References
- 1.De Bruijn NG. A combinatorial problem. Koninklijke Nederlandse Akademie v. Wetenschappen. 1946;49:758–764. [Google Scholar]
- 2.Pevzner PA, Tang H, Waterman MS. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci USA. 2001;98(17):9748–9753. doi: 10.1073/pnas.171285098. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008;18(5):821–829. doi: 10.1101/gr.074492.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Gärtner F, Höner zu Siederdissen C, Müller L, Stadler PF. Coordinate systems for supergenomes. Algorithms Mol Biol. 2018;13:15. doi: 10.1186/s13015-018-0133-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Acuña V, Grossi R, Italiano GF, Lima L, Rizzi R, Sacomoto G, Sagot MF, Sinaimeri B. On bubble generators in directed graphs. In: Bodlaender HL, Woeginer GJ, editors. Graph-theoretic concepts in computer science, 43rd WG. Lecture notes in computer science. Heidelberg: Springer. 2017;10520:18–31. 10.1007/978-3-319-68705-6_2.
- 6.Onodera T, Sadakane K, Shibuya T. Detecting superbubbles in assembly graphs. In: Darling A, Stoye J, editors. International workshop on algorithms in bioinformatics. Berlin: Springer. 2013;8126:338–48. 10.1007/978-3-642-40453-5_26.
- 7.Sung W-K, Sadakane K, Shibuya T, Belorkar A, Pyrogova I. An -time algorithm for detecting superbubbles. IEEE ACM Trans Comput Biol Bioinform. 2015;12:770–777. doi: 10.1109/TCBB.2014.2385696. [DOI] [PubMed] [Google Scholar]
- 8.Brankovic L, Iliopoulos CS, Kundu R, Mohamed M, Pissis SP, Vayani F. Linear-time superbubble identification algorithm for genome assembly. Theor Comput Sci. 2016;609:374–383. doi: 10.1016/j.tcs.2015.10.021. [DOI] [Google Scholar]
- 9.Paten B, Novak AM, Garrison E, Hickey G. Superbubbles, ultrabubbles and cacti. In: International conference on research in computational molecular biology (RECOMB), Cham: Springer. 2017;10229:173–89. 10.1007/978-3-319-56970-3_11.
- 10.Rosen Y, Eizenga J, Paten B. Describing the local structure of sequence graphs. In: Figueiredo D, Martín-Vide C, Pratas D, Vega-Rodríguez MA, editors. Algorithms for computational biology—4th AlCoB. Lecture notes in computer science. Heidelberg: Springer. 2017;10252:24–46. 10.1007/978-3-319-58163-7_2.
- 11.Paten B, Eizenga JM, Rosen YM, Novak AM, Garrison E, Hickey G. Superbubbles, ultrabubbles, and cacti. J Comput Biol. 2018;25:649–663. doi: 10.1089/cmb.2017.0251. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Tankyevych O, Talbot H, Passat N. Semi-connections and hierarchies. In: Luengo Hendriks CL, Borgefors G, Strand R, editors. Mathematical morphology and its applications to signal and image processing. Lecture notes in computer science. Berlin: Springer. 2013;7883:159–70. 10.1007/978-3-642-38294-9_14.
- 13.Ronse C. Axiomatics for oriented connectivity. Pattern Recogn Lett. 2014;47:120–128. doi: 10.1016/j.patrec.2014.03.020. [DOI] [Google Scholar]
- 14.Tarjan RE. Depth-first search and linear graph algorithms. SIAM J Comput. 1972;1:146–160. doi: 10.1137/0201010. [DOI] [Google Scholar]
- 15.Tarjan RE. Edge-disjoint spanning trees and depth-first search. Acta Inform. 1976;6:171–185. doi: 10.1007/BF00268499. [DOI] [Google Scholar]
- 16.Kahn AB. Topological sorting of large networks. Commun ACM. 1962;5:558–562. doi: 10.1145/368996.369025. [DOI] [Google Scholar]
- 17.Nuutila E, Soisalon-Soininen E. On finding the strongly connected components in a directed graph. Inf Process Lett. 1994;49:9–14. doi: 10.1016/0020-0190(94)90047-7. [DOI] [Google Scholar]
- 18.Pearce DJ. A space-efficient algorithm for finding strongly connected components. Inf Process Lett. 2016;116:47–52. doi: 10.1016/j.ipl.2015.08.010. [DOI] [Google Scholar]
- 19.Hopcroft J, Tarjan R. Algorithm 447: efficient algorithms for graph manipulation. Commun ACM. 1973;16:372–378. doi: 10.1145/362248.362272. [DOI] [Google Scholar]
- 20.Hagberg AA, Schult DA, Swart P. Exploring network structure, dynamics, and function using NetworkX. In: Varoquaux G, Vaught T, Millman J, editors. Proceedings of the 7th python in science conference (SciPy 2008). Pasadena, CA; 2008. p. 11–6. http://conference.scipy.org/proceedings/SciPy2008/paper_2/
- 21.Leskovec J, Krevl A. SNAP datasets: stanford large network dataset collection; 2014. http://snap.stanford.edu/data
- 22.Balister P, Gerke S, Gutin G, Johnstone A, Reddington J, Scott E, Soleimanfallah A, Yeo A. Algorithms for generating convex sets in acyclic digraphs. J Discrete Algorithms. 2009;7:509–518. doi: 10.1016/j.jda.2008.07.008. [DOI] [Google Scholar]