Netcombin: An algorithm for constructing optimal phylogenetic network from rooted triplets

Hadi Poormohammadi; Mohsen Sardari Zarchi

doi:10.1371/journal.pone.0227842

. 2020 Sep 18;15(9):e0227842. doi: 10.1371/journal.pone.0227842

Netcombin: An algorithm for constructing optimal phylogenetic network from rooted triplets

Hadi Poormohammadi ^1,^2,^*, Mohsen Sardari Zarchi ¹

Editor: Hocine Cherifi³

PMCID: PMC7500971 PMID: 32947609

Abstract

Phylogenetic networks construction is one the most important challenge in phylogenetics. These networks can present complex non-treelike events such as gene flow, horizontal gene transfers, recombination or hybridizations. Among phylogenetic networks, rooted structures are commonly used to represent the evolutionary history of a species set, explicitly. Triplets are well known input for constructing the rooted networks. Obtaining an optimal rooted network that contains all given triplets is main problem in network construction. The optimality criteria include minimizing the level or the number of reticulation nodes. The complexity of this problem is known to be NP-hard. In this research, a new algorithm called Netcombin is introduced to construct approximately an optimal network which is consistent with input triplets. The innovation of this algorithm is based on binarization and expanding processes. The binarization process innovatively uses a measure to construct a binary rooted tree T consistent with the approximately maximum number of input triplets. Then T is expanded using a heuristic function by adding minimum number of edges to obtain final network with the approximately minimum number of reticulation nodes. In order to evaluate the proposed algorithm, Netcombin is compared with four state of the art algorithms, RPNCH, NCHB, TripNet, and SIMPLISTIC. The experimental results on simulated data obtained from biologically generated sequences data indicate that by considering the trade-off between speed and precision, the Netcombin outperforms the others.

Introduction

Phylogenetics is a branch of bioinformatics that studies and models the evolutionary relations between a set of species or organisms (formally called taxa) [1, 2]. The tree structure is the basic model which can show the history of tree-like events such as mutation, insertion and deletion appropriately [1–5]. The main disadvantage of the tree model is its disability to show non-treelike events (more abstractly, reticulate events) like recombination, hybridization and horizontal gene transfer [2, 6]. To overcome this weakness, phylogenetic networks are introduced to generalize phylogenetic trees and represent reticulate events [1, 2, 7–13].

The structures of trees and networks can be divided into two groups, rooted and un-rooted. Rooted structures can show reticulate events, explicitly. Hence this structure has received more attention recently for constructing networks. The rooted structures are always rooted trees or rooted networks. These structures contain a unique vertex called root with in-degree 0 and out-degree at least 2 [1, 2, 7]. Fig 1a shows an example of a rooted tree.

Fig 1 — (a) A rooted tree in which the out-degree of the root is 3. (b) A rooted binary tree.

Usually rooted structures are represented in the binary form, i.e. the out-degree of each vertex is at most 2. In a rooted binary tree, the out-degree of all vertices except the leaves are 2, the out-degree of leaves are 0, the in-degree of all vertices except the root, is 1 (Fig 1b). Formally in rooted structures a common ancestor for a given set of taxa is considered as root [1, 2, 7].

One famous approach to build phylogenetic networks is constructing them up from small trees or networks [1, 2]. Triplet is the smallest tree structure which shows the evolutionary relation between three taxa. The symbol ij|k is used to show a triplet t with i, j in one side of t and k in another side of t (Fig 2) [2, 7].

Triplets are commonly considered as a standard input for building rooted structures [2, 7]. These small tree structures are usually obtained from a set of biological sequence data using standard methods such as Maximum Likelihood (ML) and Maximum Parsmiony (MP) [5, 6]. Also in some cases, the output of some biological experiments is directly in the form of triplets [14]. Moreover, in some experiments, triplets are generated randomly to evaluate a model [5].

A network is called level-k if the maximum number of reticulation nodes (nodes with in-degree two and out-degree 1) in each its biconnected components is k. [15] (Fig 4b). The optimal network is defined based on the two optimality criterions i.e. minimizing the number of reticulation nodes or the level of the final network [2, 6, 7]. For a given set of triplets as input, the main challenge is to construct an optimal rooted structure (tree or network) which contains all triplets or equivalently, all triplets are consistent with the obtained structure [2, 6, 7, 12]. In other word, all of the input triplets have to be consistent with the output network and that either the level or the number of reticulation nodes in the output network is to be minimized. Formally, a triplet is consistent with a rooted structure when the triplet is a subgraph of that structure [2, 5–7, 12]. The majority preference is to obtain a rooted tree structure. However, as mentioned before, the tree structure can’t represent reticulate events and eventually the reticulate events can’t be covered. So, in this case the network construction should be considered.

(a) A rooted phylogenetic network. r is root, t₁, t₂, …, t₈ are tree nodes, r₁, r₂, r₃ are reticulation nodes, and l₁, l₂, …, l₇ are leaves. (b) The network of Fig 4a has two biconnected components. One of the biconnected component contains r₁ and the other contains r₂ and r₃ and. So the network is a level-2 network.

In order to build a rooted network from a set of triplets, several algorithms were introduced recently [2, 6–8, 12, 13, 16]. The well-known algorithms are TripNet [7], SIMPLISTIC [6], NCHB [16] and RPNCH [8]. These algorithms find a semi-optimal rooted phylogenetic network that is consistent with a given set of triplets. Because of using heuristic algorithms the result is not necessarily exact optimal. It means that the resulting network is near to optimal which is called semi-optimal. Formally, a rooted phylogenetic network N (network for short) is a directed acyclic graph (DAG) (Fig 3) that is connected and each vertex satisfies one of the following four categories: (i) A unique root node with in-degree 0 and out-degree 2. (ii) Tree nodes with in-degree 1 and out-degree 2. (iii) Reticulation nodes with in-degree 2 and out-degree 1. (iv) Leaves with in-degree 1 and out-degree 0 (Fig 4a). A network is called a network on X if the set of its leaves is X. For example the network of Fig 4a is a network on X = {l₁, l₂, …, l₇}.

(a) A non-acyclic directed graph. (b) A directed acyclic graph (DAG) that is obtained from Fig 3a and by removing the edge (j, d).

Generally the problem of constructing an optimal rooted phylogenetic network consistent with a given set of triplets is known to be NP-hard [17, 18]. When the structure of the input triplets is dense, this problem can be solved in polynomial time order [18]. A set of triplets τ is called dense if for each subset of three taxa there is at least one information in the set of input triplets [7, 18]. More precisely, a set of triplets τ is called dense if for a given set of taxa X and each subset of three taxa {i, j, k} one of the triplets ij|k or ik|j or jk|i belongs to τ [7, 18]. For example for a given set of taxa X = {a, b, c, d, e}, the set of triplets τ = {ab|c, ad|b, be|a, ac|d, ae|c, de|a, bd|c, bc|e, be|d, de|c} is dense.

As mentioned above, density is a critical constraint concerning with constructing a rooted phylogenetic network that contains all given triplets. However, usually there is no constraint on the input triplets and in most cases the input triplets might not be dense. So, introducing efficient heuristic methods to solve this problem is necessary. The desirable goal is to construct a rooted network with no reticulate events i.e. a rooted tree structure. BUILD is the algorithm that was introduced for obtaining a tree structure from a given set of triplets if such a tree exists [19]. In fact, BUILD algorithm decides in polynomial time order if there is a rooted phylogenetic tree that contains all given triplets and then produces an output if such a tree exists. Fig 5, indicates an example of BULID algorithm steps for the given τ = {cd|b, cd|a, cd|e, cd|f, ef|a, ef|b, ef|c, ef|d, db|a, db|e, db|f, da|e, da|f, cb|a, cb|e, cb|f, ab|e, ab|f, ac|e, ac|f}.

Fig 5 — The set of nodes are X = L(τ) = {a, b, c, d, e, f} and two nodes i, j ∈ X are adjacent iff there is a node x ∈ X such that ij|x ∈ τ. Also AG(τ|A) in which A ⊆ X is defined in a similar way. Note that in this way the induced graph of AG(τ) on the set of nodes A ⊆ X is considered i.e. the set of nodes are A and i, j ∈ A are adjacent iff there is a node x ∈ A such that ij|x ∈ τ. (a) AG(τ). (b) Based on AG(τ) the resulting tree is obtained. (c) AG(τ|e, f). (d) AG(τ|a, b, c, d). (e) Based on AG(τ|e, f) the resulting tree is obtained. (f) Based on AG(τ|a, b, c, d) the resulting tree is obtained. (g) AG(τ|b, c, d). (h) Based on AG(τ|b, c, d) the resulting tree is obtained. (i) AG(τ|c, d). (j) Based on AG(τ|c, d) the resulting tree is obtained. (k) The final tree consistent with given τ that is obtained from BUILD algorithm by reversing its steps.

In tree construction process for a given τ if BUILD stops, it means that there is no tree structure for the given set of triplets. Fig 6 shows an example for the set of triplets τ = {bc|a, bd|a, cd|a, bc|d, cd|b} in which BUILD algorithm stops. In this case, the main goal is to construct a network structure similar to a tree as much as possible. In other words, constructing a rooted phylogenetic network with the minimum reticulate events is the main challenge.

The simplest possible non-treelike structure (network structure) is level-1 rooted phylogenetic network which also known as galled tree [20]. Fig 7 shows an example of a galled tree. If level-1 networks can not represent all input triplets, more complex (higher level) networks are considered to achieve consistency. LEV1ATHAN is a well-known algorithm to construct level-1 networks [21]. In [6] an algorithm is introduced that produces at most a level-2 network (Fig 4).

When more complex networks are needed (e.g. Fig 8), not restricted algorithms such as NCHB, TripNet, RPNCH and SIMPLISTIC are applicable which try to construct a consistent network with the optimality criterions (the level or the number of reticulation nodes) [6–8, 16]. Among the above four mentioned algorithms, SIMPLISTIC is not exact and just works for dense sets of triplets while for the other three methods there is no constraint on the input triplets. This is one of the SIMPLISTIC disadvantages. Moreover for complex networks SIMPLISTIC is very time consuming and has not the ability to return an output in an appropriate time [7].

TripNet has three speed options: slow, normal, and fast. The slow option returns a network near to an optimal network. Normal option works faster compared to slow option, but its network is more complex compared to slow option. Note that slow and normal options return an output in an appropriate time for input triplets consistent with simple and low level networks. However these two options are not appropriate for large data, because by increasing the number of taxa, the set of triplets corresponds to them are consistent with high level networks. Fast option usually output a network in an appropriate time but its network is more complex compared to the two other options. This option is used when the slow and normal options have not the ability to return a network in an appropriate time. It means that fast option just try to output a network and does not consider the optimality criterions. In summary, TripNet has not the ability to return an optimal network in an appropriate time, when input data is large [7]. NCHB is an improvement of TripNet which tries to improve the complexity of the TripNet networks but like TripNet it has not the ability to return an optimal network in an appropriate time for large data [16].

RPNCH is a fast method for constructing a network consistent with a given set of triplets, but its output is usually more complex considering the two optimality criterions compared to SIMPLISTIC and TripNet networks. It means that although RPNCH is fast but on average, the RPNCH networks are far away from the optimality criterions [8].

Generally none of the above four methods have the ability to return a network near to an optimal network consistent with a given set of input triplets in an appropriate time. So the focus of this paper is to introduce a new method called Netcombin (Network construction method based on binarization) for constructing a semi-optimal (near to optimal) network in an appropriate time without any constraint on the input triplets. In this research our innovation is based on the binarization and expanding processes. In the binarization process nine measures are used innovatively to construct binary rooted trees consistent with the maximum number of input triplets. These measures are computed based on the structure of the tree and the relation between input triplets. In the expansion process which converts obtained binary tree into a consistent network an intellectual process is used. In this process minimum number of edges are added heuristically to obtain the final network with the minimum number of reticulation nodes.

The structure of this paper is as follow. Section 2, presents the basic notations and definitions. In section 3, our proposed algorithm (Netcombin) is introduced and Netcombin time complexity is investigated. In section 4, the new introduced algorithm is compared with NCHB, TripNet, RPNCH, and SIMPLISTIC and the results are presented. Finally in section 5, the experimental results are discussed.

Definitions and notations

In this section the basic definitions that are used in the proposed algorithm, are presented formally. From here, a set of triplets and a network are indicated by τ and N, respectively.

A rooted phylogenetic tree (tree for short) on a given set of taxa X is a rooted directed tree that contains a unique node r (root) with in-degree zero and out-degree at least 2. In a tree, leaves are with in-degree 1 and out-degree 0 and are distinctly labeled by X. Also inner nodes i.e. nodes except root and leaves, has in-degree 1 and out-degree at least 2 [2, 7]. Fig 1, indicates an example of a tree on X = {a, b, c, d, e, f}.

The symbol L_N denotes the set of all leaf labels of N. N is a network on X if L_N = X. A triplet ij|k is consistent with N or equivalently N is consistent with ij|k if {i, j, k} ⊆ L_N and N contains two distinct nodes u and v and pairwise internally node-disjoint paths u → i, u → j, v → u, and v → k. For example, Fig 9 shows that triplets ij|k and jk|i are consistent with the given network, but ik|j is not consistent. A set of triplets τ is consistent with a network N (or equivalently N is consistent with τ) if all the triplets in τ are consistent with N. τ(N) denotes the set of all triplets that are consistent with N. Let L(τ) = ∪_t∈τ L_t. τ is a set of triplets on X if L(τ) = X [7].

Fig 9 — (a) Triplet ij|k is consistent with the network. (b) Triplet jk|i is consistent with the network.

Binarization is a basic concept, defined as follows. Let T be a rooted tree and x be a node with x₁, x₂, …, x_k, k ≥ 3 childeren. These k children are partitioned into two disjoint subsets X_l and X_r. Let $X_{l} = {x_{1}^{'}, x_{2}^{'}, \dots, x_{i}^{'}}$ and $X_{r} = {x_{i + 1}^{'}, \dots, x_{k}^{'}}$ in which $x_{1}^{'}, x_{2}^{'}, \dots, x_{k}^{'}$ is an arbitrary relabeling of x₁, x₂, …, x_k. If |X_l|>1 then create a new node x_l, remove the edges $(x, x_{1}^{'}), (x, x_{2}^{'}), \dots, (x, x_{i}^{'})$ and create the edges (x, x_l) and $(x_{l}, x_{1}^{'}), (x_{l}, x_{2}^{'}), \dots, (x_{l}, x_{i}^{'})$ . Do the same process if |X_r| > 1. Continue the process until the out-degree of all nodes except the leaves, be 2. The new tree is called a binarization of T. Fig 10 shows an example of a non-binary tree and two samples of its binarizations. Note that there are also more binarizations for the tree which two of them are illustrated. If T₂ is a binarization of T₁ then τ(T₁) ⊆ τ(T₂) [7].

(a) A non-binary tree. (b,c) Two samples of binarizations of Fig 10a.

G_τ the directed graph related to τ, is defined by V(G_τ) = {{i, j}: i, j ∈ L(τ), i ≠ j} ((i, j) is denoted by ij for short) and E(G_τ) = {(ij, ik): ij|k ∈ τ} ∪ {(ij, jk): ij|k ∈ τ} [7]. (E.g. Figs 13b and 15b). The height function of a tree and network is defined as follows. Let $(\begin{matrix} X \\ 2 \end{matrix})$ denotes the set of all subsets of X of size 2. A function $h : (\begin{matrix} X \\ 2 \end{matrix}) \to N$ is called a height function on X [7]. let T be a tree on X, with the root r, c_ij be the lowest common ancestor of i, j ∈ X, and l_T denotes the length of the longest directed path in T. Let x, y be two arbitrary nodes of T. d_T(x, y) is the edge path length between x and y. For any two i, j ∈ X the height function of T, h_T is defined by h_T(i, j) = l_T − d_T(r, c_ij). For example, Fig 11.

(a) Firstly a tree T is considered. Then the set τ of all triplets consistent with T is obtained. (b) G_τ is obtained from τ. $l_{G_{τ}}$ = 2 and h(a, b) = h(a, c) = h(a, d) = 3 since the out-degree of the nodes ab, ac, ad are zero. (c) The nodes with out-degree zero and their related edges are removed (removed edges are dashed). h(b, c) = h(b, d) = 2. (d) The nodes with out-degree zero and their related edges are removed. h(c, d) = 1. (e) The weighted complete graph (G, h) related to the obtained height function in which the edge {i, j} has weight h(i, j). (f) Removing the edges with maximum weight from (G, h). (g) The tree that is obtained based on the graph of Fig 13f. (h) Removing the edges with maximum weight from the resulting graph on the set of nodes b, c, d. (i) The tree that is obtained based on the graph of Fig 13h. (j) Removing the edges with maximum weight from the resulting graph on the set of nodes c, d. (k) The tree that is obtained based on the graph of Fig 13j. (l) The final tree consistent with given τ that is obtained from HBUILD algorithm by reversing its steps.

τ is not consistent with a tree and HBUILD stops in some steps. (a) τ is obtained from the given network and the network is the optimal network consistent with τ. (b) G_τ related to τ. (c) G_τ is not a DAG and contains a cycle. The edge (cd, bc) is removed to obtain a DAG. For simplicity the new DAG is called G_τ again. $l_{G_{τ}} = 3$ . (d) The nodes ab, ac, ad with out-degree zero and their related edges are removed. h(a, b) = h(a, c) = h(a, d) = 4. (e) The node bd with out-degree zero and its related edges are removed. h(b, d) = 3. (f) The node cd with out-degree zero and its related edges are removed. h(c, d) = 2. Finally the node bc with out-degree zero is removed. h(b, c) = 1. (g) The graph (G, h) based on the obtained $h_{G_{τ}}$ related to G_τ. (h) The edges with maximum weight are removed from (G, h). The resulting graph is disconnected. So HBUILD continues. (i) The tree structure related to the Fig 15h. (j) The edges with maximum weight are removed from the graph of Fig 15h. The resulting graph is still connected. So HBUILD stops. From here three criterions are applied to disconnect the connected subgraphs. (k) To disconnect the connected component of Fig 15j the process of removing the edges with maximum weight (the edges with weight 2) continues. The resulting graph becomes disconnected. In the remaining steps, HBUILD is applied and finally a tree structure is obtained by reversing the steps of the algorithm. (l) To disconnect the connected component of Fig 15j, Min-Cut is applied. So the graph nodes are partitioned into two parts {c, d} and {b} and the edge cb is removed. The resulting graph becomes disconnected. In the remaining steps, HBUILD is applied and finally a tree structure is obtained by reversing the steps of the algorithm. (m) To disconnect the connected component of Fig 15j, Max-Cut is applied. So the graph nodes are partitioned into two parts {c, b} and {d} and the edge cd is removed. The resulting graph becomes disconnected. In the remaining steps, HBUILD is applied and finally a tree structure is obtained by reversing the steps of the algorithm. (n) The two tree structures that finally are obtained (Three tree structures are obtained. the two structures are the same). (o) The two obtained tree structures of Fig 15n are two spanning tree structures of Fig 15a and the algorithm finally obtained these two tree structures.

Fig 11 — h_T(a, d) = h_T(a, e) = h_T(b, d) = h_T(b, e) = h_T(c, d) = h_T(c, e) = 3, h_T(a, c) = h_T(a, b) = h_T(d, e) = 2, h_T(b, c) = 1.

Let G_τ be a DAG and $l_{G_{τ}}$ be the length of the longest directed path in G_τ. Assign $l_{G_{τ}} + 1$ to the nodes with out-degree = 0 and remove them. Assign $l_{G_{τ}}$ to the nodes with out-degree = 0 in the resulting graph and continue this procedure until all nodes are removed. Define $h_{G_{τ}} (a, b)$ , a, b ∈ L(τ) and a ≠ b as the value that is assigned to the node ab ∈ V(G_τ) and call it the height function related to G_τ [7]. For example Fig 13a to 13d. If τ is consistent with a tree then G_τ is a DAG and $h_{G_{τ}}$ is well defined [7].

Let r be the root of a given network N and l_N be the length of the longest directed path in N. For each node a let d(r, a) be the length of the longest directed path from r to a. For any two nodes a and b, a is an ancestor of b if there is a directed path from a to b. In this case b is lower than a. For any two nodes a, b ∈ L_N a node c is called a lowest common ancestor of a and b if c is a common ancestor of a and b and there is no common ancestor of a and b lower than c. For any two a, b ∈ L(N), a ≠ b, let C_ab denotes the set of all lowest common ancestor of a and b. For each a, b ∈ L(N), define h_N(a, b) = min{l_N − d(r, c):c ∈ C_ab} and call it the height function of N [7]. For example for the network of Fig 6a, l_N = 4 and h_N(a, b) = h_N(a, c) = h_N(a, d) = 4, h_N(b, d) = 3, and h_N(b, c) = h_N(c, d) = 2.

A quartet is an un-rooted binary tree with four leaves. The symbol ij|kl is used to show a quartet in which i, j and k, l are its two pairs. Each quartet contains a unique edge for which two its endpoints are not leaves. This edge is called the inner edge of the quartet (See Fig 12) [7].

Method

In order to build a network N consistent with a given set of triplets τ, the height function h_N related to τ is defined [7]. The height function is a measure that is used to obtain a basic structure of the final network (N) [7]. This basic structure is in the form of a rooted tree. The height function enforce that the obtained rooted tree be consistent with approximately maximum number of triplets of τ. In this research firstly for a given τ, Netcombin assigns a height function h on L(τ). Then 3 not necessarily binary trees are constructed based on h. In the following 9 binarizations of each constructed tree are obtained (i.e. totally 27 binary trees are obtained). Finally 27 networks consistent with given τ are obtained by adding some edges to each 27 binary trees and the optimal network is reported as output as follow: [7]

Assigning height Function

Let T be a tree with its unique height function h_T and i, j ∈ L_T. The triplet ij|k is consistent with T iff h_T(i, j)<h_T(i, k) or h_T(i, j) < h_T(j, k) [7]. Moreover for a given network N and i, j, k ∈ L_N with the height function h_N if h_N(i, j)<h_N(i, k) or h_N(i, j)<h_N(j, k) then the triplet ij|k is consistent with N [7]. The above two items imply that the following Integer Programming (IP) IP(τ, s) is established for a given triplets τ with |L(τ)| = n [7].

\begin{array}{l} M a x i m i z e & Σ_{1 \leq i, j \leq n} h (i, j) \\ Subject to & h (i, k) - h (i, j) > 0 & ij | k \in τ \\ h (i, k) - h (i, j) > 0 & ij | k \in τ \\ 0 < h (i, j) \leq s & 1 \leq i, j \leq n . \end{array}

The solution of the above IP provides a criterion to obtain the basic tree structure. Ideally it is expected that the above IP has a feasible solution i.e. a solution that satisfies all its constraints. If there is a tree consistent with a given τ then the above IP has a feasible solution and the solution that maximizes the above IP is the height function of a tree that is consistent with τ. More precisely in this case $h_{T_{τ}}$ is the unique optimal solution to the IP $(τ, l_{G_{τ}} + 1)$ in which T_τ is the unique tree that is constructed by BUILD [7]. If the set of triplets τ be consistent with a tree, HBUILD can also give the same tree. So in this case by using HBUILD the desired tree consistent with τ can be constructed in polynomial time based on the optimal solution [7]. Fig 13 indicates an example of the HBUILD process for the given τ = {cd|b, cd|a, bd|a, bc|a}.

Generally the above IP has a feasible solution iff the graph G_τ is a DAG and in this case the minimum s that gives a feasible solution for IP(τ, s) is $l_{G_{τ}} + 1$ [7]. So for a given τ the IP might have a feasible solution although there is no tree consistent with τ. In the worst case, there is no tree consistent with a given τ and no feasible solution for the above IP i.e. equivalently the graph G_τ is not a DAG. To overcome this flaw, the goal is to remove minimum number of edges from G_τ (minimum number of criterions from the IP) to lose minimum information. The problem of removing minimum number of edges from a directed graph to obtain a DAG is known as the Minimum Feedback Arc Set problem, MFAS problem for short. MFAS is NP-hard [22]. The heuristic method that is introduced in [16] is used to obtain a DAG from G_τ as follow:

The nodes with in-degree zero cannot participate in any directed cycle. So these nodes are removed and this process is continued in the remaining graph until there is no node with in-degree zero. Similarly, this process is performed for the nodes with out-degree zero in the remaining graph [16].

In the resulting graph that contains no node with in-degree zero or out-degree zero, first color white is assigned to each node. Then for each node v ∈ V(G) the following is done. Suppose that the out-degree of v is m and vv₁, vv₂, …, vv_m be m such directed edges with v as their tail. These m edges are removed from the resulting graph. The color of v and v₁, v₂, …, v_m are converted to black. For each v_i, 1 ≤ i ≤ m the v_i u edge that the color of u is white is removed. Then the color of each node u that is the head of some v_i, 1 ≤ i ≤ m is converted to black. The process of removing the edges is continued in a way that the color of all nodes becomes black. The Value of v is defined as the number of remaining edges in the resulting graph. The remaining edges related to the node with the minimum value is removed from G and the resulting graph is a DAG. Fig 14 shows an example of this process [16].

(a) The first step of nodes coloring, by considering the node b as the starting point. (b-d) The next three steps of the nodes coloring. (e) A DAG is obtained from Fig 14a by removing the edges that are determined in the graph of Fig 14d.

For simplicity the new graph that is a DAG is called G_τ again. Now the height function $h_{G_{τ}}$ related to G_τ is the desired solution.

Obtaining tree

In the following the goal is to obtain a tree structure from the obtained $h_{G_{τ}}$ . In the initial step HBUILD is applied on $h_{G_{τ}}$ . The ideal situation is when HBUILD continues until a tree structure is obtained. However, HBUILD may stop in one of its subsequence steps. More precisely, Let (G, h) be the weighted complete graph related to $h_{G_{τ}}$ . HBUILD algorithm removes the edges with maximum weight from (G, h). If by removing the edges with maximum weight from each connected component, the resulting graph becomes disconnected then this process continues iteratively until each connected component contains only one node. The basic tree structure is obtained by reversing the above disconnecting process in HBUILD (See Fig 13).

If by removing the edges with maximum weight from a connected component C, the resulting graph C′ remains connected, then HBUILD halts. Hence, the goal is to disconnect the obtained connected component (C′). In order to disconnect C′, similar to RPNCH [8] three different processes can be performed as follow (See Fig 15):

I
The process of removing the edges with maximum weights from C′ is continued until C′ becomes disconnected.
II
The Min-Cut method is applied on C′. Min-Cut is a method that removes minimum weights sum of removed edges in a way that the resulting graph is converted into two connected components [23].
III
Let w be maximum weight of all edges in C′. The new weights are computed based on the current weights and w. For each edge with weight m, its new weight is assigned as:
$m_{n e w} = w - m + 1 .$

Then Min-Cut method is applied on the updated graph.

In this research, for each connected component, the above three processes is applied and then by using HBUILD, three possible tree structures are obtained. From here, without loss of generality, the symbol T_int is used to show the tree structure obtained from HBUILD or the three tree structures gained from the above processes.

Binarization

Let T be a rooted tree and τ(T) be the set of triplets consistent with T. Also let T_binary be a binarization of T and τ(T_binary) be the set of triplets consistent with T_binary. Then τ(T) ⊆ τ(T_binary). It means that binarization is an effective tool to make the tree structure more consistent with the given triplets. To perform binarization on T_int, the following heuristic algorithm is proposed.

For a given set of triplets τ and T_int a binary tree structure T_intBin is demanded. Binarization can be performed simply with a random approach [7, 8]. In order to make binarization more efficient, a new heuristic algorithm is introduced innovatively in this research. This algorithm is originally based on the three parameters, w, t, and p [16, 24].

Let τ be a set of triplets and V_i, V_j ⊆ L(τ) and V_i ∩ V_j = ∅. Let W(V_i, V_j) = {v_i v_j|v ∈ τ | v_i ∈ V_i, v_j ∈ V_j and v ∉ V_i ∪ V_j}, P(V_i, V_j) = {v_i v|v_j ∈ τ or v_j v|v_i ∈ τ | v_i ∈ V_i, v_j ∈ V_j and v ∉ V_i ∪ V_j}, and T(V_i, V_j) = {v_i v_j|v ∈ τ | v_i ∈ V_i, v_j ∈ V_j}. Also let w(V_i, V_j), p(V_i, V_j), and t(V_i, V_j) be the cardinality of W(V_i, V_j), P(V_i, V_j), and T(V_i, V_j), respectively [16, 24]. Based on the three parameters (w, t, and p), nine different measure are defined [5]. The measures M = {m₁, m₂, …, m₉} are defined as: m₁ = t(V_i, V_j), m₂ = w(V_i, V_j), m₃ = (w − p)(V_i, V_j), $m_{4} = \frac{w}{w + p} (V_{i}, V_{j})$ , $m_{5} = \frac{w}{t} (V_{i}, V_{j})$ , $m_{6} = \frac{w - p}{w + p} (V_{i}, V_{j})$ , $m_{7} = \frac{w - p}{t} (V_{i}, V_{j})$ , $m_{8} = (w - p + \frac{w}{t}) (V_{i}, V_{j})$ , $m_{9} = (\frac{w - p}{t} + \frac{w}{w + p}) (V_{i}, V_{j})$ . By using these measures, nine binary tree structures (T_intBin) are built from T_int.

The binarization process is performed as follow:

Binarization Pseudocode

1: Input: T_int

2: T_intBin = T_int

3: If T_intBin is binary

4: Do nothing

5: else

6: for each vertex v from T_intBin with c₁, c₂, …, c_n children, n > 2.

7: Initialize a set C with {c₁, c₂, …, c_n}.

8: while |C| > 1 do

9: Find and remove two vertices c_i, c_j ∈ C with maximum measure values.

10: Merge c_i and c_j and obtain c_new0.

11: Generate 6 new structures using SPR with roots c_new1, c_new2, …, c_new6.

12: Among 6+1 structures, select the more consistent structure and add its root to C.

13: Update T_intBin respect to selected structure.

14: end while

15: end for

16: Output: T_intBin

The binarization process is performed based on using nine different defined measures and Subtree- Pruning-Regrafting (SPR) [25, 26]. SPR is a method in tree topology search [25]. In the binarization process SPR helps to obtain a tree from T_int more consistent with input triplets. If T_int is binary there is nothing to do; Else there is at least a vertex v in T_int with c₁, c₂, …, c_n children, n > 2. In this case the goal is replacing this part of the tree with a binary structure (a binary subtree). For this purpose in the first step there are n sets each contains one c_i, 1 ≤ i ≤ n. Then iteratively in each step, two sets with the maximum measure values (according to the one of the nine defined measures) are selected. Let c_i and c_j, 1 ≤ i, j ≤ n be two nodes with maximum measure value. By merging c_i and c_j, a new vertex c_new is created (See Fig 16). Here, SPR is used innovatively to improve the merging consistency.

(a) A non-binary tree. The node v contains four children c₁, c₂, c₃, c₄. (b) The nodes c_i, c_j, c_k, c_l are an arbitrary relabeling of the nodes c₁, c₂, c₃, c₄. Firstly two nodes are merged. (c) Secondly two structures can be obtained from the structure of Fig 16b. (d) finally for each structure of Fig 16c, a binary structure is obtained. These two binary structures are replaced with the non-binary part of the tree of Fig 16a.

Suppose that c_lk and c_rk are the roots of the two left and right subtrees of c_k, k ∈ {i, j}. The idea behind SPR is replacing subtrees to achieve a new binary tree structure with higher consistency. In this work the potential replacement are introduced in six different ways as follow (See Fig 17). i) c_i ⇌ c_lj, ii) c_i ⇌ c_rj, iii) c_j ⇌ c_li, iv) c_j ⇌ c_ri, v) c_li ⇌ c_lj, vi) c_li ⇌ c_rj. By using these SPRs, six new structures are obtained. Among these tree structures and the structure without replacement, the best tree structure consistent with more input triplets is selected.

(a) The structure that is obtained by merging c_i, c_j and connecting them to a new node c_new. (b to g) Six different tree structures that are obtained from Fig 17a and by using SPR with replacing c_i ⇌ c_lj, c_i ⇌ c_rj, c_j ⇌ c_li, c_j ⇌ c_ri, c_li ⇌ c_lj, c_li ⇌ c_rj, respectively.

Network construction

Let τ′ ⊆ τ be the set of triplets that are not consistent with T_intBin. Here, the goal is to add some edges to T_intBin in order to construct a network consistent with input τ′. In the network construction process, edges are added incrementally to obtain the final network consistent with τ. In order to add edges, we use innovatively a heuristic criterion to select edges rather than random selection. The heuristic criterion is depended on the current non-consistent triplets ∈τ′ and the current network structure. To this purpose, for each pair of edges of the current network structure, a value is assigned. To compute the value of each pair {e, f} of edges, a new edge is added by connecting e and f via two new nodes n_e, n_f (See Fig 18). The value is the number of triplets in τ′ that are consistent with the new network structure. In each step of adding an edge, the set of triples τ′, are updated by removing consistent triplets.

Fig 18 — (a) T_intBin for τ′ = {ij|k, ij|l, ij|m, lm|k, lm|j, lm|i, mk|i, mk|j, lk|i, lk|j} and τ = τ′∪{lk|m}. (b) Two edges e and f are selected to obtain a network consistent with τ. (c) Final network consistent with τ.

Time complexity

In this section, we investigate the time complexity of Netcombin. For the input triplets τ let |L(τ)| = n and |τ| = m. At first G_τ should be computed. Its time complexity is O(m). Then, if G_τ is not a DAG the heuristic algorithm is applied to make it a DAG. For each node, the computation of Value is performed in O(m). Therefore for all nodes it needs m × O(m) = O(m²).

To obtain a tree, in method I, each step of removing the edges with maximum weight is done for each connected component in O(m). Also in each step the number of connected components should be compared with previous step. Thus Depth First Search (DFS) algorithm is performed in O(n). The overall runtime is O(mn). Since there are n nodes, the total runtime is O(mn²).

For the method II, in each step, it takes O(mn) to remove the edges with maximum weight. Then Min-Cut is performed in O(mn + n² logn). The overall run time is O(mn + n² logn). There are n nodes and the total runtime is O(mn² + n³ logn).

The runtime of the method III is the same as the method II. So obtaining tree T_int is performed in O(mn² + n³ logn) + O(m²).

The binarization process to obtain each T_intBin is computed in O(mn³) [4, 24]. The time complexity for obtaining all 27 binary trees T_intBin is 27 × O(mn³) which is equal to O(mn³).

Finally the network construction runtime is as follow: The number of edges of T_intBin is O(n). Also at most O(m) edges are added to obtain the final network. So the number of edges of the final network is O(m + n). Investigating the consistency of the new network (with the remaining triplets) which is obtained from the previous network and by adding a new edge is done in O((m + n)²). Since there are O(m + n) edges so in each step the runtime of adding a new edge is O((m + n)³). This process is done at most m times. So the total runtime of this step is O(m(m + n)³).

Finally the Netcombin runtime is O(mn² + n³ logn) + O(m²) + O(mn³) + O(m(m + n)³) ∈ O(m(m + n)³ + n³ logn).

Experiments

The RPNCH, NCHB, SIMPLISTIC and TripNet are famous algorithms in constructing phylogenetic networks from given triplets. The SIMPLISTIC algorithm just works for dense triplets [6], while there is no constraints on the NCHB, TripNet, and RPNCH inputs [7, 8, 16]. In order to evaluate the performance of Netcombin, the following scenario is designed.

Data generation

There is two standard approaches to generate triplets data. Firstly, triplets can be generated randomly which is the simplest way. Secondly, triplets can be obtained from sequences data. Sequences data usually are in the form of biological sequences. Biological sequences can be obtained from species or from simulation software that can generate these kinds of sequences under biological assumptions. In this research we used the second approach using a simulation software. There are standard methods for converting sequences into triplets. Maximum Likelihood (ML) is the well-known method which constructs tree from sequence data [5, 6]. For this reason, TREEVOLVE is used which is a software for generating biological sequences [27]. TREEVOLVE has different parameters that can be adjusted manually. In this research we set the parameters, the number of samples, the number of sequences, and the length of sequences. For the other parameters, the default values are used. The number of sequences (number of leaf labels) is set to 10, 20 30, and 40 and the length of sequences is set to 100, 200, 300, and 400. For each case, the number of samples is 10. So totally 160 different sets of sequences are generated. Then PhyML software is used which works based on Maximum Likelihood (ML) criterion. For each set of sequences, all subsets of three sequences are considered and for each of them, an outgroup is assigned. Each subset of three sequences plus the assigned outgroup, are considered as input for PhyML and for these data the output of PhyML is a quartet. Finally by removing the outgroup from each quartet, the set of triplets is obtained. In this research, each triplet information related to a quartet in which the weight of its unique inner edges is zero, is removed. This is because of these types of triplets contains no information and are stars. The way of generating triplets may give non-dense sets of triplets. SIMPLISTIC is used as a method for comparison and its output should be dense. So by adding a random triplet correspond to each star, each non-dense set is converted to a dense set and is used as the input.

Experimental results

In order to show the performance of Netcombin we compare it with TripNet, SIMPLISTIC, NCHB, and RPNCH on the data that are generated in the previous subsection. Since for large size data, SIMPLISTIC has not the ability to return a network in an appropriate time, the time restriction 6 hours is considered. Let N_finite be the set of networks for which the running time of the method is at most 6 hours. Let S_sequence shows the number of sequences where S_sequence ∈ {10, 20, 30, 40}. The output of TripNet, SIMPLISTIC, NCHB, and RPNCH is a unique network, but Netcombin outputs 27 networks and the best network is reported. Since the process of constructing these 27 Netcombin networks is independent, we apply Netcombin in a parallel way to obtain 27 networks simultaneously. In implementation we used a PC with Corei7 CPU and run our algorithm on its cores in parallel.

The results of comparing these methods on the two optimality criterions and running time are available in Tables 1 to 4.

Table 1. The number of Netcombin, TripNet, NCHB, SIMPLISTIC, and RPNCH networks that belong to N_finite.

Number of sequences (s_sequence)	10	20	30	40
Number of the Netcombin networks ∈ N_finite	40	40	40	40
Number of the TripNet networks ∈ N_finite	40	40	40	40
Number of the NCHB networks ∈ N_finite	40	40	40	40
Number of the SIMPLISTIC networks ∈ N_finite	40	38	13	0
Number of the RPNCH networks ∈ N_finite	40	40	40	40

Open in a new tab

Table 4. The average level results for the networks that belong to N_finite for Netcombin, TripNet, NCHB, SIMPLISTIC, and RPNCH.

Number of sequences (s_sequence)	10	20	30	40
Netcombin avg level for networks ∈ N_finite	2	3	7.4	15
TripNet avg level for networks ∈ N_finite	0.9	2.3	6.9	16
NCHB avg level for networks ∈ N_finite	0.7	1.5	6	15
SIMPLISTIC avg level for networks ∈ N_finite	2.05	4.2	6.95	-
RPNCH avg level for networks ∈ N_finite	2.8	6.4	10.5	19

Open in a new tab

Table 1 and 2 show the results of the number of networks that belong to N_finite, and the average of running time of the networks that belong to N_finite. These results show that when the number of taxa is 10, in all cases, all methods on average give an output in at most 2 seconds. When the number of taxa is 20, in 5% of the cases, SIMPLISTIC has not the ability to return a network in less than 6 hours. For the remaining 95% of the cases SIMPLISTIC on average gives an output in 310 seconds. For these data the other four methods on average construct a network in less than 4 seconds. When the number of taxa is 30, in 32.5% of the cases, on average SIMPLISTIC outputs a network in 2600 seconds. For the remaining 77.5% of the cases, SIMPLISTIC has not the ability to return a network in less than 6 hours. For these data, on average Netcombin and RPNCH output a network in at most 15 seconds, while NCHB and TripNet on average output a network in 203 and 210 seconds, respectively. When the number of input taxa is 40, in all cases SIMPLISTIC does not return an output in time restriction 6 hours. In this case Netcombin and RPNCH on average output a network in at most 44 seconds while NCHB and TripNet return a network in time at least 740 seconds.

Table 2. The average running time results for the networks that belong to N_finite for Netcombin, TripNet, NCHB, SIMPLISTIC, and RPNCH.

Number of sequences (s_sequence)	10	20	30	40
Netcombin avg running time for networks ∈ N_finite (Sec)	2	4	15	44
TripNet avg running time for networks ∈ N_finite (Sec)	1	1.7	210	740
NCHB avg running time for networks ∈ N_finite (Sec)	1	1.8	203	745
SIMPLISTIC avg running time for networks ∈ N_finite (Sec)	1	310	2600	-
RPNCH avg running time for networks ∈ N_finite (Sec)	1	2	10	30

Open in a new tab

Tables 3 and 4 indicate the results for the two optimality criterions i.e. the number of reticulation nodes and level for the networks that belong to N_finite. The results show that when the number of taxa is 10, on average the number of reticulation nodes for the TripNet and NCHB networks is at most 0.9, while for these data on average the Netcombin, RPNCH, and SIMPLISTIC number of reticulation nodes is at least 2 and at most 3. Also for these data, on average the level of the NCHB and TripNet networks, is not more than 0.9, while the level of Netcombin, SIMPLISTIC, and RPNCH networks, on average is at least 2 and at most 2.8. When the number of input taxa is 20 on average the TripNet and NCHB number of reticulation nodes is 2.6 and 1.8, respectively. For these data the Netcombin number of reticulation nodes on average is 4, while for SIMPLISTIC and RPNCH, on average the number of reticulation nodes is 6.95 and 9, respectively. Also for these data the level of the NCHB and TripNet networks on average are 1.5 and 2.3, respectively. For these data on average the level of the Netcombin networks is 3, while for the SIMPLISTIC and RPNCH networks the level is 4.2 and 6.4, respectively. When the number of taxa is 30, on average the number of NCHB, TripNet, and Netcombin reticulation nodes, are at least 7.2 and at most 9, while for the SIMPLISTIC and RPNCH networks on average this number is 11.275 and 13, respectively. For these data on average the level of the NCHB, TripNet, Netcombin, and SIMPLISTIC networks is at least 6 and at most 7.4 while on average the RPNCH networks level is 10.5. When the number of taxa is 40, on average the, NCHB, Netcombin, and TripNet number of reticulation nodes are 15.2, 15.5, and 16.3, respectively, while RPNCH networks on average contain 20 reticulation nodes. For these data on average the level of Netcombin, NCHB, and TripNet netowrks are 15, 15, and 16, respectively, while the level of the RPNCH networks on average is 19.

Table 3. The average number of reticulation nodes (rets for short in table) results for the networks that belong to N_finite for Netcombin, TripNet, NCHB, SIMPLISTIC, and RPNCH.

Number of sequences (s_sequence)	10	20	30	40
Netcombin avg number of rets for networks ∈ N_finite	2	4	9	15.5
TripNet avg number of rets for networks ∈ N_finite	0.9	2.6	8	16.3
NCHB avg number of rets for networks ∈ N_finite	0.7	1.8	7.2	15.2
SIMPLISTIC avg number of rets for networks ∈ N_finite	2.325	6.95	11.275	-
RPNCH avg number of rets for networks ∈ N_finite	3	9	13	20

Open in a new tab

Discussion

In this paper we investigated the problem of constructing an optimal network consistent with a given set of triplets. Minimizing the level or minimizing the number of reticulation nodes are the two optimality criterion. This problem is known to be NP-hard [17, 18]. By analyzing existing research we can divide the solution of constructing networks based on triplets, into two approaches. In the first approach, the reticulation nodes are recognized and then are removed from the set of taxa and a tree structure is obtained for the remaining taxa. Finally the network consistent with all given triplets is obtained by adding reticulation nodes to the tree structure. In the second approach, a tree structure is obtained and then by adding new edges to the tree structure, the final network consistent with all triplets is obtained. SIMPLISTIC [6], TripNet [7] and NCHB [16] belong to the first approach and RPNCH [8] belongs to the second approach. According to our best of knowledge, all the researches on this problem fall into one of these approaches. Therefore, in recent papers researchers try to improve these approaches gradually. It means that each improvement is valuable because it can reduce the time and costs, effectively. In this paper we introduced Netcombin which is a method for producing an optimal network consistent with a given set of triplets. In order to show the performance of Netcombin we compared it with NCHB, TripNet, SIMPLISTIC, and RPNCH on the 160 different sets of triplets that are generated in the process that is introduced in subsection 4-1.

The results show that although, on average RPNCH is the fastest method, but the level and the number of reticulation nodes of its results are highest. More over on average the differences between Netcombin, NCHB, and TripNet results for the two optimality criterions with RPNCH results are significant.

The results show that on average for small size data SIMPLISTIC is appropriate. But by increasing the number of taxa and for large size data it has not the ability to return a network in an appropriate time and its running time is highest. Also in all cases on average the SIMPLISTIC number of reticulation nodes and levels are just better than RPNCH. Note that SIMPLISTIC just works for dense sets of input triplets. The results show that by increasing the number of taxa, the running time of SIMPLISTIC increases exponentially. In more details when the number of taxa is 40, in time less than 6 hours it does not return any network, while the other 4 methods in at most 745 seconds output a network.

Also the results show that on average NCHB and TripNet running time results are nearly the same, but on average the two optimality criterions for NCHB results are better compared to TripNet. Note that the differences between TripNet and NCHB results for the optimality criterions are not significant.

The results show that for small size data TripNet and NCHB are appropriate and their results for the optimality criterions and running time are on average the best. But by increasing the number of taxa, the running time of these methods exceeds significantly compared to Netcombin, while the two optimality criterions for their networks are nearly the same with Netcombin networks results.

The results show that generally and by considering the running time, the level, and the number of reticulation nodes of the final networks, on average Netcombin is a valuable method that returns reasonable network in an appropriate time.

Supporting information

S1 File

(ZIP)

Click here for additional data file.^{(24.4KB, zip)}

S2 File

(ZIP)

Click here for additional data file.^{(202.7KB, zip)}

S3 File

(ZIP)

Click here for additional data file.^{(2.1MB, zip)}

S4 File

(ZIP)

Click here for additional data file.^{(3.5MB, zip)}

Acknowledgments

The first author would like to thank the Institute for Research in Fundamental Sciences (IPM), Tehran, Iran. “The authors declare no conflict of interest.“

Data Availability

All relevant data are within the manuscript and its Supporting Information files.

Funding Statement

This research is partially supported by a grant from Institute for research in fundamental sciences (IPM), Tehran, Iran by a grant number BS-1396-01-06. There is no additional external funding received for this study.

References

1. Felsenstein J. Inferring phylogenies. Sinauer associates; Sunderland, MA; 2004. [Google Scholar]
2. Huson DH, Rupp R, Scornavacca C. Phylogenetic networks: concepts, algorithms and applications. Cambridge University Press; 2010. [Google Scholar]
3. Poormohammadi H. A New Heuristic Algorithm for MRTC Problem. Journal of Emerging Trends in Computing and Information Sciences. 2012;3(7). [Google Scholar]
4. Jahangiri S, Hashemi SN, Poormohammadi H. New heuristics for rooted triplet consistency. Algorithms. 2013;6(3):396–406. 10.3390/a6030396 [DOI] [Google Scholar]
5. Poormohammadi H, Sardari Zarchi M. CBTH: a new algorithm for MRTC problem. Iranian Journal of Biotechnology (IJB). 2020, Accepted. [Google Scholar]
6. Van Iersel L, Kelk S. Constructing the simplest possible phylogenetic network from triplets. Algorithmica. 2011;60(2):207–235. 10.1007/s00453-009-9333-0 [DOI] [Google Scholar]
7. Poormohammadi H, Eslahchi C, Tusserkani R. TripNet: a method for constructing rooted phylogenetic networks from rooted triplets. PloS One. 2014;6(6):e106531 10.1371/journal.pone.0106531 [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Reyhani MH, Poormohammadi H. RPNCH: A method for constructing rooted phylogenetic networks from rooted triplets based on height function. Journal of Paramedical Sciences. 2017;8(4):14–20. [Google Scholar]
9.Linder CR, Moret BM, Nakhleh L, Warnow T. Network (reticulate) evolution: biology, models, and algorithms. The Ninth Pacific Symposium on Biocomputing (PSB). 2004.
10. Bordewich M, Semple C. Computing the minimum number of hybridization events for a consistent evolutionary history. Discrete Applied Mathematics. 2007;155(8):914–928. 10.1016/j.dam.2006.08.008 [DOI] [Google Scholar]
11. Gambette P, Huber KT, Kelk S. On the challenge of reconstructing level-1 phylogenetic networks from triplets and clusters. Journal of mathematical biology. 2017;74(7):1729–1751. 10.1007/s00285-016-1068-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Tusserkani R, Poormohammadi H, Azadi A, Eslahchi C. Inferring phylogenies from minimal information. Third International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB). 2017; 202–206.
13. Habib M, To TH. Constructing a minimum phylogenetic network from a dense triplet set. Journal of bioinformatics and computational biology. 2012;10(05):1250013 10.1142/S0219720012500138 [DOI] [PubMed] [Google Scholar]
14. Kannan SK, Lawler EL, Warnow TJ. Determining the evolutionary tree using experiments. Journal of Algorithms. 1996;21(1):26–50. 10.1006/jagm.1996.0035 [DOI] [Google Scholar]
15. Choy C, Jansson J, Sadakane K, Sung WK. Computing the maximum agreement of phylogenetic networks. Electronic Notes in Theoretical Computer Science. 2004;91:134–147. 10.1016/j.entcs.2003.12.009 [DOI] [Google Scholar]
16. Poormohammadi H, Sardari Zarchi M, Ghaneai H. NCHB: A Method for Constructing Rooted Phylogenetic Networks from Rooted Triplets based on Height Function and Binarization. Journal of Theoretical Biology. 2020;489:110–144. [DOI] [PubMed] [Google Scholar]
17. Jansson J, Nguyen NB, Sung WK. Algorithms for combining rooted triplets into a galled phylogenetic network. SIAM Journal on Computing. 2006;35(5):1098–1121. 10.1137/S0097539704446529 [DOI] [Google Scholar]
18. Jansson J, Sung WK. Inferring a level-1 phylogenetic network from a dense set of rooted triplets. Theoretical Computer Science. 2006;363(1):60–68. 10.1016/j.tcs.2006.06.022 [DOI] [Google Scholar]
19. Aho AV, Sagiv Y, Szymanski TG, Ullman JD. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM Journal on Computing. 1981;10(3):405–421. 10.1137/0210030 [DOI] [Google Scholar]
20. Gusfield D, Eddhu S, Langley C. Optimal, efficient reconstruction of phylogenetic networks with constrained recombination. Journal of bioinformatics and computational biology. 2004;2(01):173–213. 10.1142/S0219720004000521 [DOI] [PubMed] [Google Scholar]
21. Huber KT, Van Iersel L, Kelk S, Suchecki R. A practical algorithm for reconstructing level-1 phylogenetic networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB). 2010;8(3):635–649. 10.1109/TCBB.2010.17 [DOI] [PubMed] [Google Scholar]
22. Karp M. Reducibility among combinatorial problems. Complexity of computer computations. 1972;85–103. 10.1007/978-1-4684-2001-2_9 [DOI] [Google Scholar]
23. Gasieniec L, Jansson J, Lingas A, Östlin A. On the complexity of constructing evolutionary trees. Journal of Combinatorial Optimization. 1999;3(2-3):183–197. 10.1023/A:1009833626004 [DOI] [Google Scholar]
24. Wu BY. Constructing the maximum consensus tree from rooted triples. Journal of Combinatorial Optimization. 2004;8(1):29–39. 10.1023/B:JOCO.0000021936.04215.68 [DOI] [Google Scholar]
25. Wu Y. A practical method for exact computation of subtree prune and regraft distance. Bioinformatics. 2009;25(2):190–196. 10.1093/bioinformatics/btn606 [DOI] [PubMed] [Google Scholar]
26. Hein J. Reconstructing evolution of sequences subject to recombination using parsimony. Mathematical biosciences. 1990;98(2):185–200. 10.1016/0025-5564(90)90123-G [DOI] [PubMed] [Google Scholar]
27.Grassly NC, Rambaut A. Treevole: a program to simulate the evolution of DNA sequences under different population dynamic scenarios. Oxford: Department of Zoology, Wellcome Centre for Infectious Disease.; 1997.

PLoS One. doi: 10.1371/journal.pone.0227842.r001

Decision Letter 0

Hocine Cherifi

12 Feb 2020

PONE-D-19-34247

NetCombin‎: ‎An algorithm for optimal level-k network construction from triplets

PLOS ONE

Dear Dr. Poormohammadi,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

We would appreciate receiving your revised manuscript by Mar 28 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Hocine Cherifi

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. We suggest you thoroughly copyedit your manuscript for language usage, spelling, and grammar. If you do not know anyone who can help you do this, you may wish to consider employing a professional scientific editing service.

Whilst you may use any professional scientific editing service of your choice, PLOS has partnered with both American Journal Experts (AJE) and Editage to provide discounted services to PLOS authors. Both organizations have experience helping authors meet PLOS guidelines and can provide language editing, translation, manuscript formatting, and figure formatting to ensure your manuscript meets our submission guidelines. To take advantage of our partnership with AJE, visit the AJE website (http://learn.aje.com/plos/) for a 15% discount off AJE services. To take advantage of our partnership with Editage, visit the Editage website (www.editage.com) and enter referral code PLOSEDIT for a 15% discount off Editage services. If the PLOS editorial team finds any language issues in text that either AJE or Editage has edited, the service provider will re-edit the text for free.

Upon resubmission, please provide the following:

The name of the colleague or the details of the professional service that edited your manuscript
A copy of your manuscript showing your changes by either highlighting them or using track changes (uploaded as a *supporting information* file)
A clean copy of the edited manuscript (uploaded as the new *manuscript* file)

3. Thank you for stating in your Funding Statement:

"No, the funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

Please provide an amended statement that declares *all* the funding or sources of support (whether external or internal to your organization) received during this study, as detailed online in our guide for authors at http://journals.plos.org/plosone/s/submit-now. Please also include the statement “There was no additional external funding received for this study.” in your updated Funding Statement.

Please include your amended Funding Statement within your cover letter. We will change the online submission form on your behalf.

4. We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For more information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions.

In your revised cover letter, please address the following prompts:

a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially sensitive information, data are owned by a third-party organization, etc.) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent.

b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories.

We will update your Data Availability statement on your behalf to reflect the information you provide.

5. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. Please see the following video for instructions on linking an ORCID iD to your Editorial Manager account: https://www.youtube.com/watch?v=_xcclfuvtxQ

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: No

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: N/A

Reviewer #2: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: A heuristic method for constructing a phylogenetic network from a set of

rooted triplets is presented.

The problem objective is to output a simplest possible phylogenetic network

that contains all of the specified rooted triplets.

Here, simplest means having the fewest reticulation nodes or the smallest

level.

There are three serious issues with this paper:

1. lack of originality,

2. technical weakness,

3. poor presentation.

For this reason, my recommendation is reject.

More detailed comments are given below.

Issue 1: Lack of originality.

The presented heuristic is very similar to other heuristics previously

published by the authors that first build a phylogenetic tree and then

greedily insert edges into it (thereby turning it into a phylogenetic

network) until all triplets are satisfied.

As far as I can see, there are only two new contributions in this paper.

The first one is to apply Wu's technique to select a good binary phylogenetic

tree to use as the base, instead of just refining non-binary nodes to make

make them binary in some arbitrary way.

The second one is to consider every pair of existing edges when deciding the

insertion point of each new edge in the last step, instead of inserting these

new edges randomly.

Both of these contributions are rather trivial adjustments.

It seems like the authors are trying to publish a new paper every time they

come up with some minor improvement to their own algorithm.

Issue 2: Technical weakness.

The resulting method ("NetCombin") is not analyzed formally.

This is no problem, though, since the authors evaluated its performance by

comparing it experimentally against some other methods on simulated data.

However, these comparisons do not seem fair because in the experiments, they

applied NetCombin in parallel to construct 27 different networks and then

output the best one found.

For each of the other methods, they only constructed one network and output

it.

Thus, the running time of 44 seconds for NetCombin should be reported as

44*27 = 1188 seconds, etc., which means that it's actually slower than the

NCHB method while outputting solutions that are worse than NCHB's.

Furthermore, the experiments only go up to 40 sequences so it's difficult to

draw any conclusions regarding the relative performance of NetCombin,

TripNet, NCHB, and RPNCH.

Issue 3: Poor presentation.

The authors don't even bother to explain all the steps of the method; in many

places, they just refer to some of their old papers.

It would have been better to list the complete algorithm.

Throughout the paper, there is lots of vague phrasing, mistakes, grammatical

errors, and missing references to the literature.

For example:

- The title is very strange.

The parameter k is never defined or used in the paper.

What is k?

Does "optimal level-k network construction" mean that one wants to find a

network that minimizes the value of k?

- Abstract and introduction:

It's better to say "or" instead of "and" when talking about minimizing the

level and minimizing the number of reticulation nodes since they are not the

same thing.

- The abstract claims "The binarization process innovatively uses a measure

to construct a binary rooted tree T consistent with the maximum number of

input triplets.", but this is misleading.

To find such a T is another NP-hard optimization problem and applying Wu's

technique just gives an approximation of "a binary rooted tree consistent

with the maximum number of input triplets", not always an optimal one.

Similar phrasing occurs at the bottom of page 4.

- The abstract proudly declares that T is "expanded in an intellectual

process".

Actually, it's just a straightforward greedy algorithm.

- The abstract says "The experimental results on real data indicate that".

This is not true; only simulated data was used.

- To make the problem definition easier to read, say clearly that all of the

input triplets have to be consistent with the output network and that either

the level or the number of reticulation nodes in the output network is to be

minimized.

- When describing previous related results, it's important to explain how

your old algorithms work since the new algorithm is a modification of them.

My impression is that there is almost no difference, making this paper weak.

- The section "Definitions and notations" starts with the promise "In this

section the basic definitions that are used in the proposed algorithm, are

presented formally.".

A few sentences later, the text begins referring to something called "N",

which has never been introduced.

- The "Method" section says "The height function enforce that the obtained

rooted tree be consistent with maximum number of triplets of tau.".

In case no tree is consistent with all the triplets in tau, the IP won't have

any valid solution, so I don't see how using the height function can yield a

tree that is consistent with the maximum number of triplets in tau.

- How does NetCombin solve the IPs that were set up?

- What is the difference between BUILD and HBUILD?

Line 207 suggests that they are "equivalent", but what does that mean?

- In the experimental results, is the "number of sequences" the same thing as

the "number of leaf labels"?

- Lines 366-404 seem pointless.

The reader can just look at the tables to get that information.

- Does Table 4 list the number of reticulations or the level?

The caption says one thing but the table itself says another thing.

- The example in Figure 5 is wrong.

The input set of triplets contains a triplet de|f, which means that the BUILD

algorithm will join d and e by an edge so that all of the leaf labels end up

in the same connected component.

- The caption of Figure 5 is misleading.

It says "two nodes i,j in X are connected iff" but usually two nodes are said

to be "connected" if there is a path (of any length) between them.

It would be better to say "two nodes i,j in X are adjacent iff" or "two nodes

i,j in X are connected by an edge iff".

- The caption of Figure 5 refers to the "Aho graph".

However, the BUILD algorithm was published in a paper written by four people.

To be fair, call it "the Aho-Sagiv-Szymanski-Ullman graph" or something like

that if you insist on using family names.

- The example in Figure 10 is wrong.

The two trees in (b) and (c) are not binary.

In both trees, the node lca(f,g) has three children.

Also, the text refers to "its two different binarizations" but there are many

more possible binarizations.

- The caption of Figure 14 uses Max-Cut in part (m).

Should it be Min-Cut?

- Lines 321-322 state some strange-looking time complexity (without any

proof) for one of the steps in the algorithm.

It doesn't seem relevant as the time complexity of all the other steps has

been completely ignored.

- More than 30% of the references in the bibliography are self-references.

This is too much.

On the other hand, many historically important bibliographic references are

missing, such as the following.

* Previous work on the problem should also mention:

M. Habib, T.-H. To: "Constructing a minimum phylogenetic network from a dense

triplet set", Journal of Bioinformatics and Computational Biology, 10(5):

1250013, 2012.

* The concept of the level of phylogenetic networks (lines 50-51) comes from:

C. Choy, J. Jansson, K. Sadakane, W.-K. Sung: "Computing the Maximum

Agreement of Phylogenetic Networks", Theoretical Computer Science, 335(1):

93-107, 2005.

* Galled trees were not invented by [15,16] as claimed on line 80 but by:

D. Gusfield, S. Eddhu, C. Langley: "Optimal, efficient reconstruction of

phylogenetic networks with constrained recombination", Journal of

Bioinformatics and Computational Biology, 2(1): 173-213, 2004.

* The Min-Cut method applied to Aho et al.'s BUILD algorithm is from:

L. Gasieniec, J. Jansson, A. Lingas, A. Ostlin: "On the complexity of

constructing evolutionary trees", Journal of Combinatorial Optimization,

3(2-3): 183-197, 1999.

* The SPR technique is much older than reference [21]; see, e.g.:

J. Hein: "Reconstructing evolution of sequences subject to recombination

using parsimony", Mathematical Biosciences, 98: 185-200, 1990.

Reviewer #2: Revision of the paper: NetCombin‎: ‎An algorithm for optimal level-k network construction from triplets

The authors propose a novel phylogenetic tree construction algorithm called NetCombin, introduced to construct an optimal network which is consistent with input triplets. The authors compare their method with state-of-the-art such as RPNCH, NCH and similar and show competitive performance. The paper is overall well written and easy to follow, however, some critical errors need to be corrected in order for it to be publishable (I would consider this a major revision).

1.) The authors e.g., in the paragraph from lines 10-14 discuss the importance of constructing graphs instead of trees. However, the first example are just trees. Further, the terminology is not consistent. Even though each tree is a graph, not each graph is a tree, thus the authors should stick to notation of directed graphs in my opinion. The comment is related to the following text:

“The rooted structures 12 are directed graphs which contains a unique vertex called root with in-degree 0 and 13 out-degree at least 2 .. Figure 1a shows an example of a rooted tree.”

2.) The quality of the second figure is very poor and cut off. Please correct this.

3.) line 48: please define the notion of the level of a network. Even though this is a basic concept, it appears here for the first time without any clarification.

4.) Line 64: So, introducing efficient heuristic methods to solve this problem is demanding.

Perhaps “necessary” instead of “demanding” is meant?

5.) The name of the proposed algorithm is NetCombin, however this is not consistent throughout the paper, please unify.

6.) Line 160. Is N the set of natural numbers with 0 or without?

Comments related to the method:

1.) Line 218, please discuss in more detail on MFAS heuristic used. This is highly relevant to the proposed approach.

2.) Line 230, what are the three “???”? Explain.

3.) Line 282 onwards a couple of lines. The authors provide the pseudocode of the proposed algorithm, however, it is not clear why all 9 different measures for construction of binary trees (as apparently used in [5]). This computational step appears somewhat redundant and seems like everything that can be, is computed. In line 301, authors claim innovative use. Could this be elaborated in more detail? I don’t see how exhaustive enumeration of arbitrary 9 measures is any more innovative than doing this for more measures, unless the 9 offer some form of theoretical grounding that makes this traversal/scoring feasible? Furthermore, how can one resolve possible ties? What if the score distribution is ~uniform? Is the subtree selected at random? Please elaborate on these details.

4.) Line 309 onwards. It seems that the network construction step is rather expensive (n^3, m^2). To what extent can this be made run in parallel. Could you discuss this aspect, at least theoretically?

Results:

I don’t quite understand the reasoning behind: “but NetCombin 360 outputs 27 networks and the best network is reported. Since the process of constructing 361 these 27 NetCombin networks is indepe”

Why is only the best one reported? Isn’t this somewhat non-representative for real-life situations. Could you elaborate on that? Further, do other methods also output multiple networks? How did you select the compared against there?

Further, if you report the average running time, please repreat the experiments at least 5 times and report also the standard deviations.

I am not entirely sure what the measure of success is in T - it shows the number of outputted networks? Is more better? Why?

Further, low runtime is meaningless if the quality of results is low. Comment on the runtime w.r.t T3 and T4.

Discussion:

There is not enough discussion on the results (see previous section). Further, the last line reads as “NetCombin is a 434 valuable method that returns an appropriate network in an appropriate time”. What does that mean? What is appropriate? Do you mean reasonable?

Why would the user prefer NetCombin, if it performs on par with existing state-of-the-art?

Please comment on this issue.

The authors should further discuss why NCHB is a worse alternative, as it seems to perform very similarly. Is the runtime main benefit of NetCombin?

Statistical evaluation:

The authors propose tabelaric comparison of results, however, such results do not offer statistically sound insights into the algorithm's inner workings. I would suggest the authors to consider at least critical distance diagrams if possible.

Availability:

Please, make the code and data simulators (or datasets) publicly available via .e.g, github.

Final remarks:

All images are of terrible quality. Please, render the plots as vector graphics if possible, otherwise use >400 dpi. Further, the authors should specify what exactly are the main adopted novelties at the end of the introductory section.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Sep 18;15(9):e0227842. doi: 10.1371/journal.pone.0227842.r002

Author response to Decision Letter 0

20 Aug 2020

Reviewer 1:

"The presented heuristic is very similar to other heuristics previously published by the authors that first build a phylogenetic tree and then greedily insert edges into it (thereby turning it into a phylogenetic network) until all triplets are satisfied. As far as I can see, there are only two new contributions in this paper. The first one is to apply Wu's technique to select a good binary phylogenetic tree to use as the base, instead of just refining non-binary nodes to make make them binary in some arbitrary way. The second one is to consider every pair of existing edges when deciding the insertion point of each new edge in the last step, instead of inserting these new edges randomly. Both of these contributions are rather trivial adjustments. It seems like the authors are trying to publish a new paper every time they come up with some minor improvement to their own algorithm."

Answer: As you know, the problem of constructing a phylogenetic network consistent with all given input triplets is NP-hard. The paper "Constructing the simplest possible phylogenetic network from triplets" firstly explained an exact exponential method to solve the problem [1]. Then based on the exact method, the heuristic SIMPLISTIC algorithm is introduced. Drawback of SIMPLISTIC is its disability to return a network for complex sets of triplets in an acceptable time. This scenario can occur when the number of taxa is growing, which yields increasing the complexity of relations among obtained triplets. More over this scenario happens to small sets of taxa with complex relation among triplets.

We can conclude that working on phylogenetic networks needs considering more details about the assumptions, data size and the complexity of relations among input triplets. Therefore introducing an efficient method is an open challenge.

By analyzing existing research we can divide the solution of constructing networks based on triplets, into two approaches. In the first approach, the reticulation nodes are recognized and then are removed from the set of taxa and a tree structure is obtained for the remaining taxa. Finally the network consistent with all given triplets is obtained by adding reticulation nodes to the tree structure. In the second approach, a tree structure is obtained and then by adding new edges to the tree structure, the final network consistent with all triplets is obtained.

According to our best of knowledge, all the researches on this problem fall into one of these approaches. Therefore, in recent papers researchers try to improve these approaches gradually. It means that each improvement is valuable because it can reduce the time and costs, effectively.

For example in MRTC problem which is simpler than network construction, the authors first proposed BPMF algorithm [2]. Then BPMR is introduced by improving BPMF [3]. Then by

example concerning with the first approach of constructing networks, TripNet algorithm was introduced [5]. Then NCHB was introduced that novelty improves TripNet [6].

According to the above explanations, we think that our proposed algorithm (NetCombin) which is an efficient algorithm in the second approach for constructing network is valuable and has appropriate novelty.

[1] Van Iersel Leo, Kelk Steven. Constructing the simplest possible phylogenetic network from triplets. Algorithmica. 2011;60(2):207-235.

[2] ‎Wu B.Y‎. ‎Constructing the maximum consensus tree from rooted triples. ‎Journal of‎ ‎Combinatorial Optimization‎. 2004; ‎8(1): ‎29-39‎‎.

[3] ‎‎Maemura K‎, ‎‎Jansson J‎, Ono H‎, ‎‎Sadakane K‎, ‎‎Yamashita M. ‎Approximation algorithms‎ ‎for constructing evolutionary trees from rooted triplets.‎ ‎10th Korea-Japan joint‎ ‎workshop on algorithms and computation‎. ‎2007‎.

[4] Jahangiri S, Hashemi SN, Poormohammadi H. New heuristics for rooted triplet consistency. Algorithms. 2013;6(3):396-406.

[5] Poormohammadi Hadi, Sardari Zarchi Mohsen. TripNet: a method for constructing rooted phylogenetic networks from rooted triplets. PloS one. 2014;6(6):e106531.

[6] Poormohammadi Hadi, Sardari Zarchi Mohsen, Ghaneai Hossein. NCHB: A Method for Constructing Rooted Phylogenetic Networks from Rooted Triplets based on Height Function and Binarization. Journal of Theoretical Biology. Volume 489, 21 March 2020, 110-144.

"The resulting method ("NetCombin") is not analyzed formally. This is no problem, though, since the authors evaluated its performance by comparing it experimentally against some other methods on simulated data. However, these comparisons do not seem fair because in the experiments, they applied NetCombin in parallel to construct 27 different networks and then output the best one found. For each of the other methods, they only constructed one network and output it. Thus, the running time of 44 seconds for NetCombin should be reported as 27*44=1188 seconds, etc., which means that it's actually slower than the NCHB method while outputting solutions that are worse than NCHB's. Furthermore, the experiments only go up to 40 sequences so it's difficult to draw any conclusions regarding the relative performance of NetCombin, TripNet, NCHB, and RPNCH."

Answer: We added a paragraph time complexity, formally in the section "Time complexity" of the manuscript. Experimentally we used a PC with Intel Corei7 CPU. We ran our algorithm on the cores of CPU in parallel.

Concerning the simulated data, these triplet's data are obtained from biological data using standard methods. The triplet's data for Yeast are available [cite yeast]. The optimal network for yeast data is a level-2 network. Netcombin outputs for Yeast data, is a level-2 network which is optimal. There are no other biological standard triplets and its correspond optimal network. To obtain biological data we use standard methods to obtain triplets from generated biologically sequences data. This process is explained in details in subsection "Data generation" in "Experiments" section.

Usually existing methods are based on first approach (which explained in the first comment & answer). In this approach to reduce the size of the problem firstly SN-sets are recognized. Then a network is constructed for each SN-set. The network corresponds to a SN-set in the final network is the network that is connected to the final network via a cut edge. Note that each SN-set is a network itself. In the simple SN-sets there is no way to divide the problem into sub-problems. So generally if we consider simple SN-sets as inputs, the problem can not be solved in parallel. For example TripNet and NCHB and SIMPLISTIC belongs to the first approach that cannot become parallel. But our proposed method (NetCombin) which is based on the second approaches has the ability to solve in parallel process. We mentioned this fact in the revised version.

The previous methods like TripNet and NCHB uses at most 40 sequences for analyzing their performance. So in order to perform comparison we also used at most 40 sequences. However larger number of sequences can be used and our model can construct the network in an acceptable time.

"The authors don't even bother to explain all the steps of the method; in many places, they just refer to some of their old papers. It would have been better to list the complete algorithm. Throughout the paper. "

Answer: Thanks for your comments. As you suggested we added extra explanations to the revised manuscript about the steps of our algorithm. In the first part of method section, we explained the steps of NetCombin shortly. Also the part "constructing tree using height function" of the method section was divided into the two parts, "Assigning height function" and "obtaining tree". Moreover the method that we used to convert G_τ into a DAG was explained in details in the revised version.

"There is lots of vague phrasing, mistakes, grammatical errors, and missing references to the literature.For example:

-The title is very strange.

The parameter k is never defined or used in the paper.

What is k?

Does "optimal level-k network construction" mean that one wants to find a network that minimizes the value of k?"

Answer: Thanks for your careful attention. We review the manuscript to remove grammatical mistakes and typos errors. For example as you suggested:

- The title is changed to: "NetCombin: An algorithm for constructing optimal phylogenetic network from rooted triplets"

-There are two optimal criteria. First, minimizing the number of reticulation nodes and Second, minimizing the level of the network which is called K in this paper. A network is called level-k if the maximum number of reticulation nodes in each its bi-connected components is k. This explanation is added to the revised version.

"Abstract and introduction:

It's better to say "or" instead of "and" when talking about minimizing the level and minimizing the number of reticulation nodes since they are not the same thing."

Answer: Logically your comment is right. Hence we used "or" instead of "and".

The abstract claims "The binarization process innovatively uses a measure to construct a binary rooted tree T consistent with the maximum number of input triplets.", but this is misleading.To find such a T is another NP-hard optimization problem and applying Wu's technique just gives an approximation of "a binary rooted tree consistent with the maximum number of input triplets", not always an optimal one. Similar phrasing occurs at the bottom of page 4.

Answer: Thanks for your careful comment. As you mentioned, the problem of constructing a rooted tree consistent with the maximum number of rooted triplets is an NP-hard problem. This problem is known as Maximum Rooted Triplets Consistency (MRTC). The goal is to find a acceptable (near the optimal) solution for MRTC. In order to remove this ambiguity we modify the paragraph based on your comment. Moreover we explain it in details in the introduction section.

The abstract proudly declares that T is "expanded in an intellectual process".

Actually, it's just a straightforward greedy algorithm.

Answer: Although greedy algorithms seem to be straightforward, the most important things in them is defining how to define a heuristic function. One of our novelties is introducing an efficient heuristic function for constructing a phylogenetic network from obtained binary tree. However we changed the sentence to "Then ‎‎‎T‎ is expanded using a heuristic function by adding minimum number of edges to obtain final network with the approximately minimum number of reticulation nodes‎".

The abstract says "The experimental results on real data indicate that". This is not true; only simulated data was used.

Answer: according to your valuable comment we adjust the sentence as follow: " The experimental results on simulated data obtained from biologically generated sequences data indicate that by considering the trade-off between speed and precision‎, ‎the NetCombin outperforms the others‎."

We also modify it in experimental section as follow: "Secondly‎, ‎triplets can be obtained from sequences data‎. Sequences data usually are in the form of biological sequences‎. Biological sequences can be obtained from species or from simulation software that can generate these kinds of sequences under biological assumptions. In this research we used the second approach using simulation software."

"To make the problem definition easier to read, say clearly that all of the input triplets have to be consistent with the output network and that either the level or the number of reticulation nodes in the output network is to be minimized."

Answer: Thanks for your helpful comment. We add your suggested definition to the introduction part of the manuscript.

"When describing previous related results, it's important to explain how your old algorithms work since the new algorithm is a modification of them. My impression is that there is almost no difference, making this paper weak".

Answer: We explained novelty and differences between NetCombin and old methods in the answer of the first comment. If you suggest some parts of the answer of the first comment is better to be added to the manuscript please inform us.

"The section "Definitions and notations" starts with the promise "In this section the basic definitions that are used in the proposed algorithm, are presented formally." A few sentences later, the text begins referring to something called "N", which has never been introduced."

Answer: In "Introduction" section we defined network N and in "Definitions and Notations" section we used it as N. However we modify definition by adding symbol N as follow: "‎Formally‎, ‎a rooted phylogenetic network ‎‎N‎‎ (network‎‎‎‎ for short) is a ...."

"The "Method" section says "The height function enforce that the obtained rooted tree be consistent with maximum number of triplets of tau." In case no tree is consistent with all the triplets in tau, the IP won't have any valid solution, so I don't see how using the height function can yield a tree that is consistent with the maximum number of triplets in tau - How does NetCombin solve the IPs that were set up?"

Answer: We think you misunderstand the process. The height function finds an approximation solution for the IP which may ignore some triplets. If input triplets be consistent with a tree then the IP has a feasible solution. More precisely in this case the IP has the unique optimal solution. This optimal solution is corresponding to the height function related to the tree that is obtained by BUILD algorithm. If there is no tree consistent with a set of input triplets, then two cases happen. In the first case the IP has a feasible solution. In this case there is nothing to do and the algorithm goes to the next step. In the second case the IP has no feasible solution and we have an optimization (maximization) problem that should be solved heuristically. In this case the goal is to remove minimum information form the constraints of the IP. Corresponding to each triplet there are two constraints for the IP. The maximization problem is corresponding to Minimum Feedback Arc Set (MFAS) problem. Equivalently we assign a directed graph to the IP. The directed graph is acyclic if and only if the IP has a feasible solution. But in the second case the IP has no feasible solution and so the directed graph related to input triplets is not acyclic. To obtain a feasible solution for the IP we remove some constraints from it. It means that we solve the IP heuristically. Since the goal is to remove minimum number of information (constraints of the IP) so equivalently the goal is to remove minimum number of edges from the directed graph related to input triplets to make it acyclic. As mentioned this problem is MFAS which is an NP-hard problem. We used a heuristic method that is introduced in reference [14] of the manuscript. The resulting height function is used to obtain a tree by applying HBUILD algorithm.

"What is the difference between BUILD and HBUILD? Line 207 suggests that they are "equivalent", but what does that mean?"

Answer: If the set of input triplets be consistent with a tree then the BUILD and HBUILD results are the same. Otherwise BUILD stops in some steps and gives no solution. But HBUILD can proceed consequently until it produces a tree. We mentioned this in the revised version.

"In the experimental results, is the "number of sequences" the same thing as the "number of leaf labels"?"

Answer: Yes you are right. We added this tip in the revision.

"Lines 366-404 seem pointless. The reader can just look at the tables to get that information."

Answer: The result tables were presented concisely. To make it more understandable for the reader we explained them in more details in the manuscript. However if you think some sentences should be adjusted or removed please let us to know your suggestions.

"Does Table 4 list the number of reticulations or the level? The caption says one thing but the table itself says another thing."

Answer: Thanks for your careful attention. We corrected table 4.

"The example in Figure 5 is wrong. The input set of triplets contains a triplet de|f, which means that the BUILD algorithm will join d and e by an edge so that all of the leaf labels end up in the same connected component."

Answer: Thanks for your comment. As figure 5 shows, in the input triplets of tau, de|f should be da|f. We corrected this in the manuscript.

"The caption of Figure 5 is misleading. It says "two nodes i,j in X are connected iff" but usually two nodes are said to be "connected" if there is a path (of any length) between them.

It would be better to say "two nodes i,j in X are adjacent iff" or "two nodes i,j in X are connected by an edge iff". "

Answer: Thanks for your helpful comment. We adjust the caption of figure 5, by replacing ''connected'' with ''adjacent''.

"The caption of Figure 5 refers to the "Aho graph". However, the BUILD algorithm was published in a paper written by four people. To be fair, call it "the Aho-Sagiv-Szymanski-Ullman graph" or something like that if you insist on using family names."

Answer: You are right. However it is very common to use "Aho graph" term. For example in the book of reference [2] of the manuscript (which is a main book for Phylogenetics) the term "Aho graph" is used frequently.

"The example in Figure 10 is wrong. The two trees in (b) and (c) are not binary.

In both trees, the node lca(f,g) has three children. Also, the text refers to "its two different binarizations" but there are many more possible binarizations."

Answer: Thanks for your critical point. We corrected figure 10. Additionally we mentioned that there are more than binarizations in the text.

"The caption of Figure 14 uses Max-Cut in part (m). Should it be Min-Cut?"

Answer: In part (l) of figure 14 Min-Cut is used and in part (m) Max-Cut is used.

"Lines 321-322 state some strange-looking time complexity (without any

proof) for one of the steps in the algorithm. It doesn't seem relevant as the time complexity of all the other steps has been completely ignored."

Answer: Thanks for your comment. As you suggested we added the part "Time complexity" to the revised version to explain the complexity of NetCombin in more details.

"More than 30% of the references in the bibliography are self-references. This is too much. On the other hand, many historically important bibliographic references are missing, such as the following."

Answer: As you know this research is based on our previous works. So we have to cite them. However as you suggested we cited new references you mentioned in the comment.

Reviewer 2:

"1.) The authors e.g., in the paragraph from lines 10-14 discuss the importance of constructing graphs instead of trees. However, the first example are just trees. Further, the terminology is not consistent. Even though each tree is a graph, not each graph is a tree, thus the authors should stick to notation of directed graphs in my opinion. The comment is related to the following text:"

“The rooted structures 12 are directed graphs which contains a unique vertex called root with in-degree 0 and 13 out-degree at least 2 .. Figure 1a shows an example of a rooted tree.”

Answer: Thanks for your comment. We changed the sentences to "The rooted structures are always rooted trees or rooted networks. These structures contain a unique vertex called root with in-degree 0 and out-degree at least two. "

"2.) The quality of the second figure is very poor and cut off. Please correct this."

Answer: We adjusted the second figure.

"3.) line 48: please define the notion of the level of a network. Even though this is a basic concept, it appears here for the first time without any clarification."

Answer: As you suggested we brought the definition of the level of a network before the sixth paragraph of the introduction.

"4.) Line 64: So, introducing efficient heuristic methods to solve this problem is demanding. Perhaps “necessary” instead of “demanding” is meant?"

Answer: Thanks for your point. We replaced it.

"5.) The name of the proposed algorithm is NetCombin, however this is not consistent throughout the paper, please unify."

Answer: The name NetCombin stands for Network construction method based on binarization. Based on your comment we unified the name as Netcombin across the manuscript.

"6.) Line 160. Is N the set of natural numbers with 0 or without?"

Answer: N is the set of natural numbers without 0.

Comments related to the method:

"1.) Line 218, please discuss in more detail on MFAS heuristic used. This is highly relevant to the proposed approach."

Answer: We added more details about MFAS problem and the heuristic method that was used to solve it in subsection "assigning height function".

"2.) Line 230, what are the three “???”? Explain."

Answer: It was redundant and we removed it.

"3.) Line 282 onwards a couple of lines. The authors provide the pseudocode of the proposed algorithm, however, it is not clear why all 9 different measures for construction of binary trees (as apparently used in [5]). This computational step appears somewhat redundant and seems like everything that can be, is computed. In line 301, authors claim innovative use. Could this be elaborated in more detail? I don’t see how exhaustive enumeration of arbitrary 9 measures is any more innovative than doing this for more measures, unless the 9 offer some form of theoretical grounding that makes this traversal/scoring feasible? Furthermore, how can one resolve possible ties? What if the score distribution is ~uniform? Is the subtree selected at random? Please elaborate on these details."

Answer: The 9 measures are equivalent to 9 heuristic functions which give better output compared to other heuristic functions. In defining these 9 heuristic functions, we used the parameters W, P and T which are the important factors in tree construction methods based on rooted triplets. These parameters were introduced by Wuo …..

The subtrees are selected based on SPR method and were described in details in the manuscript. The term uniform if your mean is that the best tree uniformly distributed between these 9 functions, the answer is no.

"4.) Line 309 onwards. It seems that the network construction step is rather expensive (n^3, m^2). To what extent can this be made run in parallel. Could you discuss this aspect, at least theoretically?"

Answer: We added the new section "time complexity" and described the time complexity in details. In the previous submitted version, we just mention the time complexity of tree construction. In the revised version time complexity was explained for all parts of Netcombin rather than only tree construction step.

"I don’t quite understand the reasoning behind: “but NetCombin 360 outputs 27 networks and the best network is reported. Since the process of constructing 361 these 27 NetCombin networks is indepe”

Further, if you report the average running time, please repreat the experiments at least 5 times and report also the standard deviations. "

Answer: we analyzed problem based on the two optimality criterions and best network result is reported. Sine our method is based on optimizing the number of reticulation nodes, in almost all cases optimal network is unique.

Biologically some results may be more informative compared to optimal result. We did not consider it biologically since our data are semi real and are generated based on biological constraints. We ran our algorithm in parallel on cores of Cori7 CPU and the time all process was considered. There is no average computing in our experiment and so we do not have standard deviations parameter. The resulting network of NCHB, TripNet, SIMPLISTIC and RPNCH is unique.

"I am not entirely sure what the measure of success is in T - it shows the number of outputted networks? Is more better? Why?

Further, low runtime is meaningless if the quality of results is low. Comment on the runtime w.r.t T3 and T4."

Answer: As mentioned before the problem of constructing an optimal network consistent with all given input triplets is NP-hard and different methods are designed for solving the problem, heuristically. In order to study the performance of Netcobmin we generated data sets and compared our method with TripNet, SIMPLISTIC, NCHB and RPNCH. The comparison is based on three parameters level, number of reticulation nodes and running time. By considering all these three parameters the results show that our method outperforms other methods. For example RPNC is very fast but its results optimality is very low. SIMPLISTIC results for small size data, is good. But by increasing the number of taxa to 20 and more, SIMPLISTIC runtime exceeds exponentially and for large size data is not applicable. NCHB and TripNet works well for small size data. By increasing the number of input taxa the running time of both methods exceeds. Note that on average the performance of TripNet and NCHB compared to SIMPLISTIC is better with respect to running time and optimality criterions. For large size data and by increasing the number of taxa, the running time of TripNet and NCHB increased very faster compared to Netcombin. For large size data the performance of Netcombin is nearly the same as NCHB and TripNet, but its running time is very low. It means that by considering all three parameters, on average Netcombin is a reasonable method for solving the problem. So both low runtime and low level and low number of reticulation nodes happens simultaneously for large size data for Netcombin results. Netcombin firstly constructs 27 trees and then by adding edges to these obtained trees, it constructs 27 networks and optimal network is reported as output. In tree construction process and adding edges, approximately optimal trees and networks are constructed innovatively. Firstly approximately optimal trees are constructed innovatively by using 9 measures and using SPR with respect to the consistency with the maximum number of input triplets. Then approximately optimal network is constructed from each tree and by adding some edges in an innovative way to reduce the number of added reticulation nodes.

"There is not enough discussion on the results (see previous section). Further, the last line reads as “NetCombin is a 434 valuable method that returns an appropriate network in an appropriate time”. What does that mean? What is appropriate? Do you mean reasonable?

Why would the user prefer NetCombin, if it performs on par with existing state-of-the-art?

Please comment on this issue.

The authors should further discuss why NCHB is a worse alternative, as it seems to perform very similarly. Is the runtime main benefit of NetCombin?"

Answer: TripNet has three speed options slow, normal, and fast. The results of slow speed option are better compared to normal and normal speed option results are better compared to fast speed option. In almost all cases the normal speed option is used. By increasing the number of taxa, using slow or normal speed options are time consuming. In almost all cases and for comparing TripNet with other methods, normal speed option is used. In this manuscript TripNet results are based on normal speed option.

We added some sentences to the beginning of discussion part to explain our method in more details. Also thanks for your comment and we replaced the word appropriate with reasonable.

As we mentioned in the previous comments, for large size data and by increasing the number of taxa, Netcombin output a network in a time very better compared to NCHB while its optimality is nearly the same as NCHB. For small size data one can use each of Netcombin, TripNet or NHCB. But for large size data which are very common in real data, Netcombin is suggested by considering running time and optimality criterions.

"Statistical evaluation:

Answer: We did not consider the comparison statistically. Previous methods in constructing trees and networks based on input triplets, were not analyzed statistically and just their performance are compared based on the running time and optimality criterions. If the reviewer suggests a new statistical evaluation, please inform us.

"Availability:

Please, make the code and data simulators (or datasets) publicly available via .e.g, github."

Answer: Netcobimn is not public and is partially supported under a grant. Some parts of the method can be reported publicly or can be sent to the reviewer. We sent the part related to tree construction .

Final remarks:

Answer: All images are designed in visio and then are converted to JPG format. Visio is vector based and JPG is not. We can send the original visio source. Also we used >400 dpi for all figures in the revised version. At the end of introduction section we briefly mentioned the novelties of our method. In the new version we added some sentences to the discussion part to show the novelties and importance of our new method.

Attachment

Submitted filename: plos one reviewr answer.docx

Click here for additional data file.^{(51.4KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0227842.r003

Decision Letter 1

Hocine Cherifi

27 Aug 2020

PONE-D-19-34247R1

Netcombin‎: ‎An algorithm for constructing optimal phylogenetic network from rooted triplets

PLOS ONE

Dear Dr. Poormohammadi,

Please submit your revised manuscript by Oct 11 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Hocine Cherifi

Academic Editor

PLOS ONE

Additional Editor Comments (if provided):

Please update the figure

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #2: The authors have addressed the most of my concerns. The only thing still bothering me is the low resolution of the figure fig14e.jpg (I'm not sure this is >400 dpi).

I'd suggest you improve the quality prior to publication.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Sep 18;15(9):e0227842. doi: 10.1371/journal.pone.0227842.r004

Author response to Decision Letter 1

7 Sep 2020

As you suggested, the final version of manuscript is uploaded.

PLoS One. doi: 10.1371/journal.pone.0227842.r005

Decision Letter 2

Hocine Cherifi

9 Sep 2020

Netcombin‎: ‎An algorithm for constructing optimal phylogenetic network from rooted triplets

PONE-D-19-34247R2

Dear Dr. Poormohammadi,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Hocine Cherifi

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

PLoS One. doi: 10.1371/journal.pone.0227842.r006

Acceptance letter

Hocine Cherifi

11 Sep 2020

PONE-D-19-34247R2

Netcombin‎: ‎An algorithm for constructing optimal phylogenetic network from rooted triplets

Dear Dr. Poormohammadi:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Professor Hocine Cherifi

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 File

(ZIP)

Click here for additional data file.^{(24.4KB, zip)}

S2 File

(ZIP)

Click here for additional data file.^{(202.7KB, zip)}

S3 File

(ZIP)

Click here for additional data file.^{(2.1MB, zip)}

S4 File

(ZIP)

Click here for additional data file.^{(3.5MB, zip)}

Attachment

Submitted filename: plos one reviewr answer.docx

Click here for additional data file.^{(51.4KB, docx)}

Data Availability Statement

All relevant data are within the manuscript and its Supporting Information files.

[pone.0227842.ref001] 1. Felsenstein J. Inferring phylogenies. Sinauer associates; Sunderland, MA; 2004. [Google Scholar]

[pone.0227842.ref002] 2. Huson DH, Rupp R, Scornavacca C. Phylogenetic networks: concepts, algorithms and applications. Cambridge University Press; 2010. [Google Scholar]

[pone.0227842.ref003] 3. Poormohammadi H. A New Heuristic Algorithm for MRTC Problem. Journal of Emerging Trends in Computing and Information Sciences. 2012;3(7). [Google Scholar]

[pone.0227842.ref004] 4. Jahangiri S, Hashemi SN, Poormohammadi H. New heuristics for rooted triplet consistency. Algorithms. 2013;6(3):396–406. 10.3390/a6030396 [DOI] [Google Scholar]

[pone.0227842.ref005] 5. Poormohammadi H, Sardari Zarchi M. CBTH: a new algorithm for MRTC problem. Iranian Journal of Biotechnology (IJB). 2020, Accepted. [Google Scholar]

[pone.0227842.ref006] 6. Van Iersel L, Kelk S. Constructing the simplest possible phylogenetic network from triplets. Algorithmica. 2011;60(2):207–235. 10.1007/s00453-009-9333-0 [DOI] [Google Scholar]

[pone.0227842.ref007] 7. Poormohammadi H, Eslahchi C, Tusserkani R. TripNet: a method for constructing rooted phylogenetic networks from rooted triplets. PloS One. 2014;6(6):e106531 10.1371/journal.pone.0106531 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0227842.ref008] 8. Reyhani MH, Poormohammadi H. RPNCH: A method for constructing rooted phylogenetic networks from rooted triplets based on height function. Journal of Paramedical Sciences. 2017;8(4):14–20. [Google Scholar]

[pone.0227842.ref009] 9.Linder CR, Moret BM, Nakhleh L, Warnow T. Network (reticulate) evolution: biology, models, and algorithms. The Ninth Pacific Symposium on Biocomputing (PSB). 2004.

[pone.0227842.ref010] 10. Bordewich M, Semple C. Computing the minimum number of hybridization events for a consistent evolutionary history. Discrete Applied Mathematics. 2007;155(8):914–928. 10.1016/j.dam.2006.08.008 [DOI] [Google Scholar]

[pone.0227842.ref011] 11. Gambette P, Huber KT, Kelk S. On the challenge of reconstructing level-1 phylogenetic networks from triplets and clusters. Journal of mathematical biology. 2017;74(7):1729–1751. 10.1007/s00285-016-1068-3 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0227842.ref012] 12.Tusserkani R, Poormohammadi H, Azadi A, Eslahchi C. Inferring phylogenies from minimal information. Third International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB). 2017; 202–206.

[pone.0227842.ref013] 13. Habib M, To TH. Constructing a minimum phylogenetic network from a dense triplet set. Journal of bioinformatics and computational biology. 2012;10(05):1250013 10.1142/S0219720012500138 [DOI] [PubMed] [Google Scholar]

[pone.0227842.ref014] 14. Kannan SK, Lawler EL, Warnow TJ. Determining the evolutionary tree using experiments. Journal of Algorithms. 1996;21(1):26–50. 10.1006/jagm.1996.0035 [DOI] [Google Scholar]

[pone.0227842.ref015] 15. Choy C, Jansson J, Sadakane K, Sung WK. Computing the maximum agreement of phylogenetic networks. Electronic Notes in Theoretical Computer Science. 2004;91:134–147. 10.1016/j.entcs.2003.12.009 [DOI] [Google Scholar]

[pone.0227842.ref016] 16. Poormohammadi H, Sardari Zarchi M, Ghaneai H. NCHB: A Method for Constructing Rooted Phylogenetic Networks from Rooted Triplets based on Height Function and Binarization. Journal of Theoretical Biology. 2020;489:110–144. [DOI] [PubMed] [Google Scholar]

[pone.0227842.ref017] 17. Jansson J, Nguyen NB, Sung WK. Algorithms for combining rooted triplets into a galled phylogenetic network. SIAM Journal on Computing. 2006;35(5):1098–1121. 10.1137/S0097539704446529 [DOI] [Google Scholar]

[pone.0227842.ref018] 18. Jansson J, Sung WK. Inferring a level-1 phylogenetic network from a dense set of rooted triplets. Theoretical Computer Science. 2006;363(1):60–68. 10.1016/j.tcs.2006.06.022 [DOI] [Google Scholar]

[pone.0227842.ref019] 19. Aho AV, Sagiv Y, Szymanski TG, Ullman JD. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM Journal on Computing. 1981;10(3):405–421. 10.1137/0210030 [DOI] [Google Scholar]

[pone.0227842.ref020] 20. Gusfield D, Eddhu S, Langley C. Optimal, efficient reconstruction of phylogenetic networks with constrained recombination. Journal of bioinformatics and computational biology. 2004;2(01):173–213. 10.1142/S0219720004000521 [DOI] [PubMed] [Google Scholar]

[pone.0227842.ref021] 21. Huber KT, Van Iersel L, Kelk S, Suchecki R. A practical algorithm for reconstructing level-1 phylogenetic networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB). 2010;8(3):635–649. 10.1109/TCBB.2010.17 [DOI] [PubMed] [Google Scholar]

[pone.0227842.ref022] 22. Karp M. Reducibility among combinatorial problems. Complexity of computer computations. 1972;85–103. 10.1007/978-1-4684-2001-2_9 [DOI] [Google Scholar]

[pone.0227842.ref023] 23. Gasieniec L, Jansson J, Lingas A, Östlin A. On the complexity of constructing evolutionary trees. Journal of Combinatorial Optimization. 1999;3(2-3):183–197. 10.1023/A:1009833626004 [DOI] [Google Scholar]

[pone.0227842.ref024] 24. Wu BY. Constructing the maximum consensus tree from rooted triples. Journal of Combinatorial Optimization. 2004;8(1):29–39. 10.1023/B:JOCO.0000021936.04215.68 [DOI] [Google Scholar]

[pone.0227842.ref025] 25. Wu Y. A practical method for exact computation of subtree prune and regraft distance. Bioinformatics. 2009;25(2):190–196. 10.1093/bioinformatics/btn606 [DOI] [PubMed] [Google Scholar]

[pone.0227842.ref026] 26. Hein J. Reconstructing evolution of sequences subject to recombination using parsimony. Mathematical biosciences. 1990;98(2):185–200. 10.1016/0025-5564(90)90123-G [DOI] [PubMed] [Google Scholar]

[pone.0227842.ref027] 27.Grassly NC, Rambaut A. Treevole: a program to simulate the evolution of DNA sequences under different population dynamic scenarios. Oxford: Department of Zoology, Wellcome Centre for Infectious Disease.; 1997.

PERMALINK

Netcombin: An algorithm for constructing optimal phylogenetic network from rooted triplets

Hadi Poormohammadi

Mohsen Sardari Zarchi

Roles

Abstract

Introduction

Fig 1.

Fig 2. A triplet ij|k.

Fig 4.

Fig 3.

Fig 5. The Aho graph AG(τ) is defined based on τ.

Fig 6.

Fig 7. A level-1 network (galled tree) with two reticulation nodes.

Fig 8. Two different level-3 networks with three reticulation nodes r1, r2, r3.

Definitions and notations

Fig 9.

Fig 10.

Fig 13. The HBUILD process for τ = {cd|b, cd|a, bd|a, bc|a}.

Fig 15. A basic tree structure construction process for a givenτ = {bc|a, bd|a, cd|a, bc|d, cd|b.

Fig 11. For the given T, lT = 3.

Fig 12. A quartet ij|kl with the leaves {i, j, k, l}.

Method

Assigning height Function

Fig 14. The process of assigning black and white color to the nodes of a graph with no node with in-degree zero or out-degree zero.

Obtaining tree

Binarization

Fig 16.

Fig 17. SPR is used to obtain six different structures from a given tree structure.

Network construction

Fig 18.

Time complexity

Experiments

Data generation

Experimental results

Table 1. The number of Netcombin, TripNet, NCHB, SIMPLISTIC, and RPNCH networks that belong to Nfinite.

Table 4. The average level results for the networks that belong to Nfinite for Netcombin, TripNet, NCHB, SIMPLISTIC, and RPNCH.

Table 2. The average running time results for the networks that belong to Nfinite for Netcombin, TripNet, NCHB, SIMPLISTIC, and RPNCH.

Table 3. The average number of reticulation nodes (rets for short in table) results for the networks that belong to Nfinite for Netcombin, TripNet, NCHB, SIMPLISTIC, and RPNCH.

Discussion

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Hocine Cherifi

Roles

Author response to Decision Letter 0

Decision Letter 1

Hocine Cherifi

Roles

Author response to Decision Letter 1

Decision Letter 2

Hocine Cherifi

Roles

Acceptance letter

Hocine Cherifi

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Fig 8. Two different level-3 networks with three reticulation nodes r₁, r₂, r₃.

Fig 11. For the given T, l_T = 3.

Table 1. The number of Netcombin, TripNet, NCHB, SIMPLISTIC, and RPNCH networks that belong to N_finite.

Table 4. The average level results for the networks that belong to N_finite for Netcombin, TripNet, NCHB, SIMPLISTIC, and RPNCH.

Table 2. The average running time results for the networks that belong to N_finite for Netcombin, TripNet, NCHB, SIMPLISTIC, and RPNCH.

Table 3. The average number of reticulation nodes (rets for short in table) results for the networks that belong to N_finite for Netcombin, TripNet, NCHB, SIMPLISTIC, and RPNCH.