Skip to main content
PLOS One logoLink to PLOS One
. 2023 Jul 13;18(7):e0287483. doi: 10.1371/journal.pone.0287483

On closing the inopportune gap with consistency transformation and iterative refinement

Mario João Jr 1,2,*, Alexandre C Sena 3, Vinod E F Rebello 2
Editor: Yang Zhang4
PMCID: PMC10343097  PMID: 37440507

Abstract

The problem of aligning multiple biological sequences has fascinated scientists for a long time. Over the last four decades, tens of heuristic-based Multiple Sequence Alignment (MSA) tools have been proposed, the vast majority being built on the concept of Progressive Alignment. It is known, however, that this approach suffers from an inherent drawback regarding the inadvertent insertion of gaps when aligning sequences. Two well-known corrective solutions have frequently been adopted to help mitigate this: Consistency Transformation and Iterative Refinement. This paper takes a tool-independent technique-oriented look at the alignment quality benefits of these two strategies using problem instances from the HOMSTRAD and BAliBASE benchmarks. Eighty MSA aligners have been used to compare 4 classes of heuristics: Progressive Alignments, Iterative Alignments, Consistency-based Alignments, and Consistency-based Progressive Alignments with Iterative Refinement. Statistically, while both Consistency-based classes are better for alignments with low similarity, for sequences with higher similarity, the differences between the classes are less clear. Iterative Refinement has its own drawbacks resulting in there being statistically little advantage for Progressive Aligners to adopt this technique either with Consistency Transformation or without. Nevertheless, all 4 classes are capable of bettering each other, depending on the instance problem. This further motivates the development of MSA frameworks, such as the one being developed for this research, which simultaneously contemplate multiple classes and techniques in their attempt to uncover better solutions.

1 Introduction

The process of aligning three or more biological sequences (typically protein amino acids, DNA or RNA base pairs) to identify the similarities or disparities that reflect the evolutionary, functional, or structural relationship [1] between them is generally referred to as Multiple Sequence Alignment (MSA). MSA is an essential step to solve problems in different areas of molecular and computational biology such as evolutionary-based studies [2], domain analysis [2], co-evolutionary analyses [3], phylogeny inference [4], and the prediction of protein functions [5] or their three-dimensional structures [6], among others.

Note that this problem is computationally different, for example, from the process to find homologous sequences, which consists of obtaining multiple pairwise sequence alignments of protein or RNA sequences generally against a single reference sequence [7]. Multiple sequence alignments tend to provide more information than pairwise alignments by highlighting conserved regions of aligned residues that may be of structural and functional relevance. The focus of this paper is thus on the former, the traditional MSA problem.

Given the average length l of the biological sequences being aligned, the optimal alignment of two sequences has an O(l2) polynomial complexity [8] while aligning n such sequences optimally is an NP-complete problem with complexity O(ln) [9, 10]. This means that obtaining even approximate MSA solutions for scientifically relevant sequence lengths in acceptable time frames (e.g., aligning dozens of sequences with lengths of more than a few hundred residues [11]) has long motivated the development of heuristics that adopt strategies based primarily on pairwise sequence alignment.

Gaps are the spaces that are required when trying to align DNA or protein sequences so that homologous residues are matched. Over evolutionary time, a sequence may lose some residues (referred to as deletion) or have some extra residues inserted (insertion). When aligning with the original sequence, the newer sequence would require the insertion of characters to represent gaps in the locations where deletions occurred, while the original sequence would require gaps where the newer sequence has extra residues. When aligning multiple sequences, difficulties arise with regard to the positioning of these gaps. Given the complexity of determining whether the resulting gaps are caused by insertions or deletions, traditional heuristics typically do not often make this determination and instead best align sequences using gaps in positions that obtain the highest overall score. While Needleman and Wunsch first described a dynamic programming algorithm for the pairwise alignment of biological sequences [8], Gotoh extended this approach with affine gap scores [12] to address the assumption that a single large insertion/deletion (indel) event is biologically more likely to occur compared to many smaller ones with the same combined total length.

Numerous multiple sequence alignment tools have since been proposed over the last four decades [11], some of which have continued to evolve over the years [13, 14]. Interestingly, the majority of contemporary heuristic-based tools are still based on Progressive algorithms that aggregate multiple pairwise alignments. This Progressive method is considered to be one of the first practical MSA heuristics [15] and to date has maintained its three-stage workflow: In the first stage, the n sequences are compared with each other, in pairs, generating a triangular matrix with the result of each comparison represented by a Similarity Score; The next stage creates a binary Guide Tree that identifies the order in which the sequences should be aligned in the subsequent stage. The Guide Tree is generated using a hierarchical clustering algorithm, such as UPGMA [16] or Neighbor-Joining (NJ) [17]. During the final Profile Alignment stage, the Guide Tree is traversed recursively bottom-up, generating a profile [18], for each non-leaf node, by aligning the two sets of previously aligned sequences from the node’s children, until the complete alignment is produced.

While highly effective, one of the characteristics of solutions produced by the Progressive method can be summed up in the words of its authors: “once a gap, always a gap” [15]. In the third stage, if a gap is inserted during the creation of a profile to optimize the pairwise alignment, this gap will be propagated to all subsequent profiles generated from this node, including the final alignment, which could hinder the method’s ability to find the optimal alignment [11].

The quality of an MSA heuristic is generally evaluated by comparing the accuracy of the alignments produced for a set of benchmark MSA problem instances against their corresponding reference alignments, hand-crafted by scientists, which are believed to be the most biologically appropriate for the set of sequences. One key factor that significantly influences the accuracy of progressive alignments is the degree of similarity between the sequences being aligned [19]. The more dissimilar the sequences are, the harder it appears for MSA heuristics to achieve high accuracy. The term Identity is often used to refer to the degree of similarity between sequences as a percentage of the positions in the pairwise alignment that have matching residues. In particular, according to Pei in [20], progressive aligners may not produce satisfactory results when Identity values fall inside the twilight zone [21], i.e., when the similarity between sequences is less than 30%.

To mitigate the problem of these inopportune gaps and improve the alignments produced by Progressive methods, researchers widely employ two other classes of heuristics based respectively on the techniques Iterative Refinement [22] and Consistency Transformation [23]. Iterative refinement-based alignments rely on post-processing an MSA solution to revise the order in which the profiles were aligned [24]. A heuristic process starts with an initial MSA, typically provided by a progressive alignment, and reaches a new one through successive iterative refinements that try to remove these gaps. These refinements are applied until a given objective function cannot be improved or a user-defined limit of attempts has been made. Alternatively, Consistency-based heuristics try to improve the profile alignment stage of the Progressive method using additional information, extracted from the sequences at an earlier stage, to reinforce the alignment of specific regions and avoid the inadvertent insertion of gaps [23]. This technique is also particularly effective for twilight zone alignments [19].

Consequently, the majority of MSA tools currently employ progressive aligners with either Iterative Refinement or Consistency Transformation. Which tool to choose and which parameters and techniques to configure are not easy decisions for the inexperienced [2, 25]. This choice becomes even harder given that alignment quality will be influenced by the quantity, lengths, and, in particular, the similarity of the biological sequences being aligned [1]. In this context, this work aims to examine the relative benefits of using combinations of different techniques in each stage of the Progressive method and to identify novel or perhaps overlooked combinations that could help scientists find higher-quality alignments.

Different from previous research that typically compares specific MSA tools [2, 23, 25, 26], this paper focuses on the common structures of such tools and the impact alternative techniques, methods, or heuristics can have on the quality of the alignments produced. The framework used in [19], which combines distinct techniques from existing tools including a variety of scoring techniques, Guide Tree clustering algorithms, and Consistency Transformation, has been extended to incorporate Iterative Refinement. This paper aims to investigate whether progressive alignments can effectively benefit from the two classic techniques used to address the issue of inappropriate gaps in the final alignment.

A thorough evaluation using protein sequences from the BAliBASE (Benchmark Alignment dataBASE) [27] and HOMSTRAD (HOMologous STRucture Alignment Database) [28] benchmarks is presented with results showing that, overall, consistency-based progressive alignments have greater accuracies than those found by both iterative refinement heuristics and even a novel combination of the two approaches. Even though the Wilcoxon signed-rank tests [29] show that the results have statistical significance, when analyzing the alignments for each problem instance on a case-by-case basis, one of the other 3 classes of heuristic (Progressive Alignment alone, Progressive Alignment with Iterative Refinement, and Consistency-based Progressive Alignment with Iterative Refinement) frequently produce a solution with better accuracy, indicating these approaches should not perhaps be ignored or discarded.

The rest of this paper is divided into four sections. The following one covers some background information on the MSA problem, while Section 3 briefly describes our MSA framework [19]. Section 4 presents a comparison of the alignment qualities of 80 different Progressive Alignment-based heuristics, while the final section closes the paper with some conclusions.

2 Background

The modus operandi of the class of Iterative Alignment tools [12, 24] is one approach to improve the results of a Progressive Alignment. These tools are generally composed of the following steps: (a) an initial multiple sequence alignment is generated using Progressive Alignment; (b) this alignment is then split into two smaller profiles (i.e., two sets of two or more aligned sequences) so that any entire column in either profile containing only gaps can be removed; (c) The two profiles are then realigned, generally using the technique employed in the earlier Profile Alignment stage, and; (d) if the resulting MSA obtains a better evaluation score, this becomes the MSA to be improved. Steps (b) to (d) are then repeated until a specified number of iterations have been completed or when no further improvements to the evaluation score are obtained.

While the method for dividing an alignment into two profiles in step (b) varies according to each tool, they all require a given criterion to be adopted to determine the allocation of each sequence to one of the respective profiles. The criteria most commonly used range from being simple (e.g., a random choice [30]), to something more elaborate (e.g., Tree Restricted Partitioning [31] presented in Section 3.5).

An important aspect of Iterative Alignments is how to evaluate the MSA at each iteration. The most widely used functions are the sum-of-pairs (SP) [9] and, its extension, the weighted sum-of-pairs (WSP) [32]. The WSP of an alignment A of n sequences and length L is defined by:

WSP(A)=1lL1i<jnwi,j×S(ai,l,aj,l) (1)

where ak, for 1 ≤ kn, is the kth aligned sequence, ak,l is the residue of the aligned sequence ak at column l, wi,j is the weight associated with the sequence pair ai, aj (wi,j = 1 for SP), and S(x, y) is the residue substitution score of x by y. The residue substitution matrix S encapsulates the rates at which each type of residue would be substituted by other residues over time. One of the most widely adopted examples for S is Blocks of Amino Acid Substitution (BLOSUM) [33].

Two of the best known Iterative Alignment tools are MAFFT [34] and MUSCLE [22]. Both follow the steps described earlier, but each of them has its own particularities. MUSCLE generates an initial MSA by passing through the progressive method stages twice and then iteratively attempts to improve it. First, it employs k-mer distance [35] to determine the first similarity score matrix which is used by UPGMA [16] to generate the Guide Tree and, after that find an initial MSA. With this first MSA, MUSCLE generates a second similarity score matrix using Kimura distance [36]. With the second similarity score matrix, MUSCLE again uses UPGMA to create a second Guide Tree, and then the MSA is iteratively improved. MUSCLE’s iteration process uses Tree Restricted Partitioning [31] to determine the two profiles to be aligned employing WSP [32] as the objective function.

MAFFT version 5 [37] has five options that use iterative refinement: NW-NS-i, FFT-NS-i, F-NS-i, G-NS-i, and H-NS-i. The fundamental difference between these variations is the technique used in the first stage to calculate the similarity scores. While NW-NS-i and FFT-NS-i both use the number of 6-tuples of residues shared by the two sequences as the score, the latter first transforms the sequence representation using a Fast Fourier Transform (FFT) before the calculation [34]. G-NS-i also uses an FFT, but a global alignment algorithm, like Needleman-Wunsch [8], to generate similarity scores. H-NS-i and F-NS-i use variations of the alignment tool FASTA [38]. In Stage 2, MAFFT also uses UPGMA and the objective function WSP to generate the Guide Tree, which is then optimized iteratively using Tree Restricted Partitioning in the following stage. The principal difference in relation to MUSCLE is related to how the similarity scores are calculated.

An alternative approach to address the drawback of the Progressive method is through the use of Consistency Transformation. Consistency-based Alignment assumes that multiple sequence alignments are consistent, i.e., if three sequences aa, ab, and ac are aligned, and residue aa,i is aligned with the residue ab,j and residue ab,j is aligned with ac,k, then aa,i should be aligned with ac,k. Thus, applying this transitive principle in reverse, when aligning two sequences aa and ac, consistency-based MSA methods look for evidence of previous pairwise alignments involving aa and ac with other sequences (e.g., ab) to reinforce alignments of specific residue pairs. Gotoh [39] refers to these consistent alignment regions as anchor points and has shown how they can be used to support multiple alignments. Two of the most well-known consistency-based alignment tools are T-COFFEE [23] and ProbCons [30].

3 Towards a common framework for MSA

While Section 2 cited a few examples, there has been a plethora of tools proposed that fall into one of the three aforementioned heuristic classes. Some even happen to combine characteristics of more than one class (e.g., iterative and consistency [30]). As yet there does not seem to be a single MSA tool that always produces the best alignment. Thus, the appropriate choice is left to the user’s experience. Given the different classes and implementations within each class, it is often necessary to employ various tools or one with multiple configuration options and compare their alignment results manually [1]. This, however, requires various MSA processing steps to be unnecessarily repeated. For example, using Progressive Alignment, a scientist might want to compare the alignments generated by two different techniques for constructing the Guide Tree. In this case, the information necessary to calculate the similarity scores might be the same, but would be calculated twice, once for each technique.

To provide an environment where the user can efficiently evaluate different combinations of techniques and classes of MSA heuristics without having to learn to execute a suite of existing MSA tools, the work in [19] presented an initial version of an integrated framework of common MSA techniques to overcome these difficulties more efficiently. Furthermore, the framework is able to evaluate novel combinations of techniques, many of which have apparently not been investigated previously, in a coordinated manner and on a level playing field. Techniques from different state-of-the-art MSA tools have been integrated into the framework. The Clustal W [26] source code is used as the base implementation for Progressive Alignments, while Consistency Transformation implemented by T-COFFEE [23] has also been incorporated. For this paper, an iterative strategy has been added to the framework based on MUSCLE’s [22] Tree Restricted Partitioning [31].

Instead of structuring the framework into three different classes, existing approaches are mapped to a unified architecture composed of 5 stages, with each having a number of alternative implementation options. An overview of the framework stages as well as the techniques currently available in each are described in the following subsections. The final subsection presents the nomenclature adopted in the following section to identify the variations and the heuristic classes that can be designed by the current version of this common framework.

3.1 Stage 1: Similarity scoring

The first step of any progressive alignment is to determine the degree of similarity between pairs of sequences and calculate each pair’s Similarity Score. In general, these scores are held in a Similarity Matrix: a triangular matrix of dimension n, where n is the number of sequences to be aligned. Each element Mi,j, with i < j, of this matrix is the similarity score between sequence i and sequence j, calculated in this stage. In addition to the two techniques provided by Clustal W [26], known as FULL (and referred to as F in the tables of Subsection 4.3) and QUICK (Q), another three forms of calculating similarity scores, LCS (L), KMERS (K), and PROBA (P), have also been implemented. Their differences are summarized as follows:

  • FULL: The similarity score is the Identity percentage of the alignment obtained using the Myers and Miller global alignment algorithm [40];

  • QUICK calculates the scores using a fast approximation method [41] that simply counts the number of tuples that match between the two sequences;

  • LCS defines the scores to be the length of the Longest Common Subsequence divided by the length of the smallest of the two sequences. LCS is generated using an optimized implementation [42] of Hirschberg’s Algorithm [43];

  • KMERS calculate the score by dividing the number of k-mers (the term k-mer refers to all of the subsequences of length k) that the two sequences have in common by the length of the smaller sequence, as used by MUSCLE;

  • PROBA is an alternative approach used by ProbCons [30] that searches for alignments with maximum expected accuracy, using pair-hidden Markov model-derived posterior probability matrices that contain the probability of a pair of residues from the two sequences being aligned.

Note that each scoring mechanism inherently tends to incorporate a chosen scheme to “align” each pair of sequences. The framework thus opts to consider each unique alignment-scoring combination as a separate technique.

3.2 Stage 2: Guide tree construction

In addition to the two most commonly used hierarchical clustering algorithms to generate Guide Trees, UPGMA [16] and Neighbor-Joining (NJ) [17], both already available in Clustal W [26], two other variations of SLINK [44], called SLMIN and SLMAX, have been implemented in the framework. One motivation to include these two techniques was the fact that the authors in [45] claim that SLMIN is the best choice.

With the degree of similarity between pairs of sequences being held in a similarity matrix, the agglomerative UPGMA begins to construct its guide tree by representing each sequence as a single node and repeatedly joining the two nodes of the pair of sequences with the highest score in the similarity matrix to form a rooted tree. The similarity matrix is then altered to reflect the grouping of the two nodes into one. The two lines and columns of the similarity matrix, representing the two grouped nodes are substituted by a single line and a single column for the new root, with the scores between the remaining nodes being calculated as the average of the two scores being removed. The algorithm continues until all the nodes have been combined into a single cluster, representing the guide tree. SLMIN and SLMAX follow the same algorithm, however, the new scores are no longer the average of the two scores but rather the minimum (in the case of SLMAX) or the maximum (for SLMIN) score between pairs.

Similarly, Neighbor-Joining (NJ) is also a bottom-up agglomerative hierarchical clustering approach that uses a similarity matrix. Starting with a central node connected to all other nodes (sequences), the two least distant nodes (most similar sequences) are grouped, forming a tree that is connected to the central node. The distances between groups are calculated based on the path from the central node to each group, and the matrix is updated to incorporate the new group. This process is repeated until the completion of the guide tree.

3.3 Stage 3: Consistency transformation

Consistency Transformation was added to the framework through the concept of constraint lists as implemented in the widely adopted tool T-COFFEE [23] which uses posterior probabilities used earlier by the ProbCons [30] tool. The constraint list is used in the profile alignment stage to enforce the alignment of certain residues (anchor points) by adding constraint list values to the substitution matrix used by Clustal W.

3.4 Stage 4: Profile alignment

The technique used to align profiles in our framework is almost identical to the one adopted by Clustal W [26]. Default features like sequence weighting, position-specific gap penalties, and adaptive weight matrix choices were maintained and, to implement Consistency Transformation, support for the integration with constraint lists has been incorporated. However, the technique for delayed alignments, which overrides the Guide Tree order, has been disabled in order to be able to identify the impact of techniques chosen in earlier stages [19].

3.5 Stage 5: Iterative refinement

Each iteration of the Tree Restricted Partitioning [31] refinement algorithm is composed of applying the following steps to each edge of the Guide Tree, starting with the edges farthest from the root, and moving up towards the root: cut an edge of the Guide Tree, forming two sub-trees; considering the set of aligned sequences in a profile of each sub-tree, remove any column made up entirely of gaps; realign the two resulting profiles to create an intermediate MSA; calculate the WSP score of this intermediate MSA, if this calculated score is higher than the WSP of the current best MSA, the intermediate MSA becomes the new current best MSA solution. The iterative refinement process ends when an iteration fails to find an MSA that improves the WSP or when a certain number of iterations, defined by the user, has been reached.

While ProbCons [30] adopts a simpler approach where, during a user-defined number of iterations, sequences are randomly divided into two profiles, the MUSCLE implementation is preferred to be able to investigate the scope of improvement that can be achieved by this technique. It is important to emphasize that Iterative Refinement seeks to improve the WSP score of the generated alignment, which does not necessarily imply improving the quality of this alignment.

3.6 Framework variations and strategies

Currently, our framework has 20 variations of progressive alignment, given that the first stage has a choice of 5 techniques to calculate the similarity score (Section 3.1) and 4 guide tree generation algorithms in the second stage (Section 3.2). The names of the combinations, referenced extensively in Section 4.3, are illustrated in Table 1.

Table 1. The 20 progressive alignment variations currently in the common framework.

FULL KMERS LCS PROBA QUICK
NJ F.NJ K.NJ L.NJ P.NJ Q.NJ
SLMAX F.SLMAX K.SLMAX L.SLMAX P.SLMAX Q.SLMAX
SLMIN F.SLMIN K.SLMIN L.SLMIN P.SLMIN Q.SLMIN
UPGMA F.UPGMA K.UPGMA L.UPGMA P.UPGMA Q.UPGMA

The following section refers to the combination of a variation (defined by Stage 1 and 2) and a heuristic class to possibly improve progressive alignment as a strategy. Each of the four MSA heuristic classes are identified by a different combination of the remaining three stages of the framework: Progressive Alignments only employs Stage 4; Consistency-based heuristics that apply an additional Consistency Transformation step (Section 3.3) before the final alignment requires Stages 3 and 4; Iterative heuristics append an iterative refinement step (Section 3.5) in Stage 5 after the profile alignment of Stage 4; finally, the combination of all three stages (Stages 3, 4 and 5) produces the somewhat less well-known class of Consistency-based Progressive Alignments with Iterative Refinement. Given the variations presented in Table 1, and these four classes, Section 4 evaluates a total of 80 different MSA strategies.

4 Experimental analysis

Instead of treating MSA tools as black boxes and comparing the quality of the alignments they produce, as has already been extensively covered in the literature, the aim of this section is to evaluate design alternatives combining different classical techniques or methods. For example, do Progressive Alignments benefit the best from the use of Iterative Refinement, Consistency Transformation, or even a combination of both? The quality of the alignments produced depends very much on the characteristics of the sequences themselves as well as the combinations of techniques adopted by an MSA tool, thus it is common to find a given strategy can generate better, similar, or worse alignments than other approaches depending on the problem instance [19]. This section therefore also provides an analysis to reveal the statistical significance of the obtained results. An open question for researchers is whether there is a hegemonic strategy or heuristic class that could be used for the majority of problem instances.

As part of a thorough and detailed assessment, the 80 different MSA strategies within the framework were evaluated using two of the most frequently referenced protein sequence benchmarks: BAliBASE [27] and HOMSTRAD [28]. The accuracies of the alignments generated for both benchmarks were measured using a score, calculated with BAliBASE’s own tool, which is referred to here as the Developer Score (DS) [4] to avoid any confusion with the earlier SP score definition. This tool counts the number of times the position of each pair of aligned residues in an MSA matches the corresponding residue pair and position in the correctly aligned reference MSA. To obtain a normalized value, the count is divided by the total number of aligned residue pairs in the reference alignment. Note, that the DS only considers residues pairs of the columns in the reference alignment where the number of gaps is less than 20%.

As noted previously in [19, 20, 23], the qualities of alignments, as measured by the DS, tend to be higher for sequences with a high degree of similarity. With this in mind, the benchmark problem instances have been subdivided into three groups according to their Identity values.

The remainder of this section first describes some of the characteristics of the two chosen benchmarks in Subsections 4.1 and 4.2, respectively. This is followed by two statistical evaluations, Subsection 4.3 presents the benefits of Consistency Transformation and Iterative Refinement for Progressive Alignments while Subsection 4.4 compares the different techniques within each stage of Progressive Alignment heuristics.

Finally, while the aforementioned evaluation results might indicate which class of heuristics is likely to obtain superior alignments, Subsection 4.5 takes a different perspective when analyzing the performance of the different design options. It attempts to identify whether a given class is always dominated by another and how frequently each class finds the highest scoring alignment between them.

4.1 The BAliBASE benchmark

One of our motives is to evaluate the performance of MSA heuristics in relation to sequences with different degrees of Identity. While Reference Set 9 from the BAliBASE benchmark [27] consists of 4 subsets of alignments, only the first one, which is comprised of sequences with true positive linear motifs (important regions of proteins), comes separated into groups according to sequence Identity. A total of 84 separate MSA instances are divided into 3 groups:

  1. RV911 contains 29 distinct multiple alignment instances or BOXes. Each BOX is composed, on average, of 15 sequences, with less than <20% Identity between pairs. For this group, the average length of the sequences in a BOX ranges from 227 to 1474 residues.

  2. RV912 contains 28 distinct multiple alignment instances, each composed of, on average, 8 sequences, with identities of between 20% and 40% between pairs. For this group, the average sequence length in a BOX ranges from 98 to 978 residues.

  3. RV913 contains 27 distinct multiple alignment instances, each composed of, on average, 10 sequences, with identities of between 40% and 80% between pairs. For this group, the average length of the sequences in a BOX ranges from 99 to 1374 residues.

4.2 The HOMSTRAD benchmark

HOMSTRAD [28] is a curated database of structure-based alignments for homologous protein families. Although HOMSTRAD is not a benchmark per se, as it contains numerous alignments in its database, it is used as a benchmark by several authors to test the accuracy of their tools. HOMSTRAD is composed of sequence alignments using structure information extracted from the Protein Data Bank (PDB) [46] and sequence alignments from other databases such as Pfam [47] and SCOP [48]. While HOMSTRAD contains 1, 031 instances, most are comprised of only two sequences. In this paper, all of the 400 MSA instances with more than 2 sequences were used.

Differently from the selected BAliBASE set that has already been divided according to sequence Identity, as HOMSTRAD does not employ such a classification, we have grouped the instances according to the sequence Identity of their reference MSA based on Eq 2.

Id(A)=(i=1n-1j=i+1nk=1LM(aik,ajk)L-G(i,j))×100(n2) (2)

where A is the MSA, L is the length of the aligned sequences, n is the number of sequences, G(i, j) is the number of gap-only columns in the alignment of sequences i and j. M(aik, ajk) is defined to be 1 if residue k of sequence i (aik) is equal to residue k of sequence j (ajk) and neither are gaps, otherwise, M(aik, ajk) is 0.

Based on the Identity values, the corresponding HOMSTRAD problem instances were also divided into 3 groups, using the same criteria adopted by the BAliBASE, and have the following characteristics:

  1. G1 contains 73 distinct MSA instances. Each instance is composed, on average, of 6 sequences, and has a sequence Identity percentage of less than <20%. For this group, the average length of the sequences in each problem instance ranged from 36 to 723 residues.

  2. G2 contains 209 distinct MSA instances, each composed of, on average, 6 sequences, and with a sequence Identity percentage between 20% and 40%. For this group, the average sequence length in each instance ranged from 14 to 859 residues.

  3. G3 contains 118 distinct MSA instances, each composed of, on average, 5 sequences, with a sequence Identity percentage between 40% and 80%. For this group, the average sequence length in each instance ranged from 23 to 861 residues.

4.3 Impact of consistency transformation and iterative refinement on progressive alignments

The principal aim of this paper is to evaluate the impacts that Consistency Transformation and Iterative Refinement have on the qualities of alignments produced by Progressive Alignment schemes. The relative qualities of the alignments produced by the 80 different MSA strategies (Subsection 3.6) were compared based on their respective Developer Scores for each of the 484 problem instances taken from the BAliBASE and HOMSTRAD benchmarks. Given that one MSA tool may produce a higher quality alignment than another tool for one instance and a lower one for another, it is common to compare tools based on their average accuracy (DS values) [22, 23, 34, 37] or determine the statistical significance of the differences between scores obtained by the strategies being compared [30].

To determine the latter, it was first necessary to verify whether the Developer Scores obtained by each of the 80 MSA strategies for a given benchmark group of problem instances followed a normal distribution in order to identify whether a parametric or non-parametric test should be used. Given the two benchmarks were each divided into 3 groups, and the 80 strategies, a total of 480 samples (2 benchmarks × 3 groups × 80 strategies) were evaluated. The frequencies of the p-values obtained by a Shapiro test for each sample are shown in Fig 1. The results show that 259 of the 480 samples (≈53.9%) have p-values lower than 0.05 and thus, with 95% probability, these don’t follow a normal distribution. Thus, given the high probability that at least one sample does not follow a normal distribution, the non-parametric Wilcoxon signed-rank test [29] was chosen to compare the different heuristic classes.

Fig 1. The distribution of the Shapiro test p-values for all 480 samples.

Fig 1

In the context of this Wilcoxon test, if one wishes to determine if a given variant A obtains alignments with better Developer Scores than another variant B, it is necessary to refute the null hypothesis that assumes the scores of variant B are greater or equal to the corresponding ones of variant A, i.e., the opposite of our alternative hypothesis. The p-value derived from the test is the probability that the null hypothesis is true. Should this p-value be higher than a predetermined threshold (by convention, ≥ 0.05), one can conclude that the samples do not provide sufficient evidence that variant A has higher scores than B. Otherwise one can refute the null hypothesis, as the chance of variant A obtaining a higher score than variant B is considered to be statistically significant [29].

The experimental analysis focuses on comparing the following four heuristic classes, with 20 implementation variations for each: Progressive Alignment (P), composed of the 3 stages of this classic strategy; Consistency-based Alignment (C) that uses additional information to improve the final stage of the progressive strategy; Iterative Alignment (I) that, in an additional stage, makes a number of attempts (in these experiments, limited to a maximum of 100 iterations) to refine the alignment produced by the corresponding progressive strategy, and; the Consistency-based Progressive Alignment with Iterative Refinement (CI), where the iterative stage follows the generation of a Consistency-based progressive alignment.

Our evaluation is first summarized over three pairs of tables, Tables 2 to 7, with each table containing p-values for Wilcoxon signed-rank tests related to one of the six problem instance groups taken from the two benchmarks, HOMSTRAD and BAliBASE. Initially, Tables 2 and 3 present comparative results for the respective benchmarks groups with low Identity (i.e., the groups containing the problem instances with Identity percentages lower than 20%), Tables 4 and 5 show results for the groups where the instances have Identity percentages between 20% and 40%, and the last two (Tables 6 and 7) for the groups with instances that have Identity percentages greater than 40%.

Table 2. Wilcoxon test applied to the Developer Scores for G1.

C > P I > P C > I CI > P CI > I CI > C
F.NJ 0.00114 0.29929 0.00599 0.00185 0.00691 0.42319
F.SLMAX 0.00304 0.28054 0.01291 0.00207 0.01239 0.50390
F.SLMIN 0.00315 0.35647 0.00835 0.00257 0.00943 0.50234
F.UPGMA 0.00304 0.28584 0.01736 0.00254 0.01822 0.48907
K.NJ 8.30e-5 0.08205 0.00619 2.70e-5 0.00171 0.68628
K.SLMAX 5.10e-5 0.06642 0.00442 2.90e-5 0.00164 0.67298
K.SLMIN 8.10e-5 0.09690 0.00378 4.30e-5 0.00149 0.63032
K.UPGMA 6.90e-5 0.07600 0.00530 4.40e-5 0.00190 0.64790
L.NJ 2.60e-5 0.08234 0.00236 2.70e-5 0.00138 0.57987
L.SLMAX 1.80e-5 0.15443 0.00082 2.30e-5 0.00069 0.52730
L.SLMIN 0.00012 0.22677 0.00119 4.60e-5 0.00047 0.60945
L.UPGMA 0.00011 0.25856 0.00096 7.20e-5 0.00055 0.57067
P.NJ 9.60e-8 0.00762 0.00169 1.70e-8 0.00025 0.69456
P.SLMAX 6.90e-8 0.01686 0.00023 1.20e-8 3.20e-5 0.73507
P.SLMIN 1.00e-7 0.08659 3.10e-5 3.70e-8 7.90e-6 0.63548
P.UPGMA 7.00e-8 0.03483 5.40e-5 2.00e-8 1.50e-5 0.67369
Q.NJ 0.00053 0.15073 0.01507 0.00085 0.01606 0.47114
Q.SLMAX 0.00373 0.26751 0.01839 0.00342 0.01559 0.49688
Q.SLMIN 0.00592 0.28518 0.01686 0.00434 0.01208 0.51327
Q.UPGMA 0.00599 0.26494 0.02111 0.00579 0.02131 0.49141

Table 7. Wilcoxon test applied to the Developer Scores for RV913.

C > P I > P C > I CI > P CI > I CI > C
F.NJ 0.34214 0.53793 0.32018 0.33581 0.31095 0.49655
F.SLMAX 0.35495 0.60243 0.28693 0.36466 0.29884 0.51726
F.SLMIN 0.37771 0.57550 0.30182 0.37443 0.29881 0.50000
F.UPGMA 0.39760 0.57551 0.33265 0.41437 0.34531 0.51382
K.NJ 0.30488 0.56191 0.28400 0.29286 0.26951 0.48619
K.SLMAX 0.18644 0.51381 0.17956 0.18877 0.18413 0.50345
K.SLMIN 0.23903 0.51381 0.24173 0.26100 0.26384 0.50691
K.UPGMA 0.25265 0.54480 0.24173 0.28105 0.27239 0.53105
L.NJ 0.41438 0.51381 0.39094 0.38432 0.36139 0.46551
L.SLMAX 0.35173 0.58229 0.25263 0.35817 0.25818 0.51726
L.SLMIN 0.38763 0.57551 0.32639 0.38432 0.32328 0.50346
L.UPGMA 0.34852 0.55848 0.28692 0.34852 0.28692 0.50345
P.NJ 0.02190 0.47929 0.02283 0.02283 0.02378 0.51727
P.SLMAX 0.01369 0.47240 0.01430 0.01369 0.01494 0.50345
P.SLMIN 0.03332 0.48964 0.03463 0.03397 0.03598 0.50691
P.UPGMA 0.01595 0.50000 0.01737 0.01595 0.01737 0.50345
Q.NJ 0.34851 0.45862 0.34851 0.34214 0.34214 0.48619
Q.SLMAX 0.40430 0.61239 0.29286 0.40766 0.29584 0.50345
Q.SLMIN 0.41101 0.54823 0.33898 0.40765 0.33582 0.49655
Q.UPGMA 0.38102 0.60907 0.26952 0.38432 0.27526 0.50346

Table 3. Wilcoxon test applied to the Developer Scores for RV911.

C > P I > P C > I CI > P CI > I CI > C
F.NJ 0.00293 0.34298 0.00628 0.00383 0.00670 0.50621
F.SLMAX 0.00700 0.44742 0.00943 0.00692 0.00983 0.49069
F.SLMIN 0.00867 0.45974 0.00924 0.00757 0.00796 0.47519
F.UPGMA 0.02160 0.53409 0.02160 0.02042 0.02042 0.48449
K.NJ 8.12e-5 0.15050 0.00096 0.00011 0.00128 0.48139
K.SLMAX 5.38e-6 0.26951 4.03e-5 8.84e-6 5.92e-5 0.47519
K.SLMIN 7.87e-5 0.34870 0.00029 6.52e-5 0.00021 0.45357
K.UPGMA 2.80e-6 0.29309 1.95e-5 4.18e-6 2.91e-5 0.46591
L.NJ 0.00128 0.25930 0.00429 0.00098 0.00302 0.44433
L.SLMAX 0.00021 0.28246 0.00089 0.00013 0.00055 0.41988
L.SLMIN 0.00115 0.36611 0.00262 0.00086 0.00269 0.46591
L.UPGMA 0.00227 0.37786 0.00460 0.00177 0.00374 0.44741
P.NJ 5.86e-7 0.19190 7.67e-6 5.65e-7 5.99e-6 0.39571
P.SLMAX 1.62e-7 0.12329 3.36e-6 1.32e-7 2.16e-6 0.39573
P.SLMIN 8.67e-7 0.35159 5.18e-7 5.01e-7 1.05e-6 0.43820
P.UPGMA 6.10e-7 0.41685 1.48e-6 5.22e-7 1.18e-6 0.46901
Q.NJ 0.00091 0.30118 0.00310 0.00064 0.00295 0.47829
Q.SLMAX 0.00256 0.43208 0.00401 0.00211 0.00340 0.47519
Q.SLMIN 0.00239 0.50931 0.00282 0.00244 0.00250 0.49690
Q.UPGMA 0.00562 0.41079 0.00963 0.00439 0.00747 0.45973

Table 4. Wilcoxon test applied to the Developer Scores for G2.

C > P I > P C > I CI > P CI > I CI > C
F.NJ 0.03341 0.21291 0.14962 0.03866 0.15189 0.50485
F.SLMAX 0.04377 0.36248 0.09070 0.06118 0.10620 0.53983
F.SLMIN 0.04251 0.33346 0.10088 0.05972 0.12503 0.54705
F.UPGMA 0.04916 0.36097 0.10038 0.06545 0.11696 0.53356
K.NJ 0.00445 0.12737 0.06291 0.00498 0.07176 0.50953
K.SLMAX 0.00553 0.16280 0.05368 0.00724 0.05963 0.52405
K.SLMIN 0.00360 0.14551 0.04491 0.00573 0.06157 0.54914
K.UPGMA 0.00331 0.12881 0.04769 0.00499 0.06152 0.54063
L.NJ 0.01553 0.26941 0.06392 0.01810 0.06919 0.50194
L.SLMAX 0.01652 0.23630 0.07830 0.02010 0.08345 0.51114
L.SLMIN 0.01879 0.29792 0.06201 0.02440 0.07083 0.53001
L.UPGMA 0.01697 0.27994 0.06098 0.02090 0.06979 0.52131
P.NJ 1.10e-5 0.00101 0.12519 3.73e-6 0.07648 0.37777
P.SLMAX 2.00e-6 0.00179 0.04067 5.52e-7 0.02103 0.38440
P.SLMIN 5.60e-6 0.02673 0.00700 3.71e-6 0.00530 0.44478
P.UPGMA 1.50e-6 0.00556 0.01665 5.17e-7 0.00976 0.41473
Q.NJ 0.02736 0.17556 0.17400 0.03234 0.17851 0.50711
Q.SLMAX 0.03335 0.28008 0.10196 0.04721 0.12411 0.55634
Q.SLMIN 0.03593 0.32395 0.08886 0.05285 0.11072 0.56815
Q.UPGMA 0.03427 0.30314 0.09603 0.04638 0.11057 0.54401

Table 5. Wilcoxon test applied to the Developer Scores for RV912.

C > P I > P C > I CI > P CI > I CI > C
F.NJ 0.03837 0.39025 0.05879 0.03444 0.05148 0.47386
F.SLMAX 0.02759 0.46408 0.03202 0.03086 0.03508 0.51308
F.SLMIN 0.02148 0.43812 0.02759 0.02812 0.03445 0.54243
F.UPGMA 0.02974 0.47060 0.03321 0.03382 0.03572 0.53267
K.NJ 0.00894 0.54568 0.00782 0.00875 0.00782 0.49673
K.SLMAX 0.01691 0.49019 0.01691 0.01656 0.01725 0.50000
K.SLMIN 0.02415 0.48366 0.02811 0.02509 0.02973 0.51961
K.UPGMA 0.01434 0.52287 0.01319 0.01464 0.01464 0.51308
L.NJ 0.01590 0.49346 0.01796 0.01870 0.02107 0.52289
L.SLMAX 0.02149 0.50981 0.02279 0.02147 0.02368 0.48692
L.SLMIN 0.03906 0.48039 0.04978 0.03382 0.04339 0.49673
L.UPGMA 0.02323 0.50000 0.02558 0.02191 0.02557 0.50981
P.NJ 0.00235 0.39973 0.00247 0.00212 0.00218 0.47060
P.SLMAX 0.00030 0.36844 0.00048 0.00019 0.00038 0.46408
P.SLMIN 0.00098 0.32608 0.00122 0.00066 0.00092 0.49346
P.UPGMA 0.00030 0.33797 0.00044 0.00024 0.00041 0.48692
Q.NJ 0.02279 0.50000 0.02368 0.02149 0.02192 0.46734
Q.SLMAX 0.01066 0.43813 0.01465 0.01265 0.01833 0.52940
Q.SLMIN 0.03262 0.53266 0.03573 0.03384 0.03572 0.52940
Q.UPGMA 0.01908 0.51634 0.01833 0.02025 0.02025 0.51962

Table 6. Wilcoxon test applied to the Developer Scores for G3.

C > P I > P C > I CI > P CI > I CI > C
F.NJ 0.30899 0.50114 0.30430 0.32705 0.32327 0.52286
F.SLMAX 0.32327 0.41677 0.39748 0.35900 0.42910 0.53237
F.SLMIN 0.32705 0.40822 0.41490 0.34799 0.42685 0.51258
F.UPGMA 0.31745 0.41193 0.38903 0.34940 0.41788 0.52857
K.NJ 0.28060 0.45019 0.31102 0.29296 0.31882 0.50534
K.SLMAX 0.27292 0.44755 0.29500 0.29001 0.31034 0.51258
K.SLMIN 0.28545 0.45208 0.31610 0.29428 0.32190 0.49962
K.UPGMA 0.27547 0.45625 0.29698 0.27547 0.29500 0.49009
L.NJ 0.32153 0.51144 0.30160 0.38244 0.35720 0.57203
L.SLMAX 0.29728 0.45095 0.33430 0.36903 0.40819 0.59069
L.SLMIN 0.32736 0.45738 0.36471 0.37698 0.41675 0.56866
L.UPGMA 0.31267 0.45057 0.35044 0.37880 0.41526 0.58362
P.NJ 0.05057 0.25374 0.16091 0.05522 0.16212 0.51372
P.SLMAX 0.03383 0.25623 0.09584 0.03166 0.09013 0.49428
P.SLMIN 0.03153 0.23696 0.10474 0.03000 0.09930 0.50534
P.UPGMA 0.03276 0.24586 0.09663 0.02822 0.08302 0.48742
Q.NJ 0.28098 0.48133 0.28747 0.30731 0.31814 0.53237
Q.SLMAX 0.30362 0.45929 0.33850 0.34940 0.38465 0.55283
Q.SLMIN 0.31677 0.45285 0.35472 0.35118 0.39160 0.53996
Q.UPGMA 0.28195 0.44265 0.32326 0.32843 0.37483 0.56038

Each table is composed of 7 columns, the first being a list of 20 variations of techniques used in the first (Subsection 3.2) and second stages (Subsection 3.1) of a Progressive Alignment. Each combination has been incorporated into a representative MSA strategy for each of the four heuristics classes (P, C, I, and CI). The corresponding variants of each heuristic class are then compared in a pairwise fashion using a Wilcoxon signed-rank test, based on the Developer Scores obtained for the set of problem instances in the benchmark group of that table. Thus, each table presents a further six columns with headings that identify the alternative hypothesis of the pairwise test. P-values less than 0.05 (highlighted in bold) mean the null hypothesis can be rejected, affirming the condition of the column heading: C > P indicates which Consistency-based variants are statistically better than their Progressive-only ones; I > P, which Iterative variants are better than their Progressive-only ones; C > I, which Consistency-based variants are better than Iterative ones; and similarly for the columns CI > P, CI > I and CI > C, which Consistency-based Progressive Alignment with Iterative Refinement variants are better than their respective Progressive, Iterative and Consistency-based approaches.

When considering alignments for the problem instances with low Identity (groups RV911 and G1), the results presented in Tables 2 and 3 highlight pretty clearly that MSA heuristics that employ Consistency Transformation (C and CI), in general, are better than both Progressive Alignment (C > P and CI > P) and Iterative Alignment (C > I and CI > I), given that the p-values in the four columns, in both tables, for the corresponding variants are all less than 0.05. Unlike the relation to the other heuristics, the test results for the heuristics combining consistency with Iterative Refinement (CI) were similar to those obtained by the Consistency-based heuristics (C), so much so that the differences between the respective pairs were not statistically significant, as seen in the columns CI > C. The qualities of the alignments varied depending on the variant strategy and problem instance, without revealing any specific trend as to which heuristic class is better.

Furthermore, Iterative Refinement (I), an alternative approach to address the issue of inopportune gaps in progressive alignments, also does not appear to have variants with performances that are statistically significant in relation to their P variants, i.e., the distribution of scores refutes the hypothesis that I > P. In fact, only 3 of the four PROBA variants of the iterative strategy (PROBA with the Guide tree construction algorithms NJ, SLMAX, and UPGMA, respectively) can claim to have improvements over their corresponding Progressive Alignment counterparts that are statistically relevant, and which only occurred for the group G1.

Tables 4 and 5 present the Wilcoxon test results for the groups, RV912 and G2, with Identity percentages between 20% and 40%. The p-values already indicate a change in tendencies with respect to the previous groups, in particular G2 (note the fewer bold numbers compared with Tables 2 and 3). While all 20 Consistency-based variants continue to be statistically better than their progressive alignment ones for both instance groups, the quality of the alignments they produce when compared to those of the Iterative refinement heuristics (see column C > I) were statistically significant for only 5 of 20 variants, for the HOMSTRAD group, and 19 of 20 for BAliBASE. The stark difference between these two groups may in part be due to the distribution of Identity values within each group. While 92.86% of instances in RV912 have Identity values below 30%, in G2 only 47.85% have Identities below the same threshold.

The heuristics that combine Consistency Transformation and Iterative Refinement (CI) exhibit a somewhat similar behavior to the Consistency-based ones, when compared to both their respective Progressive Alignment (CI > P) and Iterative Refinement (CI > I) equivalents. However, the number of variants for which the differences in alignment quality were statistically significant drop slightly to 3 for G2 and stays the same, 19, for RV912. Comparing CI with C gives rise to p-values around 0.5, therefore, again, statistically there appears to be little benefit in combining these two techniques. Finally, only in the case of G2, can it be said that the technique of Iterative Refinement improved the qualities of Progressive Alignments that employ the scoring technique PROBA.

The last two groups of Wilcoxon test results in Tables 6 and 7) are for instances with Identity percentages greater than 40%. For these types of sequences, no particular heuristic class is better than the others, with statistical significance, for any variation. As indicated by the relatively few p-values in bold, the tests show that statistically significant improvements were only seen for Consistency-based variants that use PROBA (except with NJ) over their corresponding progressive aligners in the case of group G3, and over both their corresponding progressive and iterative aligners in the case of group RV913.

In summary, based on pairwise Wilcoxon signed-rank tests to identify a dominant heuristic class, we find (with 95% confidence) that, for Identity values between 40% and 80%, in general, there is statistically no significant difference between corresponding implementations of the four heuristic classes: Progressive Alignment (P); Consistency-based Alignment (C); Iterative Alignment (I), and; the Consistency-based Progressive Alignment with Iterative Refinement (CI). As the Identity values decrease, Consistency-based progressive alignments (C and CI) begin to achieve statistically significant improvements over both Progressive P and Iterative I Alignments. However, independent of Identity values, the addition of an Iterative stage to the MSA pipeline of both P and C classes of heuristics does not appear to be statistically beneficial. As mentioned earlier, one reason for this is due to the difference between the function WSP that is being optimized by the Iterative Refinement technique and the DS used to measure alignment quality. We have observed several instances where improvements in the WSP score during the iterative stage even led to worse Developer Scores.

4.4 The impact of similarity scoring and guide tree generation techniques on Progressive-based Alignments

Following a similar line of investigation as the previous section, one might be interested in knowing which are the best techniques for calculating similarity scores or for generating guide trees, in general. A preliminary analysis of these techniques was presented in [19] and to complement that work with a statistical analysis, this section uses the same methodology presented in Section 4.3.

To compare the 5 similarity scoring techniques, the Developer Scores (DS) obtained by each technique, for each of the 6 benchmark Identity groups, were gathered to form 6 sets of samples for that technique irrespective of the combination of guide tree technique and remaining stages used by the strategy.

Table 8 presents the Wilcoxon test results comparing the 5 techniques (FULL, KMERS, LCS, QUICK and PROBA), represented by their initial letter. As shown previously in Subsection 4.3, values in bold indicate the cases where the null hypothesis was rejected. In these cases, the statement described in the column header is true for the given benchmark group. For example, the first p-value of 0.00239, being less than 0.05, indicates that FULL is better than KMERS (F > K) for HOMSTRAD G1.

Table 8. Wilcoxon test to compare similarity scoring techniques.

F > K F > L F > P F > Q K > P
G1 0.00239 0.01106 4.70e-11 0.44092 1.28e-5
G2 0.03151 0.21076 1.31e-11 0.39444 7.68e-7
G3 0.45689 0.59746 0.00246 0.49723 0.00373
RV911 1.73e-6 0.0043 6.99e-13 0.10243 0.00142
RV912 0.12054 0.26368 0.00013 0.42194 0.00533
RV913 0.24945 0.44931 0.00004 0.45232 0.00023
L > K L > P Q > K Q > L Q > P
G1 0.33175 3.46e-6 0.00511 0.01819 2.13e-10
G2 0.11721 3.04e-10 0.05470 0.31272 8.88e-11
G3 0.37968 0.00117 0.47667 0.59894 0.00287
RV911 0.04065 2.38e-6 0.00037 0.07181 7.68e-10
RV912 0.26469 0.00069 0.17995 0.33926 0.00051
RV913 0.26848 0.00007 0.27607 0.50261 0.00008

The first observation is that PROBA is worse, with statistical significance, than the other four techniques, regardless of the benchmark group. For alignments with Identities of less than 20%, FULL is statistically better than three of the other techniques, the exception being QUICK, for both HOMSTRAD and BAliBASE. The same can be said about QUICK being better than KMERS and LCS for HOMSTRAD G1, and better than KMERS for BAliBASE RV911. Also, for RV911, LCS was better than KMERS and PROBA. For the other groups of the two benchmarks, the only other conclusion that can be drawn is that FULL is also better than KMERS for HOMSTRAD G2.

Analogous to Tables 8 and 9 presents the Wilcoxon test results for the pairwise comparison of the 4 guide tree generation techniques, Neighbor-Joining, UPGMA, SLMIN and SLMAX, represented in the table by NJ, UPG, MIN and MAX, respectively. As all the p-values in the table are above 0.05, none of the guide tree generation techniques are statistically superior to any of the others, indicating that the choosing the best guide tree generation technique for a given MSA problem instance might be a difficult task.

Table 9. Wilcoxon test to compare guide tree generation techniques.

NJ > MAX NJ > MIN NJ > UPG MAX > MIN MAX > UPG MIN > UPG
G1 0.62064 0.74468 0.73032 0.63280 0.62597 0.49377
G2 0.41524 0.55356 0.42957 0.63601 0.51379 0.37595
G3 0.44523 0.46304 0.43036 0.51789 0.48532 0.46719
RV911 0.37448 0.71471 0.75173 0.80100 0.82343 0.54210
RV912 0.47616 0.68403 0.57262 0.70955 0.60335 0.37129
RV913 0.48073 0.55148 0.54623 0.58000 0.56133 0.49627

In summary, with the exception of PROBA, which is statistically a worse performer than the other 4 similarity scoring techniques, only in specific scenarios did a given technique stand out over the others. FULL and QUICK, both from the original Clustal W implementation, were only better for alignments with low identities. No such conclusions could be drawn for the guide tree generation techniques. These results corroborate the difficulty of choosing a single strategy to try to obtain the best alignments and that the use of a tool capable of testing different techniques, variations and strategies might be beneficial to the user.

4.5 Analyzing the top performances by heuristic class

As with any heuristic, there is no guarantee that it will always find the optimal solution. As the previous subsection suggests, although based on perhaps a relatively small set of instances, identifying a unique heuristic that consistently finds the best alignments is not easy. Even when Identity values are below 40%, although Consistency-based heuristics might obtain statistically better quality alignments for most problem instances, this does not mean that they find the best alignment for every instance. Rather than present average Developer Scores, this subsection analyses the frequency with which each class of heuristic obtained the best score that was found.

The percentage of times the best DS was obtained by each of the four heuristic classes (P, C, I, and CI), regardless of which variant was used, can be seen in Figs 2 to 7, in relation to the 6 respective benchmark groups. The blue-colored portion represents the percentage of times that a given class obtained this best score but was equaled by a variant of another class. The red portion, however, represents the percentage of times that this class was the only one to obtain the highest score. The percentage values in black are the sum of these two values.

Fig 2. G1: Percentage of unique best and tied best alignments per class.

Fig 2

Fig 7. RV913: Percentage of unique best and tied best alignments per class.

Fig 7

The frequencies of top performers for the groups of instances in the lowest Identity range are shown in Figs 2 and 3. First, notice that the distributions are different for the two benchmarks. As expected, C and CI obtain the best Developer Scores most of the time. But of note for G1 (and true for G2 and G3 too) is the high frequency with which only one class is able to find the best alignment, almost 80%, with all four classes contributing to this statistic. Furthermore, the relatively novel and less explored class CI found improved alignments in a quite significant 26% of the instances in G1 and 34% of RV911 when compared with more traditional approaches. While C found the best alignment in 54.79% of the G1 cases, it was bettered by P and I a combined, and not insignificant, 16.44% of the time. In the case of RV911, CI was the best overall performer finding the best score 75.86% of the time, while tying with C in 12 of 29 instances. Both P and I performed poorly in terms of this metric. While no Progressive Alignment obtained the best score, Iterative Alignment did uniquely find the best alignment for 1 of the instance problems.

Fig 3. RV911: Percentage of unique best and tied best alignments per class.

Fig 3

An almost similar pattern occurs for the two groups with Identity percentages between 20% and 40%. For G2 (Fig 4), there is a smaller disparity between all 4 classes, both in terms of overall top scores and unique top scores, although C and CI again have an advantage. Still, to be sure of finding the best alignments these results indicate the need to use more than one tool or a tool that implements all four classes. For RV912 (Fig 5), the heuristics which use consistency transformation (C and CI) again dominate, only losing out in two instances to the base progressive approach. Iterative Refinement has mixed success, again due to its scoring optimization: CI improved 4 instances and made 4 worse; I was not able to match or improve on the two top Progressive Alignments.

Fig 4. G2: Percentage of unique best and tied best alignments per class.

Fig 4

Fig 5. RV912: Percentage of unique best and tied best alignments per class.

Fig 5

Finally, Figs 6 and 7 refer to the higher Identities value groups G3 and RV913, where there is now a closer equilibrium between all four classes—a difference of just 15.25% for G3 and 11.11% for RV913. This is further emphasized by the significant numbers of tied top scores. In the case of G3, the surprising fact is that the base progressive approach obtained the highest number of best scores. With respect to tied scores, when Identity values are above the twilight zone, both the C and CI approaches tend to be less effective, meaning simpler strategies can be equally successful.

Fig 6. G3: Percentage of unique best and tied best alignments per class.

Fig 6

5 Conclusions

Choosing the best tool to perform MSA is often difficult due to the diverse range of heuristics available and the variety of configuration options. Overcoming a disadvantage of the classical Progressive Alignment approach has seen the emergence of tools that use one of two techniques: Iterative Refinement or Consistency Transformation. Rather than propose a new tool or compare existing tools, this work focuses on understanding the impact of techniques, employed by successful tools, on alignment quality. Using an integrated framework [19], this paper presented a statistical evaluation of the alignment accuracy obtained by 80 combinations of existing techniques in order to evaluate the heuristic classes: Progressive Alignment; Progressive Alignment with Iterative Refinement; Consistency-based Alignment, and the lesser-known Consistency-based Alignment with Iterative Refinement. For sequences with a reasonable degree of similarity (above 40% Identity), all four heuristic classes are statistically comparable. Below this threshold, the two consistency-based heuristic classes stand out, while the benefits of adding an Iterative Refinement stage to either Progressive or Consistency-based Alignments are significantly smaller in comparison.

Given no single class of heuristic appears to be the silver bullet, our proposed framework aims to be able to explore a larger solution space by evaluating various, possibly novel, combinations of techniques simultaneously while re-utilizing common computations in order to find better solutions with less effort for a specific sequence alignment instance. While the focus of this work has not been to compare specific MSA tools, an indication of the potential of such a framework, in relation to its competitiveness with respect to 3 commonly adopted solutions: MUSCLE (that uses Iterative Refinement), ProbCons, and T-COFFEE (which employ Consistency Transformation) can be briefly summarized as follows. In relation to the HOMSTEAD benchmark groups, the three tools respectively only bettered the framework in 4.5%, 16%, and 16.75% of the 400 instances, and in 0%, 14.29%, and 8.33% of the 84 BAliBASE instances.

Ongoing work is continuing to integrate other techniques used by MSA tools and novel variants, in addition to a fifth class of MSA heuristic—regressive heuristics, while placing an increased focus on optimizing the MSA processing workflow and reducing execution times through the adoption of parallel and distributed computing. The use of information regarding both the structural and functional importance of each amino acid in proteins is seen as the gold standard to evaluate alignments. In the longer term, we plan to investigate how to extend and adapt our framework for the multiple structural alignment problem and to study its applicability in comparison with structural alignment tools such as TM-align [49].

Data Availability

All Developer Scores for all instances are available at URL: https://doi.org/10.6084/m9.figshare.23518842.v2.

Funding Statement

This work has received financial support from the Brazilian funding agency CNPq (Conselho Nacional de Desenvolvimento Científico e Tecnológico - https://www.gov.br/cnpq) through the project Universal, process number 404087/2021-3. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Edgar RC, Batzoglou S. Multiple sequence alignment. Current Opinion in Structural Biology. 2006;16(3):368–373. doi: 10.1016/j.sbi.2006.04.004 [DOI] [PubMed] [Google Scholar]
  • 2. Thompson JD, Linard B, Lecompte O, Poch O. A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives. PLoS ONE. 2011;6(3). doi: 10.1371/journal.pone.0018093 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Goh CS, Cohen FE. Co-evolutionary Analysis Reveals Insights into Protein–Protein Interactions. Journal of Molecular Biology. 2002;324(1):177–192. doi: 10.1016/S0022-2836(02)01038-0 [DOI] [PubMed] [Google Scholar]
  • 4. Mirarab S, Warnow T. FastSP: linear time calculation of alignment accuracy. Bioinformatics. 2011;27(23):3250–3258. doi: 10.1093/bioinformatics/btr553 [DOI] [PubMed] [Google Scholar]
  • 5. Kemena C, Notredame C. Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics. 2009;25(19):2455–2465. doi: 10.1093/bioinformatics/btp452 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Przybylski D, Rost B. Alignments grow, secondary structure prediction improves. Proteins. 2002;46(2):197–205. doi: 10.1002/prot.10029 [DOI] [PubMed] [Google Scholar]
  • 7. Li D, Becchi M. Abstract: Multiple Pairwise Sequence Alignments with the Needleman-Wunsch Algorithm on GPU. In: 2012 SC Companion: High Performance Computing, Networking Storage and Analysis; 2012. p. 1471–1472. [Google Scholar]
  • 8. Needleman SB, Wunsch CD. A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of two Proteins. Journal of Molecular Biology. 1970;48(3):443–453. doi: 10.1016/0022-2836(70)90057-4 [DOI] [PubMed] [Google Scholar]
  • 9. Carrillo H, Lipman D. The Multiple Sequence Alignment Problem in Biology. SIAM J Appl Math. 1988;48(5):1073–1082. doi: 10.1137/0148063 [DOI] [Google Scholar]
  • 10. Wang L, Jiang T. On the Complexity of Multiple Sequence Alignment. J Computational Biology. 1994;1(4):337–348. doi: 10.1089/cmb.1994.1.337 [DOI] [PubMed] [Google Scholar]
  • 11. Gotoh O. 2. In: Russell DJ, editor. Heuristic Alignment Methods. Totowa, NJ: Humana Press; 2014. p. 29–43. [DOI] [PubMed] [Google Scholar]
  • 12. Gotoh O. Optimal alignment between groups of sequences and its application to multiple sequence alignment. Bioinformatics. 1993;9(3):361–370. doi: 10.1093/bioinformatics/9.3.361 [DOI] [PubMed] [Google Scholar]
  • 13. Katoh K, Toh H. Recent developments in the MAFFT multiple sequence alignment program. Briefings in Bioinformatics. 2008;9(4):286–298. doi: 10.1093/bib/bbn013 [DOI] [PubMed] [Google Scholar]
  • 14. Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, et al. Fast, scalable generation of high‐quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology. 2011;7(1):539. doi: 10.1038/msb.2011.75 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Feng DF, Doolittle RF. Progressive Sequence Alignment as a Prerequisite to Correct Phylogenetic Trees. Journal of Molecular Evolution. 1987;25:351–360. doi: 10.1007/BF02603120 [DOI] [PubMed] [Google Scholar]
  • 16. Sokal RR, Michener CD. A statistical method for evaluating systematic relationships. The University of Kansas Science Bulletin. 1958;38(22):1409–1438. [Google Scholar]
  • 17. Saitou N, Nei M. The Neighbor-joining Method: A New Method for Reconstructing Phylogenetic Trees. Molecular Biology and Evolution. 1987;4(4):406–425. [DOI] [PubMed] [Google Scholar]
  • 18. Eddy SR. Profile hidden Markov models. Bioinformatics. 1998;14(9):755–763. doi: 10.1093/bioinformatics/14.9.755 [DOI] [PubMed] [Google Scholar]
  • 19.João M, Sena AC, Rebello VEF. On Using Consistency Consistently in Multiple Sequence Alignments. In: 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW); 2022. p. 152–161.
  • 20. Pei J. Multiple protein sequence alignment. Current Opinion in Structural Biology. 2008;18(3):382–386. doi: 10.1016/j.sbi.2008.03.007 [DOI] [PubMed] [Google Scholar]
  • 21. Rost B. Twilight zone of protein sequence alignments. Protein Engineering, Design and Selection. 1999;12(2):85–94. doi: 10.1093/protein/12.2.85 [DOI] [PubMed] [Google Scholar]
  • 22. Edgar RC. MUSCLE: A multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics. 2004;5(1):113. doi: 10.1186/1471-2105-5-113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Notredame C, Higgins DG, Heringa J. T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment. Journal of Molecular Biology. 2000;302(1):205–217. doi: 10.1006/jmbi.2000.4042 [DOI] [PubMed] [Google Scholar]
  • 24. Barton GJ, Sternberg MJ. A Strategy for the Rapid Multiple Alignment of Protein Sequences. Confidence Levels from Tertiary Structure Comparisons. Journal of Molecular Biology. 1987;198(2):327–337. doi: 10.1016/0022-2836(87)90316-0 [DOI] [PubMed] [Google Scholar]
  • 25. Notredame C. Recent progresses in multiple sequence alignment: a survey. Pharmacogenomics. 2002;3(1):1. doi: 10.1517/14622416.3.1.131 [DOI] [PubMed] [Google Scholar]
  • 26. Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22(22):4673–4680. doi: 10.1093/nar/22.22.4673 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Thompson JD, Koehl P, Ripp R, Poch O. BAliBASE 3.0: Latest Developments of the Multiple Sequence Alignment Benchmark. Proteins: Structure, Function and Genetics. 2005;61(1):127–136. doi: 10.1002/prot.20527 [DOI] [PubMed] [Google Scholar]
  • 28. Mizuguchi K, Deane CM, Blundell TL, Overington JP. HOMSTRAD: A database of protein structure alignments for homologous families. Protein Science. 1998;7(11):2469–2471. doi: 10.1002/pro.5560071126 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Wilcoxon F. Individual Comparisons by Ranking Methods. Biometrics Bulletin. 1945;1(6):80–83. doi: 10.2307/3001968 [DOI] [Google Scholar]
  • 30. Do CB, Mahabhashyam MSP, Brudno M, Batzoglou S. ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome research. 2005;15(2):330–40. doi: 10.1101/gr.2821705 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Hirosawa M, Totoki Y, Hoshida M, Ishikawa M. Comprehensive study on iterative algorithms of multiple sequence alignment. Bioinformatics. 1995;11(1):13–18. doi: 10.1093/bioinformatics/11.1.13 [DOI] [PubMed] [Google Scholar]
  • 32. Altschul SF, Carroll RJ, Lipman DJ. Weights for Data Related by a Tree. J of Molecular Biology. 1989;207(4):647–653. doi: 10.1016/0022-2836(89)90234-9 [DOI] [PubMed] [Google Scholar]
  • 33. Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences. 1992;89(22):10915–10919. doi: 10.1073/pnas.89.22.10915 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Katoh K, Misawa K, Kuma Ki, Miyata T. MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic acids research. 2002;30(14):3059–3066. doi: 10.1093/nar/gkf436 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Edgar RC. Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Nucleic Acids Research. 2004;32(1):380–385. doi: 10.1093/nar/gkh180 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Kimura M. The Neutral Theory of Molecular Evolution. Cambridge University Press; 1983. [Google Scholar]
  • 37. Katoh K, Kuma Ki, Toh H, Miyata T. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Research. 2005;33(2):511–518. doi: 10.1093/nar/gki198 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences. 1988;85(8):2444–2448. doi: 10.1073/pnas.85.8.2444 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Gotoh O. Consistency of optimal sequence alignments. Bulletin of Mathematical Biology. 1990;52(4):509–525. doi: 10.1007/BF02462264 [DOI] [PubMed] [Google Scholar]
  • 40. Myers EW, Miller W. Optimal alignments in linear space. Bioinformatics. 1988;4(1):11–17. doi: 10.1093/bioinformatics/4.1.11 [DOI] [PubMed] [Google Scholar]
  • 41. Bashford D, Chothia C, Lesk AM. Determinants of a protein fold: Unique features of the globin amino acid sequences. Journal of Molecular Biology. 1987;196(1):199–216. doi: 10.1016/0022-2836(87)90521-3 [DOI] [PubMed] [Google Scholar]
  • 42. João M Jr, Sena AC, Rebello VEF. On the parallelization of Hirschberg’s algorithm for multi-core and many-core systems. Concurrency and Computation: Practice and Experience. 2019;31(18):e5174. [Google Scholar]
  • 43. Hirschberg DS. A Linear Space Algorithm for Computing Maximal Common Subsequences. Communications of the ACM. 1975;18(6):341–343. doi: 10.1145/360825.360861 [DOI] [Google Scholar]
  • 44. Sibson R. SLINK: An optimally efficient algorithm for the single-link cluster method. The Computer Journal. 1973;16(1):30–34. doi: 10.1093/comjnl/16.1.30 [DOI] [Google Scholar]
  • 45. Plyusnin I, Holm L. Comprehensive comparison of graph based multiple protein sequence alignment strategies. BMC Bioinformatics. 2012;13(1):1–11. doi: 10.1186/1471-2105-13-64 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Berman H, Henrick K, Nakamura H. Announcing the worldwide Protein Data Bank. Nature Structural & Molecular Biology. 2003;10(12):980. doi: 10.1038/nsb1203-980 [DOI] [PubMed] [Google Scholar]
  • 47. Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, et al. Pfam: the protein families database. Nucleic Acids Research. 2014;42(Database issue). doi: 10.1093/nar/gkt1223 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Andreeva A, Kulesha E, Gough J, Murzin AG. The SCOP database in 2020: Expanded classification of representative family and superfamily domains of known protein structures. Nucleic Acids Research. 2020;48(D1):D376–D382. doi: 10.1093/nar/gkz1064 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Zhang Y, Skolnick J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 2005;33(7):2302–2309. doi: 10.1093/nar/gki524 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Yang Zhang

10 Apr 2023

PONE-D-23-07871On closing the inopportune gap with consistency transformation and iterative refinementPLOS ONE

Dear Dr. João Junior,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by May 25 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Yang Zhang

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Multiple sequence alignment (MSA) construction is actually a very important problem in bioinformatics and biology research. MSA has two slightly different meanings in recent studies. One is the traditional MSA definition that tries to align a set of sequences all against all to make the entire alignment reach an optimized global quality. The other is more like the definition of homologous sequences collection, which tries to detect and align the homologous sequences to a query sequence. In this paper, Mario Joao Jr. et. al., tried to focus on the first kind of MSA construction problem and assess the performance of different MSA construction methods, i.e., Progressive Alignments, Iterative Alignments, Consistency-based Alignments, and Consistency-based Progressive Alignments with Iterative Refinement, on two datasets and their six corresponding sub-datasets. Overall, the results shown in this paper are reasonable and logical, and the statistical analysis is appropriately made. However, there are still several issues that exist in the paper. Moreover, more comprehensive data analyses need to be done before the manuscript can be published.

Major:

1. In first paragraph of “Background section”, when describe the flow of Iterative Alignment, please describe how to split the whole alignments into two clearly, and how to combine the two split group profiles together as a new MSA/alignments.

2. In “Section 3.2”, please give a brief methodology description of UPGMA, NJ, SLMAX and SLMIN, especially the latter two, which are not commonly used in the field.

3. In “Section 4.3”, the authors mentioned that “The results of Shapiro tests for all of the 480 samples found that more than half of the samples did not follow a normal distribution, thus the non-parametric Wilcoxon signed-rank test was used to compare different heuristic classes.” Please show the data. For example, a histogram of the p-values from the Shapiro test. In addition, I think the number should be ‘484’, please recheck it.

4. In result analysis section, I think authors could show some comparison results between the 20 variations in the first column of each table, I think the readers are also curious about the performance of NJ vs UPGMA etc.

5. HOMSTRAD is a structure-based MSA database. Each sequence in a MSA has the PDB structure in HOMSTRAD database. Thus a structure-based analysis should be added to the paper, so that the readers could know how the performances of those 80 MSA construction methods are related to the structure alignment, since protein structures could directly reflect the protein evolution. For example, instead of just calculating the DS score for G1, G2 and G3, the authors could calculate the average TM-score of the pairwise proteins in a MSA based on the MSA alignment using TM-align/TM-score tool. Then show the comparisons of the average TM-scores for those 80 MSA construction methods.

Minor:

1. Multiple sequence alignments actually have two meanings in recent studies. One is its traditional definition, aligning the multiple sequences with each other, inserting reasonable gaps, and letting all sequences align well. This one is mainly used for phylogeny inference, evolution analysis, etc. The other one recently used in protein/RNA structure and function prediction is more like homologous sequence detection, which aligns all homologous to a 'query' sequence. This first one is the NP-hard problem, as the authors mentioned, while the second is mostly N times of pairwise alignments (N is the number of homologous sequences in MSA). In the “Introduction section”, please clearly mention the first one is what the authors focus on.

2. Recent co-evolutionary analyses are also based on MSA, like contact/distance prediction. In the “Introduction section”, please talk about this point.

3. In the first paragraph of “Section 4.3”, when you first mention “20 variations”, please cite “Section 3.1 and 3.2”. In the end of “Section 3”, please clearly say how you get 20 variations by combining 5 methods in “Section 3.1” and 4 methods in “Section 3.2”. Since when I first read the word “80 combinations” and “20 variations”, I am slightly confused where those methods came from.

4. Text in figure 4 is overlapping.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2023 Jul 13;18(7):e0287483. doi: 10.1371/journal.pone.0287483.r002

Author response to Decision Letter 0


30 May 2023

Major Points

1. In first paragraph of “Background section”, when describe the flow of Iterative Alignment, please describe how to split the whole alignments into two clearly, and how to combine the two split group profiles together as a

new MSA/alignments.

R.: To describe how the alignments are split, the following paragraph was added as the second paragraph of Section 2.

“While the method for dividing an alignment into two profiles in step (b) varies according to each tool, they all require a given criterion to be adopted to determine the allocation of each sequence to one of the respective profiles. The criteria most commonly used range from being simple (e.g., a random choice [30]), to something more elaborate (e.g., Tree Restricted Partitioning [31] presented in Section 3.5).”

2. In “Section 3.2”, please give a brief methodology description of UPGMA, NJ, SLMAX and SLMIN, especially the latter two, which are not commonly used in the field.

R.: A brief description of UPGMA, NJ, SLMAX, and SLMIN has been added at the end of Section 3.2. The following text added was:

“With the degree of similarity between pairs of sequences being held in a similarity matrix, the agglomerative UPGMA begins to construct its guide tree by representing each sequence as a single node and repeatedly joining the two nodes of the pair of sequences with the highest score in the similarity matrix to form a rooted tree. The similarity matrix is then altered to reflect the grouping of the two nodes into one. The two lines and columns of the similarity matrix, representing the two grouped nodes are substituted by a single line and a single column for the new root, with the scores between the remaining nodes being calculated as the average of the two scores being removed. The algorithm continues until all the nodes have

been combined into a single cluster, representing the guide tree. SLMIN and SLMAX follow the same algorithm, however, the new scores are no longer the average of the two scores but rather the minimum (in the case of SLMAX) or the maximum (for SLMIN) score between pairs. Similarly, Neighbor-Joining (NJ) is also a bottom-up agglomerative hierarchical clustering approach that uses a similarity matrix. Starting with a central node connected to all other nodes (sequences), the two least distant nodes (most similar sequences) are grouped, forming a tree that is connected to the central node. The distances between groups are calculated based on the path from the central node to each group, and the matrix is updated to incorporate the new group. This process is repeated until the completion of the guide tree.”

3. In “Section 4.3”, the authors mentioned that “The results of Shapiro tests for all of the 480 samples found that more than half of the samples did not follow a normal distribution, thus the non-parametric Wilcoxon signedrank test was used to compare different heuristic classes.” Please show the data. For example, a histogram of the p-values from the Shapiro test. In addition, I think the number should be ‘484’, please recheck it.

R.: A histogram of the p-values frequencies from the Shapiro test was added as Figure 1. To clarify the doubt with regard to the number of samples, the second paragraph of Section 4.3 now reads as follows:

“To determine the latter, it was first necessary to verify whether the Developer Scores obtained by each of the 80 MSA strategies for a given benchmark group of problem instances followed a normal distribution in order to identify whether a parametric or non-parametric test should be used. Given the two benchmarks were each divided into 3 groups, and the 80 strategies, a total of 480 samples (2 benchmarks × 3 groups × 80 strategies) were evaluated. The frequencies of the p-values obtained by a Shapiro test for each sample are shown in Fig 1. The results show that 259 of the 480 samples (≈ 53.9%) have p-values lower than 0.05 and thus, with 95% probability, these don’t follow a normal distribution. Thus, given the high probability that at least one sample does not follow a normal distribution, the non-parametric Wilcoxon signed-rank test [29] was chosen to compare the different heuristic classes.”

4. In result analysis section, I think authors could show some comparison results between the 20 variations in the first column of each table, I think the readers are also curious about the performance of NJ vs UPGMA etc.

R.: Section 4.4 was added to the paper to present two statistical comparisons, similar to that shown in Section 4.3. The first is between similarity scoring techniques and the second is between guide tree generation techniques. The penultimate paragraph of the introduction in Section 4 has also been modified to incorporate the new section.

5. HOMSTRAD is a structure-based MSA database. Each sequence in a MSA has the PDB structure in HOMSTRAD database. Thus a structurebased analysis should be added to the paper, so that the readers could know how the performances of those 80 MSA construction methods are related to the structure alignment, since protein structures could directly reflect the protein evolution. For example, instead of just calculating the DS score for G1, G2 and G3, the authors could calculate the average TM-score of the pairwise proteins in a MSA based on the MSA alignment using TMalign/TM-score tool. Then show the comparisons of the average TM-scores for those 80 MSA construction methods.

R.: This is an interesting observation. We are already aware that many consider structural alignment as the gold standard for sequence alignment, and thus have future plans on our roadmap to incorporate structure-aware techniques into our framework. However, given its current stage of development, in this paper, we have opted to only focus on trying to understand the extent to which widely adopted techniques, methods, or heuristics in existing MSA tools impact the quality of the alignments produced. In this classic alignment scenario, none of the techniques evaluated make use of information about protein structures, only the nucleotide or amino acid sequence. Thus, we thought it would be unfair to draw conclusions based on the comparison suggested. We do however believe that such a comparison would be a good starting point to evaluate enhancements to classic alignment techniques to support structural alignment. Our initial plan was to start with generating a similarity matrix based also on structural information, which will allow us to evaluate the techniques in the rest of the framework and compare the alignments obtained with TM-align/US-align. In parallel, we are working on techniques that try to predict protein structure, based on NMR-obtained atomic distance data, so that we might still be able to use the framework in situations where the structural information is not available.

But replying specifically to the review’s suggestion, given the time constraints for this review, our lack of experience in this area of structural alignments, and our unfamiliarity with the TM-score and how to provide modified structural information for the alignments produced by designed techniques for classic MSA alignments, we prefer leave this for future work so that we can take a careful and more considerate approach to make this comparison. Therefore, the following sentence was added to the conclusions in Section 5:

“The use of information regarding both the structural and functional importance of each amino acid in proteins is seen as the gold standard to evaluate alignments. In the longer term, we plan to investigate how to extend and adapt our framework for the multiple structural alignment problem and to study its applicability in comparison with structural alignment tools such as TM-align [45].”

Minor Points

1. Multiple sequence alignments actually have two meanings in recent studies. One is its traditional definition, aligning the multiple sequences with each other, inserting reasonable gaps, and letting all sequences align well. This one is mainly used for phylogeny inference, evolution analysis, etc. The other one recently used in protein/RNA structure and function prediction is more like homologous sequence detection, which aligns all homologous to a ’query’ sequence. This first one is the NP-hard problem, as the authors mentioned, while the second is mostly N times of pairwise alignments (N is the number of homologous sequences in MSA). In the “Introduction section”, please clearly mention the first one is what the authors focus on.

R.: The following paragraph was added as the second paragraph of Introduction. “Note that this problem is computationally different, for example, from the process to find homologous sequences, which consists of obtaining multiple pairwise sequence alignments of protein or RNA sequences generally against a single reference sequence [7]. Multiple sequence alignments tend to provide more information than pairwise alignments by highlighting conserved regions of aligned residues that may be of structural and functional relevance. The focus of this paper is thus on the former, the traditional MSA problem.”

2. Recent co-evolutionary analyses are also based on MSA, like contact/distance prediction. In the “Introduction section”, please talk about this point.

R.: We have included this term and a reference in the first paragraph of the Introduction.

3. In the first paragraph of “Section 4.3”, when you first mention “20 variations”, please cite “Section 3.1 and 3.2”. In the end of “Section 3”, please clearly say how you get 20 variations by combining 5 methods in “Section 3.1” and 4 methods in “Section 3.2”. Since when I first read the word “80 combinations” and “20 variations”, I am slightly confused where those methods came from.

R.: Section 3.6 has been added to clarify the meaning of the terms "variations" and "strategies". A table has also been inserted in this Section to explain how the 20 variations were obtained and their nomenclature.

4. Text in figure 4 is overlapping.

R.: Figure 5 (the old Figure 4) has been adjusted.

Attachment

Submitted filename: Rebuttal_letter.pdf

Decision Letter 1

Yang Zhang

6 Jun 2023

On closing the inopportune gap with consistency transformation and iterative refinement

PONE-D-23-07871R1

Dear Dr. João Junior,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Yang Zhang

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: I think the authors have addressed all my concerns, and I do not have any further comments. Thank you, and congratulations!

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

**********

Acceptance letter

Yang Zhang

4 Jul 2023

PONE-D-23-07871R1

On closing the inopportune gap with consistency transformation and iterative refinement

Dear Dr. João Jr.:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Yang Zhang

Academic Editor

PLOS ONE


Articles from PLOS ONE are provided here courtesy of PLOS

RESOURCES