ReAlign-N: an integrated realignment approach for multiple nucleic acid sequence alignment, combining global and local realignments

Yixiao Zhai; Tong Zhou; Yanming Wei; Quan Zou; Yansu Wang

doi:10.1093/nargab/lqae170

. 2024 Dec 18;6(4):lqae170. doi: 10.1093/nargab/lqae170

ReAlign-N: an integrated realignment approach for multiple nucleic acid sequence alignment, combining global and local realignments

Yixiao Zhai ^1,², Tong Zhou ^3,⁴, Yanming Wei ^5,⁶, Quan Zou ^7,^8,^✉, Yansu Wang ^9,^10,^✉

PMCID: PMC11655299 PMID: 39703429

Abstract

Ensuring accurate multiple sequence alignment (MSA) is essential for comprehensive biological sequence analysis. However, the complexity of evolutionary relationships often results in variations that generic alignment tools may not adequately address. Realignment is crucial to remedy this issue. Currently, there is a lack of realignment methods tailored for nucleic acid sequences, particularly for lengthy sequences. Thus, there’s an urgent need for the development of realignment methods better suited to address these challenges. This study presents ReAlign-N, a realignment method explicitly designed for multiple nucleic acid sequence alignment. ReAlign-N integrates both global and local realignment strategies for improved accuracy. In the global realignment phase, ReAlign-N incorporates K-Band and innovative memory-saving technology into the dynamic programming approach, ensuring high efficiency and minimal memory requirements for large-scale realignment tasks. The local realignment stage employs full matching and entropy scoring methods to identify low-quality regions and conducts realignment through MAFFT. Experimental results demonstrate that ReAlign-N consistently outperforms initial alignments on simulated and real datasets. Furthermore, compared to ReformAlign, the only existing multiple nucleic acid sequence realignment tool, ReAlign-N, exhibits shorter running times and occupies less memory space. The source code and test data for ReAlign-N are available on GitHub (https://github.com/malabz/ReAlign-N).

Introduction

Multiple sequence alignment (MSA) is a fundamental technique for comparing and analyzing the arrangement of biological sequences, facilitating the assessment of their similarities and differences. The outcomes of MSA play a pivotal role in various biological applications, such as inferring species’ evolutionary relationships, predicting protein structure and function, and identifying genes. Ensuring precise MSA results becomes paramount for subsequent bioinformatics analyses (1,2). However, the substantial diversity in biological sequences and intricate evolutionary relationships often introduce notable variations, leading to suboptimal outcomes when utilizing conventional MSA tools. MSA methods are commonly categorized into star alignment and tree alignment. The reliance on a single center sequence selection method restricts the applicability of star alignment tools to datasets with varying degrees of similarity. Furthermore, while tree alignment employs heuristic methods to optimize operational time complexity, it may persist in incorrect insertions due to its adherence to the ‘once a gap, always a gap’ principle (3). Given the limitations of existing tools, many researchers resort to realignment methods as a solution to enhance the overall quality of MSA outcomes.

Current realignment tools are grouped based on their approach to partitioning initial alignments, falling into three categories: horizontal, vertical and a combination of both. Horizontal partitioning divides sequences within the initial alignment, while vertical partitioning segments it into blocks of sequence fragments. Among realignment methods employing horizontal partitioning, such as ReAligner (4), remove first method (RF) (5), REFINER (6) and ReformAlign (7), iterative optimization approaches are adopted to enhance their effectiveness. In iterative optimization, a subset of sequences is chosen for realignment in each iteration. If the new alignment improves, it serves as input for the subsequent iteration. This process continues until the objective function converges or the iteration limit is reached, yielding the final result. ReAligner distinguishes itself by comprehensively processing all sequences in each iteration, while RF and REFINER adopt random sequence selection for alignment with the profile or position-specific scoring matrix (PSSM), respectively. In contrast, ReformAlign constructs a summarized profile from the initial alignment, comparing sequences with this profile in each iteration and forming new alignment results by merging all pairwise sequence alignments. TreeRefiner (8), another horizontal partitioning realigner, utilizes 3D alignment without relying on iterative optimization. Nonetheless, horizontal partitioning does not guarantee the effective removal of erroneous gaps, leading to the potential persistence of numerous inaccuracies in the two generated profiles. These erroneous gaps endure even after the realignment process, introducing a notable level of uncertainty to the improvement of the initial alignment quality using this realignment method.

Refin-Align (9), SpliVert (10) and RPfam (11) leverage vertical partitioning for realignment, with Refin-Align and RPfam incorporating iterative optimization. Refin-Align initiates by partitioning the initial alignment into blocks based on columns with identical bases. Subsequently, it eliminates gaps within each block and conducts realignment using Promalign (12). Higher sum-of-pair (SP) scored blocks are selected and restored, concluding one iteration. In contrast, RPfam employs the simulated annealing algorithm during its iterations. In each cycle, a badly aligned block is randomly chosen, and the worst fragment undergoes realignment using dynamic programming. SpliVert vertically divides the alignment into three, realigns the middle section and splices it with the others. RASCAL (13) employs both horizontal and vertical partitioning, initially dividing into subfamilies using Secator (14). It identifies global core blocks through the NorMD objective function (15) and local core blocks for each subfamily. RASCAL then realigns poorly aligned regions using a ClustalW-like algorithm. However, among the aforementioned tools, ReformAlign is the sole tool specifically designed for nucleic acid sequences, while the other tools primarily cater to protein sequence datasets.

In this research, we have developed ReAlign-N, a specialized realignment method designed for aligning multiple nucleic acid sequences. This innovative approach seamlessly blends both global and local realignment strategies to enhance alignment accuracy. During the global realignment phase, ReAlign-N incorporates K-Band and memory-saving technology, integrating them into the dynamic programming method. This integration ensures not only heightened efficiency but also minimal memory requirements, which are particularly beneficial for large-scale realignment tasks. In the local realignment stage, two methods—fully matching and entropy scoring—are employed to identify the low-quality regions, which are then realigned through the application of MAFFT. The realignment result with a superior SP score is selected as the outcome of this stage. Moreover, ReAlign-N offers flexibility with two realignment schemes: one that initiates local realignment followed by global realignment, and another that begins with global realignment followed by local realignment. To validate their performance, ReAlign-N and ReformAlign were rigorously evaluated on four simulated and three real nucleotide datasets. The empirical findings consistently demonstrate that ReAlign-N outperforms the initial alignment in terms of quality across the majority of datasets. Notably, in comparison to ReformAlign, ReAlign-N not only exhibited shorter running times but also occupied less memory space. These advantages became more pronounced as the sequence length and quantity increased.

Materials and methods

The workflow of ReAlign-N

ReAlign-N demands two inputs: the original unaligned sequences and the initial alignment to be re-aligned, obtained through aligning the original sequences with an MSA tool. The tool features two realignment modes. The first involves realigning local low-quality areas and refining the results through a subsequent global realignment. The second mode starts with a global realignment, followed by a subsequent local realignment of low-quality areas in the results. Notably, the internal methods for both global and local realignment remain consistent across both modes (Figure 1A).

Global realignment

ReAlign-N incorporates a global re-alignment method inspired by ReformAlign (Figure 1B). It initiates by creating a summarized profile from the initial alignment, capturing the type and proportion of each base in each column of the initial alignment. Each unaligned sequence then undergoes individual realignment using this profile. During each realignment iteration, if a new insertion is detected in the profile, the algorithm automatically switches to fine-tuning mode (Supplementary Note S1) and re-aligns with the updated profile. After realigning all sequences, pairwise alignments are merged, and columns with only gaps are removed to produce the ultimate global realignment results. In the realignment process, we utilize a dynamic programming algorithm with affine penalties, incorporating K-Band and innovative memory-saving technology. The parameter settings related to affine penalties and the profile fine-tuning update method align with those of ReformAlign. Consequently, we will delve into the detailed implementation of K-Band and memory-saving technology next.

Unlike linear penalties, which assign a fixed penalty value to base insertions or deletions, affine penalties approach gaps (insertions or deletions) as continuous regions rather than isolated events. In this scheme, the penalty value gradually escalates with an increasing number of insertions or deletions (Equation 1), mirroring real-world biological scenarios more accurately.

(1)

Consider Inline graphic as the total length of continuous gaps, where a fixed penalty value is applied to the initial position of these gaps, and a penalty value is assigned to subsequent positions within these gaps. As the length of continuous gaps increases, the penalty value proportionally escalates. Pairwise sequence alignments employing affine penalties often rely on dynamic programming algorithms, which not only handle traditional matching, insertion and deletion states (Figure 2A) but also address continuous gaps. This necessitates comprehensive consideration of the preceding state, the current gap’s length and its associated penalty value during state transitions. Consequently, conventional methods like Needleman–Wunsch and Smith–Waterman utilize three dynamic programming tables to store the score matrices for the match, transverse and longitudinal states, respectively (Figure 2D(i)). As sequence length increases, the dimensions of these tables expand accordingly, resulting in significant memory overhead and reduced practicality.

Figure 2. — Schematic diagram illustrating K-Band and memory-saving technology. (A) State transition diagram illustrating three states of affine penalty. The three nodes symbolize distinct states, with the directed line segments between them representing the respective scores. (B) An illustration depicting the encoding of the three states associated with the affine penalty and their storage in a single byte. The match state is represented as 01, the transverse state as 10, and the longitudinal state as 11. Bits 0-5 within a single byte are allocated for storing the three states of the current position. (C) An illustrative example showcasing the integration of dynamic programming with K-Band. The dark area indicates the regions to be computed when K = 0, while the thick-framed area represents those to be computed when K = 1. (D) A specific example illustrates the utilization of a character matrix for storing the backtracking paths of three score matrices in the context of affine penalty dynamic programming. (i) Conventional dynamic programming, employing an affine penalty model, necessitates the computation of three score matrices: M for the match state, T for the transverse state, and L for the longitudinal state. The solid line signifies that the current position's score originates from the match state, the loosely dashed line denotes that the score is derived from the transverse state, and the tightly dashed line indicates that the score is derived from the longitudinal state. (ii) The character matrix retains the backtracking paths of the three scoring matrices. The illustration displays the storage outcome for bits 0–5 of a byte, with bits 6 and 7 consistently set to 0. (iii) Hexadecimal notation represents the outcome of the encoded backtracking paths matrix. The matrix is based on the binary encoding of the backtracking path. For example, the binary encoding for the top-left corner (an empty position) is 00010000, 0 × 10 in hexadecimal. The position to its right is encoded as 00000100, which is 0 × 04 in hexadecimal. The same conversion method applies to other positions.

In pairwise sequence alignment, the key is not the individual position scores but the backtracking path derived from the score matrices for the match, insertion and deletion states. To optimize memory usage, we store only these backtracking paths. Specifically, consider two sequences, Inline graphic and , where has length and has length (with ). In affine gap penalty alignment, two main data structures are employed to conserve memory: a byte matrix () and a floating-point matrix (). The , sized , stores the encoded backtracking paths instead of the scores. The , sized , holds the current row scores for the match, insertion (transverse) and deletion (longitudinal) states.

During the alignment process, Inline graphic stores the scores for the current row in each iteration. When moving to the next row, it is replaced by another floating-point matrix, (also sized ). These two matrices alternate to minimize memory usage, storing only the scores for the current and next rows, rather than the entire score matrix. The dynamic programming algorithm then uses the scores from Inline graphic and , along with the path information in , to iteratively determine the optimal path and complete the alignment.

To efficiently record path information, binary encoding is used to represent the three states: match is encoded as 01, insertion (transverse) as 10, and deletion (longitudinal) as 11. The last six bits of each byte are used to store the source of these states for the current position. Specifically, bits 0 and 1 represent the longitudinal path, bits 2 and 3 represent the transverse path, and bits 4 and 5 represent the match path, with bits 6 and 7 set to 0 (Figure 2B). This binary-encoded backtracking information is stored in Inline graphic (Figure 2D(ii)). For clarity, we convert this binary encoding to hexadecimal to display the path information stored in each byte (Figure 2D(iii)). Once the alignment is complete, backtracking begins from the bottom-right corner of . Using bitwise operations based on the stored path information, the optimal path is traced back, either upward or leftward, until it returns to the starting point, producing the final alignment result.

Furthermore, we utiliz ethe K-Band algorithm to refine the dynamic programming approach, considering the direct proportionality of the dynamic programming table's size to the sequence length. The K-Band algorithm boosts space efficiency by presuming that the optimal solution likely resides within a limited section of the dynamic programming table, selectively computing scores within a designated ‘banded area’ (Band) (16). This approach significantly reduces the required memory space while preserving accuracy, thereby enhancing the algorithm’s efficiency and scalability, especially when handling lengthy sequences or multiple sequences within large databases (Figure 2C). In the end, we employed a combination of both methods, and the pseudocode is shown in Algorithm 1.

graphic file with name lqae170figu1.jpg

Compared to traditional dynamic programming, the K-Band algorithm reduces the time complexity of global alignment from Inline graphic to , where is the bandwidth. This reduction in time complexity is particularly significant when is much smaller than . Regarding space complexity, traditional dynamic programming requires storing three full score matrices, leading to a space complexity of . With the K-Band constraint, only elements near the main diagonal need to be stored, reducing space complexity to Inline graphic . When is much smaller than , memory usage is significantly reduced. Additionally, while traditional score matrices use floating-point storage, applying memory-saving techniques allows the backtracking paths to be stored in a byte matrix, further optimizing space usage.

Local realignment

The local realignment method draws inspiration from the technique of vertical partition realignment. It begins by identifying the more conservative columns in the initial alignment and designates them as the target columns. A low-quality block is defined as the region between adjacent target columns that exceeds the minimum split distance (MSD). This distance is determined based on the number of sequences and their average similarity within the initial alignment. The average similarity is calculated as the total pairwise sequence similarity divided by the number of sequence pairs. Pairwise sequence similarity is assessed by dividing the number of correctly aligned base pairs by the sequence length. The MSDs corresponding to various levels of similarity and the number of sequences are detailed in Supplementary Table S1. For each low-quality block, a new block is generated through MAFFT (17) aligning after removing gaps in sequence fragments within the block. Subsequently, the SP scores of the new block and the original block are computed (using Equation 2), and the block with the higher score is chosen as the result of local realignment. This iterative process continues until all low-quality blocks are re-aligned, resulting in the final realignment outcome.

(2)

The SP score is computed by summing the scores of Inline graphic columns, where represents the total number of columns in the block, and denotes the score of the -th column (Equation 3).

(3)

Here, Inline graphic is the total number of unaligned sequences, and indicates the score of aligned pair in the -th row and -th row. For matched pairs (both characters are letters and identical), a score of 1 is assigned, while mismatched pairs (both characters are letters but different) receive a score of -1. If one character is a letter and the other is a gap, a score of -2 is allocated, and a score of 0 is assigned for all other cases.

ReAlign-N enhances the accuracy of this method by employing two distinct strategies for targeting columns: one based on fully matched columns and the other on columns filtered by entropy scoring (18). These two methods yield two results, and the final local realignment method selects the one with a higher SP score as the conclusive outcome (Figure 1C). The calculation method and parameters of the SP score remain consistent with those mentioned above. Fully matched columns refer to columns that contain only one type of base and no gaps. The entropy score-based screening method requires that the gap content in the column is <30% (the number of gaps divided by the total number of sequences) and that the entropy score of the column is > −0.5. The scoring formula for each column of entropy is presented in Equation 4.

(4)

(5)

Where Inline graphic represents the number of base types in the sequence, and signifies the proportion of a specific base relative to the total number of bases in the column (Equation 5). and denote the total number of unaligned sequences and the number of gaps contained in column , respectively.

Implementation and experimental environment

ReAlign-N, implemented in C++, is compatible with any UNIX-type platform. Its global realignment methodology is based on ReformAlign. It is freely available as open-source software distributed under the GNU Public License and can be accessed from https://github.com/malabz/ReAlign-N. The experiments were conducted on a server operating the Ubuntu Linux system, equipped with an Intel(R) Xeon(R) Platinum 8168 CPU running at 2.7 GHz and boasting 1 TB of RAM.

Datasets and measurement

Seven nucleotide datasets were utilized in the experiment, comprising three real datasets and four simulated ones. Some MSA tools, such as Clustal Omega (19), encounter difficulties when aligning datasets with long sequences or a large number of sequences. To address this challenge, we performed multiple random samplings to reduce the dataset size, generating several sub-datasets for testing. This approach minimizes randomness and variability in the experimental results, thereby enhancing their credibility.

The three real datasets include two DNA datasets and one RNA dataset. The 16S ribosomal RNA (rRNA) dataset includes 108 413 DNA sequences encoding RNA in bacteria and archaea, each ∼1.5 kb in length and consisting solely of ATCG bases (20). Ten sub-datasets, each containing 1000 sequences randomly sampled without replacement, were created for this dataset. The human mitochondrial (mt) genome dataset comprises 672 genomes with lengths ranging from 16 579 to 16 556 bp (21). After removing sequences containing characters other than ATCG (such as degenerate bases), 594 sequences remain. From these, 200 sequences are randomly selected without replacement to create sub-datasets, and this process is repeated ten times. Lastly, the 23S rRNA dataset contains 641 mycobacterial 23S rRNA sequences from the SILVA rRNA database (http://www.arb-silva.de/), ranging in length from 1909 to 3485 bp. For this dataset, ten sub-datasets were generated, each containing 500 sequences randomly sampled without replacement and composed only of AGCU bases. Ten sub-datasets were generated for each of the three datasets, with detailed information provided in Supplementary Tables S3–S5.

We also utilized four sets of simulated nucleic acid datasets for comprehensive testing. Three simulated datasets were generated using hierarchical tree simulation to obtain 16S-like simulated rRNA, human mitochondrial-like and SARS-CoV-2-like simulated genome datasets. The simulation was conducted with INDELible v1.03 (22), and the substitution models were derived from estimates obtained from 3000 16S rRNA, 672 human mitochondrial genome and 500 SARS-CoV-2 genome alignments, utilizing IQ-TREE v2.2.0-beta (23). One hundred randomly selected 16S rRNA sequences, mitochondrial genome sequences and SARS-CoV-2 genome sequences were aligned to construct the simulation trees. Within these datasets, the 16S rRNA and human mitochondrial genomes are the same as those introduced above, and the SARS-CoV-2 genome sequence is sourced from the GISAID website (https://www.gisaid.org). Subsequently, the process of generating the simulated 16S-like rRNA, mt-like genome and SARS-CoV-2-like genome datasets was based on these three simulation trees. Each tree’s branch length was randomly assigned a value from 0 to 1 (non-ultrametric). The simulated sequence lengths were set to 1.5 kb for 16S-like rRNA, 16 kb for the mt-like genome and 30 kb for the SARS-CoV-2-like genome. The indel model parameters used were LAV 5 50, with insertion and deletion rates set at 0.01 and 0.1, respectively. To simulate datasets with varying average similarities, we adjusted the tree length (sum of branch lengths) to achieve average similarities of 99%, 98%, 97%, 96%, 95%, 94%, 93%, 92%, 91%, 90%, 85%, 80%, 75% and 70%. Detailed information about these two sets of 14 sub-datasets can be found in Supplementary Table S2. Each sub-dataset includes nine replicates, with each replicate containing 100 simulated sequences. The remaining four datasets consist of CIPRES-simulated rRNA datasets evolving from a common root rRNA sequence across trees with 128, 256, 512, and 1024 taxa, respectively. These datasets were retrieved from trials 1 to 9 on the CIPRES SIMULATION DATA website (https://kim.bio.upenn.edu/software/csd.shtml). Table 1 provides additional specific details.

Table 1.

The summary of datasets

Datasets	Repeat sets number	Number	Average length	Average similarity
16S rRNA	10	1000	1440	∼72%
Human mitochondrial genome	10	200	16 570	∼99%
Mycobacterium 23S rRNA	10	500	3120	∼92%
16S-like simulated rRNA with various mean similarities	9	100	1550	14 sets of data with different similarities (99%, 98%, 97%, 96%, 95%, 94%, 93%, 92%, 91%, 90%, 85%, 80%, 75%, 70%)
Mitochondrial-like simulated genomes with various mean similarities	9	100	16 000	14 sets of data with different similarities (99%, 98%, 97%, 96%, 95%, 94%, 93%, 92%, 91%, 90%, 85%, 80%, 75%, 70%)
SARS-CoV-2-like simulated genomes with various mean similarities	9	100	29 000	14 sets of data with different similarities (99%, 98%, 97%, 96%, 95%, 94%, 93%, 92%, 91%, 90%, 85%, 80%, 75%, 70%)
CIPRES-255	9	255	1527	∼80%
CIPRES-511	9	511	1528	∼80%
CIPRES-1023	9	1023	1527	∼80%
CIPRES-2047	9	2047	1527	∼80%

Open in a new tab

To assess the alignment quality of the three real datasets, we employed the average sum of pairwise (aSP) score. This metric is calculated by dividing the SP score by the number of sequence pairs, using the same penalty parameters as in the local realignment section. In addition to the aSP score, we evaluated the alignment quality of the four simulated datasets using the Q score (calculated as the number of correctly aligned base pairs divided by the aligned pairs in the reference) and TC score (computed as the number of correctly aligned columns divided by aligned columns in the reference). Both scores were derived using the qscore program (http://www.drive5.com/bench/) to assess discrepancies between the alignment results and the reference alignment.

MSA tools employed in the experimental analysis

The experiment incorporated initial alignments generated by seven MSA tools: Clustal Omega, FMAlign2 (24), HAlign3 (25), Kalign3 (26), MAFFT, MUSCLE3 (27) and WMSA2 (28). The corresponding software versions and execution commands for acquiring these initial alignments are detailed in Supplementary Table S6.

Results

Global realignment demonstrates efficient running time and memory consumption

ReformAlign requires inputting both the original unaligned sequences and the initial alignment to be optimized, after which it generates the refined result. This experiment used seven MSA tools (ClustalW2, FMAlign2, HAlign3, Kalign3, MAFFT, MUSCLE3 and WMSA2) to develop the initial alignments on all datasets. For details on these tools’ versions and operating parameters, see Supplementary Table S6. Subsequently, the initial alignments and the original files were input into ReformAlign for global realignment. The number of iterations remains consistent for both processes. In addition, disable the statements utilizing ‘pragma omp’ in the source code of both methods and recompile it to guarantee the program’s execution on a single core. Precede the running command with ‘/usr/bin/time -v,’ and record the output parameters ‘Elapsed (wall clock) time’ and ‘Maximum resident set size’ to capture the program’s running time and peak memory usage.

Compared to ReformAlign, global realignment demonstrates reduced running time and memory requirements across all simulated datasets. Analysis of the 16S-like rRNA, mt-like and SARS-CoV-2-like datasets revealed that, with the same number of sequences, running time increased as sequence similarity decreased (Figure 3C, F and I). Additionally, the results from the CIPRES-simulated RNA datasets indicated that with similar sequence similarity, a larger number of sequences resulted in longer running times (Figure 3L). However, memory consumption is influenced solely by the length and number of sequences, not their similarity. Longer sequences and larger quantities lead to increased memory usage. In the three real datasets, global realignment consistently requires less memory and significantly reduces runtime for most MSA results. Its advantages are particularly clear in the high-similarity Mycobacterium 23S rRNA (sequence similarity: 92%) and human mitochondrial genome (sequence similarity: 99%) datasets (Figure 4F and I). Additionally, on the 16S rRNA dataset (sequence similarity: 72%), global realignment runs faster than ReformAlign when optimizing most MSA results while maintaining lower overall memory consumption (Figure 4C).

Figure 3. — Performance comparison based on simulated datasets. Comparison of aSP scores (A), Q scores and TC scores (B) between ReAlign-N and ReformAlign on the 16S-like rRNA dataset. (C) Comparison of runtime and memory usage between global realignment, ReAlign-N and ReformAlign on the 16S-like rRNA dataset. (D–L) Corresponding analysis of mitochondrial-like (mt-like) genome (D–F), SARS-CoV-2-like genome (G–I) datasets and CIPRES-simulated RNA datasets (J–L). The performance of the 16S-like rRNA, mt-like genome and SARS-CoV-2-like genome datasets varies with different similarities, while the CIPRES-simulated RNA datasets vary with different numbers of sequences. Each subset, categorized by similarity or number of sequences, comprises nine replicates. We computed the average aSP score, Q score and TC score for each MSA tool across these replicates. Subsequently, using the results from seven different MSA tools, we constructed box plots to illustrate the aSP scores and line graphs with error bars to depict the average Q (solid line) and TC (dash line) scores. Error bars were calculated based on the standard error derived from the seven results. The bubble chart shows the average running time and memory consumption of each subset categorized by similarity or number of sequences across the seven MSA tools. Here, bubble height represents running time, and bubble size represents memory consumption.

Figure 4. — Comparative analysis based on real datasets. (A) Histogram depicting the similarity distribution (left) among sequence pairs from 1000 16S rRNA sequences and the length distribution (right) of these sequences. The data are based on the file 16s_1000_0.fasta. (B) For the 16S rRNA dataset, the average (above) and distribution (below) of the aSP scores of the realignment results after local realignment using various MSDs. (C) Comparison of running time and memory usage for global realignment, ReAlign-N and ReformAlign on the 16S rRNA dataset (above), and comparison of aSP scores between ReAlign-N and ReformAlign (below). (D–I) Corresponding analysis on the Mycobacterium 23S rRNA dataset (D–F) and the human mitochondrial genome dataset (G–I). The histogram for Mycobacterium 23S rRNA is based on 500 sequences from the file 23s_500_0.fasta (D), while the histogram for the human mitochondrial genome is based on 200 sequences from the file mt_200_0.fasta (G). Each dataset contains 10 replicates. The histogram of the mean aSP score displays the average of all alignments (7 × 10), with error bars representing the standard error of the seven MSA results, calculated based on the mean of each MSA result. The violin plot illustrates the distribution of aSP scores for all alignments (7 × 10) within each dataset. The bubble chart illustrates the average running time and memory consumption, with bubble height representing running time and bubble size indicating memory consumption. The box plot displays the aSP scores for the seven MSA results.

Selecting the size of the low-quality block for local realignment based on initial alignment similarity

The seven MSA tools mentioned earlier were employed to generate the initial alignment. Conserved columns in each alignment were then identified by adopting two criteria: the ‘match’ method, which contains identical bases, and the ‘entropy’ method, based on entropy-based formulas. We selected MSD values of 5, 10, 15, 20, 25, 30, 35, 40, 45 and 50 to segment the low-quality blocks. For each low-quality block, gaps were removed, and MAFFT realigned it. SP scores were calculated for the blocks before and after re-alignment, and blocks with higher SP scores were retained. This process continued until all low-quality blocks were re-aligned, marking the completion of local realignment. Finally, we counted the number of low-quality blocks identified in the simulated dataset. The aSP score was calculated to evaluate the accuracy of the real datasets, while the aSP, Q and TC scores were used to assess the accuracy of the simulated datasets.

In the 16S-like rRNA dataset, the number of low-quality blocks identified by the match method is highest at low similarity levels (70%) and decreases as similarity increases. When sequence similarity is 99%, local realignment with an MSD of 5 achieves the highest Q and TC scores. As sequence similarity decreases, a higher MSD is needed to obtain higher Q and TC scores. When the average similarity is <90%, the results for different MSD values show little variation. With an MSD of 10, the average aSP score of local realignment is the highest; however, the score gradually decreases as the MSD value increases (Figure 5A). Similarly, in the mt-like genome (see Figure 5D) and SARS-CoV-2-like genome (see Figure 5G) datasets, the number of low-quality blocks identified by the match method is high when similarity is low and decreases as similarity increases. In high-similarity datasets, selecting a smaller MSD value for local realignment results in higher Q and TC scores. As similarity decreases, a higher MSD value is needed. The average aSP scores for these two datasets also decrease as the MSD value increases.

Figure 5. — The impact of various MSD values on the quality of local realignment was evaluated using simulated datasets. (A) For the 16S-like rRNA dataset, the average aSP score (upper left), Q score and TC score (lower left) after local realignment using the match method with different MSD values, as well as the number of low-quality blocks in different similarity sub-datasets (lower right). (B) Results of local realignment using the entropy method with different MSD values on the 16S-like rRNA dataset. The layout and content are the same as in (A). (C) Comparison of the average aSP score (top), Q score (middle) and TC score (bottom) after local realignment of the 16S-like rRNA dataset using different MSD values with the match and entropy methods. (D–L) Corresponding analysis of mitochondrial-like (mt-like) genome (D–F), SARS-CoV-2-like genome (G–I) datasets and CIPRES-simulated RNA datasets (J–L). The quality across the 16S-like rRNA, mt-like genome and SARS-CoV-2-like genome datasets varies with differing levels of similarity, while the CIPRES-simulated RNA datasets vary with different sequence counts. Each subset, categorized by either similarity or sequence count, consists of nine replicates. For each bar graph representing average aSP scores, we calculated the average aSP score for each MSA tool across all replicates (9 × 14). We then used the results from the seven different MSA tools to create a bar graph with error bars. These error bars represent the standard errors calculated from the seven results. For each Q-score and TC-score heatmap, we calculated the average scores across all replicates for each similarity subset of the seven MSA tools. The data were then row-normalized to produce the final heatmap. In the heatmap, the upper left triangle represents the Q scores, while the lower right triangle represents the TC scores. In addition, for each similarity subset, we calculated the average number of low-quality blocks identified by each MSA tool across all replicates (using 10 different MSD values). We then constructed box plots to show the distribution of the number of low-quality blocks for the seven different MSA tools. Finally, we plotted line graphs of aSP, Q and TC scores based on the average values across all replicates from the seven MSA tools (9 × 14 × 7).

The number of low-quality blocks identified by the entropy method is highest when the average similarity of the dataset is low and decreases as similarity increases. In the 16S-like rRNA dataset, when the average similarity exceeds 98%, local realignment results show no significant changes across different MSD values. However, when the average similarity is >90%, a smaller MSD yields a higher Q score. Conversely, when the average similarity drops <90%, a larger MSD achieves higher Q and TC scores. The average aSP score is highest with an MSD of 40 or 45, remaining similar in other cases (Figure 5B). In the mt-like genome dataset, when the average similarity is <90%, the Q and TC scores for local realignment are highest with an MSD of 5. As the average similarity increases, using a larger MSD yields higher Q and TC scores. When the average similarity is 97% or higher, the choice of MSD has no significant effect on local realignment. The average aSP score decreases as the MSD increases (Figure 5E). In the SARS-CoV-2-like genome dataset, when the average similarity is 80% or lower, the Q and TC scores for local realignment are highest with an MSD of 5. When the average similarity exceeds 80%, a larger MSD used in local realignment results in higher Q and TC scores. The average aSP score also decreases as the MSD increases (Figure 5H). Additionally, a thorough comparison of the aSP, Q and TC scores, derived from both the match and entropy methods, was conducted across the three datasets: 16S-like rRNA, mt-like genome and SARS-CoV-2-like genome. The high correlation between aSP scores and elevated Q and TC scores indicates that the aSP score is a reliable criterion for selection (see Figure 5C, F and I).

The average similarity of the CIPRES-simulated rRNA dataset remains consistent. As the number of sequences increases, the number of low-quality blocks identified by both methods decreases (Figure 5J and K). For the match method, larger MSD values achieve the highest Q and TC scores, while the aSP score remains relatively unchanged across different MSD values (Figure 5J). In the entropy method, the Q and TC scores are highest when the MSD is 15. Although the aSP score is high at an MSD of 10, it decreases as the MSD value increases (Figure 5K). Additionally, a comprehensive comparison of the results obtained by the match and entropy methods shows that when the aSP score is high, the Q and TC scores are also correspondingly high (Figure 5L). In the real dataset, the 16S rRNA sequences have an average similarity of 72% with 1000 sequences per replicate (Figure 4A), while the 23S rRNA sequences have an average similarity of 92% with 500 sequences per replicate (Figure 4D). In both datasets, regardless of whether the match or entropy method is used, the results within each method show consistency when different MSD values are applied. In the 16S rRNA dataset, the match method yielded more stable results and generally higher aSP scores compared to the entropy method (Figure 4B). Similarly, in the 23S rRNA dataset, the match method also produced higher aSP scores than the entropy method (Figure 4E). In the human mitochondrial genome dataset, where the average similarity is 99%, each copy contains 200 sequences (Figure 4G). When using an MSD of 5 with the match method, the aSP score for local realignment is slightly higher than in other cases (Figure 4H).

Sequence similarity is a crucial factor in determining the optimal MSD for local realignment. When using the match method, a smaller MSD generally provides better results at high similarity levels. Conversely, for low similarity levels, increasing the MSD enhances the quality of local realignment results. With the entropy method, a smaller MSD tends to be more effective at extremely high or low similarity levels, while a larger MSD may improve results in other cases. Furthermore, when dealing with a large number of sequences, both methods achieve higher local realignment quality with smaller MSD values for datasets with high similarity, and larger MSD values for datasets with low similarity. Overall, the match method excels with datasets that have a high sequence count or similarity, whereas the entropy method is more effective for datasets with lower similarity or fewer sequences. Supplementary Table S1 provides details on the MSD corresponding to various similarity levels and sequence counts.

ReAlign-N enhances initial alignment accuracy and surpasses ReformAlign in speed and memory efficiency

Initial alignments were generated using seven MSA tools from the aforementioned experiments. These initial alignments and the original unaligned sequences were fed into ReAlign-N and ReformAlign. ReformAlign utilized default parameters and operated in single-core mode, which included five iterations. ReAlign-N functioned in two single-core modes: ReAlign-N1 performed local optimization followed by global optimization, resulting in the final alignment, whereas ReAlign-N2 followed the opposite sequence. Notably, global optimization in both cases involved a single iteration. The program’s running time and peak memory usage were recorded by logging output parameters, specifically ‘Elapsed (wall clock) time’ and ‘Maximum resident set size’ using ‘/usr/bin/time -v.’ We also used the aSP score to evaluate the realignment quality on the real datasets, and the aSP score, Q score and TC score to assess the realignment quality on the simulated datasets.

In the 16S-like rRNA dataset, ReAlign-N and ReformAlign significantly improved the initial alignment’s aSP score (Figure 3A), Q score and TC score (Figure 3B), with greater enhancements observed as sequence similarity decreased. Compared to ReformAlign, ReAlign-N outperformed regarding running time and memory efficiency, particularly on high-similarity datasets (Figure 3C). Similarly, in the mt-like genome (Figure 3D and E) and SARS-CoV-2-like genome (Figure 3G and H) datasets, both tools enhanced the quality of the initial alignment. Notably, ReAlign-N is faster and more memory-efficient (Figure 3F and I). Given that these datasets include 14 sub-datasets with varying similarities, leading to significant differences in aSP scores, we plotted the aSP scores for each similarity separately (Supplementary Figures S1–S3). The results demonstrate that even at high similarity, ReAlign-N effectively enhances the aSP score of the initial alignment. Furthermore, a comprehensive analysis of the three datasets highlighted that the realignment process is crucial for improving overall alignment quality, particularly in scenarios of low sequence similarity and long sequence length. In the CIPRES-simulated RNA dataset, where the datasets maintained the same similarity but increased in the number of sequences, both ReAlign-N and ReformAlign improved the initial alignment’s aSP (Figure 3J) and Q scores (Figure 3K). As the number of sequences grew, the improvements became more pronounced, with ReAlign-N significantly outperforming ReformAlign. Additionally, when the number of sequences is small, both tools effectively enhance the TC score, with ReAlign-N again demonstrating superior performance (Figure 3K). Therefore, when sequence similarity is low and the number of sequences is large, realignment can substantially boost the aSP and Q scores of the initial alignment. Conversely, realignment effectively improves the TC score in scenarios of low sequence similarity and small sequence numbers. Notably, ReAlign-N significantly reduces both running time and memory consumption across all datasets, particularly when handling longer sequences, larger datasets and higher sequence similarity (Figure 3L).

We observed similar results in real datasets. For the 16S rRNA dataset with an average similarity of ∼72% (Figure 4A), both ReAlign-N and ReformAlign improved the aSP score of the initial alignment. This improvement was particularly significant for the poor-quality WMSA2 initial alignment, with ReAlign-N1 showing the best realignment quality. Additionally, ReAlign-N consistently maintained high speed and low memory consumption compared to ReformAlign (Figure 4C). In the high-similarity Mycobacterium 23S rRNA (Figure 4D) and human mitochondrial genome datasets (Figure 4G), these tools also enhanced the aSP score of the initial alignment. ReAlign-N demonstrated greater advantages in running time and memory consumption, significantly reducing processing time and resource use (Figure 4F and I).

Discussion

In this study, we introduced ReAlign-N, a specialized method for aligning multiple nucleic acid sequences that integrates both global and local realignment strategies. ReAlign-N offers two realignment schemes: one that begins with local realignment followed by global realignment, and another that starts with global realignment before local realignment. We evaluated ReAlign-N’s performance through comprehensive testing on three real datasets and four simulated datasets.

For the global realignment strategy, we innovatively combined the K-Band algorithm with memory-saving technology and compared its performance to ReformAlign across all datasets. To ensure the results’ accuracy and reliability, we used the same number of iterations for both methods. The experimental results demonstrate that global realignment is generally faster than ReformAlign on most datasets and consistently uses less memory. Due to the relatively low similarity of 16S rRNA sequences (∼72%), the initial alignments produced by HAlign3 and Kalign3 are of low quality. Global alignment takes longer to complete compared to ReformAlign. This is because, during the global realignment process, the K-Band algorithm must gradually increase the K value from a small starting point to its maximum (i.e., the full sequence length) to determine the optimal path through dynamic programming. In contrast, ReformAlign directly constructs a dynamic programming matrix that spans the entire sequence length, allowing it to find the optimal path more efficiently under these conditions. While global realignment is slower than ReformAlign in this scenario, it still uses significantly less memory. As sequence length increases and similarity remains high (with a small K value in the K-Band algorithm), global realignment becomes even more efficient due to its reduced memory requirements, highlighting its advantages.

Additionally, we select the MSD value for cutting low-quality blocks based on the number of sequences and the average similarity of the initial alignment to ensure stable results. We tested the appropriate MSD values for both the match and entropy methods across all datasets and counted the number of low-quality blocks in the simulated datasets. However, we did not count low-quality blocks in the real datasets. Given that the average similarity of the real datasets is consistent, it is evident that a larger MSD value reduces the number of cut blocks, whereas a smaller MSD value increases it, assuming the identified target columns are the same. Finally, we compared ReAlign-N with ReformAlign across all datasets. The experimental results consistently demonstrate that ReAlign-N delivers superior alignment quality in most cases. Notably, ReAlign-N is not only faster but also more memory-efficient than ReformAlign. By integrating local realignment, ReAlign-N significantly reduces the number of global realignment iterations needed while maintaining or exceeding the alignment quality achieved by ReformAlign.

Supplementary Material

lqae170_Supplemental_File

lqae170_supplemental_file.pdf^{(213.9KB, pdf)}

Acknowledgements

We acknowledge the help from the other group members, Jiannan Chao, Yi Liu, Yizheng Wang and Qinzhong Tian, for providing critical opinions during the preparation.

Y.Z.: Conceptualization, formal analysis, methodology, validation, Writing—original draft, review and editing. T.Z.: Formal analysis, methodology. Y.W.: Methodology. Q.Z.: Writing—review and editing. Y.W.: Writing—review and editing.

Contributor Information

Yixiao Zhai, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, No.2006, Xiyuan Avenue, Pidu Zone, Chengdu 610054, China; Institute of Digital Health, Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, No.1, Chengdian Road, Kecheng Zone, Quzhou 324003, China.

Tong Zhou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, No.2006, Xiyuan Avenue, Pidu Zone, Chengdu 610054, China; Institute of Digital Health, Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, No.1, Chengdian Road, Kecheng Zone, Quzhou 324003, China.

Yanming Wei, Institute of Digital Health, Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, No.1, Chengdian Road, Kecheng Zone, Quzhou 324003, China; School of Computer Science and Technology, Xidian University, No.266, Xifeng Road, Chang'an Zone, Xi’an 710071, China.

Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, No.2006, Xiyuan Avenue, Pidu Zone, Chengdu 610054, China; Institute of Digital Health, Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, No.1, Chengdian Road, Kecheng Zone, Quzhou 324003, China.

Yansu Wang, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, No.2006, Xiyuan Avenue, Pidu Zone, Chengdu 610054, China; Institute of Digital Health, Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, No.1, Chengdian Road, Kecheng Zone, Quzhou 324003, China.

Data availability

We utilized three real and four simulated nucleic acid datasets for comprehensive testing. For more details, please refer to the datasets and measurement sections. The code has been deposited in Figshare, https://doi.org/10.6084/m9.figshare.25801384.v1.

Supplementary data

Supplementary Data are available at NARGAB Online.

Funding

National Natural Science Foundation of China [62425107, 62450002, 62271353, 62373080]; National Key R&D Program of China [2022ZD0117700]; Municipal Government of Quzhou [2023D036]; Fellowship of China Postdoctoral Science Foundation [2023M731984, GZB20230365, 2024T170498].

Conflict of interest statement. None declared.

References

1. Liu H., Zou Q., Xu Y.. A novel fast multiple nucleotide sequence alignment method based on FM-index. Briefings Bioinf. 2022; 23:bbab519. [DOI] [PubMed] [Google Scholar]
2. Wang Y., Zhai Y., Ding Y., Zou Q.. SBSM-Pro: support bio-sequence machine for proteins. Sci. China Inf. Sci. 2024; 67:212106. [Google Scholar]
3. Chao J., Tang F., Xu L.. Developments in algorithms for sequence alignment: a review. Biomolecules. 2022; 12:546. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Anson E.L., Myers E.W.. Proceedings of the First Annual International Conference on Computational Molecular Biology. 1997; NY: Association for Computing Machinery; 9–16. [Google Scholar]
5. Wallace I.M., Higgins D.G.. Evaluation of iterative alignment algorithms for multiple alignment. Bioinformatics. 2005; 21:1408–1414. [DOI] [PubMed] [Google Scholar]
6. Chakrabarti S., Lanczycki C.J., Panchenko A.R., Przytycka T.M., Thiessen P.A., Bryant S.H.. Refining multiple sequence alignments with conserved core regions. Nucleic Acids Res. 2006; 34:2598–2606. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Lyras D.P., Metzler D.. ReformAlign: improved multiple sequence alignments using a profile-based meta-alignment approach. BMC Bioinf. 2014; 15:265. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Manohar A., Batzoglou S.. 2005 IEEE Computational Systems Bioinformatics Conference (CSB'05). 2005; CA, Stanford: IEEE; 111–119. [DOI] [PubMed] [Google Scholar]
9. Mokaddem A., Hadj A.B., Elloumi M.. Refin-Align: new refinement algorithm for multiple sequence alignment. Informatica. 2019; 43:4. [Google Scholar]
10. Zhan Q., Fu Y., Jiang Q., Liu B., Peng J., Wang Y.. SpliVert: a protein multiple sequence alignment refinement method based on splitting-splicing vertically. Protein Pept. Lett. 2020; 27:295–302. [DOI] [PubMed] [Google Scholar]
11. Wei Q., Zou H., Zhong C., Xu J.. RPfam: a refiner towards curated-like multiple sequence alignments of the Pfam protein families. J. Bioinform. Comput. Biol. 2022; 20:2240002. [DOI] [PubMed] [Google Scholar]
12. Mokaddem A., Haj A.B., Elloumi M.. Pro-malign: multiple Sequence alignment algorithm using approached profile. J. Softw. 2018; 13:57–65. [Google Scholar]
13. Thompson J.D., Thierry J.-C., Poch O.. RASCAL: rapid scanning and correction of multiple sequence alignments. Bioinformatics. 2003; 19:1155–1161. [DOI] [PubMed] [Google Scholar]
14. Wicker N., Perrin G.R., Thierry J.C., Poch O.. Secator: a program for inferring protein subfamilies from phylogenetic trees. Mol. Biol. Evol. 2001; 18:1435–1441. [DOI] [PubMed] [Google Scholar]
15. Thompson J.D., Plewniak F., Ripp R., Thierry J.-C., Poch O.. Towards a reliable objective function for multiple sequence alignments. J. Mol. Biol. 2001; 314:937–951. [DOI] [PubMed] [Google Scholar]
16. Wei Y., Zou Q., Tang F., Yu L.. WMSA: a novel method for multiple sequence alignment of DNA sequences. Bioinformatics. 2022; 38:5019–5025. [DOI] [PubMed] [Google Scholar]
17. Katoh K., Standley D.M.. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 2013; 30:772–780. [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Pei J., Grishin N.V.. AL2CO: calculation of positional conservation in a protein sequence alignment. Bioinformatics. 2001; 17:700–712. [DOI] [PubMed] [Google Scholar]
19. Sievers F., Higgins D.G.. Clustal omega. Curr. Protoc. Bioinform. 2014; 48:3.13.11–13.13.16. [DOI] [PubMed] [Google Scholar]
20. DeSantis T.Z., Hugenholtz P., Keller K., Brodie E.L., Larsen N., Piceno Y.M., Phan R., Andersen G.L.. NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes. Nucleic Acids Res. 2006; 34:W394–W399. [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Tanaka M., Cabrera V.M., González A.M., Larruga J.M., Takeyasu T., Fuku N., Guo L.-J., Hirose R., Fujita Y., Kurata M.. Mitochondrial genome variation in eastern Asia and the peopling of Japan. Genome Res. 2004; 14:1832–1850. [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Fletcher W., Yang Z.. INDELible: a flexible simulator of biological sequence evolution. Mol. Biol. Evol. 2009; 26:1879–1888. [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Minh B.Q., Schmidt H.A., Chernomor O., Schrempf D., Woodhams M.D., Von Haeseler A., Lanfear R.. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 2020; 37:1530–1534. [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Zhang P., Liu H., Wei Y., Zhai Y., Tian Q., Zou Q.. FMAlign2: a novel fast multiple nucleotide sequence alignment method for ultralong datasets. Bioinformatics. 2024; 40:btae014. [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Tang F., Chao J., Wei Y., Yang F., Zhai Y., Xu L., Zou Q.. HAlign 3: fast multiple alignment of ultra-large numbers of similar DNA/RNA sequences. Mol. Biol. Evol. 2022; 39:msac166. [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Lassmann T. Kalign 3: multiple sequence alignment of large datasets. Bioinformatics. 2019; 36:1928–1929. [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Edgar R.C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004; 32:1792–1797. [DOI] [PMC free article] [PubMed] [Google Scholar]
28. Chen J., Chao J., Liu H., Yang F., Zou Q., Tang F.. WMSA 2: a multiple DNA/RNA sequence alignment tool implemented with accurate progressive mode and a fast win-win mode combining the center star and progressive strategies. Brief. Bioinf. 2023; 24:bbad190. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

lqae170_Supplemental_File

lqae170_supplemental_file.pdf^{(213.9KB, pdf)}

Data Availability Statement

[B1] 1. Liu H., Zou Q., Xu Y.. A novel fast multiple nucleotide sequence alignment method based on FM-index. Briefings Bioinf. 2022; 23:bbab519. [DOI] [PubMed] [Google Scholar]

[B2] 2. Wang Y., Zhai Y., Ding Y., Zou Q.. SBSM-Pro: support bio-sequence machine for proteins. Sci. China Inf. Sci. 2024; 67:212106. [Google Scholar]

[B3] 3. Chao J., Tang F., Xu L.. Developments in algorithms for sequence alignment: a review. Biomolecules. 2022; 12:546. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] 4. Anson E.L., Myers E.W.. Proceedings of the First Annual International Conference on Computational Molecular Biology. 1997; NY: Association for Computing Machinery; 9–16. [Google Scholar]

[B5] 5. Wallace I.M., Higgins D.G.. Evaluation of iterative alignment algorithms for multiple alignment. Bioinformatics. 2005; 21:1408–1414. [DOI] [PubMed] [Google Scholar]

[B6] 6. Chakrabarti S., Lanczycki C.J., Panchenko A.R., Przytycka T.M., Thiessen P.A., Bryant S.H.. Refining multiple sequence alignments with conserved core regions. Nucleic Acids Res. 2006; 34:2598–2606. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] 7. Lyras D.P., Metzler D.. ReformAlign: improved multiple sequence alignments using a profile-based meta-alignment approach. BMC Bioinf. 2014; 15:265. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] 8. Manohar A., Batzoglou S.. 2005 IEEE Computational Systems Bioinformatics Conference (CSB'05). 2005; CA, Stanford: IEEE; 111–119. [DOI] [PubMed] [Google Scholar]

[B9] 9. Mokaddem A., Hadj A.B., Elloumi M.. Refin-Align: new refinement algorithm for multiple sequence alignment. Informatica. 2019; 43:4. [Google Scholar]

[B10] 10. Zhan Q., Fu Y., Jiang Q., Liu B., Peng J., Wang Y.. SpliVert: a protein multiple sequence alignment refinement method based on splitting-splicing vertically. Protein Pept. Lett. 2020; 27:295–302. [DOI] [PubMed] [Google Scholar]

[B11] 11. Wei Q., Zou H., Zhong C., Xu J.. RPfam: a refiner towards curated-like multiple sequence alignments of the Pfam protein families. J. Bioinform. Comput. Biol. 2022; 20:2240002. [DOI] [PubMed] [Google Scholar]

[B12] 12. Mokaddem A., Haj A.B., Elloumi M.. Pro-malign: multiple Sequence alignment algorithm using approached profile. J. Softw. 2018; 13:57–65. [Google Scholar]

[B13] 13. Thompson J.D., Thierry J.-C., Poch O.. RASCAL: rapid scanning and correction of multiple sequence alignments. Bioinformatics. 2003; 19:1155–1161. [DOI] [PubMed] [Google Scholar]

[B14] 14. Wicker N., Perrin G.R., Thierry J.C., Poch O.. Secator: a program for inferring protein subfamilies from phylogenetic trees. Mol. Biol. Evol. 2001; 18:1435–1441. [DOI] [PubMed] [Google Scholar]

[B15] 15. Thompson J.D., Plewniak F., Ripp R., Thierry J.-C., Poch O.. Towards a reliable objective function for multiple sequence alignments. J. Mol. Biol. 2001; 314:937–951. [DOI] [PubMed] [Google Scholar]

[B16] 16. Wei Y., Zou Q., Tang F., Yu L.. WMSA: a novel method for multiple sequence alignment of DNA sequences. Bioinformatics. 2022; 38:5019–5025. [DOI] [PubMed] [Google Scholar]

[B17] 17. Katoh K., Standley D.M.. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 2013; 30:772–780. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B18] 18. Pei J., Grishin N.V.. AL2CO: calculation of positional conservation in a protein sequence alignment. Bioinformatics. 2001; 17:700–712. [DOI] [PubMed] [Google Scholar]

[B19] 19. Sievers F., Higgins D.G.. Clustal omega. Curr. Protoc. Bioinform. 2014; 48:3.13.11–13.13.16. [DOI] [PubMed] [Google Scholar]

[B20] 20. DeSantis T.Z., Hugenholtz P., Keller K., Brodie E.L., Larsen N., Piceno Y.M., Phan R., Andersen G.L.. NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes. Nucleic Acids Res. 2006; 34:W394–W399. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] 21. Tanaka M., Cabrera V.M., González A.M., Larruga J.M., Takeyasu T., Fuku N., Guo L.-J., Hirose R., Fujita Y., Kurata M.. Mitochondrial genome variation in eastern Asia and the peopling of Japan. Genome Res. 2004; 14:1832–1850. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] 22. Fletcher W., Yang Z.. INDELible: a flexible simulator of biological sequence evolution. Mol. Biol. Evol. 2009; 26:1879–1888. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] 23. Minh B.Q., Schmidt H.A., Chernomor O., Schrempf D., Woodhams M.D., Von Haeseler A., Lanfear R.. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol. Biol. Evol. 2020; 37:1530–1534. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] 24. Zhang P., Liu H., Wei Y., Zhai Y., Tian Q., Zou Q.. FMAlign2: a novel fast multiple nucleotide sequence alignment method for ultralong datasets. Bioinformatics. 2024; 40:btae014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25] 25. Tang F., Chao J., Wei Y., Yang F., Zhai Y., Xu L., Zou Q.. HAlign 3: fast multiple alignment of ultra-large numbers of similar DNA/RNA sequences. Mol. Biol. Evol. 2022; 39:msac166. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] 26. Lassmann T. Kalign 3: multiple sequence alignment of large datasets. Bioinformatics. 2019; 36:1928–1929. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B27] 27. Edgar R.C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004; 32:1792–1797. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B28] 28. Chen J., Chao J., Liu H., Yang F., Zou Q., Tang F.. WMSA 2: a multiple DNA/RNA sequence alignment tool implemented with accurate progressive mode and a fast win-win mode combining the center star and progressive strategies. Brief. Bioinf. 2023; 24:bbad190. [DOI] [PubMed] [Google Scholar]

PERMALINK

ReAlign-N: an integrated realignment approach for multiple nucleic acid sequence alignment, combining global and local realignments

Yixiao Zhai

Tong Zhou

Yanming Wei

Quan Zou

Yansu Wang

Abstract

Introduction

Materials and methods

The workflow of ReAlign-N

Figure 1.

Global realignment

Figure 2.

Local realignment

Implementation and experimental environment

Datasets and measurement

Table 1.

MSA tools employed in the experimental analysis

Results

Global realignment demonstrates efficient running time and memory consumption

Figure 3.

Figure 4.

Selecting the size of the low-quality block for local realignment based on initial alignment similarity

Figure 5.

ReAlign-N enhances initial alignment accuracy and surpasses ReformAlign in speed and memory efficiency

Discussion

Supplementary Material

Acknowledgements

Contributor Information

Data availability

Supplementary data

Funding

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases