Recursive MAGUS: Scalable and accurate multiple sequence alignment

Vladimir Smirnov

doi:10.1371/journal.pcbi.1008950

. 2021 Oct 6;17(10):e1008950. doi: 10.1371/journal.pcbi.1008950

Recursive MAGUS: Scalable and accurate multiple sequence alignment

Vladimir Smirnov ^1,^*

Editor: Dina Schneidman-Duhovny²

PMCID: PMC8523058 PMID: 34613974

Abstract

Multiple sequence alignment tools struggle to keep pace with rapidly growing sequence data, as few methods can handle large datasets while maintaining alignment accuracy. We recently introduced MAGUS, a new state-of-the-art method for aligning large numbers of sequences. In this paper, we present a comprehensive set of enhancements that allow MAGUS to align vastly larger datasets with greater speed. We compare MAGUS to other leading alignment methods on datasets of up to one million sequences. Our results demonstrate the advantages of MAGUS over other alignment software in both accuracy and speed. MAGUS is freely available in open-source form at https://github.com/vlasmirnov/MAGUS.

Author summary

Many tasks in computational biology depend on solving the problem of multiple sequence alignment (MSA), which entails arranging a set of genetic sequences so that letters with common ancestry are stacked in the same column. This is a computationally difficult problem, particularly on large datasets; current MSA software is able to accurately align up to a few thousand sequences at a time. Unfortunately, growing biological datasets are rapidly outpacing these capabilities. We present a new version of our MAGUS alignment tool, which has been massively scaled up to handle datasets of up to one million sequences, and demonstrate MAGUS’s excellent performance in aligning ultra-large datasets. The MAGUS software is open-source and can be found at https://github.com/vlasmirnov/MAGUS.

This is a PLOS Computational Biology Software paper.

Introduction

One of the principal problems in computational biology is multiple sequence alignment (MSA), being necessary for a wide range of downstream applications. This challenge is well-studied, and a good number of strong methods have been developed [1–8]. Most of these leading methods follow the paradigm of “progressive alignment”, and are able to show reasonable accuracy and speed on datasets of modest size (a few hundred to a few thousand sequences).

Unfortunately, datasets with more sequences and greater evolutionary diameters require a different approach. Accurate progressive alignment methods rely on heuristics whose runtimes scale very poorly, and early mistakes are compounded over large numbers of pairwise alignments. As a consequence, a family of divide-and-conquer methods was developed to meet the demands of larger datasets [9–11].

MAGUS (Multiple Sequence Alignment using Graph Clustering) was recently introduced [12] as a new evolution of this family. MAGUS uses the GCM (Graph Clustering Merger) technique to combine an arbitrary number of subalignments, which allows MAGUS to align large numbers of sequences with highly competitive accuracy and speed. In its original form, MAGUS is able to align up to around 40,000 sequences.

In this paper, we extend MAGUS to handle datasets of much greater size, demonstrating alignments of up to one million sequences. The next section briefly explains how MAGUS operates, and presents our extensions to enable scalability. Next, we describe our experimental study and show our results, comparing MAGUS to other methods with regard to alignment accuracy and speed over ultra-large datasets. Finally, we discuss our findings and future work.

Design and implementation

Overview of MAGUS

MAGUS is a recently developed divide-and-conquer alignment method that inherits the basic structure of the earlier PASTA [11] algorithm: MAGUS decomposes the dataset into subsets, aligns them piecewise, and merges these subalignments together. The basic algorithm is outlined in Fig 1 and itemized below.

Fig 1 — The unaligned sequences are decomposed into disjoint subsets, which are individually aligned and merged together with GCM.

Input: a set of unaligned sequences.

Construct a guide tree over the unaligned sequences. (Our default way of doing this is explained below.)
Use the guide tree to break the dataset into subsets. This is done by “centroid edge decomposition” [11], deleting edges to break the tree into sufficiently small, balanced pieces.
Align each subset with MAFFT -linsi [3].
Construct a set of backbone alignments spanning our subsets. Each backbone is composed of equal-sized random subsets from each subalignment and aligned with MAFFT -linsi.
Compile the backbones into an alignment graph. Each node represents a subalignment column, and the edges are weighted by how often they are matched by the backbone alignments.
Cluster the alignment graph with MCL [13].
Order the clusters into a valid alignment. We use a heuristic search to resolve conflicts with minimal changes.
Output the full alignment.

Please refer to the original paper [12] for more information. Steps 5–8 comprise GCM (Graph Clustering Merger, Fig A in S1 Text), the method by which MAGUS merges subalignments and its biggest departure from previous divide-and-conquer methods. The pipeline was built to be flexible: the user can supply their own subalignments in lieu of steps 1–4, their own guide tree for step 2, and their own backbones for step 5. The number and size of subsets and backbones can also be controlled.

Motivation for MAGUS enhancement

Despite its advantages, the original version of MAGUS (“MAGUS 1”) suffers from a number of constraints on its scalability. We motivate the need for improvement by glancing ahead to our experimental study, where MAGUS 1 is seen to struggle with increasing dataset sizes: MAGUS 1 takes over 20 hours to align 50,000 sequences and fails on larger datasets due to memory issues. In the next section, we explain the limitations of MAGUS 1 and present the improvements that comprise the paper.

MAGUS improvements

Recursion

First, there is a soft limit on how many sequences MAGUS 1 can reasonably align. MAFFT -linsi [3], which is used for building subset and backbone alignments, starts to really slow down past around 200 sequences. Additionally, the cluster ordering step (step 8 above) tends to struggle with more than about 200 subsets. Therefore, assuming a practical limit of about 200 subsets of 200 sequences each, unmodified MAGUS can be expected to handle up to around 200 × 200 = 40,000 sequences.

We parry this limitation with a fairly straightforward recursive structure, shown in Fig 2. Instead of automatically aligning our subsets with MAFFT, subsets larger than a threshold are recursively aligned with MAGUS. This threshold can be set by the user and is, by default, the greater of the backbone size and the target subset size used for decomposition. Our subalignments are merged with GCM just as before, regardless of whether each subalignment was estimated with MAFFT or MAGUS.

Parallelism

The next issue is parallelism. MAGUS 1 already implements thread-parallelism: it runs on a single compute node, and it can use all available threads on that node to run MAFFT tasks in parallel. This is more than enough for a few tens of thousands of sequences on a decent machine. However, with ultra-large datasets, we definitely want to benefit from node-parallelism, when multiple compute nodes can collaborate. We implement node-parallelism by extending MAGUS 1’s task management code. MAGUS 1 maintains task files with MAFFT alignments and other self-contained tasks that are pending or running, which allows worker threads to divide the jobs and MAGUS to easily resume in case of failure. Reworking this system to allow for multiple compute nodes to use the same set of files effectively permits any number of nodes to join and take tasks to work on.

Guide tree

MAGUS decomposes the dataset into subsets by estimating a rough guide tree with FastTree [14], a fast maximum likelihood tree estimation method. Since FastTree requires an alignment, we first compile a rough alignment by aligning 300 random sequences with MAFFT and adding the remaining sequences with HMMER [15]. The guide tree is recursively broken apart until the subsets are small enough. This is the same strategy used in PASTA, and seems very difficult to improve upon. On very large alignments, however, even FastTree becomes painfully slow (around 5 days on a million sequences, as will be shown below).

The new version of MAGUS presents a wider range of guide tree options, intended for situations where FastTree might not be fast enough or fails due to numerical issues. The guide tree can now also be generated with Clustal Omega’s [2] initial tree method, MAFFT’s PartTree [16] initial tree method, and FastTree’s minimum evolution tree (i.e. limited to distance-based calculations without maximum likelihood). In extremis, the dataset can be decomposed randomly for maximum speed.

Memory management and alignment compression

Memory management becomes a salient problem when handling very large datasets. For example, without modifications, MAGUS alignments on the full million-sequence RNASim dataset fall between 1 and 3 terabytes (Fig B in S1 Text). Moreover, simply having too many subalignments loaded into memory at the same time can overrun the available RAM at such dataset sizes.

We solve the latter problem by reworking the code to ensure that at most one subalignment may be fully loaded into memory at any time. With large dataset sizes, this limits the memory complexity of MAGUS to the size of the largest subalignment.

The problem of excessively large alignments is addressed by introducing a method of conservative lossy compression. If MAGUS calculates that the size of the uncompressed alignment will exceed a threshold (100GB by default, may be set by the user), MAGUS will compress the alignment to the threshold size. The compression scheme is fairly straightforward and works by “dissolving” columns: the letters are set to lower-case and shunted to neighboring columns. If the neighboring columns already contain lower-case letters from the same sequences, these are also shunted away in a recursive domino effect. (If the neighboring columns already contain upper-case letters from the same sequences, then the move is invalid.) Columns are dissolved one at a time, starting with those containing the fewest letters, until the threshold is reached or no more valid moves remain. Please refer to Fig 3 for an example.

Fig 3 — At each step, we dissolve the column with the smallest number of upper-case letters (i.e., homologous pairs), that can be merged sideways without displacing upper-case letters in another column. Dissolved letters become lower-case and no longer represent homology. Steps 2 and 3 dissolve singleton columns, and are thus lossless. Steps 4 and 5 are lossy. Note step 5, where the lowercase ‘t’ in the destination column was shunted further left to make room. Step 6 simply disposes of the empty columns to form the final, compressed alignment.

Note that if we “dissolve” a column with only one upper-case letter, then no homologous pairs are lost. Thus, the compression procedure remains lossless for as long as we are only dissolving such columns, and MAGUS allows the user to request lossless compression.

Table B in S1 Text shows the effect of compression on MAGUS’s RNASim alignments at various sizes. At one million sequences, for example, the uncompressed alignment is about 1037GB, which can be reduced to 591GB with lossless compression, and reduced further to 25GB with lossy compression. Similarly, the uncompressed alignment over 500,000 sequences is 366GB, falling to 193GB with lossless compression and 10GB with lossy compression. Lossy compression increases the SP error by less than one millionth on these datasets, so it is generally safe to use.

Results

Experimental design

Our experimental design is outlined below.

The preliminary portion of our study that explores the effects of our MAGUS extensions described above, using MAGUS 1 as our baseline. We test the impact of compression on alignment error, the use of different guide trees, and the benefit of node-parallelism. Due to space limitations, these results are available in the Supplementary Materials.

Our subsequent experiments compare MAGUS against a range of competing methods across all of our datasets. This is the most important part of our study, intended to exercise the current state-of-the-art in the alignment of ultra-large nucleotide and protein datasets. We present our results below.

Datasets

Our study uses a number of simulated and biological datasets from previous publications [4, 11]. Please see Table 1 for dataset statistics. These datasets were selected to provide suitably large and varied alignment problems with reference alignments, containing both nucleotide and amino acid sequences.

Table 1. Dataset properties.

Statistics taken from [11]. P-distance denotes the normalized Hamming distance, or the fraction of non-gap letter pairs that do not match. Alignment length shows the length of the reference alignment.

Dataset	# Seqs	Avg. p-dist.	Max p-dist.	% gaps	align. length	type
RNASim	10,000–1,000,000	0.41	0.61	93	18,268	sim NT
16S
- 16S.3	6,323	0.32	0.83	82	8,716	bio NT
- 16S.T	7,350	0.35	0.90	87	11,856	bio NT
- 16S.B.ALL	27,643	0.21	0.77	80	6,857	bio NT
HomFam
- gluts	10,099	0.60	0.81	8	235	bio AA
- myb-DNA-binding	10,398	0.59	0.77	12	61	bio AA
- tRNA-synt-2b	11,293	0.81	0.88	34	467	bio AA
- biotin-lipoyl	11,833	0.71	0.84	26	112	bio AA
- hom	12,037	0.64	0.84	35	98	bio AA
- ghf13	12,607	0.72	0.84	25	626	bio AA
- aldosered	13,277	0.57	0.79	19	386	bio AA
- hla	13,465	0.24	0.33	0	178	bio AA
- Rhodanese	14,049	0.76	0.89	31	216	bio AA
- PDZ	14,950	0.69	0.84	15	110	bio AA
- blmb	17,200	0.79	0.90	30	344	bio AA
- p450	21,013	0.79	0.87	20	512	bio AA
- adh	21,331	0.36	0.47	0	375	bio AA
- aat	25,100	0.71	0.87	15	476	bio AA
- rrm	27,610	0.77	0.91	45	157	bio AA
- Acetyltransf	46,285	0.75	0.87	29	229	bio AA
- sdr	50,157	0.77	0.89	28	361	bio AA
- zf-CCHH	88,345	0.65	0.85	25	39	bio AA
- rvp	93,681	0.63	0.76	19	132	bio AA

Open in a new tab

RNASim: [11] This is a simulated RNA dataset, generated under a non-homogeneous model of evolution that does not conform to the usual GTR model assumptions. We use subsamples ranging from 10,000 to the full one million sequences, with one replicate per size.
16S: [17] We use three large biological nucleotide datasets from the Comparative Ribosomal Website: 16S.3, 16S.T, and 16S.B.ALL, with 6,323, 7,350, and 27,643 sequences, respectively.
HomFam: [2] Finally, we include 19 amino acid HomFam datasets from, which have small Homstrad reference alignments on 5–20 sequences each. These datasets range from 10,099 to 93,681 sequences and allow us to evaluate our methods on large protein datasets. (Following the PASTA paper, we exclude the “rhv” dataset due to having a weak alignment).

Methods

We compare the following methods in our study, taken from previous publications [4, 11]. To the best of our knowledge, these methods are presently the best-equipped to tackle very large multiple sequence alignments. Regressive T-Coffee [18] is another recent development, but we were unable to run it on Blue Waters.

MAGUS 1 We use the original MAGUS as a baseline. This version does not use recursion or compression, uses a FastTree decomposition, and can only run on a single node.
MAGUS The latest version takes advantage of the new features detailed above. We enable recursion and compress alignments above 100GB. In addition to the default FastTree decomposition, we explore other guide trees: FastTree without Maximum Likelihood, MAFFT’s PartTree, Clustal Omega’s initial tree method, and a random decomposition. Henceforth, we indicate the guide tree and use of recursion in parentheses. For example, MAGUS(Recurse, Clustal) denotes MAGUS using Clustal Omega’s guide tree and with recursion enabled.
PASTA [11]
UPP [4]
UPP(Fast) We use the “Fast” mode described in the UPP paper.
Muscle [1]
Clustal Omega [2]
MAFFT -auto [3]. The “auto” mode directs MAFFT to choose an appropriate alignment strategy based on the input dataset.

Error metrics

We evaluate alignment accuracy using SPFP/SPFN (Sum-of-Pairs False Positives and Negatives) rates, computed using FastSP [19]. These values represent the fractions of missing and incorrect homologous pairs in the estimated alignment. For convenience, we show the average of SPFP and SPFN as a single “SP error” in the main paper; SPFP and SPFN are shown separately in the Supplementary Materials. Our estimated alignments are compared against the true alignment on RNASim and the curated reference alignments on 16S. The HomFam datasets provide reference alignments over a small number of included sequences; we compute our alignment error over just these reference sequences.

Computing resources

We used the NCSA Blue Waters supercomputer for our experiments. Our jobs were run on nodes with 32 cores, 64GB of RAM, and a maximum wall time of 7 days.

Experimental results

The preliminary part of our study, which investigates the impacts of compression, guide tree selection, and node-parallelism, is available in the Supplementary Materials (due to space constraints). These results provide us with two natural guide tree choices for MAGUS: using FastTree (the default, described above) is the most accurate, while using Clustal Omega’s initial tree is the faster alternative. Here, we present the principal part of our study, where we compare MAGUS to our other methods across all of our datasets.

HomFam

Our first set of results concern the HomFam protein datasets. The error rates are averaged in Fig 4, and the complete results for all datasets are available in Table C in S1 Text. These results show more variability than the other datasets, but the general trends are as follows. Muscle and Clustal trail the others, averaging 46.6% and 27.2% error, respectively. MAFFT, UPP, and PASTA are all on par, averaging about 21–23% error. The MAGUS versions perform markedly better: MAGUS(Recurse, Clustal) yields 17.9% error, MAGUS(Recurse, FastTree) shows 16.5%, and MAGUS 1 leads with 15.5%. Furthermore, MAGUS 1 achieves the best result on 12 of the 19 datasets. Recursive MAGUS (both versions) accounts for 2 of the others, while Clustal and UPP each do best on 2.

The HomFam runtime results are shown in Fig 5 and Fig N in S1 Text. PASTA is visibly the slowest, taking about 2–5 hours on the smaller datasets and up to 20 hours on the larger ones. MAFFT, UPP(Fast), Clustal Omega, and MAGUS(Recurse, Clustal) are the fastest, generally finishing in a few minutes to an hour. Notably, we see MAGUS 1 begin to dramatically slow down without recursion, running longer than MAGUS(Recurse, FastTree) on the largest datasets.

16S

The next set of results pertain to the biological 16S datasets, shown in Figs 6 and 7. As above, Muscle and Clustal trail the other methods in accuracy. On the smallest dataset, 16S.3, the results are fairly close: UPP(Fast), PASTA, and all versions of MAGUS are at about 19% SP error. There is a larger difference on 16S.T, with PASTA at around 23%, UPP and UPP(Fast) around 21%, and all versions of MAGUS at about 20%. Lastly, UPP, PASTA, and MAGUS are again fairly close on 16S.B.ALL; PASTA shows about 11% error, while both versions of UPP and MAGUS have about 10.5% error.

Fig 6 — Error is the average of SPFP and SPFN. MAGUS was run with the default 25 subsets.

Fig 7 — Runtime is shown in hours. MAGUS was run with the default 25 subsets. MAFFT -auto completed in a few minutes on 16S.3 and 16S.T.

In terms of runtime, we see that UPP, PASTA, and both versions of recursive MAGUS are the slowest methods on 16S.3 and 16S.T, running around 4–5 hours. The fastest method is MAFFT(auto) at about 2 minutes, while Muscle and UPP(Fast) take about half an hour. The picture is a little different on 16S.B.ALL, where Muscle, UPP, and PASTA seem to drastically slow down; they take about 11, 14, and 17 hours, respectively. MAGUS 1 also falters here, taking 18 hours, while recursive MAGUS with FastTree and Clustal only increases to 8 and 4 hours, respectively. MAFFT and UPP(Fast) remain the fastest, only taking 1–2 hours.

RNASim

In the final part of our study, we probe the limits of scalability on the RNASim datasets. Figs 8 and 9 show us the error and runtime results, while Table 2 summarizes all method failures. Muscle is the worst performer here, with 65–70% error and segfaulting after 50,000 sequences. Clustal Omega does better, with errors between about 30% and 60%, running out of time after 200,000 sequences. Then comes MAFFT -auto, with a steady error of 25–30% up to 100,000 sequences. Oddly, even though it is one of the fastest methods at 100,000 sequences (about 3.6 hours), it runs out of time at 200,000 sequences.

Fig 8 — Error is the average of SPFP and SPFN. ‘X’ markers indicate that compression was used (MAGUS alignments above 100GB). MAGUS was run with 100 subsets on RNASim to reduce load on Blue Waters. Compute nodes had 64GB of RAM and a maximum wall time of 7 days.

Fig 9 — Runtime is shown in hours. MAGUS was run with 100 subsets on RNASim to reduce load on Blue Waters. Compute nodes had 64GB of RAM and a maximum wall time of 7 days.

Table 2. Method failures on RNASim.

PASTA and MAGUS 1 failed due to excessive memory usage; compute nodes had 64GB of memory.

Method	Highest # Aligned	Failure
Muscle	50,000	“segmentation fault”
Clustal Omega	200,000	Max runtime elapsed (7 days)
MAFFT(auto)	100,000	Max runtime elapsed (7 days)
UPP	200,000	Max runtime elapsed (7 days)
PASTA	50,000	“Error detected during page fault processing. Process terminated via bus error.”
MAGUS 1	50,000	“OOM killer terminated this process.”

Open in a new tab

The accuracy of our remaining methods is shown more clearly in Fig 10. UPP(Fast) trails the other methods in accuracy, with about 2% higher error than PASTA and UPP. PASTA and UPP are about the same at around 10% error. MAGUS 1 and recursive MAGUS (both versions) have the best accuracy. MAGUS 1 is the most accurate at 10,000–50,000 sequences (8.2–7.8% error), but can’t proceed beyond that. MAGUS(Recurse, FastTree) is second-best at about 8.5–8%. MAGUS(Recurse, Clustal) consistently trails MAGUS(Recurse, FastTree) by about 0.5% below 200,000 sequences, and declines to about 8.3% on 1,000,000 sequences.

Aside from MAGUS(Recurse, Clustal), UPP(Fast) is the only other method that aligned all 1,000,000 sequences in a week; UPP took about 77 hours to align all 1,000,000 sequences, while MAGUS(Recurse, Clustal) took about 128 hours. PASTA encountered memory issues, while UPP and MAGUS(Recurse, FastTree) ran out of time. Notably, UPP, Clustal Omega, and MAGUS(Recurse, FastTree) showed comparable runtime scaling, all three just meeting the 1 week time limit at 200,000 sequences. MAGUS 1 initially scales better than recursive MAGUS on a single node, but only reaches 50,000.

Discussion

The accuracy of MAGUS convincingly exceeds the other methods we tried on the datasets in our study. As shown in Figs 4, 6 and 8, this is true regardless of whether recursion is used, and whether FastTree or Clustal is used for decomposition. The more difficult question we need to tease apart concerns the different ways of running MAGUS, and how they affect scalability and accuracy. We do this by considering recursion, guide tree, and node-parallelism in turn.

On one hand, recursion actually slows MAGUS down on smaller datasets. On the other hand, this is rapidly reversed as MAGUS chokes on larger datasets without recursion. This can be seen from our 16S results, where MAGUS is much faster without recursion on 6,000–7,000 sequences, but much slower on 27,000. This reversal can also be seen on the HomFam datasets. On RNASim, MAGUS without recursion is faster on 10,000–50,000 sequences, but simply fails after that.

The nature of this limitation is fairly clear: given N sequences and S subsets, MAGUS without recursion must run MAFFT -linsi on chunks of $\frac{N}{S}$ sequences. Thus, MAGUS without recursion is only viable for as long as MAFFT -linsi can handle these chunks. Our results suggest that subsets approaching around 1,000 sequences really become a problem: this is about where RNASim fails and 16S.B.ALL takes an inordinate amount of time. There is less of a problem on HomFam, where the amino acid sequences are much shorter.

Moreover, recursion does not improve accuracy; MAGUS without recursion is noticeably more accurate on HomFam, about the same on 16S, and slightly better on 10,000 sequences of RNASim. These observations suggest that recursion should be avoided if possible, and only engaged when the dataset becomes too large for the subsets to be reasonably aligned with the base method.

As far as decomposition strategy is concerned, the FastTree method remains the most accurate. The runtime becomes an issue on the largest datasets, where the tree takes about 5 days to compute on 1 million sequences. The best alternative, as suggested by our results, is to use the Clustal Omega guide tree. This gives the best compromise between accuracy and runtime, and only takes 14 hours on 1 million sequences.

Taking advantage of our newfound node-parallelism has a considerable impact on runtime. If we exclude the FastTree computation from the MAGUS runtime on 1 million sequences, the actual alignment stage takes about 9 days on a single node, but only about 17 hours on 10 nodes and 2.5 hours on 100 nodes. Thus, given enough compute nodes, the total runtime is mostly dominated by the guide tree method, rather than the alignment itself; this is the motivation for considering Clustal as a FastTree alternative.

Conclusions

We presented a powerful set of improvements to our MAGUS method, allowing it to scale from 50,000 to a full million sequences. Moreover, MAGUS is able to align such vast datasets more accurately than the other methods we compared against.

UPP(Fast) remains the fastest way to effectively align a million sequences on a single compute node, but suffers from consistently worse alignment accuracy. Other methods are able to finish quickly on smaller datasets, but struggle to complete on larger numbers of sequences, while also trailing MAGUS in accuracy.

We conclude by distilling our results into a number of concrete recommendations for interested practitioners.

Recursion is harmful on smaller datasets, but necessary on larger datasets

If the dataset is small enough, MAGUS will run considerably faster without recursion and might have slightly better accuracy. On larger datasets, MAGUS will rapidly grind to a halt without recursion. Thus, it is advised to avoid recursion if the dataset permits this. This threshold is dictated by subset size ( $\frac{# s e q u e n c e s}{# s u b s e t s}$ ). Given our data, we found the “threshold” subset size to be somewhere around 1,000 sequences of a few thousand nucleotides, or somewhere above 4,000 sequences of a few hundred amino acids.

The importance of node-parallelism and guide tree

The default FastTree-based subset decomposition gives the best accuracy, and is fast enough for most purposes. For huge datasets of half a million or more, the Clustal Omega-based decomposition runs much faster and is nearly as accurate. As one might expect, using as many compute nodes as possible will improve the runtime. However, using more nodes than subsets will decrease the added gains from node-parallelism.

Running MAGUS

Putting all of the above together, the most accurate way of running MAGUS is to use the default FastTree-based decomposition without recursion, preferably on as many compute nodes as are available. If the dataset is too large to allow the subsets to align in a reasonable amount of time, recursion should be enabled. Finally, if the dataset is too large to allow FastTree to finish in a reasonable amount of time, the Clustal-based decomposition should be used.

Future directions

We plan to explore several future directions towards further improving MAGUS. The first is to comprehensively investigate the performance of MAGUS on fragmentary data. Fragmentary sequences can potentially confound effective methods, and we will extend MAGUS to reliably handle such scenarios.

The second avenue of improvement is to consider alternative procedures for assembling backbone alignments, and is intended to further increase alignment accuracy. Currently, MAGUS uses the simple expedient of building backbones with equal, random samples from each subset. We will develop and evaluate ways to build more compact (and, thus, more accurate) backbone sets that still sufficiently span the subsets.

Thirdly, we have mostly developed MAGUS to be able to align vast numbers of sequences accurately. In the future, we hope to also extend MAGUS “in the other direction”—to handle datasets with arbitrarily long, even genome-scale sequences.

A final issue to explore is the utility and management of extra-large alignments for downstream applications. In the context of large-scale tree estimation in particular, is it better to compile a single MSA (probably with some necessary loss to compression) and use it to estimate the entire tree in one operation, or would it be more effective to estimate smaller alignments and use them for piecewise tree estimation? There has been some recent work comparing unitary and piecewise maximum likelihood tree estimation strategies on large datasets [20], showing that divide-and-conquer methods are much faster and nearly as accurate, but more investigation will be needed—particularly at the higher scales we explored here.

Commands used

MAGUS

python3 magus.py -d tempdir -o result.txt -i unalign.txt

-t <guide tree option or path> --recurse <true|false>

--maxnumsubsets <25|100>

PASTA 1.8.3

python3 run\_pasta.py -i unalign.txt -o result.txt

--temporaries tempdir -d <dna|rna|protein> --keeptemp

UPP 4.3.10

python3 run\_upp.py -s unalign.txt -p result.txt -m rna

UPP(Fast) 4.3.10

python3 run\_upp.py -s unalign.txt -p result.txt -B 100 -m rna

Muscle 3.8.425

muscle -maxiters 2 -in unalign.txt -out result.txt

Clustal Omega 1.2.4

clustalo -i unalign.txt -o result.txt --threads = 32

MAFFT 7.450 –auto

mafft --auto --ep 0.123 --quiet --thread 32 --anysymbol

unalign.txt > result.txt

FastSP 1.6.0 (Computing alignment error)

java -Xmx256G -jar FastSP\_1.6.0.jar -r reference\_align.txt

-e estimated\_align.txt -ml

Supporting information

S1 Text. Supplementary materials.

Fig A. GCM overview. Fig B. RNASim alignment sizes, MAGUS variants only. Fig C. RNASim alignment error, MAGUS variants only. Fig D. RNASim runtimes, MAGUS guide trees only. Fig E. RNASim runtimes, MAGUS variants only. Fig F. RNASim SPFN error, MAGUS variants only. Fig G. RNASim SPFP error, MAGUS variants only. Fig H. RNASim SPFN error. Fig I. RNASim SPFP error. Fig J. 16S SPFN error. Fig K. 16S SPFP error. Fig L. HomFam (smallest 9 datasets) alignment error. Fig M. Homfam (largest 10 datasets) alignment error. Fig N. Homfam (smallest 9 datasets) runtime. Table A. RNASim log-scale alignment sizes. Table B. RNASim Delta error from lossy compression. Table C. HomFam (all datasets) alignment error. Table D. HomFam (all datasets) SPFN error. Table E. HomFam (all datasets) SPFP error.

(PDF)

Click here for additional data file.^{(851.3KB, pdf)}

Data Availability

MAGUS is open-source and freely available at https://github.com/vlasmirnov/MAGUS. The datasets used in this study can be downloaded from the Illinois Data Bank at https://doi.org/10.13012/B2IDB-1048258_V1.

Funding Statement

This work was funded by the Ira & Debra Cohen Graduate Fellowship to VS. VS was also funded by a research assistantship with Dr. Tandy Warnow, which was funded by NSF grant ABI-1458652. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Edgar RC. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC bioinformatics. 2004;5(1):113. doi: 10.1186/1471-2105-5-113 [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular systems biology. 2011;7(1):539. doi: 10.1038/msb.2011.75 [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Katoh K, Kuma Ki, Toh H, Miyata T. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic acids research. 2005;33(2):511–518. doi: 10.1093/nar/gki198 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Nguyen NpD, Mirarab S, Kumar K, Warnow T. Ultra-large alignments using phylogeny-aware profiles. Genome Biology. 2015;16(1):124. doi: 10.1186/s13059-015-0688-z [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Lassmann T. Kalign 3: multiple sequence alignment of large datasets. Bioinf. 2019;36(6):1928–1929. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Notredame C, Higgins DG, Heringa J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. Journal of molecular biology. 2000;302(1):205–217. doi: 10.1006/jmbi.2000.4042 [DOI] [PubMed] [Google Scholar]
7. Do CB, Mahabhashyam MS, Brudno M, Batzoglou S. ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome research. 2005;15(2):330–340. doi: 10.1101/gr.2821705 [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Pei J, Grishin NV. PROMALS: towards accurate multiple sequence alignments of distantly related proteins. Bioinf. 2007;23(7):802–808. doi: 10.1093/bioinformatics/btm017 [DOI] [PubMed] [Google Scholar]
9. Liu K, Raghavan S, Nelesen S, Linder CR, Warnow T. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science. 2009;324(5934):1561–1564. doi: 10.1126/science.1171243 [DOI] [PubMed] [Google Scholar]
10. Liu K, Warnow TJ, Holder MT, Nelesen SM, Yu J, Stamatakis AP, et al. SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. Systematic biology. 2012;61(1):90. doi: 10.1093/sysbio/syr095 [DOI] [PubMed] [Google Scholar]
11. Mirarab S, Nguyen N, Guo S, Wang LS, Kim J, Warnow T. PASTA: ultra-large multiple sequence alignment for nucleotide and amino-acid sequences. Journal of Computational Biology. 2015;22(5):377–386. doi: 10.1089/cmb.2014.0156 [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Smirnov V, Warnow T. MAGUS: Multiple Sequence Alignment using Graph Clustering. Bioinformatics. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Van Dongen SM. A cluster algorithm for graphs. Amsterdam: National Research Institute for Mathematics and Computer Science in the Netherlands; 2000. Available from: https://ir.cwi.nl/pub/4463.
14. Price MN, Dehal PS, Arkin AP. FastTree 2–approximately maximum-likelihood trees for large alignments. PloS one. 2010;5(3):e9490. doi: 10.1371/journal.pone.0009490 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Eddy SR. HMMER website; 2020. Available from: http://hmmer.org.
16. Katoh K, Toh H. Recent developments in the MAFFT multiple sequence alignment program. Briefings in bioinformatics. 2008;9(4):286–298. doi: 10.1093/bib/bbn013 [DOI] [PubMed] [Google Scholar]
17. Cannone JJ, Subramanian S, Schnare MN, Collett JR, D’Souza LM, Du Y, et al. The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinf. 2002;3(1):2. doi: 10.1186/1471-2105-3-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Garriga E, Di Tommaso P, Magis C, Erb I, Mansouri L, Baltzis A, et al. Large multiple sequence alignments with a root-to-leaf regressive method. Nature Biotech. 2019;37(12):1466–1470. doi: 10.1038/s41587-019-0333-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Mirarab S, Warnow T. FastSP: linear time calculation of alignment accuracy. Bioinf. 2011;27(23):3250–3258. doi: 10.1093/bioinformatics/btr553 [DOI] [PubMed] [Google Scholar]
20. Park M, Zaharias P, Warnow T. Disjoint Tree Mergers for Large-Scale Maximum Likelihood Tree Estimation. Algorithms. 2021;14(5):148. doi: 10.3390/a14050148 [DOI] [Google Scholar]

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008950.r001

Decision Letter 0

Dina Schneidman-Duhovny

12 Aug 2021

Dear Mr. Smirnov,

Thank you very much for submitting your manuscript "Recursive MAGUS: scalable and accurate multiple sequence alignment" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Dina Schneidman-Duhovny

Software Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The author presents an update to a recently-published tool (MAGUS; Smirnov & Warnow, 2020). For context, MAGUS is a divide-and-conquer approach for Multiple Sequence Alignment (MSA) that is similar in spirit to PASTA (Mirarab et al., 2015) in that a complete sequence dataset is decomposed into smaller subsets, and the subsets are aligned and merged into a single MSA using a guide tree. The novelty of MAGUS (from the original manuscript) is the use of a novel "Graph Clustering Merger" approach for merging the subset alignments.

In this manuscript specifically, the author expands upon MAGUS to improve scalability in 4 key ways:

1. Recursion: To better handle ultra-large datasets in which even the subsets are prohibitively large for the underlying aligner (e.g. MAFFT), MAGUS can now recursively call itself on the subsets (i.e., to break them down into *even smaller* subsets). This is a clever idea, and I'm excited to see that it yielded significant speed-up with respect to the original MAGUS approach, but the notion of breaking down a divide-and-conquer algorithm into smaller versions of itself recursively is fairly standard (I would even argue that *all* divide-and-conquer algorithms are inherently recursive, and in this context, the only theoretical change is that the MAGUS "base case" of using MAFFT is being made smaller).

2. Parallelism: Prior to this manuscript, MAGUS supported thread-parallelism (i.e., it could utilize all available threads on a single node), and in this update, MAGUS now also supports node-parallelism (i.e., it can now distribute tasks to multiple compute nodes, with each supporting thread-parallelism). This is an excellent additional feature that I'm excited to see implemented, but the notion of sending the individual components of a divide-and-conquer algorithm to multiple compute nodes is also fairly standard from a technical standpoint.

3. Guide Tree: In the original manuscript, MAGUS used FastTree to estimate a rough guide tree (via the maximum-likelihood approach) with which to decompose the dataset. In this update, MAGUS now supports more types of trees (e.g. to enable using a faster tree-construction approach at the expense of accuracy). This is a nice addition as far as the software itself is concerned, but from a theoretical standpoint, this is a fairly trivial update (the fundamental approach is still the same, just with the newly-added ability to swap in different tools to construct this initial guide tree).

4. Memory Management and Alignment Compression: To reduce the memory complexity of MAGUS, the author has implemented optimizations on two fronts. First, memory management is conducted more optimally by only fully loading a single subalignment into memory at any given time. This is a nice fix that I'm very excited to see implemented into MAGUS, but this is certainly more of a code revision rather than a novel approach: only loading pieces of a dataset into memory at any given time has been a standard systems programming approach for decades. Second, to *further* improve the memory complexity of MAGUS, the author has now also implemented a lossy compression scheme to "dissolve" neighboring columns that are highly-similar. I actually found the idea quite interesting, and more exploration of how this lossy compression scheme impacts accuracy would have been nice to see, especially as a function of the threshold.

The results are nice to see, but they are unsurprising: nice (but not inherently novel) optimizations were made to the MAGUS codebase, and the MAGUS runtime improved considerably as a result, but with negligible impact to accuracy. From my perspective, while these improvements are excellent from a codebase improvement perspective and are surely welcome to the MAGUS userbase, they seem to be fairly standard approaches and, in the context of the original MAGUS manuscript (which I have been reading side-by-side next to this manuscript), do not seem sufficiently novel beyond the previous MAGUS manuscript to justify publication in PLOS Computational Biology (rather, this article seems more appropriate for a technical blog post or similar).

Reviewer #2: The manuscript “Recursive MAGUS: scalable and accurate multiple sequence alignment” by Vladimir Smirnov describes an extension of the earlier MAGUS alignment package (which itself is an extension of the even earlier PASTA alignment approach). The fundamental idea behind MAGUS and PASTA is straightforward – 1) sequences to align are broken into groups and each group is aligned; 2) the subalignments are then merged. This basic approach and need for good methods of very large-scale multiple sequence alignments is clear – the PASTA method has been cited almost 250 times in the 6 years since it was published. Thus, the important question for this manuscript is: doe Recursive MAGUS represents an important and useful extension of these approaches.

I’ve gone over the results quite carefully and believe the program is both straightforward and potentially quite useful. There is clear evidence that alignments produced by MAGUS are quite good under all settings tested (Figures 3, 5, 6, and 7). Run-times also appeared to be reasonable in general. The options used to run programs are clear. I do have one big question is about the lossy compression (discussed on page 4). I would like to see a figure depicting the algorithm (i.e., a flowchart with examples showing the operations on data columns). That is the one part of the paper where, try as I might, I simply could not understand what is being done by the program.

Another more philosophical area that the author might consider discussing is the potential uses of very large alignments. This issue came to me when I was trying to understand the lossy compression issue. For phylogeny it might be better to estimate several alignments, each of which is relatively large but not so large as to require the lossy compression, and then estimate trees from each and combine the trees using a supertree approach. It is not clear to me whether very large alignments would have benefits for studies of molecular evolution. Again, the alignments could be broken down into subsets and analyzed separately. Note that by “very large” I mean large enough to have to invoke lossy compression – it is clear that large alignments are useful.

Finally, I would like to apologize for a delayed review. Some unexpected stresses on my time came up after I agreed to review. I kept thinking I’d get to the review but then got buried under other obligations. Please accept my sincere aplogy.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Niema Moshiri

Reviewer #2: Yes: Edward L Braun

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. 2021 Oct 6;17(10):e1008950. doi: 10.1371/journal.pcbi.1008950.r002

Author response to Decision Letter 0

25 Aug 2021

Attachment

Submitted filename: review-response.pdf

Click here for additional data file.^{(67.3KB, pdf)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008950.r003

Decision Letter 1

Dina Schneidman-Duhovny

9 Sep 2021

Dear Mr. Smirnov,

We are pleased to inform you that your manuscript 'Recursive MAGUS: scalable and accurate multiple sequence alignment' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.

Best regards,

Dina Schneidman

Software Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: With the expansion of the lossy compression scheme portions of the paper, I now feel as though it is sufficiently novel with respect to the original Smirnov & Warnow (2020) paper to justify its own publication. The lossy compression approach is quite clever, and it is much more clearly communicated with the additional diagram. Great work!

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

Reviewer #1: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Niema Moshiri

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008950.r004

Acceptance letter

Dina Schneidman-Duhovny

28 Sep 2021

PCOMPBIOL-D-21-00627R1

Recursive MAGUS: scalable and accurate multiple sequence alignment

Dear Dr Smirnov,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Katalin Szabo

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Text. Supplementary materials.

(PDF)

Click here for additional data file.^{(851.3KB, pdf)}

Attachment

Submitted filename: review-response.pdf

Click here for additional data file.^{(67.3KB, pdf)}

Data Availability Statement

[pcbi.1008950.ref001] 1. Edgar RC. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC bioinformatics. 2004;5(1):113. doi: 10.1186/1471-2105-5-113 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008950.ref002] 2. Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular systems biology. 2011;7(1):539. doi: 10.1038/msb.2011.75 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008950.ref003] 3. Katoh K, Kuma Ki, Toh H, Miyata T. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic acids research. 2005;33(2):511–518. doi: 10.1093/nar/gki198 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008950.ref004] 4. Nguyen NpD, Mirarab S, Kumar K, Warnow T. Ultra-large alignments using phylogeny-aware profiles. Genome Biology. 2015;16(1):124. doi: 10.1186/s13059-015-0688-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008950.ref005] 5. Lassmann T. Kalign 3: multiple sequence alignment of large datasets. Bioinf. 2019;36(6):1928–1929. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008950.ref006] 6. Notredame C, Higgins DG, Heringa J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. Journal of molecular biology. 2000;302(1):205–217. doi: 10.1006/jmbi.2000.4042 [DOI] [PubMed] [Google Scholar]

[pcbi.1008950.ref007] 7. Do CB, Mahabhashyam MS, Brudno M, Batzoglou S. ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome research. 2005;15(2):330–340. doi: 10.1101/gr.2821705 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008950.ref008] 8. Pei J, Grishin NV. PROMALS: towards accurate multiple sequence alignments of distantly related proteins. Bioinf. 2007;23(7):802–808. doi: 10.1093/bioinformatics/btm017 [DOI] [PubMed] [Google Scholar]

[pcbi.1008950.ref009] 9. Liu K, Raghavan S, Nelesen S, Linder CR, Warnow T. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science. 2009;324(5934):1561–1564. doi: 10.1126/science.1171243 [DOI] [PubMed] [Google Scholar]

[pcbi.1008950.ref010] 10. Liu K, Warnow TJ, Holder MT, Nelesen SM, Yu J, Stamatakis AP, et al. SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. Systematic biology. 2012;61(1):90. doi: 10.1093/sysbio/syr095 [DOI] [PubMed] [Google Scholar]

[pcbi.1008950.ref011] 11. Mirarab S, Nguyen N, Guo S, Wang LS, Kim J, Warnow T. PASTA: ultra-large multiple sequence alignment for nucleotide and amino-acid sequences. Journal of Computational Biology. 2015;22(5):377–386. doi: 10.1089/cmb.2014.0156 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008950.ref012] 12. Smirnov V, Warnow T. MAGUS: Multiple Sequence Alignment using Graph Clustering. Bioinformatics. 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008950.ref013] 13.Van Dongen SM. A cluster algorithm for graphs. Amsterdam: National Research Institute for Mathematics and Computer Science in the Netherlands; 2000. Available from: https://ir.cwi.nl/pub/4463.

[pcbi.1008950.ref014] 14. Price MN, Dehal PS, Arkin AP. FastTree 2–approximately maximum-likelihood trees for large alignments. PloS one. 2010;5(3):e9490. doi: 10.1371/journal.pone.0009490 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008950.ref015] 15.Eddy SR. HMMER website; 2020. Available from: http://hmmer.org.

[pcbi.1008950.ref016] 16. Katoh K, Toh H. Recent developments in the MAFFT multiple sequence alignment program. Briefings in bioinformatics. 2008;9(4):286–298. doi: 10.1093/bib/bbn013 [DOI] [PubMed] [Google Scholar]

[pcbi.1008950.ref017] 17. Cannone JJ, Subramanian S, Schnare MN, Collett JR, D’Souza LM, Du Y, et al. The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinf. 2002;3(1):2. doi: 10.1186/1471-2105-3-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008950.ref018] 18. Garriga E, Di Tommaso P, Magis C, Erb I, Mansouri L, Baltzis A, et al. Large multiple sequence alignments with a root-to-leaf regressive method. Nature Biotech. 2019;37(12):1466–1470. doi: 10.1038/s41587-019-0333-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1008950.ref019] 19. Mirarab S, Warnow T. FastSP: linear time calculation of alignment accuracy. Bioinf. 2011;27(23):3250–3258. doi: 10.1093/bioinformatics/btr553 [DOI] [PubMed] [Google Scholar]

[pcbi.1008950.ref020] 20. Park M, Zaharias P, Warnow T. Disjoint Tree Mergers for Large-Scale Maximum Likelihood Tree Estimation. Algorithms. 2021;14(5):148. doi: 10.3390/a14050148 [DOI] [Google Scholar]

PERMALINK

Recursive MAGUS: Scalable and accurate multiple sequence alignment

Vladimir Smirnov

Roles

Abstract

Author summary

Introduction

Design and implementation

Overview of MAGUS

Fig 1. MAGUS overview.

Motivation for MAGUS enhancement

MAGUS improvements

Recursion

Fig 2. Recursive MAGUS overview.

Parallelism

Guide tree

Memory management and alignment compression

Fig 3. Alignment compression example.

Results

Experimental design

Datasets

Table 1. Dataset properties.

Methods

Error metrics

Computing resources

Experimental results

HomFam

Fig 4. Average SP error on HomFam datasets.

Fig 5. Homfam (largest 10 datasets) runtime, all methods.

16S

Fig 6. 16S alignment error, all methods.

Fig 7. 16S runtime, all methods.

RNASim

Fig 8. RNASim alignment error, all methods.

Fig 9. RNASim runtime, all methods.

Table 2. Method failures on RNASim.

Fig 10. RNASim alignment error, best methods.

Discussion

Conclusions

Recursion is harmful on smaller datasets, but necessary on larger datasets

The importance of node-parallelism and guide tree

Running MAGUS

Future directions

Commands used

Supporting information

Data Availability

Funding Statement

References

Decision Letter 0

Dina Schneidman-Duhovny

Roles

Author response to Decision Letter 0

Decision Letter 1

Dina Schneidman-Duhovny

Roles

Acceptance letter

Dina Schneidman-Duhovny

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases