Abstract
The multiple sequence alignment (MSA) is an increasingly important task in bioinformatics as we have to deal with the constantly increasing gene‐ and protein sequence databases. MSA is applied in phylogenetic analysis, in discovering conservative protein domains, in the assignment of secondary and tertiary structural features in proteins, or in the metagenomic sample analysis and gene discovery. Usually, the focus is on the MSA of long sequences, since in the practice these tasks appear most frequently. However, the strict analysis of the optimal MSA of short sequences is an area of negligence, and findings there may contribute to better and faster algorithms for the multiple alignment of long sequences. In the present contribution, we are examining length‐1 sequences using arbitrary metric and length‐2 sequences using unit metric, and we show that the optimum of the MSA problem can be achieved by the trivial alignment in both cases.
1. INTRODUCTION
Multiple sequence alignment (MSA) is one of the central problems in classical bioinformatics. While the exact optimal global and local alignments of two sequences can be computed in quadratic time with the Neddleman‐Wunsch 1 and the Smith‐Waterman 2 , 3 algorithms, respectively, the exact multiple alignment generally is proven to be an NP‐hard problem, 4 and therefore, it is very unlikely to be computable in ‐time algorithm, for any constant , where denotes the input size.
Heuristic MSA algorithms include the different versions of CLUSTAL, 5 , 6 , 7 , 8 , 9 , 10 , 11 MSACompro, 12 PRALINE, 13 TCS, 14 PASTASpark, 15 and numerous others.
Multiple sequence alignment algorithms have a range of applications in bioinformatics, for example, in HMM profile building in the famous HMMER suite of programs, 16 , 17 , 18 , 19 in identifying conservative protein‐ and genome‐sequences, in phylogenetic tree building and analysis, 20 , 21 , 22 , 23 motif discovery and gene identification in metagenomic samples, 19 secondary protein structure prediction, 24 , 25 and solvent accessibility computation. 26
In general, the MSA problem is computationally hard (i.e., NP‐hard). 4 An interesting question is when does the problem become a hard instance when the parameters are modified?
It is known that for a constant number of sequences, the MSA problem is solvable in polynomial time by dynamic programing algorithms (Needleman‐Wunsch and Smith‐Waterman generalizations). Therefore, the MSA problem is “easy”, that is, polynomial time computable, if the number of the sequences is small (i.e., it is a constant). To the best of our knowledge, no one examined the complexity of MSA when the number of the sequences is not small, but their length is.
One can assume that for length‐1 (and perhaps even for length‐2) sequences, it may not be that hard to find an optimal alignment. Furthermore, if an optimal alignment for short sequences can be determined in polynomial time, then it could also help to develop faster or more accurate heuristic algorithms. In this work, some new results regarding the alignment of short sequences are presented.
1.1. Definitions and notations
Definition 1
Let be a finite alphabet; a string over is called a sequence. The pair of sequences is an alignment of sequences and if for is obtained from by inserting gaps (spaces, denoted by –) into or at either end of and after that, and have the same length. It is assumed that "—" is not an element of alphabet .
The alignment of Definition 1 consists of two sequences of the same length. Consequently, every character of is uniquely corresponded to a character of , simply by locating at the same position.
Let be the common length of and . The cost of this alignment is
(1) |
where d is a score scheme over , and is the ith character of . The score scheme is usually required to be a metric on the set , that is, it needs to satisfy ; and the triangle inequality: , . A frequently used score scheme is the unit metric, where if and 1 otherwise. We call an alignment optimal for two sequences if its cost is minimal among every possible alignments.
The definition of aligning two sequences can easily be generalized for more strings: let be a positive integer, and suppose that we want to align the sequences . Let us insert gaps into or at either end of strings , so that they have the same length , and in the proper order, write the sequences , each of length , under one another. This table can be considered a matrix of size , and it is called a multiple alignment of sequences . Different scoring methods can be applied for multiple alignments, perhaps the most often used one is the sum of pairs method, where the cost is the sum of the costs of the alignments of the pairs from the aligned sequences. More exactly, if are sequences to be aligned, then their sum of pair cost 27 is
(2) |
Examples. (i) Let . The following set of sequences is a multiple alignment of :
C | C | G | – |
G | C | G | – |
– | C | G | C |
Using the unit metric and computing the costs of the columns, cost(.
(ii) Let now contain only two characters (C and G) with the following metric:
C | G | – | |
C | 0 | 2 | 1 |
G | 2 | 0 | 1 |
– | 1 | 1 | 0 |
Let . A multiple alignment of :
=
– | C | G |
G | C | – |
G | – | G |
Using the given metric, cost is equal to .
2. MULTIPLE SEQUENCE ALIGNMENT FOR LENGTH‐1 SEQUENCES
In this section, we focus on aligning length‐1 sequences (equivalently, characters of ). An important earlier result needs to be quoted here 28 :
Theorem 2 (Lemma 3)
Let U be a subset of a set S of sequences over , such that U contains only identical sequences, and let be an optimal alignment of S. Let denote the restriction of to the rows of U. Then
An important corollary of this theorem is the following one: it is enough to examine the sets of pairwise different sequences because in each optimal alignment, every instance of a given sequence is aligned identically.
The next definition will be used frequently throughout this work:
Definition 3
Let S be a set of sequences that have the same length. is called the trivial alignment of S if is constructed by writing every sequence under each another, without using any gaps.
2.1. Multiple sequence alignment for length‐1 sequences using unit metric
The main result of this subsection is the next theorem:
Theorem 4
Using unit metric, there cannot be a multiple sequence alignment for length‐1 sequences that has cost less than the cost of their trivial alignment. Additionally, if we align k pairwise different length‐1 sequences, then the cost of an optimal alignment is .
By Theorem 2, we may assume that the characters to be aligned are pairwise different. It is easy to see that the trivial alignment of different characters has a cost of : there are pairs among these characters and in every pair, there are two different sequences, so the cost of an aligned pair is always 1.
Let us suppose that this alignment is not optimal, then the length of every aligned sequence must be at least 2 in an optimal alignment. If this common length of aligned sequences is , then the general structure of the matrix of this multiple alignment is as follows: , there are characters in the ith column, where , and they are placed so that in each row, there is only one character and gaps (see Table 1).
TABLE 1.
A multiple alignment for length‐1 sequences on columns
|
– | … | – | |
… | … | … | … | |
|
– | – | – | |
– |
|
… | – | |
… | … | … | … | |
– |
|
… | – | |
… | … | … | … | |
– | – | … |
|
|
… | … | … | … | |
– | – | … |
|
Obviously, the cost of the first column is
since there are k 1 different characters with cost of , and besides that, all of the gaps increase the cost by one with every alphabetical character. A similar statement is true for every column, so the cost of this alignment is:
Consequently, the cost above is minimized, when is maximized. Since
holds, it is clear that , and the cost of this alignment cannot be less than , that is, the cost of the trivial alignment.
Note: From the proof, it is also clear (by minimizing ) that a multiple alignment for different length‐1 sequences cannot have a higher cost than , if the length of aligned sequences is . Since , the cost can be at most and this limit can be reached if there is only one character in every column and in every row, then the cost is .
3. MULTIPLE SEQUENCE ALIGNMENT FOR LENGTH‐1 SEQUENCES USING ARBITRARY METRIC
In this subsection, it will be shown that for length‐1 sequences, we can use any metric as a score scheme, and the MSA problem still remains as easy as in the case of the unit metric.
Theorem 5
Using arbitrary metric, the minimum cost of the multiple sequence alignment for length‐1 sequences is attained by the trivial alignment, and if k different sequences are aligned, then the optimal cost is equal to.
Because of Theorem 2, it can be assumed again that every sequence has exactly one instance in the set S of sequences to be aligned. If we consider the trivial alignment of the S, it is easy to see that its cost is equal to C. Induction for the number of the columns in a MSA will be used to show that no alignment can have lower cost than C.
Let be assumed that the trivial alignment is not optimal, and let denote an optimal alignment. If is not the trivial alignment, then has columns where . It can be shown that cannot have exactly two columns, because in this case, the trivial alignment would have a lower cost than has.
Let us assume to the contrary that has exactly two columns; so there are sequences in the first column and in the second column, where and there is exactly one character in each row (since our sequences to be aligned have length equal to 1, see Table 2).
TABLE 2.
A multiple alignment for k length‐1 sequences in two columns
|
– | |
|
– | |
— | … | |
|
– | |
– |
|
|
– |
|
|
… | … | |
– |
|
We assume, without loss of generality, that the sequences in the first column are and every other sequences are placed in the second column. If the cost of the first column of is denoted by , then
Similarly, the cost of the second column is
and .
A lower bound for can be determined by pairing the summands in to the summands of same form in and using the triangle inequality. For example, for a fix i () and , it is true that , so
It is useful to notice that the summands on the right side of this inequality are exactly those ones that are not included in when we consider summands of the form of for this fix i.
By considering this inequality for every , the following lower bound can be given:
This implies that
It is assumed that the trivial alignment with cost is not optimal; therefore, cannot be an optimal alignment of . By this contradiction, it is proved that an optimal alignment of S cannot have exactly columns.
Using induction, we assume that it is shown that an optimal alignment cannot have exactly i columns, and let be an optimal alignment with columns. Considering the cost of the first two columns of , there are sequences in the first column and sequences in the second one. It is enough to prove that by merging these two columns, the cost of the new alignment is lower than the cost of . The cost of these columns (see Table 3) in is equal to
TABLE 3.
The first two columns of
|
– | |
|
– | |
… | … | |
|
– | |
– |
|
|
– |
|
|
… | … | |
– |
|
|
– | – | |
… | … | |
– | – |
Let us focus on the first characters of these columns. It is an alignment of on two columns and it was shown that if these sequences are aligned trivially instead of using two columns, then the cost of the alignment cannot be higher. It means the following:
On the left side of this inequality, there is the cost of the first two columns of , while on the right side, there is the cost of the column that is constructed by merging the first two columns of . Therefore, a lower bound for is given by an alignment that has columns, implying that is not optimal W.
4. MULTIPLE SEQUENCE ALIGNMENT FOR LENGTH‐2 SEQUENCES
In this section, it will be shown that using the unit metric, a set of length‐2 sequences cannot be aligned with less cost than their trivial alignment; however, this statement does not hold for using arbitrary metric.
Theorem 6
Using the unit metric, no multiple sequence alignment for length‐2 sequences has less cost than their trivial alignment. If we align different sequences , then the cost of the optimal alignment is.
Let denote the set of sequences that need to be aligned. It is clear that the trivial alignment of has the cost written above, so this lower bound is accessible. In other words, it is enough to prove that for any , a non‐trivial alignment cannot have less cost than the trivial one.
Let be an alignment of on columns where . Let the rows of be permuted, so that those aligned sequences, where the indices of the two non‐gap characters are the same, are placed under each other, forming a block of sequences. This operation does not change the cost of . In every row of , there are exactly two characters and gaps, so there can be types of aligned sequences in , considering only the positions of the non‐gap characters in a row. This implies that there will be (not necessarily non‐empty) blocks after permuting the rows of (e.g., if , then there are blocks after the permutation of the rows, see Table 4).
TABLE 4.
The structure of after permuting its rows and making its block setting with . Number 1 denotes the first characters, and number 2 the second letters. During the proof, an upper bound is given for the cost of aligning letters with the same order that are not aligned in by using character‐gap alignment costs that are included in cost
1 | 2 | – | – |
… | … | … | … |
1 | 2 | – | – |
1 | – | 2 | – |
… | … | … | … |
1 | – | 2 | – |
1 | – | – | 2 |
… | … | … | … |
1 | – | – | 2 |
– | 1 | 2 | – |
… | … | … | … |
– | 1 | 2 | – |
– | 1 | – | 2 |
… | … | … | … |
– | 1 | – | 2 |
– | – | 1 | 2 |
… | … | … | … |
– | – | 1 | 2 |
After making this block setting, it is clear that there are six types of aligned character pairs in :
first characters of some sequences aligned with other sequences’ first characters;
first characters of some sequences aligned with other sequences’ second characters;
first characters of some sequences aligned with gaps;
second characters of some sequences aligned with other sequences’ second characters;
second characters of some sequences aligned with gaps;
gaps aligned with gaps.
In the trivial alignment , there are only pairs of types (i) and (iv); moreover, every sequence's first character is aligned with each another in (and it holds similarly for every second character of the sequences of ). Nevertheless, in a non‐trivial alignment , there are aligned sequences whose first or second characters are not aligned with each other in . This implies that it is enough to give an upper bound for the cost of these characters in that are aligned with each other in but are not aligned with each other in , using parts of cost() for this bound. (Because every part of cost () is non‐negative, if a bijection can be given between the letter‐letter alignments in that are not aligned in and some other alignments of characters of (not excluded character‐gap alignments), so that the latter alignments have always at least as much cost as the former ones, then it means that cost cost().)
If denotes the unit metric, then the following inequality holds for every pair of sets on arbitrary alphabet (where and can contain a letter more than once):
Using this inequality, a bijection mentioned above can be given: first, let be considered two sequences whose first characters ( and are not aligned in (it can be assumed that has bigger column index). This implies that the element that is in the intersection of the row of and the column of must be a gap. , so the cost of the alignment of and in can be estimated by the cost of the alignment of two characters in .
Similarly, if two sequences are considered whose second characters and are not aligned in , then (assuming that has bigger column index) the element in the intersection of the row of and the column of must be a gap. The same estimation can be given like before, meaning that the cost of the alignment of and in is less or equal to the cost of a character‐gap alignment in .
Considering the block setting (Table 4) of , let and be the two blocks whose sequences’ first characters are not aligned in . Assuming that the first characters of sequences in have bigger column index, there must be gaps in the intersection of the column of the first characters of sequences in and the rows of . If we denote the first letters of the sequences of by , then (because of the statements of the latter two paragraphs) the following holds:
Besides that, a similar result can be established if we consider two blocks whose sequences’ second characters are not aligned, using the gaps of the block that has the column with smaller column index (see Table 5). By these estimations, it is clear that this assignment between the character–character alignments in , which are not present in , and character‐gap alignments in lead to a result that the latter costs in cannot be less than the corresponding costs in . We also need to show that this assignment is a bijection, that is, there are no character‐gap alignments that are used more than one time.
TABLE 5.
The block setting of if , denoting only that an element is the first/second character of its aligned sequence or a gap. For example, the first element of the first row in the block setting and the second element of the fourth row (which are denoting the first characters of some sequences) are not aligned in , so the cost of their alignment with each other, which is a part of cost but not a part of cost , must be estimated from above with a part of cost . Namely, with the cost of aligning the block setting's first element of the first row with the gaps in the first element of the fourth row
1 | 2 | – | – |
1 | – | 2 | – |
1 | – | – | 2 |
– | 1 | 2 | – |
– | 1 | – | 2 |
– | – | 1 | 2 |
A set of gaps in the block setting are considered in an estimation if and only if some characters in the block that are containing these gaps and some characters from another block that are aligned in the same column must be aligned in but they are not aligned in . This implies that these gaps are not used in estimations like above more times than the alignment of this gap set with the rest of the given column. Therefore, the former assignment is a bijection, implying that cost cost .W.
In the proof, only the following property of the unit metric has been used: . It follows that Theorem 6 remains valid for any metric, satisfying this property.
As the next example shows, the trivial alignment will not always be optimal for length‐2 sequences if an arbitrary metric is used. Let contain two characters and with the same metric on as in the Example at the end of the Introduction. Let be also the same as in Example : . The trivial alignment of has a cost of , but as Table 6 shows, there is an alignment of that has cost only of .
TABLE 6.
The trivial and an optimal alignment of S
C | G | – | C | G |
G | C | G | C | – |
G | G | G | – | G |
In the previous section, it was shown that we can easily determine the minimum cost of a set to be aligned if it includes only length‐1 sequences; moreover, we also can construct an optimal alignment in the most trivial way using any metric. We have also seen that for length‐2 sequences, the trivial alignment is optimal if the unit metric is used but it is not optimal for arbitrary metric. Besides that, it is also known that the trivial alignment is not always optimal for length‐3 sequences even using unit metric.
As in Example at the end of the Introduction, let be as follows:
Using the unit metric, the cost of the trivial alignment is , but it is not optimal: as we have seen, there is a non‐trivial alignment of so that cost is only (see Table 7).
TABLE 7.
The trivial and an optimal alignment of S
C | C | G | C | C | G | – |
G | C | G | G | C | G | – |
C | G | C | – | C | G | C |
5. CONCLUSIONS
In this work, it was shown that the MSA problem is “easy” for length‐1 sequences and also for length‐2 sequences in special cases. While the MSA problem is well‐examined for a small number of long sequences, it is a pioneering work covering the specialties of a large number of very short sequences.
Since we know that the general problem is ‐hard, 4 it is still an interesting question that for how long sequences the MSA problem starts to become to be difficult? It is another open problem that in the case of length‐2 sequences, how can those metrics be characterized for which trivial alignment is always optimal for arbitrary alphabet?
CONFLICT OF INTEREST
The authors declare no conflict of interest.
AUTHOR CONTRIBUTIONS
KT proved the theorems, wrote the first version of this manuscript, and prepared figures. VG initiated the study, finalized the manuscript, and secured funding.
ACKNOWLEDGMENT
KT was partially funded by the VEKOP‐2.3.2‐16‐2017‐00014 program, supported by the European Union and the State of Hungary, co‐financed by the European Regional Development Fund, and by the European Union, co‐financed by the European Social Fund (EFOP‐3.6.3‐VEKOP‐16‐2017‐00002). VG was partially funded by the NKFI‐127909 and the 2017‐1.3.1‐VKE‐2017‐00013 and the VEKOP‐2.3.2‐16‐2017‐00014 grants of the National Research, Development and Innovation Office of Hungary.
REFERENCES
- 1. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;480(3):443‐453. [DOI] [PubMed] [Google Scholar]
- 2. Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981;1470(1):195‐197. [DOI] [PubMed] [Google Scholar]
- 3. Ivan G, Banky D, Grolmusz V. Fast and exact sequence alignment with the Smith‐Waterman algorithm: the SwissAlign webserver. Gene Rep. 2016;4:26‐28. [Google Scholar]
- 4. Elias I. Settling the intractability of multiple alignment. J Computational Biol. 2006;13:1323‐1339. [DOI] [PubMed] [Google Scholar]
- 5. Higgins DG, Sharp PM. Clustal: a package for performing multiple sequence alignment on a microcomputer. Gene. 1988;73:237‐244. [DOI] [PubMed] [Google Scholar]
- 6. Higgins DG, Bleasby AJ, Fuchs R. Clustal v: improved software for multiple sequence alignment. Computer Appl Biosci. 1992;8:189‐191. [DOI] [PubMed] [Google Scholar]
- 7. Higgins DG. Clustal v: multiple alignment of dna and protein sequences. Methods Molecular Biol (Clifton, N.J.). 1994;25:307‐318. [DOI] [PubMed] [Google Scholar]
- 8. Higgins DG, Thompson JD, Gibson TJ. Using clustal for multiple sequence alignments. Methods Enzymol. 1996;266:383‐402. [DOI] [PubMed] [Google Scholar]
- 9. Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG. The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 1997;25:4876‐4882. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Sievers F, Higgins DG. Clustal omega, accurate alignment of very large numbers of sequences. Methods Molecular Biol (Clifton. N.J.). 2014;1079:105‐116. [DOI] [PubMed] [Google Scholar]
- 11. Sievers F, Higgins DG. Clustal omega for making accurate alignments of many protein sequences. Protein Sci, 2018;27:135‐145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Deng X, Cheng J. Msacompro: improving multiple protein sequence alignment by predicted structural features. Methods Molecular Biol (Clifton, N.J.). 2014;1079:273‐283. [DOI] [PubMed] [Google Scholar]
- 13. Bawono P, Heringa J. Praline: a versatile multiple sequence alignment toolkit. Methods Molecular Biol (Clifton. N.J.). 2014;1079:245‐262. [DOI] [PubMed] [Google Scholar]
- 14. Chang J‐M, Di Tommaso P, Notredame C. Tcs: a new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction. Mol Biol Evol. 2014;31:1625‐1637. [DOI] [PubMed] [Google Scholar]
- 15. Abuin JM, Pena TF, Pichel JC. PASTASpark: multiple sequence alignment meets big data. Bioinformatics (Oxford, England). 2017;33:2948‐2950. [DOI] [PubMed] [Google Scholar]
- 16. Eddy SR. A new generation of homology search tools based on probabilistic inference. Genome Inform, 2009;230(1):205‐211. [PubMed] [Google Scholar]
- 17. Eddy SR. Accelerated profile HMM searches. PLoS Comput Biol, 2011;70(10):e1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Szalkai B, Scheer I, Nagy K, Vertessy BG, Grolmusz V. The metagenomic telescope. PLoS One. 2014;9:e101605. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Szalkai B, Grolmusz V. MetaHMM: A webserver for identifying novel genes with specified functions in metagenomic samples. Genomics, 2019;111(4):883–885. ISSN 1089–8646. 10.1016/j.ygeno.2018.05.016. [DOI] [PubMed] [Google Scholar]
- 20. Feng DF, Doolittle RF. Progressive alignment and phylogenetic tree construction of protein sequences. Methods Enzymol. 1990;183:375‐387. [DOI] [PubMed] [Google Scholar]
- 21. Reizer A, Reizer J. Progressive multiple alignment of protein sequences and the construction of phylogenetic trees. Methods Molecular Biol (Clifton. N.J.). 1994;25:319‐325. [DOI] [PubMed] [Google Scholar]
- 22. Metcalf V, Brennan S, George P. Using serum albumin to infer vertebrate phylogenies. Appl Bioinform. 2003;2:S97‐107. [PubMed] [Google Scholar]
- 23. Hagopian R, Davidson JR, Datta RS, Samad B, Jarvis GR, Sjolander K. SATCHMO‐JS: a webserver for simultaneous protein multiple sequence alignment and phylogenetic tree construction. Nucleic Acids Res. 2010;38:W29‐W34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Cuff JA, Barton GJ. Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins. 2000;40:502‐511. [DOI] [PubMed] [Google Scholar]
- 25. Al‐Lazikani B, Sheinerman FB, Honig B. Combining multiple structure and sequence alignments to improve sequence detection and alignment: application to the sh2 domains of janus kinases. Proc Natl Acad Sci USA. 2001;98:14796‐14801. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Garg A, Kaur H, Raghava GPS. Real value prediction of solvent accessibility in proteins using multiple sequence alignment and secondary structure. Proteins. 2005;61:318‐324. [DOI] [PubMed] [Google Scholar]
- 27. Wang L, Jiang T. On the complexity of multiple sequence alignment. J Comput Biol. 1994;10(4):337‐348. [DOI] [PubMed] [Google Scholar]
- 28. Bonizzoni P, Vedova GD. The complexity of multiple sequence alignment with SP‐score that is a metric. Theoret Comput Sci. 2001;2590(1–2):63‐79. [Google Scholar]