Skip to main content
Oxford University Press - PMC COVID-19 Collection logoLink to Oxford University Press - PMC COVID-19 Collection
. 2004 Aug 12;21(1):20–30. doi: 10.1093/bioinformatics/bth468

A memory-efficient algorithm for multiple sequence alignment with constraints

Chin Lung Lu 1,*, Yen Pin Huang 1
PMCID: PMC7109922  PMID: 15374876

Abstract

Motivation: Recently, the concept of the constrained sequence alignment was proposed to incorporate the knowledge of biologists about structures/functionalities/consensuses of their datasets into sequence alignment such that the user-specified residues/nucleotides are aligned together in the computed alignment. The currently developed programs use the so-called progressive approach to efficiently obtain a constrained alignment of several sequences. However, the kernels of these programs, the dynamic programming algorithms for computing an optimal constrained alignment between two sequences, run in 𝒪(γn2) memory, where γ is the number of the constraints and n is the maximum of the lengths of sequences. As a result, such a high memory requirement limits the overall programs to align short sequences~only.

Results: We adopt the divide-and-conquer approach to design a memory-efficient algorithm for computing an optimal constrained alignment between two sequences, which greatly reduces the memory requirement of the dynamic programming approaches at the expense of a small constant factor in CPU time. This new algorithm consumes only 𝒪(αn) space, where α is the sum of the lengths of constraints and usually α ≪ n in practical applications. Based on this algorithm, we have developed a memory-efficient tool for multiple sequence alignment with constraints.

Availability: http://genome.life.nctu.edu.tw/MUSICME

Contact: cllu@mail.nctu.edu.tw

1 INTRODUCTION

Multiple sequence alignment (MSA) is one of the fundamental problems in computational molecular biology that have been studied extensively, because it is a useful tool in the phylogenetic analyses among various organisms, the identification of conserved motifs and domains in a group of related proteins, the secondary and tertiary structure prediction of a protein (or RNA), and so on (Carrillo and Lipman, 1988; Chan et al., 1992; Gusfield, 1997; Nicholas et al., 2002; Notredame, 2002). Moreover, MSA is one of the most challenging problems in computational molecular biology because it has been shown to be NP-complete under the consideration of sum-of-pairs scoring criteria (Kececioglu, 1993; Wang and Jiang, 1994; Bonizzoni and Vedova, 2001), which means that it seems to be hard to design an efficient algorithm for finding the mathematically optimal alignment. Hence, some approximate methods (Gusfield, 1993; Pevzner, 1992; Bafna et al., 1997; Li et al., 2000) and heuristic methods (Feng and Doolittle, 1987; Taylor, 1987; Corpet, 1988; Higgins and Sharpe, 1988; Thompson et al., 1994) were introduced to overcome this problem.

Recently, the concept of the constrained sequence alignment was proposed to incorporate the knowledge of biologists regarding the structures/functionalities/consensuses of their datasets into sequence alignment such that the user-specified residues/nucleotides are aligned together in the computed alignment (Tang et al., 2003). (Tang et al. 2003) first designed a dynamic programming algorithm for finding an optimal constrained alignment of two sequences and then used it as a kernel to develop a constrained multiple sequence alignment (CMSA) tool based on the progressive approach, where each constraint considered by Tang et al. is a single residue/nucleotide only. Their proposed algorithm for the two sequences runs in 𝒪(γn4) time and consumes 𝒪(n4) space, where γ is the number of constrained residues and n is the maximum lengths of the sequences. Later, this result was improved independently by two groups of researchers to 𝒪(γn2) time and 𝒪(γn2) space using the same approach of dynamic programming (Yu, 2003; Chin et al., 2003). In fact, each constraint requested to be aligned together can represent a conserved site of a protein/DNA/RNA family and each conserved site may consist of a short segment of residues/nucleotides, instead of a single residue/nucleotide. In other words, the constraint specified by the biologists can be a fragment of several residues/nucleotides. For some applications, biologists may further expect that some mismatches are allowed among the residues/nucleotides of the columns requested to be aligned. Hence, (Tsai et al. 2004) studied such a kind of the constrained sequence alignment and designed an algorithm of 𝒪(γn2) time and 𝒪(γn2) space for two sequences. The improvements and extension above greatly increase the performances and practical usage of the CMSA tools developed using the progressive approach. However, the requirement of 𝒪(γn2) memory still limits the existing CMSA tools to align a set of short sequences, at most several hundreds of residues/nucleotides. To align large genomic sequences of at least several thousands of residues/nucleotides, there is a need to design a memory-efficient algorithm for the constrained pairwise sequence alignment (CPSA) problem, which is the key limiting factor relating to the applicable extent of the progressive CMSA tools. Hence, in this paper, we adopt the so-called divide-and-conquer approach to design a memory-efficient algorithm for solving the CPSA problem, which runs in 𝒪(γn2) time, but consumes only 𝒪(αn) space, where α is the sum of the lengths of constraints and usually α ≪ n in practical applications. Based on this algorithm, we have finally developed a memory-efficient CMSA tool using the progressive approach. Note that applying the divide-and-conquer approach to memory-efficiently align two or more sequences without any constraints has been studied extensively (Myers and Miller, 1988; Chao et al., 1994; Tönges et al., 1996; Stoye et al., 1997a; Stoye et al., 1997b; Stoye, 1998). In contrast to the progressive approach used here, the divide-and-conquer algorithms proposed by Stoye et al. (Tönges et al., 1996; Stoye et al., 1997a; Stoye et al., 1997b; Stoye, 1998) considered the input sequences simultaneously and heuristically compute the good, but not necessarily optimal, dividing positions so that the resulting total MSA is close to an optimal MSA of the original sequences. In fact, many other CMSAs have been proposed from various perspectives, even using different approaches (Schuler et al., 1991; Depiereux and Feytmans, 1992; Taylor, 1994; Myers et al., 1996; Notredame et al., 2000; Thompson et al., 2000; Sammeth et al., 2003). Of these various CMSAs, it is worth mentioning that (Myers et al., 1996) obtained their CMSA by performing progressive multiple alignment under position-based constraints that are given by users; (Sammeth et al. 2003) got their CMSA by performing simultaneous multiple alignment under segment-based constraints (as same as we studied here) that are pre-computed via a local segmented-based algorithm (Morgenstern, 1999). We refer the reader to their papers for details.

2 PROBLEM FORMULATION

Let 𝒪 = {S1, S2, …, Sχ} be the set of χ sequences over the alphabet Σ. Then an MSA of 𝒪 is a rectangular matrix consisting of χ rows of characters of Σ ∪ {−} such that no column consists entirely of dashes and removing dashes from row i leaves Si for any 1 ≤ i ≤ χ. The sum-of-pairs score (SP score) of an MSA is defined to be the sum of the scores of all columns, where the score of each column is the sum of the scores of all distinct pairs of characters in the column. In practice, the score of the pair of two dashes is usually set to zero. Then the problem of finding an MSA of 𝒪 with the optimal SP score is the so-called sum-of-pairs MSA problem (Carrillo and Lipman, 1988; Chan et al., 1992; Gusfield, 1997; Nicholas et al., 2002; Notredame, 2002).

Let δ(T1, T2) denote the Hamming distance between two subsequences T1 and T2 of equal length, which is equal to the number of mismatched pairs in the alignment of T1 and T2 without any gap. Given an alignment ℒ of 𝒪, a band is defined as a block of consecutive columns in ℒ (i.e. a submatrix of ℒ). For any band ℒ′ of ℒ, let subset(Si, ℒ′) denote the subsequence of Si whose residues/nucleotides are all in the band ℒ′, where 1 ≤ i ≤ χ. A subsequence T = t1t2tλ is said to appear in ℒ if ℒ contains a band ℒ′ of λ columns, say π1, π2, …, πλ, such that the characters of column πj, 1 ≤ j ≤ λ, are all equal to tj, or equivalently, subseq(Si, ℒ′) = T for each 1 ≤ i ≤ χ. If δ[subseq(Si, ℒ′), T] ≤ λ × ε for a given error ratio 0 ≤ ε < 1 [i.e. some mismatches are allowed between subseq(Si, ℒ′) and T], then T is said to approximately appear in ℒ. From the biological viewpoint, T can be considered as the consensus among the subsequences in ℒ′ and hence T is also called as an induced consensus by the band ℒ′. For any two subsequences T1 and T2, T1T2 is used to denote that T1 (approximately) appears strictly before T2 in ℒ (i.e. their corresponding bands do not overlap). Let Ω = (C1, C2, …, Cγ) be an ordered set of γ constraints (i.e. subsequences), each Inline graphic with length of λi, where 1 ≤ i ≤ γ. Then the CMSA of 𝒪 with respect to Ω is defined as an alignment ℒ of 𝒪 in which all the constraints of Ω approximately appear in the order C1C2 ≺ ··· ≺ Cγ such that δ(subseq(Si, ℒ′j), Cj) ≤ λj × ε for all 1 ≤ i ≤ χ and 1 ≤ j ≤ γ, where ℒ′j is the band of ℒ whose induced consensus is Cj. Given a set 𝒪 of χ sequences along with an ordered set Ω of γ constraints and an error ratio ε, the so-called CMSA problem is to find a CMSA w.r.t. Ω with the optimal SP score. When the number of sequences in 𝒪 is restricted to two (i.e. χ = 2), the CMSA problem is called as the CPSA problem.

3 ALGORITHM

In this section, we shall first design a memory-efficient algorithm for solving the CPSA problem with two given sequences A = a1a2am and B = b1b2bn, a given ordered set Ω = (C1, C2, …, Cγ) of γ constraints, each Inline graphic with length of λi, 1 ≤ i ≤ γ, and a given error threshold ε. After that, we shall use it as the kernel to heuristically solve the CMSA problem.

For any sequence T, let pref(T, l) [respectively, suff(T, l)] phase don't change denote the prefix (respectively, suffix) of T with length l. For any two characters a, b ∈ Σ, let σ(a, b) denote the score of aligning a with b. The gap penalty adopted here is the so-called affine gap penalty that penalizes a gap of length l with wo + l × we, where wo > 0 is the gap-open penalty and we > 0 is the gap-extension penalty. For convenience, let Ai = pref(A, i) = a1a2ai, Bj = pref(B, j) = b1b2bj and Ωk = (C1, C2, …, Ck), where 1 ≤ im, 1 ≤ jn and 1 ≤ k ≤ γ. Let ℳk(i, j) denote the score of an optimal constrained alignment of Ai and Bj w.r.t. Ωk. Clearly, ℳγ(m, n) is the score of an optimal constrained alignment of A and B w.r.t. Ω. An alignment ℒ is called as a semi-constrained alignment of Ai and Bj w.r.t. Ωk if it is a constrained alignment of Ai and Bj w.r.t. Ωk−1 and also ends (or begins) with a band whose induced consensus is equal to a prefix of Ck (or a suffix of C1). 𝒩k(i, j, h) is defined to be the score of an optimal semi-constrained alignment of Ai and Bj w.r.t. Ωk that ends with an induced consensus equal to pref(Ck, h). Let Inline graphic [respectively, Inline graphic] be the maximum scores of all constrained alignments of Ai and Bj w.r.t. Ωk that end with a deletion pair (ai, −) [respectively, an insertion pair (−, bj)]. By definition, it is not hard to derive the recurrence of ℳk(i, j), 1 ≤ im and 1 ≤ jn, as follows. If k = 0, then Inline graphic. If 1 ≤ k ≤ γ, then Inline graphic. Clearly, Inline graphic, if δ(suff(Ai, λk), Ck) ≤ λk × ε and δ(suff(Bj, λk), Ck) ≤ λk × ε; otherwise, 𝒩k(i, j, λk) = −∞. To simply describe the computation of Inline graphic and Inline graphic, we introduce another notation Inline graphic, which is defined to be the maximum score of all constrained alignments of Ai and Bj w.r.t. Ωk that end with a substitution pair (ai, bj). Let Inline graphic denote the alignment of Ai and Bj with score Inline graphic that ends with a deletion pair (ai, −). Let ℒ′ be the portion of Inline graphic before the last aligned pair (ai, −). Then there are three possibilities when we consider the last aligned pair of ℒ′.

Case 1: The last aligned pair of ℒ′ is a substitution pair. Then the score of ℒ′ is Inline graphic and (ai, −) is charged by a gap-open penalty and a gap-extension penalty in Inline graphic. Hence, Inline graphic.

Case 2: The last aligned pair of ℒ′ is a deletion pair. Then the score of ℒ′ is Inline graphic and (ai, −) is charged by only one gap-extension penalty in Inline graphic. Hence, Inline graphic.

Case 3: The last aligned pair of ℒ′ is an insertion pair. Then the score of ℒ′ is Inline graphic and (ai, −) is charged by a gap-open penalty and a gap-extension penalty in Inline graphic. Hence, Inline graphic.

In summary, Inline graphic. However, by including an extra Inline graphic into the right-hand side of the above recurrence, we can reformulate the above recurrence as Inline graphic. Similar to the discussion above, the recurrence of Inline graphic can be derived as Inline graphic.

According to the recurrences above, we designed an algorithm to compute ℳγ(m, n) and its corresponding constrained alignment using the technique of dynamic programming as follows. For convenience, we depicted the recurrences of matrices ℳk, Inline graphic, Inline graphic and 𝒩k for all 0 ≤ k ≤ γ by a three-dimensional (3D) grid graph 𝒢, which consists of (m + 1) × (n + 1) × (γ + 1) entries and each entry (i, j, k) consists of four nodes ℳk, Inline graphic, Inline graphic and 𝒩k corresponding to ℳk, Inline graphic, Inline graphic and 𝒩k(i, j, λk), respectively. Figure 1 shows the relationship of four adjacent entries (i, j, k), (i − 1, j, k), (i, j − 1, k) and (i − 1, j − 1, k) of 𝒢 for each fixed k.

Fig. 1.

Fig. 1

The schematic diagram of four adjacent entries of 𝒢, where entry (i, j, k) consists of four nodes ℳk, Inline graphic, Inline graphic and 𝒩k corresponding to ℳk(i, j), Inline graphic, Inline graphic and 𝒩k(i, j, λk), respectively.

Note that there is a directed edge, which is not shown in Figure 1, with weight Inline graphic from the ℳk−1 node of the entry (i − λk, j − λk, k − 1) to the 𝒩k node of the entry (i, j, k). Then each path from ℳ0(0, 0) node of entry (0, 0, 0) to ℳγ(m, n) node of entry (m, n, γ) corresponds to a constrained alignment of A and B w.r.t. Ω. As a result, an optimal constrained alignment of A and B can be obtained by backtracking a shortest path from ℳγ(m, n) to ℳ0(0, 0) in 𝒢. It is not hard to see that the algorithm costs both computer time and memory in the order of 𝒪(γmn). We call the above algorithm based on the dynamic programming approach as CPSA-DP algorithm.

Hirschberg (1975) had developed a linear-space algorithm for solving the longest common subsequence problem based on the divide-and-conquer technique. Since then, this strategy has been extended to yield a number of memory-efficient algorithms for aligning biological sequences (Myers and Miller, 1988; Chao et al., 1994). In this paper, we generalize the Hirschberg's algorithm so that it is capable of dealing with the CPSA. As compared with others, our generalization is more complicated because the grid graph 𝒢 dealt here is 3D, instead of 2D, and the input sequences are accompanied with several constraints that need to be considered carefully. The central idea of our memory-efficient algorithm is to determine a middle position (imid, jmid, kmid) on an optimal path from ℳ0(0, 0) to ℳγ(m, n) in 𝒢 so that we were able to divide the constrained alignment problem into two smaller constrained alignment problems; then these smaller constrained alignment problems are continued to be divided in the same manner, and finally the optimal constrained alignment is obtained completely by merging the series of the calculated mid-points (Fig. 2).

Fig. 2.

Fig. 2

Schematic diagram of divide-and-conquer approach: two light gray areas are the reduced subproblems after middle position (imid, jmid, kmid) is determined, each of which will be further divided into two subproblems of dark gray areas.

Before describing our algorithm, some notation must be introduced as follows. Let Inline graphic and Inline graphic denote the suffixes ai+1ai+2am and bj + 1bj + 2bn of A and B, respectively, for 1 ≤ im and 1 ≤ jn. Let Inline graphic denote the ordered subset (Ck + 1, Ck + 2, …, Cγ) for 1 ≤ k ≤ γ. Define Inline graphic to be the score of an optimal constrained alignment of Inline graphic and Inline graphic w.r.t. Inline graphic, and define Inline graphic (Inline graphic and Inline graphic, respectively) to be the maximum score of all constrained alignments of Inline graphic and Inline graphic w.r.t. Inline graphic that begin with a substitution [deletion and insertion, respectively] pair (ai + 1, bj + 1) [(ai + 1, −) and (−, bj + 1), respectively]. Let Ωk(h) = [C1, C2, …, Ck − 1, pref(Ck, h)] and Inline graphic, where 1 ≤ h ≤ λk. Let Inline graphic denote the score of an optimal semi-constrained alignment Inline graphic of Inline graphic and Inline graphic w.r.t. Inline graphic that begins with a band whose induced consensus is equal to suff(Ck, λkh). Note that the recurrences for computing matrices Inline graphic, Inline graphic, Inline graphic, Inline graphic and Inline graphic can be developed similarly as those for computing ℳk, Inline graphic, Inline graphic, Inline graphic and 𝒩k, respectively. Clearly,Inline graphic. If δ[suff(Ai, λk), Ck] ≤ λk × ε and δ[suff(Bj, λk), Ck] ≤ λk × ε, then we can reformulate the recurrence of 𝒩k as follows: 𝒩k(i, j, 1) = ℳk − 1(i − 1, j − 1) + σ(ai, bj) and 𝒩k(i, j, h) = 𝒩k(i − 1, j − 1, h − 1) + σ(ai, bj) for each 1 < h ≤ λk.

Next, we describe our divide-and-conquer algorithm, termed as CPSA-DC algorithm, for computing an optimal constrained alignment between A and B w.r.t. Ω as follows. The key point is to determine the middle position (imid, jmid, kmid) of the optimal path in 𝒢 to divide the problem into two subproblems, each of which is recursively divided into two smaller subproblems using the same way. Given an alignment ℒ, we use score(ℒ) to denote the score of ℒ. Let ℒγ(A, B) be an optimal constrained alignments of A and B w.r.t. Ω and clearly score[ℒγ(A, B)] = ℳγ(m, n). Let Inline graphic. Then, we partition ℒγ(A, B) into two parts by cutting it at the position immediately after Inline graphic and we let Inline graphic denote the part containing Inline graphic and Inline graphic denote the remaining part, where Inline graphic denotes the last character in Inline graphic from B, and kmid denotes the largest index so that Inline graphic (approximately) appears in Inline graphic Then there are two possibilities when we consider the last aligned pair of Inline graphic.

Case 1: The last aligned pair of Inline graphic is a substitution pair [i.e. Inline graphic, Inline graphic]. In this case, we have Inline graphic. If (Inline graphic, Inline graphic) is not a constrained column in ℒγ(A, B), then Inline graphic is an optimal constrained alignment of Aimid and Bjmid w.r.t. Inline graphic ending with a substitution pair (Inline graphic, Inline graphic), and Inline graphic is an optimal constrained alignment of Inline graphic and Inline graphic w.r.t. Inline graphic. Hence, Inline graphic. If (Inline graphic, Inline graphic) is a constrained column in Inline graphic, then Inline graphic is an optimal semi-constrained alignment of Inline graphic and Inline graphic w.r.t. Inline graphic ending with a band ℒ′ whose induced consensus is equal to Inline graphic. If Inline graphic, then Inline graphic is an optimal semi-constrained alignment of Inline graphic and Inline graphic w.r.t. Inline graphic beginning with a band Inline graphic whose induced consensus is equal to Inline graphic. Moreover, the induced consensus of the merge of ℒ′ and Inline graphic have to be equal to Inline graphic. In this case, we have Inline graphic. If Inline graphic, then Inline graphic is an optimal constrained alignment of Inline graphic and Inline graphic w.r.t. Inline graphic, and hence Inline graphic.

Case 2: The last aligned pair of Inline graphic is a deletion pair [i.e. (Inline graphic,−)]. If the first aligned pair in Inline graphic is not a deletion pair, then Inline graphic. If the first aligned pair in Inline graphic is a deletion pair, then Inline graphic. We need to compensate it by adding wo because the open penalty of the gap containing Inline graphic and Inline graphic in ℒγ(A, B) is charged twice by Inline graphic and Inline graphic.

In summary, the recurrence of ℳγ(m, n) is derived as follows:

graphic file with name M122.gif

When Inline graphic is added to the right-hand side, the above recurrence is not changed, but can be reformulated as follows:

graphic file with name M124.gif

In other words, jmid, kmid and hmid are the indices j, k and h, where 1 ≤ jn, 0 ≤ k ≤ γ and 1 ≤ h < λk, such that the following maximal value is the maximum.

graphic file with name M125.gif

Now, we show how to use 𝒪(αn), instead of 𝒪(γmn), memory to determine jmid, kmid and hmid, where α = ∑1 ≤ k ≤ γ λk and α ≤ min{m, n} intrinsically. In fact, a single matrix E of size (γ + 1) × (n + 1) with each entry E(k, j) of λk + 4 space is enough to compute ℳk(imid, j), Inline graphic, Inline graphic and 𝒩k(imid, j, h), for 1 ≤ jn, 0 ≤ k ≤ γ and 1 ≤ h ≤ λk. When reaching the entry (i, j, k) of 3D grid graph 𝒢, we use entry E(k, j) of E to hold the most recently computed values of ℳk(i, j), Inline graphic, Inline graphic and 𝒩k(i, j, h), which clearly needs a total of λk + 4 space. Note that the old values in entry E(k, j) will be moved into an extra entry, termed as Vk whose space is equal to E(k, j), before they are overwritten by their newly computed values. Before moving the old values in E(k, j) into Vk; however, we need to first move ℳk(i − 1, j − 1) in Vk into a space, named as vk, k + 1, where 1 ≤ im. The mechanism above will enable us to compute 𝒩k(i, j,1), which needs to refer to ℳk − 1(i − 1, j − 1) that is kept in vk − 1, k; compute 𝒩k(i, j, h) for each 2 ≤ h ≤ λk, which needs to refer to 𝒩k(i − 1, j − 1, h − 1) that is kept in Vk; compute Inline graphic, which needs to refer ℳk(i − 1, j − 1) that is kept in Vk; and finally we were able to compute ℳk(i, j). Figure 3 shows the grid locations of E(k − 1), E(k) and the values in Vk − 1 and Vk when we reach the entry (i, j, k) of 𝒢 for the computation, where E(k) denotes the k-th row of E. Hence, the total needed space for computing and storing all ℳk(imid, j), Inline graphic, Inline graphic and 𝒩k(imid, j, h) is the sum of the space of matrix E, the space of all Vk and the space of all vk, k + 1, where 1 ≤ jn, 0 ≤ k ≤ γ and 1 ≤ h ≤ λk, which is equal to 𝒪(αn). Similarly, the required matrix, denoted by Ē, for computing all Inline graphic, Inline graphic, Inline graphic and Inline graphic still needs 𝒪(αn) space. Hence, the determination of jmid, kmid and hmid can be performed in 𝒪(αn) space. The details of CPSA-DC algorithm are described as follows. Note that the program code of BestScoreRev is similar to that of BestScore and hence is omitted here. In the codes, the variable E(ℳk(imid, j)) is used to denote the value of ℳk(imid, j) in E(k, j) and others are analogous. The global variables Inline graphic, ℋB(j, k, h) = δ(suff(Bj, h), pref(Ck, h)), and Inline graphic are computed in Algorithm BestScore so that they can be used directly in Algorithm CPSA-DC.

Fig. 3.

Fig. 3

The grid locations of E(k − 1), E(k) and the values in Vk−1 and Vk when the entry (i, j, k) of 𝒢, marked with ‘?’, is reached for the computation.

Algorithm CPSA-DC(istart, iend, jstart, jend, kstart, kend)Input: Sequences Inline graphic and Inline graphicwith constraints Inline graphic)1:if (istart > iend) or (jstart > jend) then

Align the nonempty sequence with spaces;

else

Inline graphic

BestScore(istart,imid,jstart,jend,kstart,kend);

BestScoreRev(imid + 1,iend,jstart,jend,kstart,kend);

end if

2: max = −∞;

for j = jstart − 1 to jenddo

for k = kstart − 1 to kenddo

if Inline graphic then

Inline graphic

j mid = j; kmid = k; type = case 1;

end if

if Inline graphic then

Inline graphic;

j mid = j; kmid = k; type = case 2;

end if

if Inline graphic then

Inline graphic;

j mid = j; kmid = k; type = case 3;

end if

if k ≥ 1 then

for h = 1 to λk − 1 do

if Inline graphic and

Inline graphic then

if Inline graphic

max then

Inline graphic;

j mid = j; kmid = k; hmid = h; type = case 4;

end if

end if

end for

if Inline graphic and Inline graphicthen

if Inline graphic

max then

Inline graphic;

j mid = j; kmid = k; hmid = h; type = case 5;

end if

end if

end if

end for

end for

3: if type = case 1 then

CPSA-DC(istart,imid − 1,jstart,jmid,kstart,kmid);

Align Inline graphic with a space;

CPSA-DC(imid + 1,iend,jmid + 1,jend,kmid+1,kend);

end if

if type = case 2 then

CPSA-DC(istart,imid − 1,jstart,jmid,kstart,kmid);

Align Inline graphic with two spaces;

CPSA-DC(imid + 2,iend,jmid+1,jend,kmid+1,kend);

end if

if type = case 3 then

CPSA-DC(istart,imid − 1,jstart,jmid − 1,kstart,kmid);

Align Inline graphic with Inline graphic;

CPSA-DC(imid+1,iend,jmid+1,jend,kmid+1,kend);

end if

if type = case 4 then

CPSA-DC(istart,imidhmid,jstart,jmidhmid,kstart, kmid − 1);

Align Inline graphic with Inline graphic;

CPSA-DC(imidkhmid+1,iend, jmidkhmid + 1,jend,kmid+1,kend);

end if

if type = case 5 then

CPSA-DC(istart,imid − λk,jstart, jmid − λk,kstart, kmid − 1);

Align Inline graphic with Inline graphic;

CPSA-DC(imid+1,iend,jmid+1,jend,kmid+1,kend);

end if

Algorithm BestScore(istart, iend, jstart, jend, kstart, kend)

Input: Sequences Inline graphic and Inline graphic

with constraints (Inline graphic)

1: /* Reindex */

m = istartiend + 1; n = jstartjend + 1;

γ = kstartkend + 1;

2: /* Initialization */

for j = 0 to ndo

for k = 0 to γ do

Inline graphic;

if (j = 0) or (k > 0) thenInline graphic;

else Inline graphic;

if (j = 0) and (k = 0) thenE(ℳk(imid,j)) = 0;

else E(ℳk(imid,j)) = −∞;

if k ≥ 1 then

for h = 1 to λkdoE(𝒩k(imid,j,h)) = −∞;

end if

end for

end for 3: /* Computation */

for i = 1 to mdo

for k = 0 to γ do /* For the case ofj = 0 */

Vk(ℳk(imid, 0)) = E(ℳk(imid,0));

if k ≥ 1 then

for h = 1 to λkdoVk(𝒩k(imid,0,h))

= E(𝒩k(imid,0,h)));

end if

Inline graphic;

Inline graphic;

end for

for j = 1 to ndo/* For the case ofj> 0 */

for k = 0 to γ do

tempk(ℳk(imid,j)) = E(ℳk(imid,j));

if k ≥ 1 then

for h = 1 to λkdo tempk(𝒩k(imid,j,h))

= E(𝒩k(imid,j,h));

end if

Inline graphic;

Inline graphic;

Inline graphic;

if k ≥ 1 then

for h = 1 to λkdo

if h = 1 then

Inline graphic;

else

Inline graphic;

end if

end for

end if

Inline graphic;

v k,k+1 = Vk(Mk(imid,j));

V k(ℳk(imid,j)) = tempk(ℳk(imid,j));

if k ≥ 1 then

for h = 1 to λkdo

V k(𝒩(imid,j,h)) = tempk(𝒩k(imid,j,h));

end for

end if

if i = m and k ≥ 1 then

for h = 1 to λkdo

if h = 1 then

if j = 1 and Inline graphicthen

A(k,h) = 1; elseA(k,h) = 0;

if Inline graphic then

B(j,k,h) = 1; elseB(j,k,h) = 0;

else

if j = 1 and Inline graphicthen

A(k,h) = ℋA(k,h − 1) + 1;

if Inline graphic thenB(j,k,h)

=ℋB(j,k,h − 1)+ 1;

end if

end for

end if

end for

end for

end for

Now, we analyze the time-complexity of our CPSA-DC algorithm for solving the CPSA. As shown in Figure 2, after determining the middle position (imid, jmid, kmid) of the optimal path in 𝒢, we can divide the original problem into two subproblems, each of which further can be recursively divided into two smaller subproblems using the same way. Note that regardless of where the optimal path passes through (imid, jmid, kmid), the total size of the two reduced subproblems is just half the size of the original problem, where the size is measured by the number of entries in 𝒢. It is not hard to see that the time-complexity of determining the middle position of each subproblem at each recursive stage is proportional to the size of the subproblem. Let Ψ denote the size of the original problem (i.e. Ψ = γmn). Then the total time-complexity of our CPSA-DC algorithm is equal to Inline graphic, which is twice as high as the CPSA-DP algorithm. Using the CPSA-DC algorithm as a kernel, we were able to design a memory-efficient algorithm, termed CMSA-DC, for progressively aligning multiple input sequences into a CMSA according to the branching order of a guide tree. The above progressive method we adopted was proposed by (Tang et al. 2003). Owing to space limitation, we refer the reader to their paper for the details of its implementation.

4 EXPERIMENTAL RESULTS

We use Java language to implement the CMSA-DC algorithm as a web server, called as MuSiC-ME (Memory-Efficient tool for Multiple Sequence Alignment with Constraints). The input of the MuSiC-ME system consists of a set of protein/DNA/RNA sequences and a set of user-specified constraints, each with a fragment of residue/nucleotide that (approximately) appears in all input sequences. The output of MuSiC-ME is a CMSA in which the fragments of the input sequences whose residues/nucleotides exhibit a given degree of similarity to a constraint are aligned together. For its biological applications, we refer the reader to other related papers (Tang et al., 2003; Tsai et al., 2004).

In the following, we evaluate our memory-efficient MuSiC-ME system and compare its running time and memory to the original MuSiC system (Tsai et al., 2004), whose kernel CPSA algorithm was implemented by the dynamic programming approach. We chose five families of protein/RNA sequences as our testing datasets, each of which has been shown to contain an ordered series of conserved motifs related to the structures/functionalities/consensuses of the family (McClure et al., 1994; Chin et al., 2003; Tang et al., 2003; Tsai et al., 2004): (1) the aspartic acid protease family (Protease), (2) the hemoglobins family (Globin), (3) the ribonuclease family (RNase), (4) the kinase family (Kinase) and (5) the 3′- untranslated region of the coronaviruses (CoV-3′-UTR). From each family, we have selected a representative set of sequences and adopted the ordered series of conserved motifs as the constraints. Table 1 lists the information of the tested families and their constraints. All tests were run with default parameters on IBM PC with 1.26 GHz processor and 512 MB RAM under Linux system. Table 2 lists the CPU time and memory usage of our experiments using MuSiC and MuSiC-ME. It shows that the memory usage of MuSiC-ME is much smaller than that of MuSiC for large-scale sequences, and the CPU time required by MuSiC-ME is smaller than that required by MuSiC for short sequences, since we have simplified the recurrences of the dynamic programming here.

Table 2.

The information of the tested families and their constraints

Family #SEQ MAXSEQ #CON MAXCON
Protease 6 123 4 1
Globin 6 146 7 2
RNase 6 185 3 1
Kinase 6 353 10 3
CoV-3′-UTR 6 422 12 2

#SEQ is the number of sequences, MAXSEQ is the maximum length of sequences, #CON is the number of constraints and MAXCON is the maximum length of constraints.

Table 2.

The comparison of CPU time and memory usage between MuSiC and MuSiC-ME

Family MuSiC CPU Time (s) Memory (MB) MuSiC-ME CPU Time (s) Memory (MB)
Protease 6 25.4 6 15.5
Globin 23 42.0 18 15.5
RNase 11 32.0 8 15.5
Kinase 131 160.8 96 15.9
CoV-3′-UTR 165 17.4

The memory usage includes JVM (Java Virtual Machine), code (MuSiC/MuSiC-ME) and data, and MuSiC cannot deal with the case of CoV-3′-UTR due to running out of memory.

It is worth mentioning that in MuSiC-ME system, the letters representing the constraints are not just the individual residues/nucleotides, but also the IUPAC (International Union of Pure and Applied Chemistry) codes. For example, nucleotides N and R have the meanings of any nucleotides and purine (i.e. A or G), respectively. This enhanced improvement will enable the user to define more flexible constraints or combine several small constraints with fixed distances into a large one. For example, consider our fifth experiment above related to the 3′-UTRs of the coronavirus sequences, including HCV-229E (human coronavirus), PEDV (porcine epidemic diarrhea virus), TGEV (porcine transmissible gastroenteritis virus), BCV (bovine coronavirus), MHV (mouse hepatitis virus) and SARS-TW1 (severe acute respiratory syndrome virus). All the 12 adopted constraints appear in the fragment sequences that were able to fold themselves into a stable pseudoknot structure (Williams et al., 1999; Tsai et al., 2004). However, these adopted constraints are too short to correctly align the truly conserved motifs of sequences together, since the short constraints occur frequently in the large genomic sequences that led to the difficulty in identifying the true occurrences. In fact, four pairs of two consecutive constraints appear in the stem regions (containing no loops) of pseudoknots and each paired constraints is separated by a non-conserved subsequence of fixed length. Hence, we can combine each pair of constraints into a new and larger constraint by representing the non-conserved part with N. Consequently, we got eight new constraints with the order of (CUNNNNC, A, AA, G, C, UNNNA, GNNNNAG, UNNNA) for this dataset. After running MuSiC-ME, a satisfied CMSA was found (Figure 4), where the band of the resulting CMSA corresponding to a constraint is black and its corresponding constraint is displayed beneath it. This resulting CMSA implies that the fragment of SARS-TW1 between the first band and the last band may fold into a pseudoknot structure that is possibly involved in replicating SARS viruses (Pleij, 1994; Deiman and Pleij, 1997). In fact, this fragment is the pseudoknot sequence of SART-TW1 that was found by (Tsai et al. 2004) using MuSiC to align the 3′-UTR of SARS-TW1 with the pseudoknot sequences, instead of 3′-UTRs, of other coronaviruses. The input sequences of the above experiment were also tested by Clustal W 1.82, the most commonly used MSA tool. According to its resulting MSA as shown in Figure 5, the fragments of all pseudoknots, including our detected pseudoknot for SARS-TW1, were not able to align well so that it is difficult for us to identify the exact fragment of the SARS-TW1 pseudoknot from this MSA.

Fig. 4.

Fig. 4

The partial display of the resulting CMSA of MuSiC-ME by aligning the sequences of SARS-TW1 3′-UTR with those of other five coronaviruses.

Fig. 5.

Fig. 5

The partial display of the resulting MSA of Clustal W 1.82 by aligning the 3′-UTR sequences of six coronaviruses, where the bases not in the pseudoknots are marked with dots.

5 CONCLUSIONS

In this paper, we designed a memory-efficient program for performing the CMSA, which can incorporate the knowledge of biologists about the structures/functionalities/consensuses of their datasets into sequence alignment such that the user-specified residues/nucleotides are aligned together. We first used the divide-and-conquer approach to design a memory-efficient algorithm for optimally aligning two sequences with constraints, and then based on this algorithm, we used the progressive method to develop a memory-efficient tool, called MuSiC-ME, for heuristically aligning multiple sequences with constraints. The proposed MuSiC-ME system makes it possible to align several large-scale protein/DNA/RNA sequences with constraints through the desktop PC with the limited memory. In this system, moreover, the letters allowed to represent the constraints are the IUPAC codes, which will enable the user to define more flexible constraints or combine several small constraints with fixed distances into a large one. It is worth mentioning that the A* algorithm, a heuristic search method in Artificial Intelligence, has been extensively used to time- and/or memory-efficiently solve the general MSA problem without constraints (Ikeda and Imai, 1994, 1999; Kobayashi and Imai, 1999; Lermen and Reinert, 2000). Hence, it is interesting to study whether or not the A* algorithm can still be applied to the CMSA problem.

Acknowledgments

The authors would like to thank the anonymous referees for their constructive comments to the presentation of this paper. This work was supported in part by National Science Council of Republic of China under grant NSC93-2213-E-009-113.

REFERENCES

  1. Bafna, V., Lawler, E.L., Pevzner, P.A. 1997. Approximation algorithms for multiple sequence alignment. Theoret. Comput. Sci. 182233–244 [Google Scholar]
  2. Bonizzoni, P. and Vedova, G.D. 2001. The complexity of multiple sequence alignment with SP-score that is a metric. Theoret. Comput. Sci. 25963–79 [Google Scholar]
  3. Carrillo, H. and Lipman, D. 1988. The multiple sequence alignment problem in biology. SIAM J. Appl. Math. 481073–1082 [Google Scholar]
  4. Chan, S.C., Wong, A.K.C., Chiu, D.K.Y. 1992. A survey of multiple sequence comparison methods. Bull. Math. Biol. 54563–598 [DOI] [PubMed] [Google Scholar]
  5. Chao, K.M., Hardison, R.C., Miller, W. 1994. Recent developments in linear-space alignment methods: a survey. J. Comput. Biol. 1271–291 [DOI] [PubMed] [Google Scholar]
  6. Chin, F.Y.L., Ho, N.L., Lamy, T.W., Wong, P.W.H., Chan, M.Y. 2003. Efficient constrained multiple sequence alignment with performance guarantee. Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB 2003) , Los Alamitos, CA IEEE, pp. pp. 337–346 [PubMed]
  7. Corpet, F. 1988. Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res. 1610881–10890 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Deiman, B. and Pleij, C.W.A. 1997. Pseudoknots: a vital feature in viral RNA. Semin. Virol. 8166–175 [Google Scholar]
  9. Depiereux, E. and Feytmans, E. 1992. MATCH-BOX: a fundamentally new algorithm for the simultaneous alignment of several protein sequences. Comput. Appl. Biosci. 8501–509 [DOI] [PubMed] [Google Scholar]
  10. Feng, D.F. and Doolittle, R.F. 1987. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25351–360 [DOI] [PubMed] [Google Scholar]
  11. Gusfield, D. 1993. Efficient methods for multiple sequence alignment with guaranteed error bounds. Bull. Math. Biol. 55141–154 [DOI] [PubMed] [Google Scholar]
  12. Gusfield, D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology 1997, NY Cambridge University Press
  13. Higgins, D. and Sharpe, P. 1988. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73, pp. 237–244 [DOI] [PubMed] [Google Scholar]
  14. Hirschberg, D.S. 1975. A linear space algorithm for computing maximal common subsequences. Commun. ACM 18341–343 [Google Scholar]
  15. Ikeda, T. and Imai, H. 1994. Fast A* algorithms for multiple sequence alignment. Proceedings of the Genome Informatics Workshop , Tokyo Universal Academy Press, pp. pp. 90–99
  16. Ikeda, T. and Imai, H. 1999. Enhanced A* algorithms for multiple alignments: optimal alignments for several sequences and k-opt approximate alignments for large cases. Theoret. Comput. Sci. 210341–374 [Google Scholar]
  17. Kececioglu, J.D. 1993. The maximum weight trace problem in multiple sequence alignment. Proceedings of the Fourth Annual Symposium on Combinatorial Pattern Matching (CPM 2004) , Heidelberg, Germany LNCS Springer-Verlag 684, pp. pp. 106–119 [Google Scholar]
  18. Kobayashi, H. and Imai, H. 1999. Improvement of the A* algorithm for multiple sequence alignment. Proceedings of the Genome Informatics Workshop , Tokyo Universal Academy Press, pp. pp. 120–130 [PubMed]
  19. Lermen, M. and Reinert, K. 2000. The practical use of the A* algorithm for exact multiple sequence alignment. J. Comput. Biol. 7655–672 [DOI] [PubMed] [Google Scholar]
  20. Li, M., Ma, B., Wang, L. 2000. Near optimal multiple alignment within a band in polynomial time. Proceedings of the Thirty Second Annual ACM Symposium on Theory of Computing (STOC 2000) , Portland, OR ACM Presspp. 425–434
  21. McClure, M.A., Vasi, T.K., Fitch, W.M. 1994. Comparative analysis of multiple protein-sequence alignment methods. Mol. Biol. Evol. 11571–592 [DOI] [PubMed] [Google Scholar]
  22. Morgenstern, B. 1999. DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 15211–218 [DOI] [PubMed] [Google Scholar]
  23. Myers, E.W. and Miller, W. 1988. Optimal alignment in linear space. Comput. Appl. Biosci. 411–17 [DOI] [PubMed] [Google Scholar]
  24. Myers, G., Selznick, S., Zhang, Z., Miller, W. 1996. Progressive multiple alignment with constraints. J. Comput. Biol. 3563–572 [DOI] [PubMed] [Google Scholar]
  25. Nicholas, H.B., Ropelewski, A.J., Deerfield, D.W. 2002. Strategies for multiple sequence alignment. Biotechniques 32592–603 [DOI] [PubMed] [Google Scholar]
  26. Notredame, C. 2002. Recent progresses in multiple sequence alignment: a survey. Pharmacogenomics 3131–144 [DOI] [PubMed] [Google Scholar]
  27. Notredame, C., Higgins, D.G., Heringa, J. 2000. T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302205–217 [DOI] [PubMed] [Google Scholar]
  28. Pevzner, P.A. 1992. Multiple alignment, communication cost, and graph matching. SIAM J. Appl. Math. 521763–1779 [Google Scholar]
  29. Pleij, C.W.A. 1994. RNA pseudoknots. Curr. Opin. Struct. Biol. 4337–344 [Google Scholar]
  30. Sammeth, M., Morgenstern, B., Stoye, J. 2003. Divide-and-conquer multiple alignment with segment-based constraints. Bioinformatics 19ii189–ii195 [DOI] [PubMed] [Google Scholar]
  31. Schuler, G.D., Altschul, S.F., Lipman, D.J. 1991. A workbench for multiple alignment construction and analysis. Proteins 9180–190 [DOI] [PubMed] [Google Scholar]
  32. Stoye, J. 1998. Multiple sequence alignment with the divide-and-conquer method. Gene 211GC45–GC56 [DOI] [PubMed] [Google Scholar]
  33. Stoye, J., Moultony, V., Dress, A.W.M. 1997. DCA: an efficient implementation of the divide-and-conquer approach to simultaneous multiple sequence alignment. Comput. Appl. Biosci. 13625–626 [DOI] [PubMed] [Google Scholar]
  34. Stoye, J., Perrey, S.W., Dress, A.W.M. 1997. Improving the divide-and-conquer approach to sum-of-pairs multiple sequence alignment. Appl. Math. Lett. 1067–73 [Google Scholar]
  35. Tang, C.Y., Lu, C.L., Chang, M.D.T., Tsai, Y.T., Sun, Y.J., Chao, K.M., Chang, J.M., Chiou, Y.H., Wu, C.M., Chang, H.T., Chou, W.I. 2003. Constrained multiple sequence alignment tool development and its application to RNase family alignment. J. Bioinform. Comput. Biol. 1267–287 [DOI] [PubMed] [Google Scholar]
  36. Taylor, W.R. 1987. Multiple sequence alignment by a pairwise algorithm. Comput. Appl. Biosci. 381–87 [DOI] [PubMed] [Google Scholar]
  37. Taylor, W.R. 1994. Motif-biased protein sequence alignment. J. Comput. Biol. 1297–310 [DOI] [PubMed] [Google Scholar]
  38. Thompson, J.D., Higgs, D.G., Gibson, T.J. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties, and weight matrix choice. Nucleic Acids Res. 224673–4680 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Thompson, J.D., Plewniak, F., Thierry, J.-C., Poch, O. 2000. DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Res. 282919–2926 [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Tönges, U., Perrey, S.W., Stoye, J., Dress, A.W.M. 1996. A general method for fast multiple sequence alignment. Gene 172GC33–GC41 [DOI] [PubMed] [Google Scholar]
  41. Tsai, Y.T., Huang, Y.P., Yu, C.T., Lu, C.L. 2004. MuSiC: a tool for multiple sequence alignment with constraints. Bioinformatics (in press) [DOI] [PubMed]
  42. Wang, L. and Jiang, T. 1994. On the complexity of multiple sequence alignment. J. Comput. Biol. 1337–348 [DOI] [PubMed] [Google Scholar]
  43. Williams, G.D., Chang, R.-Y., Brian, D.A. 1999. A phylogenetically conserved hairpin-type 39 untranslated region pseudoknot functions in coronavirus RNA replication. J. Virol. 738349–8355 [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Yu, C.T. 2003. Efficient algorithms for constrained sequence alignment problems. Master's Thesis, Department of Computer Science and Information Management, Providence University

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES