A memory-efficient algorithm for multiple sequence alignment with constraints

Chin Lung Lu; Yen Pin Huang

doi:10.1093/bioinformatics/bth468

. 2004 Aug 12;21(1):20–30. doi: 10.1093/bioinformatics/bth468

A memory-efficient algorithm for multiple sequence alignment with constraints

Chin Lung Lu ^1,^*, Yen Pin Huang ¹

PMCID: PMC7109922 PMID: 15374876

Abstract

Motivation: Recently, the concept of the constrained sequence alignment was proposed to incorporate the knowledge of biologists about structures/functionalities/consensuses of their datasets into sequence alignment such that the user-specified residues/nucleotides are aligned together in the computed alignment. The currently developed programs use the so-called progressive approach to efficiently obtain a constrained alignment of several sequences. However, the kernels of these programs, the dynamic programming algorithms for computing an optimal constrained alignment between two sequences, run in 𝒪(γn²) memory, where γ is the number of the constraints and n is the maximum of the lengths of sequences. As a result, such a high memory requirement limits the overall programs to align short sequences~only.

Results: We adopt the divide-and-conquer approach to design a memory-efficient algorithm for computing an optimal constrained alignment between two sequences, which greatly reduces the memory requirement of the dynamic programming approaches at the expense of a small constant factor in CPU time. This new algorithm consumes only 𝒪(αn) space, where α is the sum of the lengths of constraints and usually α ≪ n in practical applications. Based on this algorithm, we have developed a memory-efficient tool for multiple sequence alignment with constraints.

Availability: http://genome.life.nctu.edu.tw/MUSICME

Contact: cllu@mail.nctu.edu.tw

1 INTRODUCTION

Multiple sequence alignment (MSA) is one of the fundamental problems in computational molecular biology that have been studied extensively, because it is a useful tool in the phylogenetic analyses among various organisms, the identification of conserved motifs and domains in a group of related proteins, the secondary and tertiary structure prediction of a protein (or RNA), and so on (Carrillo and Lipman, 1988; Chan et al., 1992; Gusfield, 1997; Nicholas et al., 2002; Notredame, 2002). Moreover, MSA is one of the most challenging problems in computational molecular biology because it has been shown to be NP-complete under the consideration of sum-of-pairs scoring criteria (Kececioglu, 1993; Wang and Jiang, 1994; Bonizzoni and Vedova, 2001), which means that it seems to be hard to design an efficient algorithm for finding the mathematically optimal alignment. Hence, some approximate methods (Gusfield, 1993; Pevzner, 1992; Bafna et al., 1997; Li et al., 2000) and heuristic methods (Feng and Doolittle, 1987; Taylor, 1987; Corpet, 1988; Higgins and Sharpe, 1988; Thompson et al., 1994) were introduced to overcome this problem.

Recently, the concept of the constrained sequence alignment was proposed to incorporate the knowledge of biologists regarding the structures/functionalities/consensuses of their datasets into sequence alignment such that the user-specified residues/nucleotides are aligned together in the computed alignment (Tang et al., 2003). (Tang et al. 2003) first designed a dynamic programming algorithm for finding an optimal constrained alignment of two sequences and then used it as a kernel to develop a constrained multiple sequence alignment (CMSA) tool based on the progressive approach, where each constraint considered by Tang et al. is a single residue/nucleotide only. Their proposed algorithm for the two sequences runs in 𝒪(γn⁴) time and consumes 𝒪(n⁴) space, where γ is the number of constrained residues and n is the maximum lengths of the sequences. Later, this result was improved independently by two groups of researchers to 𝒪(γn²) time and 𝒪(γn²) space using the same approach of dynamic programming (Yu, 2003; Chin et al., 2003). In fact, each constraint requested to be aligned together can represent a conserved site of a protein/DNA/RNA family and each conserved site may consist of a short segment of residues/nucleotides, instead of a single residue/nucleotide. In other words, the constraint specified by the biologists can be a fragment of several residues/nucleotides. For some applications, biologists may further expect that some mismatches are allowed among the residues/nucleotides of the columns requested to be aligned. Hence, (Tsai et al. 2004) studied such a kind of the constrained sequence alignment and designed an algorithm of 𝒪(γn²) time and 𝒪(γn²) space for two sequences. The improvements and extension above greatly increase the performances and practical usage of the CMSA tools developed using the progressive approach. However, the requirement of 𝒪(γn²) memory still limits the existing CMSA tools to align a set of short sequences, at most several hundreds of residues/nucleotides. To align large genomic sequences of at least several thousands of residues/nucleotides, there is a need to design a memory-efficient algorithm for the constrained pairwise sequence alignment (CPSA) problem, which is the key limiting factor relating to the applicable extent of the progressive CMSA tools. Hence, in this paper, we adopt the so-called divide-and-conquer approach to design a memory-efficient algorithm for solving the CPSA problem, which runs in 𝒪(γn²) time, but consumes only 𝒪(αn) space, where α is the sum of the lengths of constraints and usually α ≪ n in practical applications. Based on this algorithm, we have finally developed a memory-efficient CMSA tool using the progressive approach. Note that applying the divide-and-conquer approach to memory-efficiently align two or more sequences without any constraints has been studied extensively (Myers and Miller, 1988; Chao et al., 1994; Tönges et al., 1996; Stoye et al., 1997a; Stoye et al., 1997b; Stoye, 1998). In contrast to the progressive approach used here, the divide-and-conquer algorithms proposed by Stoye et al. (Tönges et al., 1996; Stoye et al., 1997a; Stoye et al., 1997b; Stoye, 1998) considered the input sequences simultaneously and heuristically compute the good, but not necessarily optimal, dividing positions so that the resulting total MSA is close to an optimal MSA of the original sequences. In fact, many other CMSAs have been proposed from various perspectives, even using different approaches (Schuler et al., 1991; Depiereux and Feytmans, 1992; Taylor, 1994; Myers et al., 1996; Notredame et al., 2000; Thompson et al., 2000; Sammeth et al., 2003). Of these various CMSAs, it is worth mentioning that (Myers et al., 1996) obtained their CMSA by performing progressive multiple alignment under position-based constraints that are given by users; (Sammeth et al. 2003) got their CMSA by performing simultaneous multiple alignment under segment-based constraints (as same as we studied here) that are pre-computed via a local segmented-based algorithm (Morgenstern, 1999). We refer the reader to their papers for details.

2 PROBLEM FORMULATION

Let 𝒪 = {S₁, S₂, …, S_χ} be the set of χ sequences over the alphabet Σ. Then an MSA of 𝒪 is a rectangular matrix consisting of χ rows of characters of Σ ∪ {−} such that no column consists entirely of dashes and removing dashes from row i leaves S_i for any 1 ≤ i ≤ χ. The sum-of-pairs score (SP score) of an MSA is defined to be the sum of the scores of all columns, where the score of each column is the sum of the scores of all distinct pairs of characters in the column. In practice, the score of the pair of two dashes is usually set to zero. Then the problem of finding an MSA of 𝒪 with the optimal SP score is the so-called sum-of-pairs MSA problem (Carrillo and Lipman, 1988; Chan et al., 1992; Gusfield, 1997; Nicholas et al., 2002; Notredame, 2002).

Let δ(T₁, T₂) denote the Hamming distance between two subsequences T₁ and T₂ of equal length, which is equal to the number of mismatched pairs in the alignment of T₁ and T₂ without any gap. Given an alignment ℒ of 𝒪, a band is defined as a block of consecutive columns in ℒ (i.e. a submatrix of ℒ). For any band ℒ′ of ℒ, let subset(S_i, ℒ′) denote the subsequence of S_i whose residues/nucleotides are all in the band ℒ′, where 1 ≤ i ≤ χ. A subsequence T = t₁t₂ … t_λ is said to appear in ℒ if ℒ contains a band ℒ′ of λ columns, say π₁, π₂, …, π_λ, such that the characters of column π_j, 1 ≤ j ≤ λ, are all equal to t_j, or equivalently, subseq(S_i, ℒ′) = T for each 1 ≤ i ≤ χ. If δ[subseq(S_i, ℒ′), T] ≤ λ × ε for a given error ratio 0 ≤ ε < 1 [i.e. some mismatches are allowed between subseq(S_i, ℒ′) and T], then T is said to approximately appear in ℒ. From the biological viewpoint, T can be considered as the consensus among the subsequences in ℒ′ and hence T is also called as an induced consensus by the band ℒ′. For any two subsequences T₁ and T₂, T₁ ≺ T₂ is used to denote that T₁ (approximately) appears strictly before T₂ in ℒ (i.e. their corresponding bands do not overlap). Let Ω = (C₁, C₂, …, C_γ) be an ordered set of γ constraints (i.e. subsequences), each Inline graphic with length of λ_i, where 1 ≤ i ≤ γ. Then the CMSA of 𝒪 with respect to Ω is defined as an alignment ℒ of 𝒪 in which all the constraints of Ω approximately appear in the order C₁ ≺ C₂ ≺ ··· ≺ C_γ such that δ(subseq(S_i, ℒ′_j), C_j) ≤ λ_j × ε for all 1 ≤ i ≤ χ and 1 ≤ j ≤ γ, where ℒ′_j is the band of ℒ whose induced consensus is C_j. Given a set 𝒪 of χ sequences along with an ordered set Ω of γ constraints and an error ratio ε, the so-called CMSA problem is to find a CMSA w.r.t. Ω with the optimal SP score. When the number of sequences in 𝒪 is restricted to two (i.e. χ = 2), the CMSA problem is called as the CPSA problem.

3 ALGORITHM

In this section, we shall first design a memory-efficient algorithm for solving the CPSA problem with two given sequences A = a₁a₂ … a_m and B = b₁b₂ … b_n, a given ordered set Ω = (C₁, C₂, …, C_γ) of γ constraints, each Inline graphic with length of λ_i, 1 ≤ i ≤ γ, and a given error threshold ε. After that, we shall use it as the kernel to heuristically solve the CMSA problem.

For any sequence T, let pref(T, l) [respectively, suff(T, l)] phase don't change denote the prefix (respectively, suffix) of T with length l. For any two characters a, b ∈ Σ, let σ(a, b) denote the score of aligning a with b. The gap penalty adopted here is the so-called affine gap penalty that penalizes a gap of length l with w_o + l × w_e, where w_o > 0 is the gap-open penalty and w_e > 0 is the gap-extension penalty. For convenience, let A_i = pref(A, i) = a₁a₂ … a_i, B_j = pref(B, j) = b₁b₂ … b_j and Ω_k = (C₁, C₂, …, C_k), where 1 ≤ i ≤ m, 1 ≤ j ≤ n and 1 ≤ k ≤ γ. Let ℳ_k(i, j) denote the score of an optimal constrained alignment of A_i and B_j w.r.t. Ω_k. Clearly, ℳ_γ(m, n) is the score of an optimal constrained alignment of A and B w.r.t. Ω. An alignment ℒ is called as a semi-constrained alignment of A_i and B_j w.r.t. Ω_k if it is a constrained alignment of A_i and B_j w.r.t. Ω_k−1 and also ends (or begins) with a band whose induced consensus is equal to a prefix of C_k (or a suffix of C₁). 𝒩_k(i, j, h) is defined to be the score of an optimal semi-constrained alignment of A_i and B_j w.r.t. Ω_k that ends with an induced consensus equal to pref(C_k, h). Let Inline graphic [respectively, ] be the maximum scores of all constrained alignments of A_i and B_j w.r.t. Ω_k that end with a deletion pair (a_i, −) [respectively, an insertion pair (−, b_j)]. By definition, it is not hard to derive the recurrence of ℳ_k(i, j), 1 ≤ i ≤ m and 1 ≤ j ≤ n, as follows. If k = 0, then Inline graphic . If 1 ≤ k ≤ γ, then . Clearly, , if δ(suff(A_i, λ_k), C_k) ≤ λ_k × ε and δ(suff(B_j, λ_k), C_k) ≤ λ_k × ε; otherwise, 𝒩_k(i, j, λ_k) = −∞. To simply describe the computation of and , we introduce another notation , which is defined to be the maximum score of all constrained alignments of A_i and B_j w.r.t. Ω_k that end with a substitution pair (a_i, b_j). Let Inline graphic denote the alignment of A_i and B_j with score that ends with a deletion pair (a_i, −). Let ℒ′ be the portion of before the last aligned pair (a_i, −). Then there are three possibilities when we consider the last aligned pair of ℒ′.

Case 1: The last aligned pair of ℒ′ is a substitution pair. Then the score of ℒ′ is Inline graphic and (a_i, −) is charged by a gap-open penalty and a gap-extension penalty in . Hence, .

Case 2: The last aligned pair of ℒ′ is a deletion pair. Then the score of ℒ′ is Inline graphic and (a_i, −) is charged by only one gap-extension penalty in . Hence, .

Case 3: The last aligned pair of ℒ′ is an insertion pair. Then the score of ℒ′ is Inline graphic and (a_i, −) is charged by a gap-open penalty and a gap-extension penalty in . Hence, .

In summary, Inline graphic . However, by including an extra into the right-hand side of the above recurrence, we can reformulate the above recurrence as . Similar to the discussion above, the recurrence of can be derived as .

According to the recurrences above, we designed an algorithm to compute ℳ_γ(m, n) and its corresponding constrained alignment using the technique of dynamic programming as follows. For convenience, we depicted the recurrences of matrices ℳ_k, Inline graphic , and 𝒩_k for all 0 ≤ k ≤ γ by a three-dimensional (3D) grid graph 𝒢, which consists of (m + 1) × (n + 1) × (γ + 1) entries and each entry (i, j, k) consists of four nodes ℳ_k, , and 𝒩_k corresponding to ℳ_k, , and 𝒩_k(i, j, λ_k), respectively. Figure 1 shows the relationship of four adjacent entries (i, j, k), (i − 1, j, k), (i, j − 1, k) and (i − 1, j − 1, k) of 𝒢 for each fixed k.

Fig. 1 — The schematic diagram of four adjacent entries of 𝒢, where entry (i, j, k) consists of four nodes ℳ_k, , and 𝒩_k corresponding to ℳ_k(i, j), , and 𝒩_k(i, j, λ_k), respectively.

Note that there is a directed edge, which is not shown in Figure 1, with weight Inline graphic from the ℳ_k−1 node of the entry (i − λ_k, j − λ_k, k − 1) to the 𝒩_k node of the entry (i, j, k). Then each path from ℳ₀(0, 0) node of entry (0, 0, 0) to ℳ_γ(m, n) node of entry (m, n, γ) corresponds to a constrained alignment of A and B w.r.t. Ω. As a result, an optimal constrained alignment of A and B can be obtained by backtracking a shortest path from ℳ_γ(m, n) to ℳ₀(0, 0) in 𝒢. It is not hard to see that the algorithm costs both computer time and memory in the order of 𝒪(γmn). We call the above algorithm based on the dynamic programming approach as CPSA-DP algorithm.

Hirschberg (1975) had developed a linear-space algorithm for solving the longest common subsequence problem based on the divide-and-conquer technique. Since then, this strategy has been extended to yield a number of memory-efficient algorithms for aligning biological sequences (Myers and Miller, 1988; Chao et al., 1994). In this paper, we generalize the Hirschberg's algorithm so that it is capable of dealing with the CPSA. As compared with others, our generalization is more complicated because the grid graph 𝒢 dealt here is 3D, instead of 2D, and the input sequences are accompanied with several constraints that need to be considered carefully. The central idea of our memory-efficient algorithm is to determine a middle position (i_mid, j_mid, k_mid) on an optimal path from ℳ₀(0, 0) to ℳ_γ(m, n) in 𝒢 so that we were able to divide the constrained alignment problem into two smaller constrained alignment problems; then these smaller constrained alignment problems are continued to be divided in the same manner, and finally the optimal constrained alignment is obtained completely by merging the series of the calculated mid-points (Fig. 2).

Fig. 2 — Schematic diagram of divide-and-conquer approach: two light gray areas are the reduced subproblems after middle position (i_mid, j_mid, k_mid) is determined, each of which will be further divided into two subproblems of dark gray areas.

Before describing our algorithm, some notation must be introduced as follows. Let Inline graphic and denote the suffixes a_i+1a_i+2 … a_m and b_{j + 1}b_{j + 2}… b_n of A and B, respectively, for 1 ≤ i ≤ m and 1 ≤ j ≤ n. Let denote the ordered subset (C_{k + 1}, C_{k + 2}, …, C_γ) for 1 ≤ k ≤ γ. Define to be the score of an optimal constrained alignment of Inline graphic and w.r.t. , and define ( and , respectively) to be the maximum score of all constrained alignments of and w.r.t. that begin with a substitution [deletion and insertion, respectively] pair (a_{i + 1}, b_{j + 1}) [(a_{i + 1}, −) and (−, b_{j + 1}), respectively]. Let Ω_k(h) = [C₁, C₂, …, C_{k − 1}, pref(C_k, h)] and Inline graphic , where 1 ≤ h ≤ λ_k. Let denote the score of an optimal semi-constrained alignment of and w.r.t. that begins with a band whose induced consensus is equal to suff(C_k, λ_k − h). Note that the recurrences for computing matrices , , , and can be developed similarly as those for computing ℳ_k, Inline graphic , , and 𝒩_k, respectively. Clearly,. If δ[suff(A_i, λ_k), C_k] ≤ λ_k × ε and δ[suff(B_j, λ_k), C_k] ≤ λ_k × ε, then we can reformulate the recurrence of 𝒩_k as follows: 𝒩_k(i, j, 1) = ℳ_{k − 1}(i − 1, j − 1) + σ(a_i, b_j) and 𝒩_k(i, j, h) = 𝒩_k(i − 1, j − 1, h − 1) + σ(a_i, b_j) for each 1 < h ≤ λ_k.

Next, we describe our divide-and-conquer algorithm, termed as CPSA-DC algorithm, for computing an optimal constrained alignment between A and B w.r.t. Ω as follows. The key point is to determine the middle position (i_mid, j_mid, k_mid) of the optimal path in 𝒢 to divide the problem into two subproblems, each of which is recursively divided into two smaller subproblems using the same way. Given an alignment ℒ, we use score(ℒ) to denote the score of ℒ. Let ℒ_γ(A, B) be an optimal constrained alignments of A and B w.r.t. Ω and clearly score[ℒ_γ(A, B)] = ℳ_γ(m, n). Let Inline graphic . Then, we partition ℒ_γ(A, B) into two parts by cutting it at the position immediately after and we let denote the part containing and denote the remaining part, where denotes the last character in from B, and k_mid denotes the largest index so that (approximately) appears in Inline graphic Then there are two possibilities when we consider the last aligned pair of .

Case 1: The last aligned pair of Inline graphic is a substitution pair [i.e. , ]. In this case, we have . If (, ) is not a constrained column in ℒ_γ(A, B), then is an optimal constrained alignment of A_{i_mid} and B_{j_mid} w.r.t. ending with a substitution pair (, ), and is an optimal constrained alignment of and w.r.t. . Hence, Inline graphic . If (, ) is a constrained column in , then is an optimal semi-constrained alignment of and w.r.t. ending with a band ℒ′ whose induced consensus is equal to . If , then is an optimal semi-constrained alignment of and w.r.t. beginning with a band whose induced consensus is equal to Inline graphic . Moreover, the induced consensus of the merge of ℒ′ and have to be equal to . In this case, we have . If , then is an optimal constrained alignment of and w.r.t. , and hence .

Case 2: The last aligned pair of Inline graphic is a deletion pair [i.e. (,−)]. If the first aligned pair in is not a deletion pair, then . If the first aligned pair in is a deletion pair, then . We need to compensate it by adding w_o because the open penalty of the gap containing and in ℒ_γ(A, B) is charged twice by and Inline graphic .

In summary, the recurrence of ℳ_γ(m, n) is derived as follows:

When Inline graphic is added to the right-hand side, the above recurrence is not changed, but can be reformulated as follows:

In other words, j_mid, k_mid and h_mid are the indices j, k and h, where 1 ≤ j ≤ n, 0 ≤ k ≤ γ and 1 ≤ h < λ_k, such that the following maximal value is the maximum.

Now, we show how to use 𝒪(αn), instead of 𝒪(γmn), memory to determine j_mid, k_mid and h_mid, where α = ∑_{1 ≤ k ≤ γ} λ_k and α ≤ min{m, n} intrinsically. In fact, a single matrix E of size (γ + 1) × (n + 1) with each entry E(k, j) of λ_k + 4 space is enough to compute ℳ_k(i_mid, j), Inline graphic , and 𝒩_k(i_mid, j, h), for 1 ≤ j ≤ n, 0 ≤ k ≤ γ and 1 ≤ h ≤ λ_k. When reaching the entry (i, j, k) of 3D grid graph 𝒢, we use entry E(k, j) of E to hold the most recently computed values of ℳ_k(i, j), , and 𝒩_k(i, j, h), which clearly needs a total of λ_k + 4 space. Note that the old values in entry E(k, j) will be moved into an extra entry, termed as V_k whose space is equal to E(k, j), before they are overwritten by their newly computed values. Before moving the old values in E(k, j) into V_k; however, we need to first move ℳ_k(i − 1, j − 1) in V_k into a space, named as v_{k, k + 1}, where 1 ≤ i ≤ m. The mechanism above will enable us to compute 𝒩_k(i, j,1), which needs to refer to ℳ_{k − 1}(i − 1, j − 1) that is kept in v_{k − 1, k}; compute 𝒩_k(i, j, h) for each 2 ≤ h ≤ λ_k, which needs to refer to 𝒩_k(i − 1, j − 1, h − 1) that is kept in V_k; compute Inline graphic , which needs to refer ℳ_k(i − 1, j − 1) that is kept in V_k; and finally we were able to compute ℳ_k(i, j). Figure 3 shows the grid locations of E(k − 1), E(k) and the values in V_{k − 1} and V_k when we reach the entry (i, j, k) of 𝒢 for the computation, where E(k) denotes the k-th row of E. Hence, the total needed space for computing and storing all ℳ_k(i_mid, j), Inline graphic , and 𝒩_k(i_mid, j, h) is the sum of the space of matrix E, the space of all V_k and the space of all v_{k, k + 1}, where 1 ≤ j ≤ n, 0 ≤ k ≤ γ and 1 ≤ h ≤ λ_k, which is equal to 𝒪(αn). Similarly, the required matrix, denoted by Ē, for computing all , , and still needs 𝒪(αn) space. Hence, the determination of j_mid, k_mid and h_mid can be performed in 𝒪(αn) space. The details of CPSA-DC algorithm are described as follows. Note that the program code of BestScoreRev is similar to that of BestScore and hence is omitted here. In the codes, the variable E(ℳ_k(i_mid, j)) is used to denote the value of ℳ_k(i_mid, j) in E(k, j) and others are analogous. The global variables Inline graphic , ℋ_B(j, k, h) = δ(suff(B_j, h), pref(C_k, h)), and are computed in Algorithm BestScore so that they can be used directly in Algorithm CPSA-DC.

Fig. 3 — The grid locations of E(k − 1), E(k) and the values in V_k−1 and V_k when the entry (i, j, k) of 𝒢, marked with ‘?’, is reached for the computation.

Algorithm CPSA-DC(i_start, i_end, j_start, j_end, k_start, k_end)Input: Sequences Inline graphic and with constraints )1:if (i_start > i_end) or (j_start > j_end) then

Align the nonempty sequence with spaces;

else

Inline graphic

BestScore(i_start,i_mid,j_start,j_end,k_start,k_end);

BestScoreRev(i_mid + 1,i_end,j_start,j_end,k_start,k_end);

end if

2: max = −∞;

for j = j_start − 1 to j_enddo

for k = k_start − 1 to k_enddo

if Inline graphic then

Inline graphic

j _mid = j; k_mid = k; type = case 1;

end if

if Inline graphic then

Inline graphic ;

j _mid = j; k_mid = k; type = case 2;

end if

if Inline graphic then

Inline graphic ;

j _mid = j; k_mid = k; type = case 3;

end if

if k ≥ 1 then

for h = 1 to λ_k − 1 do

if Inline graphic and

Inline graphic then

if Inline graphic

max then

Inline graphic ;

j _mid = j; k_mid = k; h_mid = h; type = case 4;

end if

end for

if Inline graphic and then

if Inline graphic

max then

Inline graphic ;

j _mid = j; k_mid = k; h_mid = h; type = case 5;

end if

end for

3: if type = case 1 then

CPSA-DC(i_start,i_mid − 1,j_start,j_mid,k_start,k_mid);

Align Inline graphic with a space;

CPSA-DC(i_mid + 1,i_end,j_mid + 1,j_end,k_mid+1,k_end);

end if

if type = case 2 then

CPSA-DC(i_start,i_mid − 1,j_start,j_mid,k_start,k_mid);

Align Inline graphic with two spaces;

CPSA-DC(i_mid + 2,i_end,j_mid+1,j_end,k_mid+1,k_end);

end if

if type = case 3 then

CPSA-DC(i_start,i_mid − 1,j_start,j_mid − 1,k_start,k_mid);

Align Inline graphic with ;

CPSA-DC(i_mid+1,i_end,j_mid+1,j_end,k_mid+1,k_end);

end if

if type = case 4 then

CPSA-DC(i_start,i_mid − h_mid,j_start,j_mid − h_mid,k_start, k_mid − 1);

Align Inline graphic with ;

CPSA-DC(i_mid+λ_k − h_mid+1,i_end, j_mid+λ_k − h_mid + 1,j_end,k_mid+1,k_end);

end if

if type = case 5 then

CPSA-DC(i_start,i_mid − λ_k,j_start, j_mid − λ_k,k_start, k_mid − 1);

Align Inline graphic with ;

CPSA-DC(i_mid+1,i_end,j_mid+1,j_end,k_mid+1,k_end);

end if

Algorithm BestScore(i_start, i_end, j_start, j_end, k_start, k_end)

Input: Sequences Inline graphic and

with constraints ( Inline graphic )

1: /* Reindex */

m = i_start − i_end + 1; n = j_start − j_end + 1;

γ = k_start − k_end + 1;

2: /* Initialization */

for j = 0 to ndo

for k = 0 to γ do

Inline graphic ;

if (j = 0) or (k > 0) then Inline graphic ;

else Inline graphic ;

if (j = 0) and (k = 0) thenE(ℳ_k(i_mid,j)) = 0;

else E(ℳ_k(i_mid,j)) = −∞;

if k ≥ 1 then

for h = 1 to λ_kdoE(𝒩_k(i_mid,j,h)) = −∞;

end if

end for

end for 3: /* Computation */

for i = 1 to mdo

for k = 0 to γ do /* For the case ofj = 0 */

V_k(ℳ_k(i_mid, 0)) = E(ℳ_k(i_mid,0));

if k ≥ 1 then

for h = 1 to λ_kdoV_k(𝒩_k(i_mid,0,h))

= E(𝒩_k(i_mid,0,h)));

end if

Inline graphic ;

end for

for j = 1 to ndo/* For the case ofj> 0 */

for k = 0 to γ do

temp_k(ℳ_k(i_mid,j)) = E(ℳ_k(i_mid,j));

if k ≥ 1 then

for h = 1 to λ_kdo temp_k(𝒩_k(i_mid,j,_h))

= E(𝒩_k(i_mid,j,h));

end if

Inline graphic ;

if k ≥ 1 then

for h = 1 to λ_kdo

if h = 1 then

Inline graphic ;

else

Inline graphic ;

end if

end for

end if

Inline graphic ;

v _k,k+1 = V_k(M_k(i_mid,j));

V _k(ℳ_k(i_mid,j)) = temp_k(ℳ_k(i_mid,j));

if k ≥ 1 then

for h = 1 to λ_kdo

V _k(𝒩(i_mid,j,h)) = temp_k(𝒩_k(i_mid,j,h));

end for

end if

if i = m and k ≥ 1 then

for h = 1 to λ_kdo

if h = 1 then

if j = 1 and Inline graphic then

ℋ_A(k,h) = 1; else ℋ_A(k,h) = 0;

if Inline graphic then

ℋ_B(j,k,h) = 1; else ℋ_B(j,k,h) = 0;

else

if j = 1 and Inline graphic then

ℋ_A(k,h) = ℋ_A(k,h − 1) + 1;

if Inline graphic then ℋ_B(j,k,h)

=ℋ_B(j,k,h − 1)+ 1;

end if

end for

end if

end for

Now, we analyze the time-complexity of our CPSA-DC algorithm for solving the CPSA. As shown in Figure 2, after determining the middle position (i_mid, j_mid, k_mid) of the optimal path in 𝒢, we can divide the original problem into two subproblems, each of which further can be recursively divided into two smaller subproblems using the same way. Note that regardless of where the optimal path passes through (i_mid, j_mid, k_mid), the total size of the two reduced subproblems is just half the size of the original problem, where the size is measured by the number of entries in 𝒢. It is not hard to see that the time-complexity of determining the middle position of each subproblem at each recursive stage is proportional to the size of the subproblem. Let Ψ denote the size of the original problem (i.e. Ψ = γmn). Then the total time-complexity of our CPSA-DC algorithm is equal to Inline graphic , which is twice as high as the CPSA-DP algorithm. Using the CPSA-DC algorithm as a kernel, we were able to design a memory-efficient algorithm, termed CMSA-DC, for progressively aligning multiple input sequences into a CMSA according to the branching order of a guide tree. The above progressive method we adopted was proposed by (Tang et al. 2003). Owing to space limitation, we refer the reader to their paper for the details of its implementation.

4 EXPERIMENTAL RESULTS

We use Java language to implement the CMSA-DC algorithm as a web server, called as MuSiC-ME (Memory-Efficient tool for Multiple Sequence Alignment with Constraints). The input of the MuSiC-ME system consists of a set of protein/DNA/RNA sequences and a set of user-specified constraints, each with a fragment of residue/nucleotide that (approximately) appears in all input sequences. The output of MuSiC-ME is a CMSA in which the fragments of the input sequences whose residues/nucleotides exhibit a given degree of similarity to a constraint are aligned together. For its biological applications, we refer the reader to other related papers (Tang et al., 2003; Tsai et al., 2004).

In the following, we evaluate our memory-efficient MuSiC-ME system and compare its running time and memory to the original MuSiC system (Tsai et al., 2004), whose kernel CPSA algorithm was implemented by the dynamic programming approach. We chose five families of protein/RNA sequences as our testing datasets, each of which has been shown to contain an ordered series of conserved motifs related to the structures/functionalities/consensuses of the family (McClure et al., 1994; Chin et al., 2003; Tang et al., 2003; Tsai et al., 2004): (1) the aspartic acid protease family (Protease), (2) the hemoglobins family (Globin), (3) the ribonuclease family (RNase), (4) the kinase family (Kinase) and (5) the 3′- untranslated region of the coronaviruses (CoV-3′-UTR). From each family, we have selected a representative set of sequences and adopted the ordered series of conserved motifs as the constraints. Table 1 lists the information of the tested families and their constraints. All tests were run with default parameters on IBM PC with 1.26 GHz processor and 512 MB RAM under Linux system. Table 2 lists the CPU time and memory usage of our experiments using MuSiC and MuSiC-ME. It shows that the memory usage of MuSiC-ME is much smaller than that of MuSiC for large-scale sequences, and the CPU time required by MuSiC-ME is smaller than that required by MuSiC for short sequences, since we have simplified the recurrences of the dynamic programming here.

Table 2.

The information of the tested families and their constraints

Family	#SEQ	MAXSEQ	#CON	MAXCON
Protease	6	123	4	1
Globin	6	146	7	2
RNase	6	185	3	1
Kinase	6	353	10	3
CoV-3′-UTR	6	422	12	2

Open in a new tab

#SEQ is the number of sequences, MAXSEQ is the maximum length of sequences, #CON is the number of constraints and MAXCON is the maximum length of constraints.

Table 2.

The comparison of CPU time and memory usage between MuSiC and MuSiC-ME

Family	MuSiC CPU Time (s)	Memory (MB)	MuSiC-ME CPU Time (s)	Memory (MB)
Protease	6	25.4	6	15.5
Globin	23	42.0	18	15.5
RNase	11	32.0	8	15.5
Kinase	131	160.8	96	15.9
CoV-3′-UTR	—	—	165	17.4

Open in a new tab

The memory usage includes JVM (Java Virtual Machine), code (MuSiC/MuSiC-ME) and data, and MuSiC cannot deal with the case of CoV-3′-UTR due to running out of memory.

It is worth mentioning that in MuSiC-ME system, the letters representing the constraints are not just the individual residues/nucleotides, but also the IUPAC (International Union of Pure and Applied Chemistry) codes. For example, nucleotides N and R have the meanings of any nucleotides and purine (i.e. A or G), respectively. This enhanced improvement will enable the user to define more flexible constraints or combine several small constraints with fixed distances into a large one. For example, consider our fifth experiment above related to the 3′-UTRs of the coronavirus sequences, including HCV-229E (human coronavirus), PEDV (porcine epidemic diarrhea virus), TGEV (porcine transmissible gastroenteritis virus), BCV (bovine coronavirus), MHV (mouse hepatitis virus) and SARS-TW1 (severe acute respiratory syndrome virus). All the 12 adopted constraints appear in the fragment sequences that were able to fold themselves into a stable pseudoknot structure (Williams et al., 1999; Tsai et al., 2004). However, these adopted constraints are too short to correctly align the truly conserved motifs of sequences together, since the short constraints occur frequently in the large genomic sequences that led to the difficulty in identifying the true occurrences. In fact, four pairs of two consecutive constraints appear in the stem regions (containing no loops) of pseudoknots and each paired constraints is separated by a non-conserved subsequence of fixed length. Hence, we can combine each pair of constraints into a new and larger constraint by representing the non-conserved part with N. Consequently, we got eight new constraints with the order of (CUNNNNC, A, AA, G, C, UNNNA, GNNNNAG, UNNNA) for this dataset. After running MuSiC-ME, a satisfied CMSA was found (Figure 4), where the band of the resulting CMSA corresponding to a constraint is black and its corresponding constraint is displayed beneath it. This resulting CMSA implies that the fragment of SARS-TW1 between the first band and the last band may fold into a pseudoknot structure that is possibly involved in replicating SARS viruses (Pleij, 1994; Deiman and Pleij, 1997). In fact, this fragment is the pseudoknot sequence of SART-TW1 that was found by (Tsai et al. 2004) using MuSiC to align the 3′-UTR of SARS-TW1 with the pseudoknot sequences, instead of 3′-UTRs, of other coronaviruses. The input sequences of the above experiment were also tested by Clustal W 1.82, the most commonly used MSA tool. According to its resulting MSA as shown in Figure 5, the fragments of all pseudoknots, including our detected pseudoknot for SARS-TW1, were not able to align well so that it is difficult for us to identify the exact fragment of the SARS-TW1 pseudoknot from this MSA.

Fig. 4 — The partial display of the resulting CMSA of MuSiC-ME by aligning the sequences of SARS-TW1 3′-UTR with those of other five coronaviruses.

Fig. 5 — The partial display of the resulting MSA of Clustal W 1.82 by aligning the 3′-UTR sequences of six coronaviruses, where the bases not in the pseudoknots are marked with dots.

5 CONCLUSIONS

In this paper, we designed a memory-efficient program for performing the CMSA, which can incorporate the knowledge of biologists about the structures/functionalities/consensuses of their datasets into sequence alignment such that the user-specified residues/nucleotides are aligned together. We first used the divide-and-conquer approach to design a memory-efficient algorithm for optimally aligning two sequences with constraints, and then based on this algorithm, we used the progressive method to develop a memory-efficient tool, called MuSiC-ME, for heuristically aligning multiple sequences with constraints. The proposed MuSiC-ME system makes it possible to align several large-scale protein/DNA/RNA sequences with constraints through the desktop PC with the limited memory. In this system, moreover, the letters allowed to represent the constraints are the IUPAC codes, which will enable the user to define more flexible constraints or combine several small constraints with fixed distances into a large one. It is worth mentioning that the A^* algorithm, a heuristic search method in Artificial Intelligence, has been extensively used to time- and/or memory-efficiently solve the general MSA problem without constraints (Ikeda and Imai, 1994, 1999; Kobayashi and Imai, 1999; Lermen and Reinert, 2000). Hence, it is interesting to study whether or not the A^* algorithm can still be applied to the CMSA problem.

Acknowledgments

The authors would like to thank the anonymous referees for their constructive comments to the presentation of this paper. This work was supported in part by National Science Council of Republic of China under grant NSC93-2213-E-009-113.

REFERENCES

Bafna, V., Lawler, E.L., Pevzner, P.A. 1997. Approximation algorithms for multiple sequence alignment. Theoret. Comput. Sci. 182233–244 [Google Scholar]
Bonizzoni, P. and Vedova, G.D. 2001. The complexity of multiple sequence alignment with SP-score that is a metric. Theoret. Comput. Sci. 25963–79 [Google Scholar]
Carrillo, H. and Lipman, D. 1988. The multiple sequence alignment problem in biology. SIAM J. Appl. Math. 481073–1082 [Google Scholar]
Chan, S.C., Wong, A.K.C., Chiu, D.K.Y. 1992. A survey of multiple sequence comparison methods. Bull. Math. Biol. 54563–598 [DOI] [PubMed] [Google Scholar]
Chao, K.M., Hardison, R.C., Miller, W. 1994. Recent developments in linear-space alignment methods: a survey. J. Comput. Biol. 1271–291 [DOI] [PubMed] [Google Scholar]
Chin, F.Y.L., Ho, N.L., Lamy, T.W., Wong, P.W.H., Chan, M.Y. 2003. Efficient constrained multiple sequence alignment with performance guarantee. Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB 2003) , Los Alamitos, CA IEEE, pp. pp. 337–346 [PubMed]
Corpet, F. 1988. Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res. 1610881–10890 [DOI] [PMC free article] [PubMed] [Google Scholar]
Deiman, B. and Pleij, C.W.A. 1997. Pseudoknots: a vital feature in viral RNA. Semin. Virol. 8166–175 [Google Scholar]
Depiereux, E. and Feytmans, E. 1992. MATCH-BOX: a fundamentally new algorithm for the simultaneous alignment of several protein sequences. Comput. Appl. Biosci. 8501–509 [DOI] [PubMed] [Google Scholar]
Feng, D.F. and Doolittle, R.F. 1987. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25351–360 [DOI] [PubMed] [Google Scholar]
Gusfield, D. 1993. Efficient methods for multiple sequence alignment with guaranteed error bounds. Bull. Math. Biol. 55141–154 [DOI] [PubMed] [Google Scholar]
Gusfield, D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology 1997, NY Cambridge University Press
Higgins, D. and Sharpe, P. 1988. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73, pp. 237–244 [DOI] [PubMed] [Google Scholar]
Hirschberg, D.S. 1975. A linear space algorithm for computing maximal common subsequences. Commun. ACM 18341–343 [Google Scholar]
Ikeda, T. and Imai, H. 1994. Fast A^* algorithms for multiple sequence alignment. Proceedings of the Genome Informatics Workshop , Tokyo Universal Academy Press, pp. pp. 90–99
Ikeda, T. and Imai, H. 1999. Enhanced A^* algorithms for multiple alignments: optimal alignments for several sequences and k-opt approximate alignments for large cases. Theoret. Comput. Sci. 210341–374 [Google Scholar]
Kececioglu, J.D. 1993. The maximum weight trace problem in multiple sequence alignment. Proceedings of the Fourth Annual Symposium on Combinatorial Pattern Matching (CPM 2004) , Heidelberg, Germany LNCS Springer-Verlag 684, pp. pp. 106–119 [Google Scholar]
Kobayashi, H. and Imai, H. 1999. Improvement of the A^* algorithm for multiple sequence alignment. Proceedings of the Genome Informatics Workshop , Tokyo Universal Academy Press, pp. pp. 120–130 [PubMed]
Lermen, M. and Reinert, K. 2000. The practical use of the A^* algorithm for exact multiple sequence alignment. J. Comput. Biol. 7655–672 [DOI] [PubMed] [Google Scholar]
Li, M., Ma, B., Wang, L. 2000. Near optimal multiple alignment within a band in polynomial time. Proceedings of the Thirty Second Annual ACM Symposium on Theory of Computing (STOC 2000) , Portland, OR ACM Presspp. 425–434
McClure, M.A., Vasi, T.K., Fitch, W.M. 1994. Comparative analysis of multiple protein-sequence alignment methods. Mol. Biol. Evol. 11571–592 [DOI] [PubMed] [Google Scholar]
Morgenstern, B. 1999. DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 15211–218 [DOI] [PubMed] [Google Scholar]
Myers, E.W. and Miller, W. 1988. Optimal alignment in linear space. Comput. Appl. Biosci. 411–17 [DOI] [PubMed] [Google Scholar]
Myers, G., Selznick, S., Zhang, Z., Miller, W. 1996. Progressive multiple alignment with constraints. J. Comput. Biol. 3563–572 [DOI] [PubMed] [Google Scholar]
Nicholas, H.B., Ropelewski, A.J., Deerfield, D.W. 2002. Strategies for multiple sequence alignment. Biotechniques 32592–603 [DOI] [PubMed] [Google Scholar]
Notredame, C. 2002. Recent progresses in multiple sequence alignment: a survey. Pharmacogenomics 3131–144 [DOI] [PubMed] [Google Scholar]
Notredame, C., Higgins, D.G., Heringa, J. 2000. T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302205–217 [DOI] [PubMed] [Google Scholar]
Pevzner, P.A. 1992. Multiple alignment, communication cost, and graph matching. SIAM J. Appl. Math. 521763–1779 [Google Scholar]
Pleij, C.W.A. 1994. RNA pseudoknots. Curr. Opin. Struct. Biol. 4337–344 [Google Scholar]
Sammeth, M., Morgenstern, B., Stoye, J. 2003. Divide-and-conquer multiple alignment with segment-based constraints. Bioinformatics 19ii189–ii195 [DOI] [PubMed] [Google Scholar]
Schuler, G.D., Altschul, S.F., Lipman, D.J. 1991. A workbench for multiple alignment construction and analysis. Proteins 9180–190 [DOI] [PubMed] [Google Scholar]
Stoye, J. 1998. Multiple sequence alignment with the divide-and-conquer method. Gene 211GC45–GC56 [DOI] [PubMed] [Google Scholar]
Stoye, J., Moultony, V., Dress, A.W.M. 1997. DCA: an efficient implementation of the divide-and-conquer approach to simultaneous multiple sequence alignment. Comput. Appl. Biosci. 13625–626 [DOI] [PubMed] [Google Scholar]
Stoye, J., Perrey, S.W., Dress, A.W.M. 1997. Improving the divide-and-conquer approach to sum-of-pairs multiple sequence alignment. Appl. Math. Lett. 1067–73 [Google Scholar]
Tang, C.Y., Lu, C.L., Chang, M.D.T., Tsai, Y.T., Sun, Y.J., Chao, K.M., Chang, J.M., Chiou, Y.H., Wu, C.M., Chang, H.T., Chou, W.I. 2003. Constrained multiple sequence alignment tool development and its application to RNase family alignment. J. Bioinform. Comput. Biol. 1267–287 [DOI] [PubMed] [Google Scholar]
Taylor, W.R. 1987. Multiple sequence alignment by a pairwise algorithm. Comput. Appl. Biosci. 381–87 [DOI] [PubMed] [Google Scholar]
Taylor, W.R. 1994. Motif-biased protein sequence alignment. J. Comput. Biol. 1297–310 [DOI] [PubMed] [Google Scholar]
Thompson, J.D., Higgs, D.G., Gibson, T.J. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties, and weight matrix choice. Nucleic Acids Res. 224673–4680 [DOI] [PMC free article] [PubMed] [Google Scholar]
Thompson, J.D., Plewniak, F., Thierry, J.-C., Poch, O. 2000. DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Res. 282919–2926 [DOI] [PMC free article] [PubMed] [Google Scholar]
Tönges, U., Perrey, S.W., Stoye, J., Dress, A.W.M. 1996. A general method for fast multiple sequence alignment. Gene 172GC33–GC41 [DOI] [PubMed] [Google Scholar]
Tsai, Y.T., Huang, Y.P., Yu, C.T., Lu, C.L. 2004. MuSiC: a tool for multiple sequence alignment with constraints. Bioinformatics (in press) [DOI] [PubMed]
Wang, L. and Jiang, T. 1994. On the complexity of multiple sequence alignment. J. Comput. Biol. 1337–348 [DOI] [PubMed] [Google Scholar]
Williams, G.D., Chang, R.-Y., Brian, D.A. 1999. A phylogenetically conserved hairpin-type 39 untranslated region pseudoknot functions in coronavirus RNA replication. J. Virol. 738349–8355 [DOI] [PMC free article] [PubMed] [Google Scholar]
Yu, C.T. 2003. Efficient algorithms for constrained sequence alignment problems. Master's Thesis, Department of Computer Science and Information Management, Providence University

[B1] Bafna, V., Lawler, E.L., Pevzner, P.A. 1997. Approximation algorithms for multiple sequence alignment. Theoret. Comput. Sci. 182233–244 [Google Scholar]

[B2] Bonizzoni, P. and Vedova, G.D. 2001. The complexity of multiple sequence alignment with SP-score that is a metric. Theoret. Comput. Sci. 25963–79 [Google Scholar]

[B3] Carrillo, H. and Lipman, D. 1988. The multiple sequence alignment problem in biology. SIAM J. Appl. Math. 481073–1082 [Google Scholar]

[B4] Chan, S.C., Wong, A.K.C., Chiu, D.K.Y. 1992. A survey of multiple sequence comparison methods. Bull. Math. Biol. 54563–598 [DOI] [PubMed] [Google Scholar]

[B5] Chao, K.M., Hardison, R.C., Miller, W. 1994. Recent developments in linear-space alignment methods: a survey. J. Comput. Biol. 1271–291 [DOI] [PubMed] [Google Scholar]

[B6] Chin, F.Y.L., Ho, N.L., Lamy, T.W., Wong, P.W.H., Chan, M.Y. 2003. Efficient constrained multiple sequence alignment with performance guarantee. Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB 2003) , Los Alamitos, CA IEEE, pp. pp. 337–346 [PubMed]

[B7] Corpet, F. 1988. Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res. 1610881–10890 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] Deiman, B. and Pleij, C.W.A. 1997. Pseudoknots: a vital feature in viral RNA. Semin. Virol. 8166–175 [Google Scholar]

[B9] Depiereux, E. and Feytmans, E. 1992. MATCH-BOX: a fundamentally new algorithm for the simultaneous alignment of several protein sequences. Comput. Appl. Biosci. 8501–509 [DOI] [PubMed] [Google Scholar]

[B10] Feng, D.F. and Doolittle, R.F. 1987. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25351–360 [DOI] [PubMed] [Google Scholar]

[B11] Gusfield, D. 1993. Efficient methods for multiple sequence alignment with guaranteed error bounds. Bull. Math. Biol. 55141–154 [DOI] [PubMed] [Google Scholar]

[B12] Gusfield, D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology 1997, NY Cambridge University Press

[B13] Higgins, D. and Sharpe, P. 1988. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73, pp. 237–244 [DOI] [PubMed] [Google Scholar]

[B14] Hirschberg, D.S. 1975. A linear space algorithm for computing maximal common subsequences. Commun. ACM 18341–343 [Google Scholar]

[B15] Ikeda, T. and Imai, H. 1994. Fast A^* algorithms for multiple sequence alignment. Proceedings of the Genome Informatics Workshop , Tokyo Universal Academy Press, pp. pp. 90–99

[B16] Ikeda, T. and Imai, H. 1999. Enhanced A^* algorithms for multiple alignments: optimal alignments for several sequences and k-opt approximate alignments for large cases. Theoret. Comput. Sci. 210341–374 [Google Scholar]

[B17] Kececioglu, J.D. 1993. The maximum weight trace problem in multiple sequence alignment. Proceedings of the Fourth Annual Symposium on Combinatorial Pattern Matching (CPM 2004) , Heidelberg, Germany LNCS Springer-Verlag 684, pp. pp. 106–119 [Google Scholar]

[B18] Kobayashi, H. and Imai, H. 1999. Improvement of the A^* algorithm for multiple sequence alignment. Proceedings of the Genome Informatics Workshop , Tokyo Universal Academy Press, pp. pp. 120–130 [PubMed]

[B19] Lermen, M. and Reinert, K. 2000. The practical use of the A^* algorithm for exact multiple sequence alignment. J. Comput. Biol. 7655–672 [DOI] [PubMed] [Google Scholar]

[B20] Li, M., Ma, B., Wang, L. 2000. Near optimal multiple alignment within a band in polynomial time. Proceedings of the Thirty Second Annual ACM Symposium on Theory of Computing (STOC 2000) , Portland, OR ACM Presspp. 425–434

[B21] McClure, M.A., Vasi, T.K., Fitch, W.M. 1994. Comparative analysis of multiple protein-sequence alignment methods. Mol. Biol. Evol. 11571–592 [DOI] [PubMed] [Google Scholar]

[B22] Morgenstern, B. 1999. DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 15211–218 [DOI] [PubMed] [Google Scholar]

[B23] Myers, E.W. and Miller, W. 1988. Optimal alignment in linear space. Comput. Appl. Biosci. 411–17 [DOI] [PubMed] [Google Scholar]

[B24] Myers, G., Selznick, S., Zhang, Z., Miller, W. 1996. Progressive multiple alignment with constraints. J. Comput. Biol. 3563–572 [DOI] [PubMed] [Google Scholar]

[B25] Nicholas, H.B., Ropelewski, A.J., Deerfield, D.W. 2002. Strategies for multiple sequence alignment. Biotechniques 32592–603 [DOI] [PubMed] [Google Scholar]

[B26] Notredame, C. 2002. Recent progresses in multiple sequence alignment: a survey. Pharmacogenomics 3131–144 [DOI] [PubMed] [Google Scholar]

[B27] Notredame, C., Higgins, D.G., Heringa, J. 2000. T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302205–217 [DOI] [PubMed] [Google Scholar]

[B28] Pevzner, P.A. 1992. Multiple alignment, communication cost, and graph matching. SIAM J. Appl. Math. 521763–1779 [Google Scholar]

[B29] Pleij, C.W.A. 1994. RNA pseudoknots. Curr. Opin. Struct. Biol. 4337–344 [Google Scholar]

[B30] Sammeth, M., Morgenstern, B., Stoye, J. 2003. Divide-and-conquer multiple alignment with segment-based constraints. Bioinformatics 19ii189–ii195 [DOI] [PubMed] [Google Scholar]

[B31] Schuler, G.D., Altschul, S.F., Lipman, D.J. 1991. A workbench for multiple alignment construction and analysis. Proteins 9180–190 [DOI] [PubMed] [Google Scholar]

[B32] Stoye, J. 1998. Multiple sequence alignment with the divide-and-conquer method. Gene 211GC45–GC56 [DOI] [PubMed] [Google Scholar]

[B33] Stoye, J., Moultony, V., Dress, A.W.M. 1997. DCA: an efficient implementation of the divide-and-conquer approach to simultaneous multiple sequence alignment. Comput. Appl. Biosci. 13625–626 [DOI] [PubMed] [Google Scholar]

[B34] Stoye, J., Perrey, S.W., Dress, A.W.M. 1997. Improving the divide-and-conquer approach to sum-of-pairs multiple sequence alignment. Appl. Math. Lett. 1067–73 [Google Scholar]

[B35] Tang, C.Y., Lu, C.L., Chang, M.D.T., Tsai, Y.T., Sun, Y.J., Chao, K.M., Chang, J.M., Chiou, Y.H., Wu, C.M., Chang, H.T., Chou, W.I. 2003. Constrained multiple sequence alignment tool development and its application to RNase family alignment. J. Bioinform. Comput. Biol. 1267–287 [DOI] [PubMed] [Google Scholar]

[B36] Taylor, W.R. 1987. Multiple sequence alignment by a pairwise algorithm. Comput. Appl. Biosci. 381–87 [DOI] [PubMed] [Google Scholar]

[B37] Taylor, W.R. 1994. Motif-biased protein sequence alignment. J. Comput. Biol. 1297–310 [DOI] [PubMed] [Google Scholar]

[B38] Thompson, J.D., Higgs, D.G., Gibson, T.J. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties, and weight matrix choice. Nucleic Acids Res. 224673–4680 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B39] Thompson, J.D., Plewniak, F., Thierry, J.-C., Poch, O. 2000. DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Res. 282919–2926 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B40] Tönges, U., Perrey, S.W., Stoye, J., Dress, A.W.M. 1996. A general method for fast multiple sequence alignment. Gene 172GC33–GC41 [DOI] [PubMed] [Google Scholar]

[B41] Tsai, Y.T., Huang, Y.P., Yu, C.T., Lu, C.L. 2004. MuSiC: a tool for multiple sequence alignment with constraints. Bioinformatics (in press) [DOI] [PubMed]

[B42] Wang, L. and Jiang, T. 1994. On the complexity of multiple sequence alignment. J. Comput. Biol. 1337–348 [DOI] [PubMed] [Google Scholar]

[B43] Williams, G.D., Chang, R.-Y., Brian, D.A. 1999. A phylogenetically conserved hairpin-type 39 untranslated region pseudoknot functions in coronavirus RNA replication. J. Virol. 738349–8355 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B44] Yu, C.T. 2003. Efficient algorithms for constrained sequence alignment problems. Master's Thesis, Department of Computer Science and Information Management, Providence University

PERMALINK

A memory-efficient algorithm for multiple sequence alignment with constraints

Chin Lung Lu

Yen Pin Huang

Abstract

1 INTRODUCTION

2 PROBLEM FORMULATION

3 ALGORITHM

Fig. 1.

Fig. 2.

Fig. 3.

4 EXPERIMENTAL RESULTS

Table 2.

Table 2.

Fig. 4.

Fig. 5.

5 CONCLUSIONS

Acknowledgments

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A memory-efficient algorithm for multiple sequence alignment with constraints

Chin Lung Lu

Yen Pin Huang

Abstract

1 INTRODUCTION

2 PROBLEM FORMULATION

3 ALGORITHM

Fig. 1.

Fig. 2.

Fig. 3.

4 EXPERIMENTAL RESULTS

Table 2.

Table 2.

Fig. 4.

Fig. 5.

5 CONCLUSIONS

Acknowledgments

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases