Supporting Text

The algorithm is tested in the alignment of 20 mitochondrial control region sequences from primates (D-loop) and of 15 genomic sequences around the CAV2 gene from mammals and chicken (CAV2). In the first example (D-loop), the sequences are <400 bases long and contain only short indels, and we compare alignments performed with different settings to assess the effect of (i) our modification to the algorithm; (ii) the modeling of the substitution process; and, when the modification was enabled, (iii) the different ways of handling the insertions. As a reference, we align the same data set using a traditional algorithm implemented in the software CLUSTALW v.1.83 (1). In the second example (CAV2), the sequence lengths vary from 5,200 to 7,000 bases and contain long insertion elements and nonhomologous regions. Here, we test whether our progressive global aligner can infer reasonable alignments for genomic sequences and, because long insertion elements are typical in genomic sequences, compare the two different approaches for handling the inserted sites. The software used in the analyses, data sets analyzed, and detailed instructions to repeat the analyses are available at www.ebi.ac.uk/goldman or upon request from A.L.

D-Loop.

The guide tree for the D-loop alignments is based on the tree produced with CLUSTALW (Fig. 3 Left) but we have corrected the placement of C. aethiops and rooted it by defining E. fulvus an outgroup (Fig. 3 Right). The sequences are aligned with CLUSTALW by using the software default parameters (analysis CLUSTALW) and with our probabilistic algorithm by using either the Jukes–Cantor (JC) or Hasegawa–Kishino–Yano (HKY) model (2, 3), either disabling (–) or enabling (+) the correction for insertion sites, and allowing characters to be matched to sites inferred earlier as insertion (i.e., insertions may be closed, denoted +) or forcing these sites to stay as unmatched insertions (i.e., insertions open forever, +F). In all probabilistic alignments, parameters are r = 0.025, e = 0.5, and g = 0; in the HKY model empirical base frequencies (0.342/0.309/0.104/0.245 for A/C/G/T) are used, and k , the transition–transversion ratio, is set to 2. The resulting alignments are shown in Figs. 5–11 as follows: CLUSTALW (Fig. 5), JC (Fig. 6), JC+ (Fig. 7), JC+F (Fig. 8), HKY (Fig. 9), HKY+ (Fig. 10), and HKY+F (Fig. 11).

A comparison of the alignments performed with and without the correction for insertion sites (JC+ vs. JC and HKY+ vs. HKY) shows that our method works: when enabled, the algorithm preferentially places gaps at the same sites and is more likely to create gaps that can be explained by a single insertion event. The heuristics implemented in CLUSTALW have a partly similar effect, and the CLUSTALW alignment has in total fewer gapped columns than JC and HKY. However, in comparison with JC+ and HKY+, the gaps inferred by CLUSTALW are less consistent with the phylogeny; one should remember that two insertions are never evolutionary homologous and that gaps at the same site in different parts of the tree require multiple independent deletions. When the gaps inferred as insertion are forced to be skipped over in all subsequent alignments (i.e., JC+F and HKY+F; Figs. 8 and 11), all indel events are strictly consistent with the phylogeny. Some sequences have truncated terminal regions, however, and in intermediate alignments other (nontruncated) terminal regions are incorrectly inferred as insertions. By disabling their matching at a later stage, the strict "insertions open forever" rule spreads out the end of the alignment with multiple long gaps. The differences in the alignments inferred using different evolutionary substitution models (JC vs. HKY) suggest that subsequent analyses (e.g., phylogenetic inference) estimating the same parameters may, at least partly, depend on the initial choices made (or accepted) for the sequence alignment.

Our method can be vulnerable to wrong alignment order, however, and when the alignments were performed by using the original (i.e., wrong) guide tree, the algorithm attempted to explain the inconsistency in the data by inferring additional gaps (data not shown).

CAV2.

For the CAV2 alignments, a guide tree inferred with a maximum likelihood approach is used, although the placement of rodent and rabbit sequences in the tree is controversial (Fig. 4). The sequences are aligned with CLUSTALW both using the software default parameters (analysis CLUSTALW ) and, because the high penalty for long gaps seems to cause problems, using the default parameters except a gap extension penalty of 0 (CLUSTALW_0). The probabilistic alignments, either allowing for insertions to be closed or forcing the algorithm to keep them open (HKY+ and HKY+F, respectively), are performed by using parameters r = 0.025, e = 0.9, g = 0, and the HKY model with empirical base frequencies (0.166/0.314/0.402/0.117 for A/C/G/T) and k = 2. The resulting alignments are shown in Figs. 12–15, respectively.

The CLUSTALW alignment is very compact, only 9.7 kb in comparison with 12.0, 13.6, and 14.3 kb for CLUSTALW_0, HKY+, and HKY+F, respectively, and some regions are clearly overmatched by discouraging insertions in single or only a few sequences (Fig. 12). The first and third exons are correctly aligned among the mammals and between mammals and chicken (sites 565–714 and 9,007–9,157, respectively), whereas the second exon of chicken is misaligned in comparison with mammals (sites 1,179–1,367 in mammals but 809–996 in chicken). The alignment CLUSTALW_0 is less compact, mostly because of the fragmentation due to many short gaps (Fig. 13). The first and third exons are again correctly aligned (654–803 and 11,333–11,483, respectively), and the second exon is not (1,376–1,630 in mammals, 910–1,176 in chicken). The alignment HKY+ is even longer than CLUSTALW_0, but the most dramatic difference is in the structure of the gaps: the difference in sequence lengths is distributed among fewer and longer indel events, and in most cases these gaps are consistent with phylogeny; i.e., they can be explained by a single insertion or deletion event and the shared history of the sequences (Fig. 14). All three exons are correctly aligned among all of the sequences (sites 1,095–1,244, 1,825–2,012, and 12,608–12,758), although the alignment of nonexonic regions of the evolutionarily very distant chicken sequence to mammals is locally incorrect, and in some places the chicken sequence is matched against sites earlier inferred as insertion. The problem is largely resolved by forcing the algorithm to skip over all of the insertions (HKY+F): exons are correctly aligned (sites 1,159–1,308, 1,948–2,135 and 13,500–13,650), and, as expected, large proportions of the alignment consist of long insertions in a single or only a few sequences (Fig. 15).

1. Thompson, J. D., Higgins, D. G. & Gibson, T. J. (1994) Nucleic Acids Res. 22, 4673–4680.

2. Jukes, T. H. & Cantor, C. (1969) in Mammalian Protein Metabolism, ed. Munro, H. N. (Academic, New York), pp. 21–132.

3. Hasegawa, M., Kishino, H. & Yano, T. (1985) J. Mol. Evol. 22, 160–174.