Abstract
The Clustal series of programs are widely used in molecular biology for the multiple alignment of both nucleic acid and protein sequences and for preparing phylogenetic trees. The popularity of the programs depends on a number of factors, including not only the accuracy of the results, but also the robustness, portability and user-friendliness of the programs. New features include NEXUS and FASTA format output, printing range numbers and faster tree calculation. Although, Clustal was originally developed to run on a local computer, numerous Web servers have been set up, notably at the EBI (European Bioinformatics Institute) (http://www.ebi.ac.uk/clustalw/).
INTRODUCTION
One of the cornerstones of modern bioinformatics is the comparison or alignment of protein sequences. With the aid of multiple sequence alignments, biologists are able to study the sequence patterns conserved through evolution and the ancestral relationships between different organisms. Sequences can be aligned across their entire length (global alignment) or only in certain regions (local alignment). The most widely used programs for global multiple sequence alignment are from the Clustal series of programs. The first Clustal program was written by Des Higgins in 1988 (1) and was designed specifically to work efficiently on personal computers, which at that time, had feeble computing power by today's standards. It combined a memory-efficient dynamic programming algorithm (2) with the progressive alignment strategy developed by Feng and Doolittle (3) and Willie Taylor (4). The multiple alignment is built up progressively by a series of pairwise alignments, following the branching order in a guide tree. The initial pre-comparison used a rapid word-based alignment algorithm (5) and the guide tree was constructed using the UPGMA method (6). In 1992, a new release was made, called ClustalV (7,8), which incorporated profile alignments (alignments of existing alignments) and the facility to generate trees from the multiple alignment using the Neighbour-Joining (NJ) method (9). The third generation of the series, ClustalW (10), released in 1994, incorporated a number of improvements to the alignment algorithm, including sequence weighting, position-specific gap penalties and the automatic choice of a suitable residue comparison matrix at each stage in the multiple alignment. In addition, the approximate word search used for the pre-comparison step was replaced by a more sensitive dynamic programming algorithm, and the dendogram construction by UPGMA was replaced by NJ. The ClustalW program looked very similar to ClustalV, with simple text menus for interactive use and the possibility of running the program in batch mode by specifying the input file and the parameter options on the command line.
The rationale behind the development of the Clustal series has been to provide robust, portable programs that are capable of providing good, biologically accurate alignments within a reasonable time limit. A close collaboration between biologists and computer scientists is probably one of the main reasons for the success and continued widespread use of the Clustal programs. ClustalW has given rise to a number of developments, including the latest member of the family, ClustalX (11). Although the alignments produced are the same as those produced by the current release of ClustalW, the user can better evaluate alignments in ClustalX. The program displays the multiple alignment in a scrollable window and all parameters are available using pull-down menus. Within alignments, conserved columns are highlighted using a customizable colour scheme and quality analysis tools are available to highlight potentially misaligned regions. ClustalX is easy to install, is user-friendly and maintains the portability of the previous generations through the NCBI Vibrant toolkit (ftp://ncbi.nlm.nih.gov/toolbox/ncbitools/). Numerous options are provided, such as the realignment of selected sequences or selected blocks of the alignment and the possibility of building up difficult alignments piecemeal, making ClustalX an ideal tool for working interactively on alignments.
Parallel versions of ClustalW and ClustalX have been developed by SGI (http://www.sgi.com/industries/sciences/chembio/resources/clustalw/parallel_clustalw.html), which show increased speeds of up to 10× when running ClustalW/X on 16 CPUs and significantly reduce the time required for data analysis. A number of other significant developments have been based on the ClustalW program. For example, ClustalNet (12) is a Clustal alignment CORBA server and DbClustal (13) is a program for aligning sequences detected by database searches, which uses local alignment information to anchor the global multiple alignment. DbClustal is available on the Web at http://www-igbmc.u-strasbg.fr/BioInfo/DbClustal and forms part of the WU-Blast2 (Washington University BLAST version 2.0) server at the EBI (http://www.ebi.ac.uk/blast2/).
ClustalWWW WEB SERVER
Numerous Web servers have exploited the command line interface of ClustalW, notably the EBI's ClustalWWW Web server, which currently runs between 2000–10 000 jobs/day and the SRS server at the same site (http://srs.ebi.ac.uk/), which has ClustalW built in. The EBI ClustalWWW interface provides extensive help, ranging from an introduction to multiple alignments for new users to detailed descriptions of each alignment option. An important factor in obtaining a high-quality alignment is the ability to change the numerous alignment parameters available in ClustalW. While the default values of the parameters have been optimised to work in the majority of cases, they are not necessarily optimal for any given alignment problem. In the ClustalWWW interface, all the options are easily accessible on the top page.
Sequences can be entered by either pasting them or by uploading a file from the user's local computer. In both cases, the sequences should be in one of seven different formats (GCG, FASTA, EMBL, GenBank, PIR, NBRF, Phylip or SWISS-PROT). Although users are encouraged to submit large numbers of sequences, there is no guarantee that the alignment will be completed within the job run limits. Therefore, users who experience problems when attempting to make very large alignments are advised to download the software and run it locally. In addition to the input format, the user can also specify the preferred output format for the multiple sequence alignment. The options are currently ALN, GCG, PHYLIP, PIR and GDE. It is also possible to configure the browser to automatically load the results files from ClustalW into a suitable external application. A list of some example URLs for obtaining such applications for MS-Windows, Macintosh and UNIX systems is provided in Table 1. Many commercial packages, e.g. the GCG package (Wisconsin Package, Genetics Computer Group, Madison, WI) and its X Window graphical user interface, SeqLab, can also accept ClustalW alignments.
Table 1. Example URLs of some external applications compatible with the output from ClustalW and ClustalX.
The resulting multiple alignments can be displayed as either black and white or colour coded text. An example of the colour coded display is shown in Figure 1. The alignment consists of four oxidoreductase NAD binding domains. The colouring of residues takes place according to physicochemical criteria highlighting conserved positions in the sequences. A consensus line is also displayed below the alignment with the following symbols denoting the degree of conservation observed in each column: ‘*’ (identical residues in all sequences), ‘:’ (highly conserved column), ‘.’ (weakly conserved column).
A recent enhancement to the ClustalWWW interface has been the addition of an option that allows the user to upload the results of ClustalW into an alignment editor, using a Java Applet called JalView (http://www.compbio.dundee.ac.uk/). JalView is a fully featured multiple sequence alignment editor which allows the user to perform further alignment analysis. Special features include the definition of sequence sub-groups, links to the SRS server at the EBI and an option to output the alignment as a colour postscript file for printing purposes.
As well as constructing multiple alignments, ClustalWWW can also calculate trees from a multiple alignment using the NJ method, a widely used and relatively fast algorithm that clusters sequences by minimising the sum of branch lengths. The resulting evolutionary relationships can be viewed either as cladograms or phylograms, with the option to display branch lengths (or ‘tree graph distances’).
NEW FEATURES
Both ClustalW and ClustalX are being actively maintained and updated. Recent enhancements have included the possibility of saving both alignments and phylogenetic trees in the NEXUS format (14) for compatibility with a number of phylogeny programs. Some work has also been done to optimise the alignment parameters, for example the Gonnet series of residue comparison matrices (15) is now used by default for protein sequence alignments. The latest version of the programs (version 1.83), which was released early this year, contained four main enhancements. The first modification is the facility to save the multiple alignment result as a FASTA format file, for compatibility with a number of other software packages. Another is to provide a percent identity matrix, which some users have asked for. A third new option is the possibility of saving the residue range in the output file when saving a user-specified range of the alignment. This is particularly useful when extracting a single domain from the alignment of multi-domain proteins. For example, in Figure 1 the NAD binding domain was extracted from a multiple alignment of the full-length oxidoreductase protein sequences and the residue range was automatically appended to the sequence names. Perhaps the most important enhancement in the latest version, however, is the incorporation of a faster implementation of the NJ algorithm used to construct guide trees during the multiple alignment process and also to construct phylogenetic trees based on the final alignment. Table 2 contains examples of the time required by the NJ algorithm for the construction of a phylogenetic tree from alignments containing different numbers of sequences. The increased speeds obtained mean that it is now possible to construct phylogenetic trees for very large sets of sequences, which were previously only feasible on very large computer systems. As an example, Figure 2 shows a phylogenetic tree constructed from an alignment of more than 1100 ring finger domain sequences taken from the PFAM database (16) entry PF00097. The new NJ implementation was written by T. Koike. An independent acceleration of the NJ algorithm has been published and is freely available as the QuickTree program (17). Though coding details differ, both implementations addressed the major slow points of the original code and so will not produce combinatorial improvement.
Table 2. A comparison of execution times.
Number of sequences | Original NJ | New NJ | ||
---|---|---|---|---|
NJ algorithm only | Complete multiple alignment | NJ algorithm only | Complete multiple alignment | |
200 | 0′ 6″ | 0′ 11″ | 0.1″ | 0′ 5″ |
500 | 6′ 55″ | 7′ 27″ | 1.1″ | 0′ 33″ |
1000 | XXX | XXX | 16″ | 2′ 18″ |
A comparison of two different implementations of the NJ algorithm (not including the time taken for the calculation of the distance matrix) for different sizes of alignments. The time required for the NJ algorithm depends only on the number of sequences, while the complete multiple alignment depends also on the lengths of the sequences. The timings reported here were all performed for sequences of approximately ∼40 residues. XXX denotes the algorithm did not complete. The timings were performed on a Compaq Alpha EV67 running True64 UNIX.
Acknowledgments
ACKNOWLEDGEMENTS
We thank the many Clustal users who have provided feedback, bug reports and feature requests. T.K. would like to express his thanks to the Life Science Systems Division at Fujitsu Ltd for allowing him to participate in the bioinformatics research and development at the National Institute of Genetics. J.D.T. was supported by institute funds from the Institut National de la Santé et de la Recherche Médicale, the Centre National de la Recherche Scientifique, the Hôpital Universitaire de Strasbourg and the Fond National de la Science (GENOPOLE).
REFERENCES
- 1.Higgins D.G. and Sharp,P.M. (1988) CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 73, 237–244. [DOI] [PubMed] [Google Scholar]
- 2.Myers E.W. and Miller,W. (1988) Optimal alignments in linear space. Comput. Applic. Biosci., 4, 11–17. [DOI] [PubMed] [Google Scholar]
- 3.Feng D.F. and Doolittle,R.F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol., 25, 351–360. [DOI] [PubMed] [Google Scholar]
- 4.Taylor W.R. (1988) A flexible method to align large numbers of biological sequences. J. Mol. Evol., 28, 161–169. [DOI] [PubMed] [Google Scholar]
- 5.Wilbur W.J. and Lipman,D.J. (1983) Rapid similarity searches of nucleic acid and protein data banks. Proc. Natl Acad. Sci. USA, 80, 726–730. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Sneath P.H.A. and Sokal,R.R. (1973) Numerical Taxonomy. WH Freeman, San Francisco, CA, pp. 230–234. [Google Scholar]
- 7.Higgins D.G. (1994) CLUSTAL V: multiple alignment of DNA and protein sequences. Methods Mol. Biol., 25, 307–318. [DOI] [PubMed] [Google Scholar]
- 8.Higgins D.G., Bleasby,A.J. and Fuchs,R. (1992) CLUSTAL V: improved software for multiple sequence alignment. Comput. Appl. Biosci., 8, 189–191. [DOI] [PubMed] [Google Scholar]
- 9.Saitou N. and Nei,M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4, 406–425. [DOI] [PubMed] [Google Scholar]
- 10.Thompson J.D., Higgins,D.G. and Gibson,T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Thompson J.D., Gibson,T.J., Plewniak,F., Jeanmougin,F. and Higgins,D.G. (1997) The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res., 25, 4876–4882. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Compagne F. (2000) Clustalnet: the joining of Clustal and CORBA. Bioinformatics, 16, 606–612. [DOI] [PubMed] [Google Scholar]
- 13.Thompson J.D., Plewniak,F., Thierry,J. and Poch,O. (2000) DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Res., 28, 2919–2926. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Maddison D.R., Swofford,D.L. and Maddison,W.P. (1997) NEXUS: an extensible file format for systematic information, Syst. Biol., 46, 590–621. [DOI] [PubMed] [Google Scholar]
- 15.Benner S.A., Cohen,M.A. and Gonnet,G.H. (1994) Amino acid substitution during functionally constrained divergent evolution of protein sequences. Protein Eng., 7, 1323–1332. [DOI] [PubMed] [Google Scholar]
- 16.Bateman A., Birney,E., Cerruti,L., Durbin,R., Etwiller,L., Eddy,S.R., Griffiths-Jones,S., Howe,K.L., Marshall,M. and Sonnhammer,E.L. (2002) The Pfam protein families database. Nucleic Acids Res., 1, 276–280. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Howe K., Bateman,A. and Durbin,R. (2002) Quick Tree: building huge Neighbour-Joining trees of protein sequences. Bioinformatics, 11, 1546–1547. [DOI] [PubMed] [Google Scholar]
- 18.Perrière G. and Gouy,M. (1996) WWW-Query: an on-line retrieval system for biological sequence banks. Biochimie, 78, 364–369. [DOI] [PubMed] [Google Scholar]