Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2003 Jul 1;31(13):3795–3798. doi: 10.1093/nar/gkg573

WU-Blast2 server at the European Bioinformatics Institute

Rodrigo Lopez *, Ville Silventoinen, Stephen Robinson, Asif Kibria, Warren Gish 1
PMCID: PMC168979  PMID: 12824421

Abstract

Since 1995, the WU-BLAST programs (http://blast.wustl.edu) have provided a fast, flexible and reliable method for similarity searching of biological sequence databases. The software is in use at many locales and web sites. The European Bioinformatics Institute's WU-Blast2 (http://www.ebi.ac.uk/blast2/) server has been providing free access to these search services since 1997 and today supports many features that both enhance the usability and expand on the scope of the software.

INTRODUCTION

The European Bioinformatics Institute (EBI) offers a wide range of methods for searching primary and secondary protein and nucleotide sequence databases (1). The EBI is committed to providing free and unfettered access to its databases, as well as to tools for exploiting them. Many of the databases are updated on a daily basis. In the case of the EMBL nucleotide sequence database (2) and SWALL [SWISS-PROT (3) combined with TrEMBL], the current growth rates are respectively ∼80% and 50% per year. Fast, flexible and reliable tools to search these data are therefore of paramount importance and software such as WU-BLAST (4) is at the centre of our effort. WU-BLAST was chosen for its sensitivity, speed, robust nature and unique combination of options governing its operation.

About WU-BLAST

Washington University BLAST version 2.0 was made publicly available in 1996, as the first suite of BLAST programs to provide gapped alignments with statistical significance estimates. This highly optimized software uses constrained dynamic programming to produce locally optimal gapped alignments by a method similar to that of Chao et al. (4) but with the inclusion of a drop-off score to further improve speed. The gapped alignments are themselves seeded by ungapped alignments found by the classical ungapped BLAST algorithm (5). As described below, the software offers a unique variety of statistical evaluation methods, as well as numerous output options and several ways to increase speed, sensitivity and selectivity. Historically, the software has been subject to frequent change, albeit with an effort to maintain reproducibility of results and backward compatibility upon each revision, but operational characteristics may still change. Users are therefore advised to consult the http://blast.wustl.edu site for the latest details about the search engine.

The EBI WU-Blast2 server

Numerous web servers use implementations of the BLAST algorithm to perform sequence similarity searches. The EBI's WU-Blast2 server has been in operation since 1997 and has been maintained under a strict update regimen to remain in synchrony with the latest databases, software developments and demands of users. Currently, one can search 12 protein sequence databases and 34 nucleotide sequence libraries, including the entirety of the EMBL nucleotide sequence database and various subsets—all major taxonomic divisions and HTG, EST and GSS special sections separately. Tables 1 and 2 indicate the databases that are currently available, as well as the WU-BLAST programs appropriate to search them.

Table 1. Protein sequence databases searchable by BLASTP and BLASTX on the EBI WU-Blast2 server.

Protein sequence database Description
swall SWALL Non-Redundant Protein Sequence Database: SWISS-PROT+SPTrEMBL+TrEMBLNew (2)
swissprot SWISS-PROT Protein Database (3)
swnew Updates to SWISS-PROT
sptrembl SPTrEMBL
tremblnew TrEMBLNew (Translated EMBL Updates)
IPI International Protein Index, provides a top level guide to the main databases that describe the human and mouse proteomes (http://www.ensembl.org/IPI)
sgt Structural Genomics Targets Database (http://www.ebi.ac.uk/msd-srv/msdtarget/)
pdb Brookhaven Database Sequences, The Protein Data Bank, the single worldwide repository for the processing and distribution of 3D biological macromolecular structure data (18)
prints PRINTS protein fingerprint sequences, a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterize a protein family (19)
IMGT/HLAprot Provides specialist databases for sequences of the human major histocompatibility complex (HLA) (20)
gpcrdb G protein-coupled receptors (21)
gpcrdbsup G protein-coupled receptors supplement (21)

Table 2. Nucleotide sequence databases searchable by BLASTN, TBLASTN and TBLASTX on the EBI WU-Blast2 server.

Nucleotide sequence database Description
embl EMBL Nucleotide Sequence Database, Europe's primary nucleotide sequence resource. The main sources of the DNA and RNA sequences in the database are submissions from individual researchers, genome sequencing projects and patent applications (2)
est_* EST divisions from the EMBL databank, contain sequence and mapping data on ‘single-pass’ cDNA sequences or Expressed Sequence Tags from a number of organisms
gss_* Genome Survey Sequence divisions from the EMBL databank. It contains random single pass read genome survey sequences, cosmid/BAC/ YAC end sequences, exon trapped genomic sequences, and Alu PCR sequences
htg_* High Throughput Genomes divisions from the EMBL databank. Contains unfinished genomic sequence data generated by the high-throughput sequencing centers
htg0_* High Throughput Genomes Phase 0 divisions from the EMBL databank. Low-coverage shotgun sequencing reads of individual genomic clones
STS Sequence Tags Sites divisions from the EMBL databank. contains sequence and mapping data on short genomic landmark sequences or Sequence Tagged Sites
emblnew EMBL daily updates (cumulative)
emvec EVEC—An extraction of sequences from the SYNthetic division of EMBL containing >2000 sequences commonly used in cloning and sequencing experiments
igvec IGVEC—Intelligenetics Vector Database, a database of cloning vector sequences and annotations
IMGT/LIGM-DB Immunoglobulin and T cell receptor DNA sequences derived from IMGT (22)
IMGT/HLAnuc HLA DNA sequences derived from IMGT (20)

The WU-Blast2 server provides a simple user input form with the most important options to run the programs effectively to suit different users' needs. Most of the options are presented with defaults, which are equivalent to running the programs from the command line without options.

BLAST searches may be performed in two different ways at the EBI: interactively and via email. The latter approach uses the mail server located at blast@ebi.ac.uk. While interactive is the default, the RESULTS pop-up menu on the WWW form provides the choice of email for return of search results. For longer searches, typically of nucleotide sequence databases, it is recommended that the email mode be used. In this mode, a valid email address must be provided when the job request is submitted; the server will then display a page with the chosen settings for the search. The job request is then transferred to the mail server, at which time an email message is sent to inform the user that their job is running. When the search is completed, the results are returned in another email message in plain text format. To obtain further details on the use of the mail server, users may send an email message to blast@ebi.ac.uk with the word ‘HELP’ in the message body. The options for the mail service are identical to those offered by the EBI WU-Blast2 web server.

Interactive job submissions are more often used for protein searches, historically representing ∼70% of all requests. Both protein and nucleotide sequence database searches are available in interactive mode. Once a job is submitted in interactive mode, a web page is returned that contains the URL where the results will be available upon job completion. The URL is generated by combining client and server information that is unique to each job request. The URL can be book-marked in order to access the results at a more convenient time.

When an interactive job completes, the results are displayed in the format generated by the WU-BLAST search software. No attempt is made to embellish this output, as this could interfere with scripted parsing of the results. Raw results are available by job identifier in the server-provided URL after replacing the ‘.html’ extension with ‘{programName}.res’. (Example: 865968.537525-32523.html becomes 865968.537525-32523.WU-blastp.res). Results may be retrieved for up to 24 h.

To aid in the interpretation of results, two additional tools are provided: MView (6) and DbClustal (7). MView allows its user to visualize the results of a search as if it is a multiple sequence alignment, with color codes for the amino acids according to their physiochemical properties. DbClustal allows its user to generate global multiple sequence alignments from a BLAST search (protein versus protein only).

Using the server

The user must choose the program or search mode to run. All five search modes in the WU-BLAST suite are available: blastp, blastn, blastx, tblastn and tblastx. Depending on the choice of program, the DATABASE list will display only the databases (protein or nucleotide) available for use with that program. This protects the user and the server from specifying invalid input. Other options, such as MATRIX and DNA STRAND, also change automatically and show correct lists of options for each of the programs.

Options for controlling the output of the programs include the Expect threshold, which can be changed within the range E=1.0 to E=1000. Sequence complexity filters (8), such as seg (9,10) and xnu (11) for peptide and dust (12, R. Tatusov and D. J. Lipman, ftp://ftp.ncbi.nih.gov/pub/tatusov/dust/version1) for nucleotide sequences, may also be chosen. By default, no complexity filtering is performed by WU-BLAST or the EBI WU-Blast2 server. As a result, many spurious hits may be obtained and the P-values reported may be unduly low (8). The software does, however, issue a WARNING when numerous hits are obtained and filtering has not been used. This contrasts with the NCBI BLAST server, which performs complexity filtering by default, but as a result may report no hits for some queries and suboptimal alignments for others.

A unique feature of the EBI WU-Blast2 server is the SENSITIVITY option that allows users to choose from predetermined sets of BLAST command line options for achieving different levels of sensitivity. For reference, Table 3 displays these options in the current implementation. Typical of the BLAST algorithm, increasing sensitivity correlates with slower speed and vice versa. The default, ‘normal’ sensitivity level will execute searches much faster and be more responsive than when ‘high’ sensitivity is selected. For quick searches, the user may choose ‘very low’ sensitivity but is reminded that the programs might miss important similarities. Very low sensitivity is a good choice for determining whether or not an identical sequence exists in a database.

Table 3. Sensitivity settings of the EBI WU-Blast2 server and their corresponding WU-BLAST command line options.

Search sensitivity Associated command line options
Protein  
 Very low W=5 T=1000 wink=2 hitdist=30 altscore=“*, any −12” altscore=“any,* −12”
 Low T=1000 hitdist=40 altscore=“*,any −12” altscore=“any,* −12”
 Medium hitdist=40 altscore=“*,any −12” altscore=“any,* −12”
 Normal altscore=“*,any −12” altscore=“any,* −12” (W=3 T=11-to-13 wink=1 hitdist=0 hspmax=1000 gspmax=1000)
 High hspmax=0 gspmax=0
BLASTN  
 Very Low W=13 M=1 N=−3 Q=3 R=3 wink=2 hitdist=30
 Low W=12 M=1 N=−3 Q=3 R=3 hitdist=40
 Medium M=1 N=−3 Q=3 R=3
 Normal (W=11 gapW=16 M=5 N=-4 Q=10 R=10 hitdist=0 hspmax=1000 gspmax=1000)
 High W=9 gapW=32 hspmax=0 gspmax=0

Users can choose an alternative method for estimating the statistical significance of results from the STATS pop-up menu. Three methods of evaluation are available: ‘sump’, ‘poissonp’ and ‘kap’. Sump is the default method in all search modes; it employs Karlin-Altschul ‘Sum’ statistics (13), which involves a joint probability calculation on the scores of multiple, consistent alignments. Similarly, the ‘poissonp’ method performs joint probability calculations on consistent sets but uses Poisson statistics (13) instead. Sum statistics tend to work best when the scores among a consistent set of alignments vary considerably; Poisson statistics tend to work best when the scores in a set vary little. The third method of evaluation, ‘kap’, requires that the alignment scores be individually capable of satisfying the Expect significance threshold for reporting; the kap method is best applied when searching for similarity to a short motif or oligomer, when similarity is not expected—or not of interest—across multiple regions.

The default sort order for reporting the search results can be manipulated with the SORT option, which has the following possible values:

  • P-value: Database hits are sorted in order of increasing P-value (probability). When Sum or Poisson statistics are used (see the STATS option), the ascribed P-values are a function of the values reported in the column labeled N, as well.

  • count: Sorts the database hits from highest to lowest by the number of alignments found for each database sequence, independent of the scores of these alignments.

  • high score: Database sequences are sorted by the highest alignment score found for each one.

  • total score: Database hits are sorted by the sum of all alignment scores for each database sequence.

‘Topcombo’ processing is also available as an option on the EBI server. Topcombo is a unique and useful feature of WU-BLAST that causes alignments to be reported in consistent sets (‘Groups’), with the condition that no alignment is allowed to be a member of more than one set. Topcombo helps eliminate spurious alignments from the output that are insignificant on their own and that would otherwise be reported in conjunction with other regions of similarity. Users may often want to see just the best set of consistent alignments, which is equivalent to setting the topcombo N parameter to a value of 1.

The last two options the user can change on the EBI server are the limits on the number of database sequences for which one-line descriptions (‘scores’) and alignments may be displayed. A maximum of 500 scores and 500 alignments may be chosen.

All homology and similarity services at the EBI share a sequence input form that permits the user to copy and paste sequence in a variety of popular formats, such as fasta, nbrf, gcg, embl, swiss, phylip, raw, etc. Sequences may also be uploaded or referred to as described in Table 4. The sequence input portion of the form also contains a small text dialog that indicates to the user which type of sequence is expected to be used for each of the five BLAST program choices.

Table 4. Special input syntax for retrieving sequences from internal databases for the sequence input form used in various EBI WWW services.

Database identifier syntax for queries in the sequence input form of the EBI WU-Blast2 Server
For SWALL (SPTR): swall:{swissprotld | swissprotAccession}
Examples: swall:wap_rat or swall: P01174
Note: SWALL (SPTR) includes SWISS-PROT, TrEMBL and TrEMBL Updates.
For EMBL: embl:{emblld | emblAccession}
Examples: embl:hscfos or embl:v01512
Note: EMBL includes the official release of EMBL and EMBL Updates.
Note: For the above, an identifier must NOT be followed by a carriage return; and multiple identifiers are NOT supported.

Extensive and comprehensive help is available by clicking the HELP button on the sequence input form, which opens a small window that provides thematic navigation of various subjects concerning WU-BLAST. An introduction is provided covering in detail the software package, as well as examples and instruction on interpreting results. Generalities about sequence formats and references are also covered.

Differences from other BLAST servers

Aside from the unique and extensive variety of databases available to search on the EBI WU-Blast2 server, characteristics inherent to the underlying WU-BLAST software and the server's configuration may cause different results to be obtained from this server than from others. Potential sources of these differences include:

  • different default scoring systems (scoring matrix+gap penalties), including a more severe penalty of −12 for substitutions involving stop codons (‘*’; Table 3), compared to the −4 penalty specified in the standard BLOSUM62 matrix (14);

  • default use of the classical one-hit BLAST algorithm (5) in all search modes, which is somewhat slower but more sensitive than the two-hit BLAST algorithm (15) for the same neighborhood word score threshold; with the option of selecting the two-hit algorithm via the SENSITIVITY setting;

  • default search for gapped alignments in all search modes;

  • default use of Sum statistics (13,16) in all search modes, with the option of using Poisson statistics (17) or basic Karlin–Altschul statistics (17);

  • use of a gapped alignment procedure that constrains alignments to reside within a band;

  • use of different pre-computed values for the parameters λ, K and H (17) when evaluating the significance of gapped alignment scores (16);

  • when complexity filters are used (see above), subtle differences in the output from the filters used by WU-BLAST may arise, as compared to the filters used by other programs; the VIEW FILTER option allows the precise result of filtering to be observed;

  • support for selenocysteine residues (U) in protein queries and protein sequence databases;

It is noteworthy that the banded alignment procedure used by WU-BLAST 2.0 can cause long regions of similarity to be reported as multiple, shorter regions interrupted by the band boundaries. Breaks tend to arise in regions where the two sequences clearly do not align, but will also occur where the distribution of insertions (or deletions) is significantly biased toward one sequence or the other. Sensitivity may be reduced if the resultant, shorter alignments have a low probability of detection by the heuristic software or if the shorter alignments are detected but their scores fall below the established threshold. Assuming the shorter alignments are detected with scores above the threshold, joint probability calculations, like the default ‘sump’ method that is used, can boost their statistical significance well beyond that of a longer alignment, and thus improve sensitivity.

CONCLUSION

We describe here the current state of the EBI WU-Blast2 server, along with much of its unique flexibility, which is freely available to the public. Questions, comments and suggestions from users are encouraged and may be addressed to support@ebi.ac.uk.

REFERENCES

  • 1.Brooksbank C., Camon,E., Harris,M.A., Magrane,M., Martin,M.J., Mulder,N., O'Donovan,C., Parkinson,H., Tuli,M.A., Apweiler,R. et al. (2003) The European Bioinformatics Institute's data resources. Nucleic Acids Res., 31, 43–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Stoesser G., Baker,W., Van Den Broek,A., Garcia-Pastor,M., Kanz,C., Kulikova,T., Leinonen,R., Lin,Q., Lombard,V., Lopez,R. et al. (2003) The EMBL nucleotide sequence database. Nucleic Acids Res., 31, 17–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Boeckmann B., Bairoch,A., Apweiler,R., Blatter,M.-C., Estreicher,A., Gasteiger,E., Martin,M.J., Michoud,K., O'Donovan,C., Phan,I. et al. (2003) The Swiss-Prot protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res., 31, 365–370. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Chao K.M., Pearson,W.R. and Miller,W. (1992) Aligning two sequences within a specified diagonal band. Comput. Appl. Biosci., 8, 481–487. [DOI] [PubMed] [Google Scholar]
  • 5.Altschul S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. [DOI] [PubMed] [Google Scholar]
  • 6.Brown N.P., Leroy,C. and Sander,C. (1998) MView: a web-compatible database search or multiple alignment viewer. Bioinformatics, 14, 380–381. [DOI] [PubMed] [Google Scholar]
  • 7.Thompson J.D., Plewniak,F., Thierry,J. and Poch,O. (2000) DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Res., 28, 2919–2926. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Altschul S.F., Boguski,M.S., Gish,W. and Wootton,J.C. (1994) Issues in searching molecular sequence databases. Nature Genet., 6, 119–129. [DOI] [PubMed] [Google Scholar]
  • 9.Wootton J.C. and Federhen,S. (1993) Statistics of local complexity in amino acid sequences and sequence databases. Comp. Chem., 17, 149–163. [Google Scholar]
  • 10.Wootton J.C. and Federhen,S. (1996) Analysis of compositionally biased regions in sequence databases. Methods Enzymol., 266, 554–571. [DOI] [PubMed] [Google Scholar]
  • 11.Claverie J.M. and States,D.J. (1993) Information enhancement methods for large scale sequence analysis. Comp. Chem., 17, 191–201. [Google Scholar]
  • 12.Hancock J.M. and Armstrong,J.S. (1994) SIMPLE34: an improved and enhanced implementation for VAX and Sun computers of the SIMPLE algorithm for analysis of clustered repetitive motifs in nucleotide sequences. Comput. Appl. Biosci., 10, 67–70. [DOI] [PubMed] [Google Scholar]
  • 13.Karlin S. and Altschul,S.F. (1993) Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl Acad. Sci. USA, 90, 5873–5877. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Henikoff S, and Henikoff,JG. (1993) Performance evaluation of amino acid substitution matrices. Proteins, 17, 49–61. [DOI] [PubMed] [Google Scholar]
  • 15.Altschul S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W., and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Altschul S.F., and Gish,W. (1996) Local alignment statistics. Ed. R. Doolittle. Methods Enzymol., 266, 460–480. [DOI] [PubMed] [Google Scholar]
  • 17.Karlin S. and Altschul,S.F. (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl Acad. Sci. USA, 87, 2264–2268. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Westbrook J., Feng,Z., Jain,S., Bhat,T.N., Thanki,N., Ravichandran,V., Gilliland,G.L., Bluhm,W., Weissig,H., Greer,D.S. et al. (2002) The Protein Data Bank: unifying the archive. Nucleic Acids Res., 30, 245–248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Attwood T.K., Bradley,P., Flower,D.R., Gaulton,A., Maudling,N., Mitchell,A.L., Moulton,G., Nordle,A., Paine,K., Taylor,P. et al. (2003) PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res., 31, 400–402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Robinson J., Waller,M.J., Parham,P., de Groot,N., Bontrop,R., Kennedy,L.J., Stoehr,P. and Marsh,S.G.E. (2003) IMGT/HLA and IMGT/MHC: sequence databases for the study of the major histocompatibility complex. Nucleic Acids Res., 31, 311–314. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Horn F., Bettler,E., Oliveira,L., Campagne,F., Cohen,F.E. and Vriend,G. (2003) GPCRDB information system for G protein-coupled receptors. Nucleic Acids Res., 31, 294–297. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Lefranc M.-P. (2003) IMGT, the international ImMunoGeneTics database®. Nucleic Acids Res., 31, 307–310. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES