Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2015 Apr 6;43(Web Server issue):W580–W584. doi: 10.1093/nar/gkv279

The EMBL-EBI bioinformatics web and programmatic tools framework

Weizhong Li 1,, Andrew Cowley 1,, Mahmut Uludag 1, Tamer Gur 1, Hamish McWilliam 1, Silvano Squizzato 1, Young Mi Park 1, Nicola Buso 1, Rodrigo Lopez 1,*
PMCID: PMC4489272  PMID: 25845596

Abstract

Since 2009 the EMBL-EBI Job Dispatcher framework has provided free access to a range of mainstream sequence analysis applications. These include sequence similarity search services (https://www.ebi.ac.uk/Tools/sss/) such as BLAST, FASTA and PSI-Search, multiple sequence alignment tools (https://www.ebi.ac.uk/Tools/msa/) such as Clustal Omega, MAFFT and T-Coffee, and other sequence analysis tools (https://www.ebi.ac.uk/Tools/pfa/) such as InterProScan. Through these services users can search mainstream sequence databases such as ENA, UniProt and Ensembl Genomes, utilising a uniform web interface or systematically through Web Services interfaces (https://www.ebi.ac.uk/Tools/webservices/) using common programming languages, and obtain enriched results with novel visualisations. Integration with EBI Search (https://www.ebi.ac.uk/ebisearch/) and the dbfetch retrieval service (https://www.ebi.ac.uk/Tools/dbfetch/) further expands the usefulness of the framework. New tools and updates such as NCBI BLAST+, InterProScan 5 and PfamScan, new categories such as RNA analysis tools (https://www.ebi.ac.uk/Tools/rna/), new databases such as ENA non-coding, WormBase ParaSite, Pfam and Rfam, and new workflow methods, together with the retirement of depreciated services, ensure that the framework remains relevant to today's biological community.

INTRODUCTION

The European Bioinformatics Institute (EMBL-EBI https://www.ebi.ac.uk) has provided free and open access to a range of bioinformatics applications for sequence analysis since 1998 (1). In 2009 the Job Dispatcher framework (2,3) was released to provide consistent, robust and updatable access to modern bioinformatics tools such as NCBI BLAST+ (4) and PSI-Search (5) for sequence similarity searching; InterProScan (6) and PfamScan (7) for protein functional analysis; and multiple sequence alignment tools such as Clustal Omega (8), Kalign2 (9) and MAFFT (10). Through these applications the latest mainstream bioinformatics databases can be searched, for example ENA (11), Ensembl Genomes (12), UniProt (13), InterPro (14) and Pfam (15).

The framework is used by academic and industry scientists, and in 2014 handled roughly 110 million analysis jobs, up from 65 million in 2013. Help pages, tutorials and user guides (available as protocols (16)) are provided, together with training courses and helpdesk support. Continued feedback from the biological community, collaboration with bioinformatics tools and data providers and comprehensive metrics analysis helps to drive improvements to the accessibility and quality of the services.

THE TOOLS FRAMEWORK

The EMBL-EBI Job Dispatcher is a modular and configuration-driven framework aimed at both novice and expert users. A uniform web browser interface enables users to upload their data or select existing data from our databases for analysis in a wide range of applications (Table 1). Browser inputs are checked to validate all the parameters required for successful job submission and guidance is provided to the user in the case of failure. Default parameter choices are set in collaboration with the tool authors for the intended uses of the tools, and can be adjusted by the user. Results are presented visually and enriched with data from other applications, for example cross-reference annotations via EBI Search (17) or functional domain predictions via InterPro (14). Biological data entries discovered as part of the analysis can be retrieved via the dbfetch service (3). SOAP and REST Web Services provide stable APIs for programmatic use. Input validation and parameter help is also built-in to Web Service use and results can be retrieved in a range of graphical and machine readable formats. Sample Web Services clients are available in a range of programming languages (e.g. C#, Java, Perl, Python and Ruby).

Table 1. Tool services available in the Job Dispatcher framework.

Category Tool
EMBOSS Programs (https://www.ebi.ac.uk/Tools/emboss/) needle, stretcher, water, matcher, transeq, sixpack, backtranseq, backtranambig, pepinfo, pepstats, pepwindow, cpgplot, newcpgreport, isochore & seqret
Multiple Sequence Alignment (https://www.ebi.ac.uk/Tools/msa/) clustal omega, clustalw2, dbclustal, kalign, mafft, mafft_addseq, muscle, mview, tcoffee & prank
Pairwise Sequence Alignment (https://www.ebi.ac.uk/Tools/psa/) needle, stretcher, water, matcher, lalign, wise2dba, genewise & promoterwise
Phylogeny Analysis (https://www.ebi.ac.uk/Tools/phylogeny/) clustalw2 phylogeny & raxml_epa
Protein Functional Analysis (https://www.ebi.ac.uk/Tools/pfa/) censor, fingerprintscan, interproscan 5, pfamscan, phobius, pratt, prosite scan & radar
RNA Analysis (https://www.ebi.ac.uk/Tools/rna/) infernal_cmscan & mapmi
Sequence Format Conversion (https://www.ebi.ac.uk/Tools/sfc/) seqret, readseq & mview
Sequence Operation (https://www.ebi.ac.uk/Tools/so/) censor & seqcksum
Sequence Similarity Search (https://www.ebi.ac.uk/Tools/sss/) ncbiblast+, fasta, ggsearch, glsearch, psiblast, psisearch, ssearch & wublast
Sequence Statistics (https://www.ebi.ac.uk/Tools/seqstats/) pepinfo, pepstats, pepwindow, saps, cpgplot, newcpgplot & isochore
Sequence Translation (https://www.ebi.ac.uk/Tools/st/) transeq, sixpack, backtranseq & backtranambig

New analysis tools and databases

New tool developments include NCBI BLAST+ for sequence similarity searching, InterProScan 5 (6) and PfamScan (7) for protein functional analysis, Infernal_cmscan (18) and MapMi (19) for RNA analysis, RAxML_EPA (20) for phylogenetic analysis and MAFFT_addseq (10) for multiple sequence alignment. Please see the Supplementary Information for sample inputs for PfamScan, Infernal_cmscan and MapMi. New sequence databases include ENA Coding and Non-coding sequence databases, WormBase ParaSite (21), Pfam (15), Rfam (22) along with many new genomes and proteomes as existing databases are updated.

Tool and database retirements

Legacy applications SRS (23), InterProScan 4 (24) and NCBI BLAST (non-plus) have been retired. EMBLCDS (11), HGVbase (25), IPI (26) and proteomes databases (13) have been removed from sequence similarity searching services.

New functionalities

As a result of user-feedback we have incorporated additional workflow functionalities to the framework. Interactive workflows help the investigator move between different tool categories and can be utilised through both the web browser interface and Web Services. Figure 1 illustrates an example workflow constructing a phylogenetic tree from sequence similarity search results. The top BLAST hit sequences (Figure 1a) are selected and aligned using the Clustal Omega tool (8); the alignment (Figure 1b) is then used to generate a phylogenetic analysis (Neighbor-Joining clustering (27)) and the final phylogenetic tree is displayed (Figure 1c). The user can control which sequences are selected at each stage in the process. Sequences are retrieved behind the scenes and robust filtering and validation procedures and additional pre- and post-processing steps have been implemented to ensure successful job submissions across the workflow.

Figure 1.

Figure 1.

An example workflow from NCBI BLAST+ to Clustal Omega and construction of a phylogenetic tree. (a) Perform a NCBI BLAST+ similarity search and select sequence hits from the summary table to align with Clustal Omega; (b) Perform a simple phylogenetic analysis on the Clustal Omega alignment; (c) Visualise the phylogenetic tree.

New result representations

The web interfaces have adopted the latest EMBL-EBI web style guidelines (https://www.ebi.ac.uk/web/guidelines) and are more user-friendly as a result of extensive usability testing. Feature annotations can be displayed that (Figure 2) highlight UniProt sequence features present within well-aligned regions and are available in the FASTA (28), PSI-Search and LALIGN services. The result summary tables (Figure 1a) for sequence similarity searching can now be downloaded in XML, CSV, TSV and JSON formats. The NCBI BLAST+ service now offers more BLAST alignment views, including ASN archive format. Phylogenetic analysis offers output in percentage identity matrix (PIM) and a new tree viewing (Figure 1c) component is now available that uses JavaScript technologies such as BioJS (29) and D3 (d3js.org).

Figure 2.

Figure 2.

An example domain display from PSI-Search output, showing UniProt sequence features that are present in significantly aligned regions.

WSDLs for the SOAP Web Services API have been provided since the first availability of the framework in 2009. Users of the REST API are now supported through the provision of equivalent WADLs for all tools. The parameter settings of analysis jobs can be accessed through the REST API as well. Integrated tests have been implemented to make the Web Services more robust and stable.

Help and documentation

EMBL-EBI offers helpdesk support and training courses for the use of the tool services provided by the framework. General help, FAQ pages, tutorials and example protocols (16) are available for using the services via web browser interfaces and sample clients for Web Services. A brief guide to Web Services technologies is also provided for those wishing to learn more and develop their own client programs (https://www.ebi.ac.uk/Tools/webservices/tutorials/00_contents).

FUTURE DEVELOPMENTS

As well as continuing to maintain existing services, future planned developments include the integration of new tools such as HMMER 3 (30), R-COFFEE (31) and new data resources such as ENA Barcode and Geospatial databases (11). Further cross-resource integration will be available, such as additional annotations to sequence similarity results using the EBI Search Web Services (17) and visualisations using novel client-side technologies that render complex data faster and in more efficient ways than traditional server-side methods. Ensembl data (32) will be available via the NCBI BLAST+ service.

Some applications have been flagged for retirement from EMBL-EBI in 2015. These include ClustalW2 (27), DaliLite (33), DbClustal (34), MaxSprout (35), ReadSeq (36) and WU-BLAST (37). Further details will be announced on the web site.

Additional support for users in the future will include webinars and the production of video-based tutorials and other integrated online learning capabilities.

DISCUSSION

Having a tools framework for EMBL-EBI applications allows users access to a range of services through uniform interfaces and helps the maintenance of a robust, relevant service by enabling individual applications to be added, updated, or retired as required. Improvements to the web browser interface help usability and allow more complex analyses to be carried out through the provision of workflow mechanisms between tools. Integration of other resources such as EBI Search and dbfetch expands the resources the framework can draw on and facilitates user acquisition of biological data. New and updated tools and databases ensure that scientists have access to the most recent analyses and data available, while retirement of depreciated services helps to ensure that the application set is well maintained and resources are dedicated to the most relevant services.

Since becoming available in 2009, the framework has been used by academic and industry users for almost 260 million analysis jobs and the volume of usage has been increasing significantly with roughly 110 million analyses in 2014 alone. Web Services in particular lend themselves to integration in third party pipelines, and the applications have been of use to commercial and academic organisations as well as to other EMBL-EBI teams such as Ensembl Genomes, Pfam and UniProt. Where such integrations are present it is especially important not to break dependencies. So, a careful process of communication and change management is in place, including updates through a range of channels that include mailing lists, news feeds, web site announcements and Twitter.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

SUPPLEMENTARY DATA

Acknowledgments

The authors wish to acknowledge Prof. William Pearson from the University of Virginia for his support and feedback and the web administrators and Systems team at EMBL-EBI for their assistance in the provision of the tools framework service.

FUNDING

Funding for open access charge: European Molecular Biology Laboratory.

Conflict of interest statement. None declared.

REFERENCES

  • 1.Brooksbank C., Bergman M.T., Apweiler R., Birney E., Thornton J. The European Bioinformatics Institute's data resources 2014. Nucleic Acids Res. 2014;42:D18–D25. doi: 10.1093/nar/gkt1206. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Goujon M., McWilliam H., Li W., Valentin F., Squizzato S., Paern J., Lopez R. A new bioinformatics analysis tools framework at EMBL-EBI. Nucleic Acids Res. 2010;38:W695–W699. doi: 10.1093/nar/gkq313. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.McWilliam H., Li W., Uludag M., Squizzato S., Park Y.M., Buso N., Cowley A.P., Lopez R. Analysis Tool Web Services from the EMBL-EBI. Nucleic Acids Res. 2013;41:W597–W600. doi: 10.1093/nar/gkt376. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J., Bealer K., Madden T.L. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421–429. doi: 10.1186/1471-2105-10-421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Li W., McWilliam H., Goujon M., Cowley A., Lopez R., Pearson W.R. PSI-Search: iterative HOE-reduced profile SSEARCH searching. Bioinformatics. 2012;28:1650–1651. doi: 10.1093/bioinformatics/bts240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Jones P., Binns D., Chang H.Y., Fraser M., Li W., McAnulla C., McWilliam H., Maslen J., Mitchell A., Nuka G., et al. InterProScan 5: genome-scale protein function classification. Bioinformatics. 2014;30:1236–1240. doi: 10.1093/bioinformatics/btu031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Mistry J., Bateman A., Finn R.D. Predicting active site residue annotations in the Pfam database. BMC Bioinformatics. 2007;8:298–311. doi: 10.1186/1471-2105-8-298. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Sievers F., Wilm A., Dineen D., Gibson T.J., Karplus K., Li W., Lopez R., McWilliam H., Remmert M., Söding J.J., et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 2011;7:539–544. doi: 10.1038/msb.2011.75. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Lassmann T., Frings O., Sonnhammer E.L. Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features. Nucleic Acids Res. 2009;37:858–865. doi: 10.1093/nar/gkn1006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Katoh K., Standley D.M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 2013;30:772–780. doi: 10.1093/molbev/mst010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Silvester N., Alako B., Amid C., Cerdeño-Tárraga A., Cleland I., Gibson R., Goodgame N., Ten Hoopen P., Kay S., Leinonen R., et al. Content discovery and retrieval services at the European Nucleotide Archive. Nucleic Acids Res. 2014;43:D23–D29. doi: 10.1093/nar/gku1129. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Kersey P.J., Allen J.E., Christensen M., Davis P., Falin L.J., Grabmueller C., Hughes D.S., Humphrey J., Kerhornou A., Khobova J., et al. Ensembl Genomes 2013: scaling up access to genome-wide data. Nucleic Acids Res. 2014;42:D546–D552. doi: 10.1093/nar/gkt979. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res. 2015;43:D204–D212. doi: 10.1093/nar/gku989. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Mitchell A., Chang H.Y., Daugherty L., Fraser M., Hunter S., Lopez R., McAnulla C., McMenamin C., Nuka G., Pesseat S., et al. The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res. 2015;43:D213–D221. doi: 10.1093/nar/gku1243. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Finn R.D., Bateman A., Clements J., Coggill P., Eberhardt R.Y., Eddy S.R., Heger A., Hetherington K., Holm L., Mistry J., et al. Pfam: the protein families database. Nucleic Acids Res. 2014;42:D222–D230. doi: 10.1093/nar/gkt1223. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Lopez R., Cowley A., Li W., McWilliam H. Using EMBL-EBI Services via Web Interface and Programmatically via Web Services. Curr Protoc Bioinformatics. 2014;48:3.12.1–3.12.50. doi: 10.1002/0471250953.bi0312s48. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Valentin F., Squizzato S., Goujon M., McWilliam H., Paern J., Lopez R. Fast and efficient searching of biological data resources–using EB-eye. Brief Bioinform. 2010;11:375–384. doi: 10.1093/bib/bbp065. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Nawrocki E.P., Eddy S.R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics. 2013;29:2933–2935. doi: 10.1093/bioinformatics/btt509. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Guerra-Assunção J.A., Enright A.J. MapMi: automated mapping of microRNA loci. BMC Bioinformatics. 2010;11:133–139. doi: 10.1186/1471-2105-11-133. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Rokas A. Phylogenetic analysis of protein sequence data using the Randomized Axelerated Maximum Likelihood (RAXML) Program. Curr. Protoc. Mol. Biol. 2011 doi: 10.1002/0471142727.mb1911s96. doi:10.1002/0471142727.mb1911s96. [DOI] [PubMed] [Google Scholar]
  • 21.Howe K., Davis P., Paulini M., Tuli M.A., Williams G., Yook K., Durbin R., Kersey P., Sternberg P.W. WormBase: Annotating many nematode genomes. Worm. 2012;1:15–21. doi: 10.4161/worm.19574. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Burge S.W., Daub J., Eberhardt R., Tate J., Barquist L., Nawrocki E.P., Eddy S.R., Gardner P.P., Bateman A. Rfam 11.0: 10 years of RNA families. Nucleic Acids Res. 2013;41:D226–D232. doi: 10.1093/nar/gks1005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Etzold T., Ulyanov A., Argos P. SRS: information retrieval system for molecular biology data banks. Meth. Enzymol. 1996;266:114–128. doi: 10.1016/s0076-6879(96)66010-8. [DOI] [PubMed] [Google Scholar]
  • 24.Quevillon E., Silventoinen V., Pillai S., Harte N., Mulder N., Apweiler R., Lopez R. InterProScan: protein domains identifier. Nucleic Acids Res. 2005;33:W116–W120. doi: 10.1093/nar/gki442. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Fredman D., Munns G., Rios D., Sjöholm F., Siegfried M., Lenhard B., Lehväslaiho H., Brookes A.J. HGVbase: a curated resource describing human DNA variation and phenotype relationships. Nucleic Acids Res. 2004;32:D516–D519. doi: 10.1093/nar/gkh111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Kersey P.J., Duarte J., Williams A., Karavidopoulou Y., Birney E., Apweiler R. The International Protein Index: an integrated database for proteomics experiments. Proteomics. 2004;4:1985–1988. doi: 10.1002/pmic.200300721. [DOI] [PubMed] [Google Scholar]
  • 27.Larkin M.A., Blackshields G., Brown N.P., Chenna R., McGettigan P.A., McWilliam H., Valentin F., Wallace I.M., Wilm A., Lopez R., et al. Clustal W and Clustal X version 2.0. Bioinformatics. 2007;23:2947–2948. doi: 10.1093/bioinformatics/btm404. [DOI] [PubMed] [Google Scholar]
  • 28.Pearson W.R., Lipman D.J. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. U.S.A. 1988;85:2444–2448. doi: 10.1073/pnas.85.8.2444. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Gómez J., García L.J., Salazar G.A., Villaveces J., Gore S., García A., Martín M.J., Launay G., Alcántara R., Del-Toro N., et al. BioJS: an open source JavaScript framework for biological data visualization. Bioinformatics. 2013;29:1103–1104. doi: 10.1093/bioinformatics/btt100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Mistry J., Finn R.D., Eddy S.R., Bateman A., Punta M. Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. Nucleic Acids Res. 2013;41:e121. doi: 10.1093/nar/gkt263. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Wilm A., Higgins D.G., Notredame C. R-Coffee: a method for multiple alignment of non-coding RNA. Nucleic Acids Res. 2008;36:e52. doi: 10.1093/nar/gkn174. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Cunningham F., Amode M.R., Barrell D., Beal K., Billis K., Brent S., Carvalho-Silva D., Clapham P., Coates G., Fitzgerald S., et al. Ensembl 2015. Nucleic Acids Res. 2015;43:D662–D669. doi: 10.1093/nar/gku1010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Holm L., Park J. DaliLite workbench for protein structure comparison. Bioinformatics. 2000;16:566–567. doi: 10.1093/bioinformatics/16.6.566. [DOI] [PubMed] [Google Scholar]
  • 34.Thompson J.D., Plewniak F., Thierry J., Poch O. DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Res. 2000;28:2919–2926. doi: 10.1093/nar/28.15.2919. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Holm L., Sander C. Database algorithm for generating protein backbone and side-chain co-ordinates from a C alpha trace application to model building and detection of co-ordinate errors. J. Mol. Biol. 1991;218:183–194. doi: 10.1016/0022-2836(91)90883-8. [DOI] [PubMed] [Google Scholar]
  • 36.Gilbert D. Sequence file format conversion with command-line readseq. Curr. Protoc. Bioinformatics. 2003 doi: 10.1002/0471250953.bia01es00. doi:10.1002/0471250953.bia01es00. [DOI] [PubMed] [Google Scholar]
  • 37.Lopez R., Silventoinen V., Robinson S., Kibria A., Gish W. WU-Blast2 server at the European Bioinformatics Institute. Nucleic Acids Res. 2003;31:3795–3798. doi: 10.1093/nar/gkg573. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

SUPPLEMENTARY DATA

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES