Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2009 Mar 20;25(11):1422–1423. doi: 10.1093/bioinformatics/btp163

Biopython: freely available Python tools for computational molecular biology and bioinformatics

Peter J A Cock 1,*, Tiago Antao 2, Jeffrey T Chang 3, Brad A Chapman 4, Cymon J Cox 5, Andrew Dalke 6, Iddo Friedberg 7, Thomas Hamelryck 8, Frank Kauff 9, Bartek Wilczynski 10,11, Michiel J L de Hoon 12
PMCID: PMC2682512  PMID: 19304878

Abstract

Summary: The Biopython project is a mature open source international collaboration of volunteer developers, providing Python libraries for a wide range of bioinformatics problems. Biopython includes modules for reading and writing different sequence file formats and multiple sequence alignments, dealing with 3D macro molecular structures, interacting with common tools such as BLAST, ClustalW and EMBOSS, accessing key online databases, as well as providing numerical methods for statistical learning.

Availability: Biopython is freely available, with documentation and source code at www.biopython.org under the Biopython license.

Contact: All queries should be directed to the Biopython mailing lists, see www.biopython.org/wiki/_Mailing_listspeter.cock@scri.ac.uk.

1 INTRODUCTION

Python (www.python.org) and Biopython are freely available open source tools, available for all the major operating systems. Python is a very high-level programming language, in widespread commercial and academic use. It features an easy to learn syntax, object-oriented programming capabilities and a wide array of libraries. Python can interface to optimized code written in C, C++or even FORTRAN, and together with the Numerical Python project numpy (Oliphant, 2006), makes a good choice for scientific programming (Oliphant, 2007). Python has even been used in the numerically demanding field of molecular dynamics (Hinsen, 2000). There are also high-quality plotting libraries such as matplotlib (matplotlib.sourceforge.net) available.

Since its founding in 1999 (Chapman and Chang, 2000), Biopython has grown into a large collection of modules, described briefly below, intended for computational biology or bioinformatics programmers to use in scripts or incorporate into their own software. Our web site lists over 100 publications using or citing Biopython.

The Open Bioinformatics Foundation (OBF, www.open-bio.org) hosts our web site, source code repository, bug tracking database and email mailing lists, and also supports the related BioPerl (Stajich et al., 2002), BioJava (Holland et al., 2008), BioRuby (www.bioruby.org) and BioSQL (www.biosql.org) projects.

2 BIOPYTHON FEATURES

The Seq object is Biopython's core sequence representation. It behaves very much like a Python string but with the addition of an alphabet (allowing explicit declaration of a protein sequence for example) and some key biologically relevant methods. For example,

graphic file with name btp163i1.jpg

Sequence annotation is represented using SeqRecord objects which augment a Seq object with properties such as the record name, identifier and description and space for additional key/value terms. The SeqRecord can also hold a list of SeqFeature objects which describe sub-features of the sequence with their location and their own annotation.

The Bio.SeqIO module provides a simple interface for reading and writing biological sequence files in various formats (Table 1), where regardless of the file format, the information is held as SeqRecord objects. Bio.SeqIO interprets multiple sequence alignment file formats as collections of equal length (gapped) sequences. Alternatively, Bio.AlignIO works directly with alignments, including files holding more than one alignment (e.g. re-sampled alignments for bootstrapping, or multiple pairwise alignments). Related module Bio.Nexus, developed for Kauff et al. (2007), supports phylogenetic tools using the NEXUS interface (Maddison et al., 1997) or the Newick standard tree format.

Table 1.

Selected Bio.SeqIO or Bio.AlignIO file formats

Format R/W Name and reference
fasta R+W FASTA (Pearson and Lipman, 1988)
genbank R+W GenBank (Benson et al., 2007)
embl R EMBL (Kulikova et al., 2006)
swiss R Swiss-Prot/TrEMBL or UniProtKB
(The UniProt Consortium, 2007)
clustal R+W Clustal W (Thompson et al., 1994)
phylip R+W PHYLIP (Felsenstein, 1989)
stockholm R+W Stockholm or Pfam (Bateman et al., 2004)
nexus R+W NEXUS (Maddison et al., 1997)

Where possible, our format names (column ‘Format’) match BioPerl and EMBOSS (Rice et al., 2000). Column ‘R/W’ denotes support for reading (R) and writing (W).

Modules for a number of online databases are included, such as the NCBI Entrez Utilities, ExPASy, InterPro, KEGG and SCOP. Bio.Blast can call the NCBI's online Blast server or a local standalone installation, and includes a parser for their XML output. Biopython has wrapper code for other command line tools too, such as ClustalW and EMBOSS. Bio.PDB module provides a PDB file parser, and functionality related to macromolecular structure (Hamelryck and Manderick, 2003). Module Bio.Motif provides support for sequence motif analysis (searching, comparing and de novo learning). Biopython's graphical output capabilities were recently significantly extended by the inclusion of GenomeDiagram (Pritchard et al., 2006).

Biopython contains modules for supervised statistical learning, such as Bayesian methods and Markov models, as well as unsu pervised learning, such as clustering (De Hoon et al., 2004).

The population genetics module provides wrappers for GENEPOP (Rousset, 2007), coalescent simulation via SIMCOAL2 (Laval and Excoffier, 2004) and selection detection based on a well-evaluated Fst-outlier detection method (Beaumont and Nichols, 1996).

BioSQL (www.biosql.org) is another OBF supported initiative, a joint collaboration between BioPerl, Biopython, BioJava and BioRuby to support loading and retrieving annotated sequences to and from an SQL database using a standard schema. Each project provides an object-relational mapping (ORM) between the shared schema and its own object model (a SeqRecord in Biopython). As an example, xBASE (Chaudhuri and Pallen, 2006) uses BioSQL with both BioPerl and Biopython.

3 CONCLUSIONS

Biopython is a large open-source application programming interface (API) used in both bioinformatics software development and in everyday scripts for common bioinformatics tasks. The homepage www.biopython.org provides access to the source code, documentation and mailing lists. The features described herein are only a subset; potential users should refer to the tutorial and API documentation for further information.

ACKNOWLEDGEMENTS

The OBF hosts and supports the project. The many Biopython contributors over the years are warmly thanked, a list too long to be reproduced here.

Funding: Fundacao para a Ciencia e Tecnologia (Portugal) (grant SFRH/BD/30834/2006 to T.A.).

Conflict of Interest: none declared.

REFERENCES

  1. Chapman B, Chang J. Biopython: Python tools for computational biology. ACM SIGBIO Newslett. 2000;20:15–19. [Google Scholar]
  2. Chaudhuri RR, Pallen MJ. xBASE, a collection of online databases for bacterial comparative genomics. Nucleic Acids Res. 2006;34:D335–D337. doi: 10.1093/nar/gkj140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bateman A, et al. The Pfam protein families database. Nucleic Acids Res. 2004;32:D138–D141. doi: 10.1093/nar/gkh121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Beaumont MA, Nichols RA. Evaluating loci for use in the genetic analysis of population structure. Proc. R. Soc. Lond. B. 1996;263:1619–1626. [Google Scholar]
  5. Benson DA, et al. GenBank. Nucleic Acids Res. 2007;35:D21–D25. doi: 10.1093/nar/gkl986. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Felsenstein J. PHYLIP -phylogeny inference package (Version 3.2) Cladistics. 1989;5:164–166. [Google Scholar]
  7. Hamelryck T, Manderick B. PDB file parser and structure class implemented in Python. Bioinformatics. 2003;19:2308–2310. doi: 10.1093/bioinformatics/btg299. [DOI] [PubMed] [Google Scholar]
  8. Hinsen K. The molecular modeling toolkit: a new approach to molecular simulations. J. Comp. Chem. 2000;21:79–85. [Google Scholar]
  9. Holland RCG, et al. BioJava: an open-source framework for bioinformatics. Bioinformatics. 2008;24:2096–2097. doi: 10.1093/bioinformatics/btn397. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. De Hoon MJL, et al. Open source clustering software. Bioinformatics. 2004;20:1453–1454. doi: 10.1093/bioinformatics/bth078. [DOI] [PubMed] [Google Scholar]
  11. Kauff F, et al. WASABI: an automated sequence processing system for multi-gene phylogenies. Syst. Biol. 2007;56:523–531. doi: 10.1080/10635150701395340. [DOI] [PubMed] [Google Scholar]
  12. Kulikova T, et al. EMBL nucleotide sequence database in 2006. Nucleic Acids Res. 2006;35:D16–D20. doi: 10.1093/nar/gkl913. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Lavel G, Excoffier L. SIMCOAL 2.0: a program to simulate genomic diversity over large recombining regions in a subdivided population with a complex history. Bioinformatics. 2004;20:2485–2487. doi: 10.1093/bioinformatics/bth264. [DOI] [PubMed] [Google Scholar]
  14. Maddison DR, et al. NEXUS: an extensible file format for systematic information. Syst. Biol. 1997;46:590–621. doi: 10.1093/sysbio/46.4.590. [DOI] [PubMed] [Google Scholar]
  15. Oliphant TE. Guide to NumPy. USA: Trelgol Publishing; 2006. [Google Scholar]
  16. Oliphant TE. Python for Scientific Computing. Comput. Sci. Eng. 2007;9:10–20. [Google Scholar]
  17. Pearson WR, Lipman DJ. Improved tools for biological sequence analysis. PNAS. 1988;85:2444–2448. doi: 10.1073/pnas.85.8.2444. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Pritchard L, et al. GenomeDiagram: a Python package for the visualisation of large-scale genomic data. Bioinformatics. 2006;22:616–617. doi: 10.1093/bioinformatics/btk021. [DOI] [PubMed] [Google Scholar]
  19. Rice P, et al. EMBOSS: the European molecular biology open software suite. Trends Genet. 2000;16:276–277. doi: 10.1016/s0168-9525(00)02024-2. [DOI] [PubMed] [Google Scholar]
  20. Rousset F. GENEPOP '007: a complete re-implementation of the GENEPOP software for Windows and Linux. Mol. Ecol. Res. 2007;8:103–106. doi: 10.1111/j.1471-8286.2007.01931.x. [DOI] [PubMed] [Google Scholar]
  21. Stajich JE, et al. The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002;12:1611–1618. doi: 10.1101/gr.361602. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. The UniProt Consortium. 2007 The universal protein resource (UniProt) Nucleic Acids Res. 35 D193-D197 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Thompson JD, et al. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22:4673–4680. doi: 10.1093/nar/22.22.4673. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES