Version Changes
Revised. Amendments from Version 1
Thanks to the referees’ recommendations, we have improved our manuscript in this revised version. An extra paragraph was added in the introduction to describe the rationale underlying Fragger. Several citations to web servers and protein fragment databases were added to provide a broader overview of similar approaches. Some variables in the algorithm have been renamed to facilitate easy understanding. A new paragraph and one reference about choosing good reference fragments were added. Several changes were made to clarify the choice of reference fragments, and the impact of the effective cutoff values on search speed.
Abstract
Protein modeling and design activities often require querying the Protein Data Bank (PDB) with a structural fragment, possibly containing gaps. For some applications, it is preferable to work on a specific subset of the PDB or with unpublished structures. These requirements, along with specific user needs, motivated the creation of a new software to manage and
query 3D protein fragments. Fragger is a protein fragment picker that allows protein fragment databases to be created and queried. All fragment lengths are supported and any set of PDB files can be used to create a database. Fragger can efficiently search a fragment database with a query fragment and a distance threshold. Matching fragments are ranked by distance to the query. The query fragment can have structural gaps and the allowed amino acid sequences matching a query can be constrained via a regular expression of one-letter amino acid codes. Fragger also incorporates a tool to compute the backbone RMSD of one versus many fragments in high throughput. Fragger should be useful for protein design, loop grafting and related structural
bioinformatics tasks.
Keywords: protein fragments, protein design, fragments database, structural query, triangular inequality
Introduction
Nowadays, a large number of protein structures are available (122,761 as of July 2017 at RCSB) and protein fragments are frequently used in structural bioinformatics. Protein structure prediction methods such as Rosetta 1, QUARK 2 and EdaFold 3, 4 use protein fragments as building blocks. Protein fragments are also used in crystallographic phasing 5– 7 and model rebuilding 8. The quality of protein models can be improved by combining protein fragments with molecular dynamics 9. Other applications include the curation of unresolved loops in crystal structures 10, 11, grafting of loop sequences on protein scaffolds and other protein design algorithms 12, 13.
When there are too many fragments to search from, an efficient strategy is necessary to reach sub-linear search times. This problem is well-known to the chemoinformatics community, which has developed several efficient strategies to screen large databases of small molecules. For example, geometric embedding and locality sensitive hashing 14, kd-trees 15, a tree data structure (called µ-tree) with a heuristic 16, bounds of similarity scores for chemical fingerprints 17 and a proximity filter based on the logical exclusive or operator 18 have all been developed to this end.
Currently, several fragment pickers 19– 22 and protein fragment databases 23– 28 are available. Of particular interest is the Super method 20 that uses the lower bound of RMSD 29 to screen the whole fragment space. However, our research on protein design and refinement of protein decoys for crystallographic phasing required specific options and therefore a new fragment picker.
Methods
Implementation
Fragger exploits the triangular inequality of RMSD 30 to prune the fragment space ( Figure 1 and Algorithm 1). RMSDs are computed efficiently via the QCP method 31. Fragger is written in OCaml 32, except backbone RMSD computations which are performed with a new version of the C++ ranker tool from Durandal 33. Computations are parallelized on multi-core computers via the Parmap library 34.
Algorithm 1. Query with a fragment and an RMSD threshold. Comments are enclosed between braces.
Input: D: fragment set to query
Input: R: reference fragment set
Input: q: query fragment
Input: d q: RMSD threshold
Output: M: matching fragment set
M ← D
{fuzzy query: prune the fragment space}
for r j in R do
d ← distance( q, r j)
d inf ← d – d q
d sup ← d + d q
{ distance( f i, r j) comes from the database index}
M ← {∀ f i ∈ M | distance( f i, r j) ∈ [ d inf, d sup]}
end for
{exact query: refine the result of pruning}
M ← {∀ f i ∈ M | distance( f i, q) ≤ d q}
return M
Figure 1. Left: pruning the fragment space for query distance d q and query fragment q.
q is at distance d 1 (resp. d 2) from reference fragment r 1 (resp. r 2). Only fragments which are both within d 1 ± d q of r 1 and d 2 ± d q of r 2 will undergo an RMSD calculation. Middle: 13 residues loops that can connect residue ALA 98 to GLY 110 in chain A of PDB 1MEL. The query loop is shown in red. Only its first and last three residues were used to rank the retrieved fragments. Right: Backbone of PDB 1BKR covered with ten residue fragments from non-homologous proteins retrieved with Fragger.
Fragger allows a database to be queried with a fragment and an RMSD threshold. Matching fragments are ranked by RMSD to the query. Fragger’s ranker tool allows to compute the backbone RMSD of a single fragment versus many. Fragger can deal with residue gaps or a selection of residues from the query, create a fragment database from a set of Protein Data Bank (PDB) files, work with all fragment lengths and extract specific or randomly-chosen fragments from a database.
Compared to existing fragment pickers, some of the specific functionalities required by users include:
Outputing only the N best or N first found fragments matching a query (this can make a query terminate faster)
Constraining the amino acid sequences allowed to match a query (for loop grafting; such filtering is applied after RMSD pruning of the fragment space)
Reading and writing PDB fragments from/to a binary format (faster than reading/writing regular PDB files)
Preventing a list of PDB codes from matching a query
Automatically varying the RMSD threshold to the query until a given number of fragments is reached.
Operation
Users need to install OPAM and the pdbset command from CCP4 in order to use Fragger.
Details on how to install Fragger and usage examples are provided in the README file of the released software.
Results and discussion
Tests were performed on one core of a 2.4GHz Intel Xeon workstation with 12GB of RAM running Ubuntu Linux 12.04. The PDB dataset is composed of all proteins determined by X-ray, without highly similar sequences (30% sequence identity cutoff) in order to create a challenging set of fragments to benchmark a protein design algorithm. It contains 13,554 PDBs. PDBs were extracted from the protein databank website using the advanced search tab and ticking the "Retrieve only representatives at 30% sequence identity" box. Querying with a three (resp. nine) residues fragment takes at least 6.75s (resp. 5.2s).
Query times vary with the query fragment, reference fragments, indexed proteins and RMSD tolerance to the query. In general, the longer the required fragment length and the smaller the RMSD tolerance, the faster the query.
Reference fragments can be chosen randomly. Pruning of the search space is better if there are at least three reference fragments, far from each other. Once a RMSD index has been computed for a randomly chosen fragment ( f i), taking the furthest fragment from it ( f j) and the median fragment ( f k) would give three acceptable reference fragments. For interested contributors, some good heuristics can be found in the literature but were not implemented in Fragger, like Brin’s greedy algorithm 35.
For one time tasks, it is not necessary to create RMSD indices and actually query a database, as fragments extraction and RMSD computations are fast enough. For example, it takes only 15s to generate all (41,200) fragments of 13 residues starting with alanine and ending with glycine (middle of Figure 1). Ranking them to the query takes 1.5s. When working on PDB files, the ranker tool included with Fragger can compute 66,580 (resp. 23,784) RMSD /s on the backbone of three (resp. nine) residue fragments. These numbers become 304,149 (resp. 138,744) RMSD /s when working on Fragger’s binary-encoded PDBs. In the future, it might be possible to improve the performance of Fragger by incorporating a faster score than RMSD, such as BCscore 36.
Fragger can be useful for protein design, loop grafting and retrieval of candidates to rebuild low-confidence regions of protein models 6.
Data availability
All data underlying the results are available as part of the article and no additional source data are required.
Software availability
Fragger can be downloaded from: https://github.com/UnixJunkie/fragger
Archived source code at the time of publication: https://zenodo.org/record/877320
Software license: LGPL.
Funding Statement
This work was supported by the “Initiative Research Unit” program from RIKEN, Japan, the Japanese Society for the Promotion of Science (JSPS) and computing resources on the RIKEN Integrated Cluster of Clusters (RICC). FB is a JSPS international fellow.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
[version 2; referees: 2 approved]
References
- 1. Leaver-Fay A, Tyka M, Lewis SM, et al. : Rosetta3: An Object-Oriented Software Suite for the Simulation and Design of Macromolecules. Methods Enzymol.Academic Press,2011;487:545–574. 10.1016/B978-0-12-381270-4.00019-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Xu D, Zhang Y: Ab initio protein structure assembly using continuous structure fragments and optimized knowledge-based force field. Proteins. 2012;80(7):1715–1735. 10.1002/prot.24065 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Simoncini D, Berenger F, Shrestha R, et al. : A Probabilistic Fragment-Based Protein Structure Prediction Algorithm. PLoS One. 2012;7(7):e38799. 10.1371/journal.pone.0038799 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Simoncini D, Schiex T, Zhang KY: Balancing exploration and exploitation in population-based sampling improves fragment-based de novo protein structure prediction. Proteins. 2017;85(5):852–858. 10.1002/prot.25244 [DOI] [PubMed] [Google Scholar]
- 5. Rodriguez DD, Grosse C, Himmel S, et al. : Crystallographic ab initio protein structure solution below atomic resolution. Nat Methods. 2009;6(9):651–653. 10.1038/nmeth.1365 [DOI] [PubMed] [Google Scholar]
- 6. Shrestha R, Simoncini D, Zhang KY: Error-estimation-guided rebuilding of de novo models increases the success rate of ab initio phasing. Acta Crystallogr D Biol Crystallogr. 2012;68(Pt 11):1522–1534. 10.1107/S0907444912037961 [DOI] [PubMed] [Google Scholar]
- 7. Shrestha R, Zhang KY: A fragmentation and reassembly method for ab initio phasing. Acta Crystallogr D Biol Crystallogr. 2015;71(Pt 2):304–312. 10.1107/S1399004714025449 [DOI] [PubMed] [Google Scholar]
- 8. Adams PD, Afonine PV, Bunkóczi G, et al. : PHENIX: a comprehensive Python-based system for macromolecular structure solution. Acta Crystallogr D Biol Crystallogr. 2010;66(Pt 2):213–221. 10.1107/S0907444909052925 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Zhang J, Liang Y, Zhang Y, et al. : Atomic-Level Protein Structure Refinement Using Fragment-Guided Molecular Dynamics Conformation Sampling. Structure. 2011;19(12):1784–1795. 10.1016/j.str.2011.09.022 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Lee J, Lee D, Park H, et al. : Protein loop modeling by using fragment assembly and analytical loop closure. Proteins. 2010;78(16):3428–36. 10.1002/prot.22849 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Shehu A, Clementi C, Kavraki LE, et al. : Modeling protein conformational ensembles: from missing loops to equilibrium fluctuations. Proteins. 2006;65(1):164–79. 10.1002/prot.21060 [DOI] [PubMed] [Google Scholar]
- 12. Claessens M, Van Cutsem E, Lasters I, et al. : Modelling the polypeptide backbone with ‘spare parts’ from known protein structures. Protein Eng. 1989;2(5):335–45. 10.1093/protein/2.5.335 [DOI] [PubMed] [Google Scholar]
- 13. Tsai HH, Tsai CJ, Ma B, et al. : In silico protein design by combinatorial assembly of protein building blocks. Protein Sci. 2004;13(10):2753–65. 10.1110/ps.04774004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Cao Y, Jiang T, Girke T: Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing. Bioinformatics. 2010;26(7):953–959. 10.1093/bioinformatics/btq067 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Agrafiotis DK, Lobanov VS: An efficient implementation of distance-based diversity measures based on k-d trees. J Chem Inf Comput Sci. 1999;39(1):51–58. 10.1021/ci980100c [DOI] [Google Scholar]
- 16. Xu H, Agrafiotis DK: Nearest neighbor search in general metric spaces using a tree data structure with a simple heuristic. J Chem Inf Comput Sci. 2003;43(6):1933–1941. 10.1021/ci034150f [DOI] [PubMed] [Google Scholar]
- 17. Swamidass SJ, Baldi P: Bounds and algorithms for fast exact searches of chemical fingerprints in linear and sublinear time. J Chem Inf Model. 2007;47(2):302–317. 10.1021/ci600358f [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Baldi P, Hirschberg DS, Nasr RJ: Speeding up chemical database searches using a proximity filter based on the logical exclusive or. J Chem Inf Model. 2008;48(7):1367–1378. 10.1021/ci800076s [DOI] [PubMed] [Google Scholar]
- 19. Gront D, Kulp DW, Vernon RM, et al. : Generalized fragment picking in Rosetta: design, protocols and applications. PLoS One. 2011;6(8):e23294. 10.1371/journal.pone.0023294 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Collier JH, Lesk AM, Garcia de la Banda M, et al. : Super: a web server to rapidly screen superposable oligopeptide fragments from the protein data bank. Nucleic Acids Res. 2012;40(Web Server issue):W334–W339. 10.1093/nar/gks436 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Guyon F, Martz F, Vavrusa M, et al. : BCSearch: fast structural fragment mining over large collections of protein structures. Nucleic Acids Res. 2015;43(W1):W378–W382. 10.1093/nar/gkv492 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Santos KB, Trevizani R, Custodio FL, et al. : Profrager web server: Fragment libraries generation for protein structure prediction. In Proceedings of the International Conference on Bioinformatics & Computational Biology (BIOCOMP).The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp),2015;38 Reference Source [Google Scholar]
- 23. Kim DE, Chivian D, Baker D: Protein structure prediction and analysis using the Robetta server. Nucleic Acids Res. 2004;32(Web Server issue):W526–W531. 10.1093/nar/gkh468 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Samson AO, Levitt M: Protein segment finder: an online search engine for segment motifs in the pdb. Nucleic Acids Res. 2009;37(Database issue):D224–D228. 10.1093/nar/gkn833 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Debret G, Martel A, Cuniasse P: RASMOT-3D PRO: a 3D motif search webserver. Nucleic Acids Res. 2009;37(Web Server issue):W459–W464. 10.1093/nar/gkp304 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Vanhee P, Verschueren E, Baeten L, et al. : BriX: a database of protein building blocks for structural analysis, modeling and design. Nucleic Acids Res. 2011;39(Database issue):D435–D442. 10.1093/nar/gkq972 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Nagarajan R, Siva Balan S, Sabarinathan R, et al. : Fragment Finder 2.0: a computing server to identify structurally similar fragments. J Appl Cryst. 2012;45(2):332–334. 10.1107/S0021889812001501 [DOI] [Google Scholar]
- 28. Budowski-Tal I, Nov Y, Kolodny R: FragBag, an accurate representation of protein structure, retrieves structural neighbors from the entire PDB quickly and accurately. Proc Natl Acad Sci U S A. 2010;107(8):3481–3486. 10.1073/pnas.0914097107 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Tramontano A, Lesk AM: Common features of the conformations of antigen-binding loops in immunoglobulins and application to modeling loop conformations. Proteins. 1992;13(3):231–245. 10.1002/prot.340130306 [DOI] [PubMed] [Google Scholar]
- 30. Steipe B: A revised proof of the metric properties of optimally superimposed vector sets. Acta Crystallogr A. 2002;58(Pt 5):506. 10.1107/S0108767302011637 [DOI] [PubMed] [Google Scholar]
- 31. Theobald DL: Rapid calculation of RMSDs using a quaternion-based characteristic polynomial. Acta Crystallogr A. 2005;61(Pt 4):478–480. 10.1107/S0108767305015266 [DOI] [PubMed] [Google Scholar]
- 32. Leroy X, Doligez D, Frisch A, et al. : The OCaml system release 4.00 Documentation and user’s manual. INRIA, France,2012. Reference Source [Google Scholar]
- 33. Berenger F, Shrestha R, Zhou Y, et al. : Durandal: fast exact clustering of protein decoys. J Comput Chem. 2012;33(4):471–474. 10.1002/jcc.21988 [DOI] [PubMed] [Google Scholar]
- 34. Daneluttoa M, Di Cosmo R: A "Minimal Disruption" Skeleton Experiment: Seamless Map and Reduce Embedding in OCaml. Procedia Comput Sci. 2012;9:1837–1846. 10.1016/j.procs.2012.04.202 [DOI] [Google Scholar]
- 35. Brin S: Near neighbor search in large metric spaces. In Proceedings of the 21th International Conference on Very Large Data Bases.VLDB ’95, San Francisco, CA USA, Morgan Kaufmann Publishers Inc.1995;574–584. Reference Source [Google Scholar]
- 36. Guyon F, Tufféry P: Fast protein fragment similarity scoring using a Binet-Cauchy kernel. Bioinformatics. 2014;30(6):784–791. 10.1093/bioinformatics/btt618 [DOI] [PubMed] [Google Scholar]