Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2015 May 14;43(Web Server issue):W378–W382. doi: 10.1093/nar/gkv492

BCSearch: fast structural fragment mining over large collections of protein structures

Frédéric Guyon 1, François Martz 1, Marek Vavrusa 1, Jérôme Bécot 1, Julien Rey 1, Pierre Tufféry 1,*
PMCID: PMC4489267  PMID: 25977292

Abstract

Resources to mine the large amount of protein structures available today are necessary to better understand how amino acid variations are compatible with conformation preservation, to assist protein design, engineering and, further, the development of biologic therapeutic compounds. BCSearch is a versatile service to efficiently mine large collections of protein structures. It relies on a new approach based on a Binet–Cauchy kernel that is more discriminative than the widely used root mean square deviation criterion. It has statistics independent of size even for short fragments, and is fast. The systematic mining of large collections of structures such as the complete SCOPe protein structural classification or comprehensive subsets of the Protein Data Bank can be performed in few minutes. Based on this new score, we propose four innovative applications: BCFragSearch and BCMirrorSearch, respectively, search for fragments similar and anti-similar to a query and return information on the diversity of the sequences of the hits. BCLoopSearch identifies candidate fragments of fixed size matching the flanks of a gaped structure. BCSpecificitySearch analyzes a complete protein structure and returns information about sites having few similar fragments. BCSearch is available at http://bioserv.rpbs.univ-paris-diderot.fr/services/BCSearch.

INTRODUCTION

The large amount of protein structures available, thanks to the efforts of the structural genomics, now constitutes a valuable resource to analyze in depth the impact of amino acid sequence variation on protein conformation. Efficient and large-scale mining of the structures available offers promising perspectives to assist protein engineering and design. It also meets the increasing interest of pharmaceutical industry to develop new biologic entities including large peptides, recombinant proteins, antibodies, immunoconjugates or synthetic vaccines to cite some (1). Indeed, beyond the analysis and the classification of complete protein domains, the focus is progressively moving to a more local level of structure analysis. The present number of entries of the Protein Data Bank (PDB) (2), over 100 000 protein structures, corresponds to several tens of millions of protein fragments of short size (10–20 amino acids) amino acids. There is some challenge to design efficient and fast services to analyze structural similarities with statistical significance.

Whereas numerous approaches have been proposed to classify or align complete protein structures (3,4), fewer methods have been developed for a more local level. Several online facilities have been proposed, which focus on contiguous or linear fragments (510). Superimposé (5) combines several search algorithms such as TM-align (11) or CE (12) to search for fragments similar to a query. The FragFinder (6) search engine is based on the comparison of the main chain backbone conformational angles (ϕ and ψ). SA-Mot (7) is based on the encoding of structure as strings of a structural alphabet to search for over-represented conformations among collections of proteins with similar functions. Finally, TopMatch (8) that can generate several alignments between a query and a target protein structure has been recently updated as TopSearch (9). For the comparison of non-sequential motifs, much more complex and slower algorithms have been proposed. These include Rasmot-3D (13), SPRITE and ASSAM (14) or ProSMoS (15), to cite some.

Another important application of structural fragment mining is knowledge-based loop modeling. It implies the search for fragments matching geometric boundary conditions in subsets of the PDB or SCOPe (16). For that purpose, some online services are available at this time. ArchPRED (17) uses secondary structures flanking the missing loop, their relative orientation and the number of missing residues to identify candidate loop conformations. SuperLooper (18) mines the Loop In Protein (LIP) database (19), a comprehensive loop database containing all protein segments up to 15 residues from the PDB, to identify fragments matching geometrical criteria between the two last atoms of the main chain of one flank and the two first of the other. FREAD (20) searches for candidate fragments matching conditions on distances between Cα of the flanks. The method developed by Peng and Yang (21) does not seem reachable any longer. The recent FALC-Loop (22) uses a de novo modeling approach combining fragments for loop generation and thus is not stricto sensu based on similarity search.

We have recently introduced a new score based on the Binet–Cauchy kernel, the BC-score. It is a geometric correlation score, with a maximum value that equals 1 indicating perfect similarity, values close to 0 being associated with unrelated conformations and a minimal value of −1 corresponding to mirror conformations. This score addresses two major drawbacks of the widely used root mean square deviation (RMSD). Firstly, it shows better performance in the discrimination of medium-range RMSD values, which leads to the identification of more consistent similarities. Secondly, its statistical significance is independent of fragment size, even for short fragments. Due to the simplicity of the BC-score formulation, BCSearch provides one of the fastest services for large-scale mining of protein structures, being able to undergo several tens of thousands of comparisons per second, which makes possible to mine several thousands of structures per second.

In BCSearch, we take advantage of the speed of computation and the accuracy of the BC-score (23) to propose, in a unified framework, new large-scale mining facilities, some of which previously out of reach. The first application performs a search for similar fragment search within large collections of protein structures, possibly the whole SCOPe, or large subsets of the PDB. Taking advantage of the properties of the BC-score, BCSearch is also able to search for mirror conformations. We have previously illustrated that it is, for instance, able to identify left-handed helices, that, even if rare, are important for the stability of the protein, for ligand binding or as part of the active site (24). Another application called BCLoopSearch is an enhancement of the simple BC-score that makes possible to mine for two disjoint fragments separated by a given number of residues. Finally, it also becomes possible to quantify, for a complete protein structure, the fragments that are specific of the structure, i.e. fragment rarely found into a given collection of structures.

MATERIALS AND METHODS

Binet–Cauchy score definition and properties

The Binet–Cauchy score as a measure of conformation similarity has already been described in (23). We only recall here the general concepts.

We only consider the coordinates of the α-carbon atoms of the protein fragments. The coordinates of the N residue fragments to be compared are stored in N × 3 matrices X and Y. The coordinate matrices are centered at the origin. We use the structural score derived from the Binet–Cauchy kernel (23). This score, we named Binet–Cauchy score, is the cosine between the Grassmann vectors of X and Y

graphic file with name M4.gif (1)

The BC-score is a positive kernel, it is rotation independent and it corresponds to a correlation coefficient between the Grassmann representation of X and Y, and thus varies from −1 to 1.

Importantly, the BC-score is a flexible score. It is maximal (equals 1) for identical structures. However, it is also possible that BC (X, Y) = 1 for two different fragment conformations with RMSD (X, Y) > 0. In order to control the admissible amount of flexibility, an additional score, called rigidity score, is used: if we denote Xi (resp. Yi) the coordinates of the ith Cα of the fragment X (resp. Y) of length N. The rigidity score between the two structures is

graphic file with name M5.gif (2)
graphic file with name M6.gif (3)

It corresponds to a measure of the maximum variation of intra-distances between the residues and the geometric center, and intra-distances between the terminal α-carbons.

The BC-score and the RMSD distance are strongly anti-correlated for very low RMSD values. Both provide comparable measures between close structures. But, since the RMSD averages the distances between atoms, the medium-range and even low-range RMSDs do not imply significative conformation similarity. On the contrary, the BC-score characterizes more precisely global shape similarity. Combined with distortion rate, it allows better discrimination among medium RMSD range hits (23).

Therefore, contrary to the RMSD, the BC-score can be efficiently used to search for fragments in structure databases with a certain amount of flexibility while discarding spurious fragments which cannot be structurally aligned with the query.

IMPLEMENTATION

Data sets

BCSearch can mine large collections of protein structures. Presently, two collections of structure have been considered. The first corresponds to the complete collections of structural domains of SCOPe version 2.04 (over 190 000 domains in total) (16), for which it is possible to specify any level of the hierarchy using the class.fold.superfamilly.familly scheme—e.g. g.3 for Toxic hairpinKnottins, g.3.3 for cyclotides. In order to make possible analyses on a subset of structures at high resolution only, a second collection denoted as PDB corresponds to a subset of the PDB corresponding to structures resolved using X-ray diffraction, at resolutions better than 1.6, 1.8, 2.0, 2.2 or 2.5 Å, and with an R-value less than 0.25 or 1.0, as defined by the pisces server (25).

BCSearch services

Based on the BC-score, BCSearch comes as a collection of services that address different questions, as illustrated in Figure 1.

Figure 1.

Figure 1.

BCSearch services.

BCFragSearch

The BCFragSearch corresponds to the exhaustive search in a collection of structures of fragments similar to a query. Its aims is to return information about amino acid sequences observed in similar fragments.

BCMirrorSearch

The BCMirrorSearch is similar to BCFragSearch except that it corresponds to the exhaustive search in a collection of structures of fragments having conformations anti-similar to a query. Its aim is to return information about the existence of mirror conformations and their amino acid sequences.

BCLoopSearch

The BCLoopSearch corresponds to BCFragSearch applied to only the flanking regions of a fragment of interest. It can thus be considered as the search for conformations for fragments of unknown conformation but for which the flanks are known, similarly to the problem of loop modeling. BCLoopSearch uses flanks of four residues. Note that the search is only performed on a geometrical basis, and no control over sequence identity is used during the search. In order to avoid the return of fragments clashing the template structure, a rough checking of steric clashes is performed. Candidates for which at least one inter α-carbon distance to a residue distant of at least three positions in the amino acid sequence is less than a cutoff of 3 Å are discarded.

BCSpecificitySearch

This service analyzes a complete protein structure and returns its ‘specific’ parts, i.e. the regions associated with fragment conformations rarely found into a collection of reference structures. This collection can be defined as any SCOPe subset corresponding to a valid class/superfamily/family/fold subset. The search bank can also be defined as the complete SCOPe, excluding some specified SCOP subset. Hence, this service gives the possibility to retrieve fragments specific of a protein structure at a given level of structural similarity. It also permits to search for fragments common to a given SCOP level and specific to this level, that is which are not present at other levels.

We evaluate the fragment specificity with the following score, we denote it as specificity score in the following:

graphic file with name M7.gif

where Nhits corresponds to the number of proteins where a similar fragment is found and Ntotal is the total number of proteins in the search bank.

Execution times

Typical run times against the full unfiltered SCOPe compendium (over 190 000 domains and 20 millions elementary comparisons) are below 1 min for BCFragSearch, BCMirrorSearch and BCLoopSearch, and on the order of few seconds to 5 min for BCSpecificitySearch, depending on the size of the query and the collection of structures mined.

Input

As input, BCSearch requires a structure, and in some cases some sequence information. Structures can be uploaded as PDB-formatted files or searched in repositories given a PDB or an SCOPe identifier.

For fragment search and mirror search, a sequence specifying the part of the query structure to use can be input.

For loop search, the complete sequence must be provided. Missing parts of the protein are automatically detected by comparing the sequence to that of the gaped structure.

The collection of structures to mine can correspond to subsets of either the SCOP databank or the PDB at different resolutions. The PDB and SCOP collections can be filtered depending on sequence identity—90, 70, 50 or 30%. Cutoff values can also be set for the BC-score and the rigidity score.

Output

Results page of the BCSearch services but BCSpecificitySearch return all the hits in a csv file and an interactive table ordered by BCscore, truncated to the best 1000 hits to preserve interactivity. For each match, the data reported are: the name of the query, the name of the hit (PDB or SCOP ID), starting and ending residues number of the query and of the match, BCscore, rigidity value, P-value, RMSD and the sequence of the match. The P-value and the RMSD are calculated from the query and any hit for BCFragSearch and BCMirrorSearch, and between the flanking regions of the query and the match for BCLoopSearch. When relevant, a sequence logo is also provided. It depicts the sequence variability among the hits. A visualization panel is available thanks to the PV—JavaScript Protein Viewer (http://biasmv.github.io/pv/). For BCSpecificitySearch, a dynamic color gradient allows one to interactively explore the structure at various specificity score values.

APPLICATIONS

Fragment mining

Figure 2 illustrates a BCFragSearch run applied to the search for fragments similar to the fragment Cys15–Gly37 from the human zinc finger protein (PDB: 2EMJ) that belongs to the superfamily of the beta–beta–alpha-zinc fingers (SCOPe g.37.1). The search was performed against the SCOPe collection at 100% sequence identity, using the default search values of 0.95 and 1. for the BC-score and rigidity, respectively. Twenty eight hits were identified over the 280 proteins of the superfamily. The logo representation of the corresponding sequences clearly shows the C2H2 motif specific of the zinc binding motif. Importantly, similar fragments in the remaining members of the superfamily do have indels in the fragment, highlighting the stringency of the BC-score, which detects such events.

Figure 2.

Figure 2.

Top: The query structure is the fragment from Cys15 to Gly37 from the human Zinc finger protein (cyan) (PDB: 2EMJ). Twenty eight similar fragments identified by BCFragSearch are depicted in green. Bottom: The corresponding sequence logo shows a conserved C2H2 motif involved in the binding of the zinc.

Candidate loop search

To illustrate the BCLoopSearch service, we start from the known complex between the glycoprotein Ib alpha and the von Willebrand factor (PDB: 1M10). Residues from the positions 226 to 242 of the unbound von Willebrand factor binding domain of glycoprotein Ib alpha (PDB: 1M0Z) undergo a large conformational change of 5.05 Å upon binding to the von Willebrand factor (see Figure 3A). Starting from the unbound conformation, we removed residues 226–242 of the moving loop. We then performed a BCLoopSearch against the complete SCOPe collection - 100% sequence identity, using BC-score and rigidity cutoff values of 0.95 and 1.0, respectively. We obtained 20 different conformations. Not only the bound conformation but also 19 other conformations cover a range of RMSD from 0.9 to 17.1 Å (see Figure 3B). The closest conformation to the bound conformation, excluding itself, deviates by 0.9 Å and is from a ternary complex between von Willebrand factor, glycoprotein Ib alpha and botrocetin (PDB: 1U0N). Thus, BCLoopSearch appears able to return valuable collections of candidate loops. We recall however that only a rough pruning of the candidate loops is performed by BCLoopSearch and that further processing to score them should be considered.

Figure 3.

Figure 3.

(A) The unbound structure of the von Willebrand factor binding domain of glycoprotein Ib alpha is depicted in green (PDB: 1M0Z), and is superimposed onto its bound conformation, taken from the complex with the von Willebrand factor, depicted in blue (PDB: 1M10). The red regions correspond to the four residues flanking fragments used by BCLoopSearch. The 226-242 loop (loop between the two red flanks) undergoes a conformational change of 5.05 Å. The conformation of this region fluctuates from an un-organized segment to an antiparallel beta sheet. (B) All matches found by BCLoopSearch superimposed on the query. These matches have been manually clustered: in cyan the one that is close to the bound structure, in light green close to the unbound structure and in pink the outlayer.

BCSpecificitySearch

The porcine beta trypsin (PDB: 1QQU) is associated with the SCOPe fold b.47. Using BCSpecificitySearch, it is possible to ask what fragments of the structure are specific of the fold, searching occurrences of fragments in the b class, but discarding the protein domains of the b.47 fold (5352 protein domains). We have used a fragment size of 9, against the corresponding SCOPe subset filtered at 90% sequence identity, BC-score and rigidity cut-off values of 0.95 and 1, respectively. Figure 4, left shows the sites associated with specificity scores greater than 0.995, i.e. associated with less than 0.5% of matches. It is striking that these sites define a patch on the structure. Interestingly, Figure 4, right shows that this patch corresponds to the patch in interaction with the soybean trypsin inhibitor (PDB: 1AVX).

Figure 4.

Figure 4.

Left: Specific fragments of the porcine beta trypsin (PDB: 1QQU) as identified by BCSpecificitySearch. The red parts correspond to fragments specific of this conformation and the blue to non-specific ones. Right: Structure of the complex porcine pancreatic trypsin (green)/soybean trypsin inhibitor (cyan) (PDB: 1AVX).

CONCLUSION

BCSearch services offer fast and versatile means to mine large collections of structures and extract information about local sequence–structure relationships. It is possible to search for fragments similar to a query, to search for fragments in a mirror conformation or to identify candidate fragments of fixed size matching the flanks of a gaped structure. BCSearch also provides an innovative means to analyze the specificity of local conformations in a complete protein structure by identifying sites associated with un-frequent conformations. Due to the properties of the BC-score, the parameters driving the search are few, and are independent of fragment size. Using the same framework, it is still possible to enlarge the panel of services. Particularly, all services presently perform ungaped search. As we have shown in one example, accepting a limited number of gaps could certainly help to extend the interest of BCSearch to motif identification from structure, a point for further development, however. It remains that BCSearch runs are typically on the order from few seconds up to few minutes only, depending on the collection of structures to mine, making it, we hope, suitable as a useful tool for biologists, to analyze, engineer or design proteins.

Acknowledgments

The French IA bioinformatics BipBip; INSERM UMR-S 973; the Ressource Parisienne en Bioinformatique Structurale. Funding for open access charge: INSERM UMR-S 973.

Conflict of interest statement. None declared.

REFERENCES

  • 1.Alvim-Gaston M., Grese T., Mahoui A., Palkowitz A.D., Pineiro-Nunez M., Watson I. Open Innovation Drug Discovery (OIDD): a potential path to novel therapeutic chemical space. Curr. Top. Med. Chem. 2014;14:294–303. doi: 10.2174/1568026613666131127125858. [DOI] [PubMed] [Google Scholar]
  • 2.Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N., Bourne P.E. The protein data bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Kolodny R., Koehl P., Levitt M. Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. J. Mol. Biol. 2005;346:1173–1188. doi: 10.1016/j.jmb.2004.12.032. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Hasegawa H., Holm L. Advances and pitfalls of protein structural alignment. Curr. Opin. Struct. Biol. 2009;19:341–348. doi: 10.1016/j.sbi.2009.04.003. [DOI] [PubMed] [Google Scholar]
  • 5.Bauer R.A., Bourne P.E., Formella A., Frmmel C., Gille C., Goede A., Guerler A., Hoppe A., Knapp E.W., Pschel T., et al. Superimpose: a 3D structural superposition server. Nucleic Acids Res. 2008;36:47–54. doi: 10.1093/nar/gkn285. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Nagarajan R., Siva Balan S., Sabarinathan R., Kirti Vaishnavi M., Sekar K. Fragment Finder 2.0: a computing server to identify structurally similar fragments. J. Appl. Crystallogr. 2012;45:332–334. [Google Scholar]
  • 7.Regad L., Saladin A., Maupetit J., Geneix C., Camproux A.C. SA-Mot: a web server for the identification of motifs of interest extracted from protein loops. Nucleic Acids Res. 2011;39:203–209. doi: 10.1093/nar/gkr410. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Sippl M.J., Wiederstein M. Detection of spatial correlations in protein structures and molecular complexes. Structure. 2012;20:718–728. doi: 10.1016/j.str.2012.01.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Wiederstein M., Gruber M., Frank K., Melo F., Sippl M.J. Structure-based characterization of multiprotein complexes. Structure. 2014;22:1063–1070. doi: 10.1016/j.str.2014.05.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Samson A.O., Levitt M. Protein segment finder: an online search engine for segment motifs in the PDB. Nucleic Acids Res. 2009;37:D224–D228. doi: 10.1093/nar/gkn833. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Zhang Y., Skolnick J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 2005;33:2302–2309. doi: 10.1093/nar/gki524. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Shindyalov I.N., Bourne P.E. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 1998;11:739–747. doi: 10.1093/protein/11.9.739. [DOI] [PubMed] [Google Scholar]
  • 13.Debret G., Martel A., Cuniasse P. RASMOT-3D PRO: a 3D motif search webserver. Nucleic Acids Res. 2009;37:459–464. doi: 10.1093/nar/gkp304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Nadzirin N., Gardiner E.J., Willett P., Artymiuk P.J., Firdaus-Raih M. SPRITE and ASSAM: web servers for side chain 3D-motif searching in protein structures. Nucleic Acids Res. 2012;4:W380–W386. doi: 10.1093/nar/gks401. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Shi S., Chitturi B., Grishin N.V. ProSMoS server: a pattern-based search using interaction matrix representation of protein structures. Nucleic Acids Res. 2009;37:W526–W531. doi: 10.1093/nar/gkp316. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Fox N.K., Brenner S.E., Chandonia J.M. SCOPe: Structural Classification of Proteins extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res. 2014;42:D304–D309. doi: 10.1093/nar/gkt1240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Fernandez-Fuentes N., Zhai J., Fiser A. ArchPRED: a template based loop structure prediction server. Nucleic Acids Res. 2006;34:W173–W176. doi: 10.1093/nar/gkl113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Hildebrand P.W., Goede A., Bauer R.A., Gruening B., Ismer J., Michalsky E., Preissner R. SuperLoopera prediction server for the modeling of loops in globular and membrane proteins. Nucleic Acids Res. 2009;37:571–574. doi: 10.1093/nar/gkp338. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Michalsky E., Goede A., Preissner R. Loops In Proteins (LIP) a comprehensive loop database for homology modelling. Protein Eng. 2003;16:979–985. doi: 10.1093/protein/gzg119. [DOI] [PubMed] [Google Scholar]
  • 20.Choi Y., Deane C.M. FREAD revisited: accurate loop structure prediction using a database search algorithm. Proteins. 2010;78:1431–1440. doi: 10.1002/prot.22658. [DOI] [PubMed] [Google Scholar]
  • 21.Peng H.P., Yang A.S. Modeling protein loops with knowledge-based prediction of sequence-structure alignment. Bioinformatics. 2007;23:2836–2842. doi: 10.1093/bioinformatics/btm456. [DOI] [PubMed] [Google Scholar]
  • 22.Ko J., Lee D., Park H., Coutsias E.A., Lee J., Seok C. The FALC-Loop web server for protein loop modeling. Nucleic Acids Res. 2011;39:210–214. doi: 10.1093/nar/gkr352. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Guyon F., Tufféry P. Fast protein fragment similarity scoring using a Binet-Cauchy kernel. Bioinformatics. 2014;30:784–791. doi: 10.1093/bioinformatics/btt618. [DOI] [PubMed] [Google Scholar]
  • 24.Novotny M., Kleywegt G.J., Wang G., Dunbrack R.L. A survey of left-handed helices in protein structures. J. Mol. Biol. 2005;347:231–241. doi: 10.1016/j.jmb.2005.01.037. [DOI] [PubMed] [Google Scholar]
  • 25.Wang G., Dunbrack R.L. PISCES: a protein sequence culling server. Bioinformatics. 2003;19:1589–1591. doi: 10.1093/bioinformatics/btg224. [DOI] [PubMed] [Google Scholar]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES