Skip to main content
Data in Brief logoLink to Data in Brief
. 2020 Mar 5;29:105383. doi: 10.1016/j.dib.2020.105383

Data set of intrinsically disordered proteins analysed at a local protein conformation level

Akhila Melarkode Vattekatte a,b,c,d,1, Tarun Jairaj Narwani a,b,d, Aline Floch b,e,f,g, Mirjana Maljković h, Soubika Bisoo a,b,d, Nicolas K Shinada a,b,d,i,j, Agata Kranjc a,b,d, Jean-Christophe Gelly a,b,d,k, Narayanaswamy Srinivasan l, Nenad Mitić h, Alexandre G de Brevern a,b,d,k,
PMCID: PMC7078294  PMID: 32195305

Abstract

Intrinsic Disorder Proteins (IDPs) have become a hot topic since their characterisation in the 90s. The data presented in this article are related to our research entitled “A structural entropy index to analyse local conformations in Intrinsically Disordered Proteins” published in Journal of Structural Biology [1]. In this study, we quantified, for the first time, continuum from rigidity to flexibility and finally disorder. Non-disordered regions were also highlighted in the ensemble of disordered proteins. This work was done using the Protein Ensemble Database (PED), which is a useful database collecting series of protein structures considered as IDPs. The data set consists of a collection of cleaned protein files in classical pdb format that can be readily used as an input with most automatic analysis software. The accompanying data include the coding of all structural information in terms of a structural alphabet, namely Protein Blocks (PBs). An entropy index derived from PBs that allows apprehending the continuum between protein rigidity to flexibility to disorder is included, with information from secondary structure assignment, protein accessibility and prediction of disorder from the sequences. The data may be used for further structural bioinformatics studies of IDPs. It can also be used as a benchmark for evaluating disorder prediction methods.

Keywords: Protein disorder, PDB, Ensembles, Entropy, Local protein conformation, Structural alphabet


Specification Table 

Subject Biochemistry.
Specific subject area Structural Bioinformatics, proteins disorder.
Type of data A collection of atom coordinates in the pdb format, tables, text files and Figures.
How data were acquired A survey of the Protein Ensemble Database (PED).
Data format Raw, analysed and filtered
Parameters for data collection A Protein Ensemble Database survey was performed in march 2019. The data set consists of PED stores 25,473 protein structures of 60 ensembles in 24 entries in the Protein Data Bank (pdb) format. The atom coordinate files were cleaned and treated as described below and as such may be used for further automatic analysis.
Description of data collection Every entry of PED was analysed, i.e. some have inconsistencies. Then, all cleaned files were used for Protein Blocks (PBs) assignment, the frequency of each PBs was calculated, and local entropy was computed. All files are provided. In a similar way, DSSP was used to assign secondary structure and solvent accessibility for each residue. The dataset also collects disorder prediction generated from DisoPred and PrDOS webserver. Flat text files are provided for simple use and some Figures for better visualisation.
Data source location University of Paris, Paris, France.
Data accessibility Data is given in the paper.
It can as well be downloaded from: http://www.dsimb.inserm.fr/∼debrevern/RESEARCH/IDP-PB/.
Related research article Akhila Melarkode Vattekatte, Tarun Jairaj Narwani, Aline Floch, Mirjana Maljković, Soubika Bisoo, Nicolas K. Shinada, Agata Kranjc, Jean-Christophe Gelly, Narayanaswamy Srinivasan, Nenad Mitić & Alexandre G. de Brevern, (2020) “A structural entropy index to analyse local conformations in Intrinsically Disordered Proteins”, Journal of Structural Biology, in press [1].

Value of the Data

  • Atomic coordinate files in pdb format are processed in a manner suitable for most analysis programs.

  • The PB assignment and entropy calculation allow defining the rigidity – flexibility – disordered state as done in [1] and are easy to use for further research.

  • The secondary structure assignment and solvent accessibility are provided, as they represent the basis for structural analyses.

  • Two types of disorder prediction methodologies are provided; all these data can be used as a benchmark for evaluating disorder prediction methods.

  • These data were largely used for the Journal of Structural Biology [1], and can be useful for researchers interested in the analyses of IDPs and IDRs, but also for the development of novel prediction approaches. The addition of secondary structure assignment, solvent accessibility and the two different disorder prediction methodologies will also help them greatly.

1. Data description

Intrinsic Disorder Proteins (IDPs) and Intrinsic Disorder Regions (IDRs) are a non-negligible part of the protein structures. IDPs are not ordered and are likely to be unfolded in solution under native functional conditions [2], [3], [4]. They do not have a well-defined 3-D structure, but embrace an ensemble of conformations. In our recent research [1], we have analysed the Protein Ensemble Database (PED3) [5] in the light of a structural alphabet [6]. PED3 is a useful database collecting series of protein structures associated to IDPs. PED stores 25,473 protein structures of 60 ensembles in 24 entries.

We provide the entire dataset in four separate folders. The data collected in these folders represent the core of our previous research published in the Journal of Structural Biology [1].

The first folder (1_DATA) consists of the raw data, i.e. the 24 entries with accompanying ensembles in the pdb format. They could be directly downloaded from PED website, but we have cleaned few of them for better parsing. Each subdirectories is noted PEDxAAy-pdb, where x is always a number ranging from 1 to 9, and y is a letter ranging from A to D, i.e. PED1AAD-pdb (β-synuclein).

The second folder (2_PBs) corresponds to the local protein conformations analyses in the light of Protein Blocks (PBs, [7]). PBxplore software [8] was used to translate the protein structures in terms of PBs. For each entry, text files are provided with corresponding Figures. The name of directories follows the same rules with PEDxAAy. For each structure entry, the pdb files assigned as series of PBs are named PEDxAAy.PB.fasta. When multiple chains are found, the syntax is slightly changed to PEDxAAy-chainZ.PB.fasta, with chain Z added in the name. PBs are small prototypes of 5 residues length, ranging from a to p. The first two and the last two residues are not assigned and are labelled Z. In rare cases, too many residues are incomplete in the pdb file, therefore only stretches of Z are assigned by the PBxplore.

From the distribution of PBs, the frequencies of every PB at a given position are computed and saved in PEDxAAy.PB.count files. These frequencies are used to compute an entropy index named Neq which defines whether the position is rigid, flexible or disordered. The entropy index is stored in the files named PEDxAAy.PB.Neq. This information is easily readable and parsable for future analyses; visual representations are given with corresponding Figures. Firstly, one PB frequency map is shown (files named PEDxAAy.map.png). In this map the colours range from deep blue (lack of a given type of PB) to red (only one type of PB) for a fixed residue position, with a grading of green, yellow and orange for intermediate states. Secondly, the same information is also shown with logos of PBs, the logo sizes are proportionate to their frequencies (files named PEDxAAy.PB.logo.png). Finally, different 3D visualisations done with PyMOL software [9] are provided with three different protein orientations (files named PyMOL_PEDxAAy.png).

The third folder (3_DDSP) corresponds to the secondary structure assignment performed with DSSP software [10]. DSSP provides the 8-states assignment (α-helix, π-helix, 3.10 helix, bend, turn, β-bridge, β-sheet and coil), but also the solvent accessibility. These two pieces of information are essential for most structural analyses. Each structure is in a file named PEDxAAy-n.dssp, with n corresponding to the number of the models. DSSP is the most widely used secondary structure assignment for over thirty years.

The fourth folder (4_DISORDER) contains the disorder prediction outputs. Two very different methodologies were chosen, namely DisoPred 3.1 [11] and PrDOS [12]. Their results can be quite dissimilar. It underlines the importance to have a better description of the disorder states. Each of the 24 entries is shown individually. DisoPred subdirectory contains files named name.pbat that include prediction values of protein binding residues in disordered regions as well as disordered and ordered residues). In addition, it includes in a corresponding csv file (name.csv) and a simplified version (files named name.comb). An illustrative Figure named annotationGrid.png shows the results of DisoPred analysis. In PrDOS subdirectory a csv file summarizes all the results (prdos.name.csv), a separate plot shows the predicted values along the sequence (in png format). The whole output of the PrDOS, providing the information on the analysed protein sequence, turn available also on the website (files named xAAy-PrDOS.jpeg).

These data were therefore used for the work presented in Journal of Structural Biology paper [1]. They are presented in a way that can be easily reused by researchers. Adding to the PB analyses, the data of secondary structures, of accessibility to the solvent and of the prediction methods is useful in the context of the development of new methodologies for predicting disorder and/or protein flexibility.

2. Experimental design, materials, and methods

2.1. Raw data

The raw data were downloaded from PED website and correspond to an important occurrence of ensembles. PED3 contains 25,473 protein structures of 60 ensembles in 24 entries. Out of these, 6 entries have data from both SAXS and NMR, 7 from only SAXS, 10 from only NMR and one from Molecular Dynamics. Some entries have 10 or fewer models, while 8 have them more than 500. The PED4AAB entry, the Sendai virus phosphoprotein ensemble is the most populated with 13,718 models. All the models follow the classical PDB format (without most of the remarks). It can already be seen that some residues are incomplete and could be problematic for the future analyses.

2.2. Protein blocks

Protein Blocks (PBs) is a structural alphabet composed of 16 local prototypes [7], PBs are employed to analyse local conformations. Each specific PB is characterized by the φ, ψ dihedral angles of five consecutive residues. The PBs m and d can be roughly described as prototypes for central α-helix and central β-strand, respectively. PBs a through c primarily represent the N-cap region of β-strand while PBs e and f correspond to the C-caps; PBs g through j are specific to coils, k and l correspond to the N-cap region of α-helix, and PBs n through p to that of C-caps [6,13]. PB assignment was carried out for every residue from every snapshot extracted from MD simulations using PBxplore tool [8] available at GitHub (https://github.com/pierrepo/PBxplore). A useful measure to quantify the flexibility of each amino acid, called Neq (for equivalent number of PBs) [7] was used. Neq is a statistical measurement similar to entropy; it represents the average number of PBs a residue may adopt at a given position. Neq is calculated as follows [7]:

Neq=exp(x=116fxlnfx) (1)

Where, fx is the frequency of PB x in the position of interest. A Neq value of 1 indicates that only one type of PB is observed, while a value of 16 is equivalent to an equal probability for each of the 16 states, i.e. random distribution. We have also computed average Neq values. PBs were successfully used for the analysis of molecular dynamics simulation of e.g. integrins, Duffy Antigen Chemokine Receptor (DARC) protein, KiSS1-derived peptide receptor (KISS1R), HIV-1 capsid protein, α-1,4-glycosidic hydrolase, NMDA Receptor Channel Gate.

2.3. Secondary structure assignment

Secondary structure assignment was performed using DSSP [10] (DSSP 2015 version 2.2.1; the latest DSSP distribution is available at GitHub on address https://github.com/cmbi/xssp) with default parameters [14]. DSSP assigns 8-secondary structure states, with 3 helical states, namely α-, 310- and π-helices, 2 definition of β-turns, namely turns (with hydrogen bonds) and bends (without hydrogen bonds), the rare β-bridge, and the frequent β-strand composing the β-sheet, and the coil (or loop) state.

2.4. Disorder prediction

Two approaches were used, namely DisoPred 3.1 [11] and PrDOS [12]. The first is one of the most well-known and used approaches (664 citations in January-2020 as measured by Google Scholar), the second one is less well-known but also has a large number of citations (463 at the same period). Both are based on very different approaches and provide slightly different tendencies depending on the entries, making them useful to enrich the analyses.

Acknowledgments

Acknowledgments

This work was supported by grants from the Ministry of Research (France), University de Paris, University Paris Diderot, Sorbonne, Paris Cité (France), National Institute for Blood Transfusion (INTS, France), National Institute for Health and Medical Research (INSERM, France), IdEx ANR-18-IDEX-0001 and labex GR-Ex. The labex GR-Ex, reference ANR-11-LABX-0051 is funded by the program “Investissements d'avenir” of the French National Research Agency, reference ANR-11-IDEX-0005-02. TJN, NS and AdB acknowledge to Indo-French Centre for the Promotion of Advanced Research / CEFIPRA for collaborative grant (number 5302-2). NSh acknowledges support from ANRT. AMV is supported by Allocation de Recherche Réunion granted by the Conseil Régional de la Réunion and the European Social Fund EU (ESF). MM and NM acknowledge to project grants No. 174021 and 44006 from Ministry of Education, Science and Technological Development, Republic of Serbia. Research in NS group is supported by Mathematical Biology program and FIST program sponsored by the Department of Science and Technology and also by the Department of Biotechnology, Government of India in the form of IISc-DBT partnership programme. Support from UGC, India – Centre for Advanced Studies and Ministry of Human Resource Development, India is gratefully acknowledged. NS is a J. C. Bose National Fellow.

The authors were granted access to high performance computing (HPC) resources at the French National Computing Centre CINES under grant no. c2013037147, no. A0010707621 and A0040710426 funded by the GENCI (Grand Equipement National de Calcul Intensif). Calculations were also performed on an SGI cluster granted by Conseil Régional Ile de France and INTS (SESAME Grant).

Conflict of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Footnotes

Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.dib.2020.105383.

Appendix. Supplementary materials

mmc1.zip (68.7MB, zip)

References

  • 1.Melarkode Vattekatte A., Narwani T.J., Floch A., Maljkovic M., Bisoo S., Shinada N.K., Kranjc A., Gelly J.C., Srinivasan N., Mitic N., de Brevern A.G. A structural entropy index to analyse local conformations in intrinsically disordered proteins. J. Struct. Biol. 2020 doi: 10.1016/j.jsb.2020.107464. [DOI] [PubMed] [Google Scholar]
  • 2.Uversky V.N. Cracking the folding code. Why do some proteins adopt partially folded conformations, whereas other don't? FEBS Lett. 2002;514:181–183. doi: 10.1016/s0014-5793(02)02359-1. [DOI] [PubMed] [Google Scholar]
  • 3.Pavlovic-Lazetic G.M., Mitic N.S., Kovacevic J.J., Obradovic Z., Malkov S.N., Beljanski M.V. Bioinformatics analysis of disordered proteins in prokaryotes. BMC Bioinf. 2011;12:66. doi: 10.1186/1471-2105-12-66. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Mitić N.S., Malkov S.N., Kovacevic J.J., Pavlovic-Lazetic G.M., Beljanski M.V. Structural disorder of plasmid-encoded proteins in Bacteria and Archaea. BMC Bioinf. 2018;19:158. doi: 10.1186/s12859-018-2158-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Varadi M., Kosol S., Lebrun P., Valentini E., Blackledge M., Dunker A.K., Felli I.C., Forman-Kay J.D., Kriwacki R.W., Pierattelli R., Sussman J., Svergun D.I., Uversky V.N., Vendruscolo M., Wishart D., Wright P.E., Tompa P. pE-DB: a database of structural ensembles of intrinsically disordered and of unfolded proteins. Nucleic Acids Res. 2014;42:D326–D335. doi: 10.1093/nar/gkt960. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Joseph A.P., Agarwal G., Mahajan S., Gelly J.-C., Swapna L.S., Offmann B., Cadet F., Bornot A., Tyagi M., Valadié H., Schneider B., Cadet F., Srinivasan N., de Brevern A.G. A short survey on protein blocks. Biophys. Rev. 2010;2:137–145. doi: 10.1007/s12551-010-0036-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.de Brevern A.G., Etchebest C., Hazout S. Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks. Proteins. 2000;41:271–287. doi: 10.1002/1097-0134(20001115)41:3<271::aid-prot10>3.0.co;2-z. [DOI] [PubMed] [Google Scholar]
  • 8.Barnoud J., Santuz H., Craveur P., Joseph A.P., Jallu V., de Brevern A.G., Poulain P. PBxplore: a tool to analyze local protein structure and deformability with protein blocks. PeerJ. 2017;5:e4013. doi: 10.7717/peerj.4013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.W.L.T. DeLano, The PyMOL molecular graphics system DeLano scientific, San Carlos, CA, USA. http://www.pymol.org, (2002).
  • 10.Kabsch W., Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983;22:2577–2637. doi: 10.1002/bip.360221211. [DOI] [PubMed] [Google Scholar]
  • 11.Jones D.T., Cozzetto D. DISOPRED3: precise disordered region predictions with annotated protein-binding activity. Bioinformatics. 2015;31:857–863. doi: 10.1093/bioinformatics/btu744. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Ishida T., Kinoshita K. PrDOS: prediction of disordered protein regions from amino acid sequence. Nucleic Acids Res. 2007;35:W460–W464. doi: 10.1093/nar/gkm363. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.de Brevern A.G. New assessment of a structural alphabet. In Silico Biol. 2005;5:283–289. [PMC free article] [PubMed] [Google Scholar]
  • 14.Touw W.G., Baakman C., Black J., te Beek T.A., Krieger E., Joosten R.P., Vriend G. A series of PDB-related databanks for everyday needs. Nucleic Acids Res. 2015;43:D364–D368. doi: 10.1093/nar/gku1028. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

mmc1.zip (68.7MB, zip)

Articles from Data in Brief are provided here courtesy of Elsevier

RESOURCES