Abstract
Intrinsically disordered proteins (IDPs) are comprised of significant numbers of residues that form neither helix, sheet, nor any other canonical type of secondary structure. They play important roles in a broad range of biological processes, such as molecular recognition and signalling, largely due to their chameleon-like ability to change structure from unordered when free in solution to ordered when bound to partner molecules. Circular dichroism (CD) spectroscopy is a widely-used method for characterising protein secondary structures, but analyses of IDPs using CD spectroscopy have suffered because the methods and reference datasets used for the empirical determination of secondary structures do not contain adequate representations of unordered structures. This work describes the creation, validation and testing of a standalone Windows-based application, DichroIDP, and a new reference dataset, IDP175, which is suitable for analyses of proteins containing significant amounts of disordered structure. DichroIDP enables secondary structure determinations of IDPs and proteins containing intrinsically disordered regions.
Subject terms: Computational biology and bioinformatics, Protein databases, Computational biophysics
DichroIDP is a method with a bespoke reference dataset for analyzing and determining secondary structures in proteins containing intrinsically disordered regions using circular dichroism spectroscopy.
Introduction
Most globular proteins in their native state are primarily comprised of canonical (helical, sheet and turn) secondary structures and exist in well-defined conformations with specific three-dimensional structures. In contrast, intrinsically disordered proteins (IDPs) tend to form dynamic ensembles of highly flexible polypeptide chains that often have very limited amounts of persistent secondary structures1. In addition, some globular proteins also exhibit intrinsically disordered regions (IDRs) comprised of ~30 or more consecutive amino acid residues, which do not adopt regular secondary structures2. Due to their flexible nature, IDPs and proteins with IDRs have the potential to bind to a range of partner molecules, acquiring different conformations according to the templates provided by the binding partners. This is likely to be a reason why they appear to be involved in a number of regulatory functions, including molecular recognition and signalling3. In humans, for instance, ~80% of “hub” proteins with >10 known binding partners are predicted to contain long disordered regions4.
Circular dichroism (CD) spectroscopy (and the related method of synchrotron radiation circular dichroism (SRCD) spectroscopy5) are widely-used techniques for quantitatively analysing the helix, sheet and turn contents of proteins6,7 in different environments and as components of complexes. In most cases, the analyses employ empirical methods that rely on the availability of suitable and broadly-based reference datasets (RDS) derived from proteins with known crystal structures8–10. These types of analyses, however, can be of limited value if the protein to be analysed includes significant numbers of residues that are not present in canonical types of secondary structures. Such residues usually have been grouped together under nomenclatures such as “other”, “unordered”, “irregular”, “disordered” or “random coil”. Empirical analyses of proteins with such features rely on the availability of examples of protein spectra which include non-canonical structures in their reference datasets; however currently-available reference datasets used by the CD methods have been derived from proteins that crystallise, and therefore tend to include only limited numbers of examples of natively “unordered” or disordered types of secondary structure (which tend to be missing in crystal structures) and are often referred to as “other”. Indeed, computationally, the “other” type of secondary structure is often simply ascribed to the remainder of the protein that is not calculated to be helical, sheet, or in some cases, turn. As examples, “other” type structures have also been used to refer to loop structures that do not form the strict hydrogen-bond pairings present in different types of tight turns, to unfolded structures present in thermally- or chemically- treated proteins which have lost their tertiary structural interactions, or to intrinsically disordered regions (IDRs) of proteins which do not adopt regular helical or sheet structures.
The aim of this study was therefore to improve the coverage of disordered secondary structure types available in CD reference databases, and the methods used for their analyses by CD spectroscopy. It describes a new reference dataset which includes examples of this class of structure, and an associated novel application method that can be used to analyse CD spectra from a wide-range of protein types, including IDP and IDR-containing proteins.
A number of existing secondary structure analysis tools11–13 have been developed which incorporate or are based on different empirical algorithms for determining the secondary structures of proteins from CD spectroscopic data using the available reference datasets derived from spectra of globular proteins with known crystal structures. These tools include the SELCON314, CONTINLL15, CDSSTR16, and BeStSel12 deconvolution algorithms, and the SESCA13 programme, amongst others. Although there is usually some variation in the results obtained with these different algorithms, the majority of the differences arise not from the different methodologies, but rather from the use of different reference datasets comprised of different proteins11. To date, the available reference datasets with the most comprehensive coverage of protein secondary structure and fold space are the bioinformatics-designed SP175 reference dataset8 (for soluble proteins), the SMP180 reference dataset9 which includes both soluble and membrane proteins, and the SP175+ reference dataset (SP175 augmented by a number of additional beta sheet proteins17). The first two of these are included with the DichroWeb analysis server7,11 and the latter is available in the BeStSel12 analysis server. The SESCA13 programme utilises a number of datasets, including a modified version of SP175.
However, none of the currently available reference datasets contain representatives of proteins that include significant amounts of intrinsic disorder. This is primarily because the disordered regions in globular proteins tend not to be visible in crystal structures, and because IDPs, by their nature, do not form regular crystallisable structures, even though they may contain regions that are statically- or dynamically- well-defined. One existing reference dataset, CDPro4210 from the CDPro software package (dataset 7 available in the DichroWeb11 online analysis resource located at: http://dichroweb.cryst.bbk.ac.uk/html/home.shtml), does contain several spectra of denatured proteins as representatives of “disordered” proteins, which are assumed to be comprised of ~90% unordered structure; but there is no independent evidence that they adopt such structures nor that these denatured structures (produced by chemical unfolding reagents or heating) are related to intrinsically-unfolded regions of native proteins. In order to create a new reference dataset (and an associated analysis method) that distinguishes disordered structures from helix, sheet and turn, it has been necessary to include examples of IDP proteins (Supplementary Table S1 (top)) with those of standard globular proteins that are primarily composed of canonical helical and sheet secondary structures (Supplementary Table S1 (bottom)). However, since IDPs do not readily crystallise, several new bioinformatics methods have been used to predict secondary structures for a number of IDP or IDR-rich proteins directly from their primary sequences. These methods include Spot-1D18, NetSurfP-2.019, RaptorX20 and AlphaFold221. All predict solvent accessibility and backbone dihedral angles, and therefore the potential secondary structure of individual residues in the sequence, using deep learning neural networks trained on structures present in the Protein Data Bank (PDB)22 (Supplementary Table S2). Spot-1D and NetSurfP-2.0 output three- and eight-state residue-by-residue secondary structure predictions, whereas RaptorX and AlphaFold2 output atomic coordinates in PDB format. The latter two methods permit secondary structures to be independently calculated using the dictionary of protein secondary structure (DSSP)23 algorithm (in the same way as those used for the soluble proteins in the dataset, which all have crystal structures available in the PDB).
The use of the AlphaFold method to predict protein structures in general has been endorsed by the Critical Assessment of Protein Structure Prediction (CASP)24 assessment competition, which compared leading structure prediction methods in detail for a wide range of proteins. AlphaFold was the top-ranked method overall, with a median GDT (Global Distance Test) score of 92.4 across all targets and 87.0 on the challenging free-modelling category, compared to 72.8 and 61.0 for the next best methods in these categories. However, those assessments were done primarily on fully-ordered proteins, rather than the disordered or partially disordered proteins in the present study. IDPs (or ordered proteins with IDRs) are different types of structures than fully ordered proteins, however, David et al.25, Ruff and Pappu26 and Wilson et al.27 have asserted, that whilst the details of the AlphaFold2 predictions of the 3D structures of the IDP regions may not be exactly defined residue-by-residue, what is clear is that the extent and characteristics of the IDP region residues are clearly indicated by Alphafold2 to be IDRs in nature. This is the information required for the present study.
The new reference dataset reported herein is designated IDP175, a name which reflects the inclusion of intrinsically disordered proteins with the low wavelength end of their spectra extending down to a wavelength minimum of 175 nm. It includes spectra (Fig. 1) from both the existing SP175 RDS8 and the newly-characterised group of IDP protein spectra determined in this study but which are not present in any other dataset available to date. All components are publicly-available in the Protein Circular Dichroism Data Bank (PCDDB)28. This new dataset should therefore be appropriate for analyses of not only IDPs, but also for proteins which contain mixtures of both ordered and disordered structures. For ease of use, the IDP175 dataset has been incorporated into a stand-alone Windows application method called DichroIDP, which utilises SelMat8 a modified version of the SELCON3 algorithm to determine secondary structures from protein CD spectra.
The IDP175 reference dataset was first cross-validated by leave-one-out procedures using a modified version of DichroIDP that was produced exclusively for the purpose of testing. The IDP175 reference dataset was then trialled in the DichroIDP app using spectra of both IDPs and spectra of globular proteins with significant amounts of disorder, in order to demonstrate its general suitability; the results obtained were compared with results using three existing RDS, SP1758, CDPro10 and SP175+ 17 In the cross-validation tests, the IDP175 and other reference datasets produced roughly comparable results for helix and sheet components, but the IDP175 reference dataset produced a significant improvement for the calculated turn and disordered components based on the Pearsons correlation and zeta factor criteria8. More crucially, whilst producing similar values for helix and sheet components, IDP175 outperformed all other reference datasets, and also other widely used methods, including BeStSel12 and K2D329, in analyses of the spectra of disordered proteins, defined by how close the values were to those calculated by the DSSP algorithm based on either their AlphaFold221 or PDB22 structures.
Results
Criteria used for selection of intrinsically disordered proteins
The new proteins in the IDP175 RDS (Supplementary Table S1 (top)) ranged from small (49 residues, the region 174-222 of the translocated actin-recruiting phosphoprotein (Tarp174-222) from Chlamydia trachomatis) to moderate size (>300 residues, the hydrophilic acylated surface protein from Leishmania major (HASPA))30. Only soluble (not membrane) proteins were included, and no proteins with bound chromophores or ligands that absorb in the UV or visible ranges were included, as these could potentially distort the protein spectra, even in the far UV region used for secondary structure analyses.
Choice of proteins included in either the reference or test datasets
The difficulty in expressing and purifying soluble monomeric IDPs meant that there were a limited number of fully IDP proteins or polypeptides available for use. Furthermore, as pointed out by Micsonai et al.31, bioinformatics methods for obtaining secondary structure from protein sequences do not take into account environmental factors, which can radically alter a protein’s conformation. Therefore the number of IDP spectra available was further limited as some of the structural data obtained using these methods were deemed to not match the general form of the CD spectra obtained. Consequently judicious choices had to be made regarding which of the IDP proteins were to be used for creation of the RDS and which were to be used for testing of the RDS. The more proteins in the reference dataset, the more accurate it was likely to be; however, including more of the proteins in the RDS would then limit the number of test proteins available for independent validation calculations. Ultimately the selection of proteins included in the RDS was guided by optimisation of the cross-validation test parameters (see below). The proteins that have been included in IDP175 and those used for testing are listed in Supplementary Tables S1 (top) and S1 (bottom). The spectra of the components of the entire dataset and of only the “fully IDP” proteins included in the dataset are shown in Fig. 1a, b, respectively; not surprisingly, all of the IDP spectra appear to be very similar.
The spectra of alpha-chymotrypsin, alpha-chymotrypsinogen, elastase and soybean trypsin inhibitor, which contain right-hand-twisted beta-sheets (hereafter designated β2 spectra) can resemble the spectra of the IDPs12,31, (see Fig. 1b), causing existing analysis algorithms to assign excessive beta structure to IDP spectra if they are included in the RDS. This was also found to be the case for the IDP175 dataset, especially if there are (even very small) errors in the spectral magnitudes due to inaccurate concentration determinations or cell pathlength measurements. β2 spectra were therefore removed from the IDP175 RDS, along with the spectrum of ferredoxin which gives an anomalous disordered-like spectrum, likely due to the presence of its chromophore.
The test dataset included not only IDPs, but folded proteins with mostly beta sheet (Types 1 and 2 (relaxed and right-hand-twisted)) structures, proteins comprised of both alpha helix and beta sheet, and alpha helical proteins (Fig. 2). The latter inclusions were to demonstrate how the RDS performed in analyses of all common secondary structure types.
Protein spectra sources and selection
Secondary structures of the SP1758 proteins in the dataset were derived from crystal structure coordinates (from the same PDB files that were used for the original SP175 dataset) using the DSSP23 algorithm (Supplementary Table S2). The Spot-1D18, NetsurfP-2.019, RaptorX20, and AlphaFold221 prediction methods were used to generate structural data from the primary sequences of the IDP proteins, which do not readily crystallise and therefore had no crystal structures included in the PDB22. Although the results were similar, RaptorX20 and AlphaFold221 were initially favoured in this study because they generate PDB files that can be analysed in the same manner as globular protein structures, using DSSP23 and AlphaFold2 (which was endorsed by CASP24 results). Of all four methods, AlphaFold221 was judged to give the best performance in cross-validation results and in the analyses of IDPs in the test dataset, with respect to the disordered fraction (Supplementary Table S3). Therefore only structures obtained from this method were used in the final RDS. Four protein spectra for which structural assignments did not correlate with the appearances of their CD spectra, were discarded (Supplementary Table S4). Although Wilson et al.27 suggested that the problem usually manifests in the over-prediction of disordered residues, the discarded proteins were predicted to have more alpha helix by AlphaFold2 than was judged to be the case from the general appearance of the CD spectra (Supplementary Fig. S1). For example, the CD spectrum of alpha synuclein in water indicates a disordered structure with a single negative peak at around 200 nm. However AlphaFold2 assigns 45% helix to this protein, a structure which would generate a spectrum with noticeable negative peaks at ~222 nm, ~208 nm and a positive peak ~190 nm.
Definitions of secondary structural classifications used in IDP175
When defining the number of separate classes to be identified from CD spectroscopic data, it is important to consider the information content present in the spectral data: if the data extend down to 190 nm, they have high enough information content8 to distinguish only 5 different types of secondary structures, although this number increases to 7 or 8 if data down to 175 nm (which can be achieved using SRCD instruments) is included. The secondary structural components of two of the most popular general datasets, SP1758 and SMP1809, use the six structural classifications of regular helix, distorted helix, regular sheet, distorted sheet, turns and “other”, where “other” combines everything else. The SP175+ dataset17, used in the BeStSel server12 (which was primarily designed to analyse beta sheets) divides the components into helix, parallel and antiparallel beta sheet, turns and other. However, in the present study, since we are mainly concerned with accurate predictions of the ‘other’ component, and to prevent over interpretation when data only reaches 190 nm, our output was limited to four categories. These are based on their DSSP values where the DSSP classes H, G and I are combined as helix, sheet is class E, turn is a combination of classes T and S and disorder is everything else (classes B and O).
Validations and dataset analysis comparisons
Cross-validation studies were first done for all four standard secondary structure types (helix, sheet, turn and disordered) using the “leave one out” method8 (Table 1) in order to show that there is adequate coverage of representative types present in the new IDP175 reference dataset and in a version of the dataset with the low wavelength cutoff of the data truncated to 190 nm (designated IDP175t). The selection of proteins to be part of the RDS (as opposed to test proteins) was optimised to produce the highest correlations for all four categories.
Table 1.
a) IDP175 | b) IDP175t | ||||||
r | δ | ζ | r | δ | ζ | ||
H | 0.9270 | 0.0801 | 2.6633 | H | 0.9214 | 0.0831 | 2.5668 |
E | 0.8543 | 0.0886 | 1.9186 | E | 0.8422 | 0.0920 | 1.8476 |
T | 0.5342 | 0.0613 | 1.1365 | T | 0.5411 | 0.0599 | 1.1631 |
D | 0.9322 | 0.0649 | 2.6987 | D | 0.9364 | 0.0617 | 2.8400 |
c) SP175 | d) SP175t | ||||||
r | δ | ζ | r | δ | ζ | ||
H | 0.9299 | 0.0771 | 2.7191 | H | 0.9233 | 0.0807 | 2.5977 |
E | 0.8398 | 0.0871 | 1.8354 | E | 0.8018 | 0.0956 | 1.6715 |
T | 0.3691 | 0.0543 | 1.0353 | T | 0.3915 | 0.0535 | 1.0510 |
D | 0.5945 | 0.0535 | 1.3051 | D | 0.7282 | 0.0484 | 1.4414 |
e) CDPro42 | f) | SP175+17 (from BeStSel12) | |||||
r | δ | ζ | r | δ | ζ | ||
H | 0.9175 | 0.0870 | 2.5001 | H | 0.9087 | 0.0897 | 2.3924 |
E | 0.6980 | 0.1145 | 1.3710 | E | 0.8016 | 0.1042 | 1.6712 |
T | 0.5296 | 0.0771 | 1.0873 | T | 0.4454 | 0.0531 | 1.0967 |
D | 0.7385 | 0.1550 | 1.3511 | D | 0.6623 | 0.054 | 1.3077 |
The following reference datasets were used: a) IDP175; b) IDP175t (low wavelength cut off 190 nm); c) SP1758, d) SP175t (low wavelength cut off 190 nm); e) CDPro4210 and f) SP175+17. The statistical parameters reported are: r, the Pearson’s correlation coefficient; δ, the root mean squared deviation, and ζ the ratio of δ over the population standard deviation as defined in the main text. The cross-validation values for all of the reference datasets/assignment methods are similar for helix and sheet secondary structures, but the disordered structure contents are very much improved using the IDP175 and IDP175t reference datasets. H,E,T, and D refer to the helical, sheet, turn, and disordered components, respectively.
The cross validation results (Table 1) were compared with studies using the SP1758 and SP175t RDS (like IDP175t, SP175t uses data to 190 nm, as opposed to the SP175 reference dataset which requires data to 175 nm), CDPro10 (cutoff 190 nm) and SP175+ 17 using the same 4 secondary structural types defined in IDP175, so that the quality of the analyses could be directly compared. All datasets exhibited little difference in the quality of the analyses of the helix and sheet categories, but IDP175 and IDP175t showed significant improvements for the turn and “disordered” categories, as expected.
Then de novo tests were done using the spectra of IDPs and folded proteins (shown in Fig. 2b) not present in the reference dataset. The results with IDP175 or IDP175t (depending on the low wavelength cutoff of the test protein data) are compared once more with the other reference datasets mentioned above (Fig. 3 and Supplementary Table S5) and also with results obtained using BestSel12 and K2D329, a neural network method trained on spectra predicted from PDB structures using DichroCalc32 (Supplementary Table S6). The disordered test proteins were also analysed using SESCA13 with the IDP175 dataset and DSSP-F, a dataset that comes with the SESCA package (Supplementary Table S7). The calculated secondary structure contents using the IDP175 reference dataset produced values that were closer to those of AlphaFold2 than those produced by the DSSP-F reference data. In addition, the NRMSD values were generally smaller.
Discussion
Early CD spectroscopic secondary structure analysis methods divided proteins into helical, sheet or “random coil” types of secondary structures, and used very limited numbers of proteins with known structures to create reference datasets for simple deconvolutions. Later methods used selection methods14–16, generally with reference datasets consisting of slightly larger numbers of proteins containing representative types of secondary structure. Their secondary structural components were, for the most part, divided into regular helix, distorted helix, regular sheet, distorted sheet, (sometimes) turns and “other” structures, where “other” combines everything else, including residues present in undefined regions of the protein crystal structure. More recently CD reference datasets for soluble8 and membrane proteins9 have been developed using bioinformatics techniques enabling wider coverage of fold and secondary structure spaces.
Most reference datasets available to date include only proteins whose crystal structures are known8,9, but at least one reference dataset10 included a few “denatured” protein structures (produced by acid and heat denaturation) all of which were assumed to contain 90% disordered structure. The availability of a number of stable purified, soluble IDPs, has now enabled the measurements of their CD spectra whilst the emergence of deep learning neural networks such as AlphaFold221, which has been shown to outperform other prediction methods in the Critical Assessment of Protein Structure Prediction exercises, now means there is a method for assigning atomic coordinates to this additional class of protein, which had not proved to be amenable to crystallisation. Both of these developments have thus allowed the construction, validation and testing of a new reference dataset for use with the new DichroIDP application described herein, to characterise proteins that have considerable amounts of disordered structure, often in the presence of canonical secondary structures.
In summary, we have produced a new user-friendly tool for studying an important class of proteins which are disordered or partially disordered, enabling quantitation of the amount of disordered structure present in both primarily folded, and primarily unfolded proteins using CD data. Previously, this class of proteins was not accurately analysed by CD due to the methodologies available and the lack of suitable reference and test protein spectra.
Materials and methods
Materials
The IDP175 reference dataset included the following spectra obtained previously in our lab: MEG-14 (microexon 14 protein from Schistosoma mansoni)33, HASPA and HASPB (hydrophilic acylated surface proteins from Leishmania major)30, bovine casein (Sigma-Aldrich), β-b1 C-terminus34,35, and TARP174-222 (translocated actin recruiting phosphoprotein from Chlamydia trachomatis, donated by Prof. Tharin Blumenschein of the University of East Anglia). Test proteins included four soluble proteins: Bence-Jones lambda protein, bovine trypsin, prealbumin and alpha-lactalbumin present in the SMP180 RDS9 (which were not in the SP175 RDS8), plus pokeweed lectin and saporin (Sigma-Aldrich). The spectra of six additional proteins (osteopontin, UTPase, ecotin, β2-microglobulin, MAGI-1PDZ1 and eGFP) were obtained from existing PCDDB28 entries. Two other IDP test protein spectra were obtained by digitising published spectra of cyclin-dependent-kinase inhibitor, Sic136, and strepsirrhine primate amelogenin37 using the desktop version of WebPlotDigitiser38. The CD spectra of all of the IDP175 proteins are depicted in Fig. 1 (main text), whilst the CD spectra of the test proteins are in Fig. 2 (main text). The secondary structures and UniProt39 codes of all of these proteins (and, where available, their PCDDB IDs) are listed in Supplementary Tables S1 (top), S1 (bottom), and S2.
Methods
Synchrotron radiation circular dichroism spectroscopy
All synchrotron radiation circular dichroism (SRCD) spectra that have not been previously published were measured at synchrotron beamlines CD1 or UV1 at the ISA facility in Aarhus, Denmark except for β-b1 C-terminus, which was measured on beamline CD12 at the SRS, Daresbury, UK.
The protein concentrations were determined by the A280 method with extinction coefficients calculated using the EXPASY webserver40. For comparison, the concentrations of proteins measured on beamlines CD1 and UV1 were also determined in situ using the A205 method whereby the sample absorbance is determined from the HT (high tension) signal and the synchrotron ring current41, and the concentration determined using amino acid extinction coefficients at 205 nm from values by Anthis and Clore42.
Spectra were obtained at 20°C in quartz cylindrical demountable cells (Hellma UK, Ltd) with optical pathlengths of 0.0015, 0.0024, or 0.0011 cm (each calibrated using the interference method43). In all cases the dataset spectra were measured from a high wavelength of >260 nm down to a low wavelength of a least 175 nm, in 1 nm steps, using averaging times of 1 to 3 s. Data processing was carried out using the CDtoolX software44 as follows: Three replicate sample spectra were averaged and a buffer baseline (also the average of three replicate spectra) subtracted. The net spectrum was calibrated using a spectrum of camphorsulphonic acid45 measured on the same instrument and then scaled to delta epsilon units.
The protein spectra were divided into those incorporated into the reference dataset (Supplementary Table S1(top)) and the test dataset (Supplementary Table S1 (bottom)), following cross-validation testing (see below) to optimise the reference dataset contents whilst retaining availability of some of the other spectra for validation testing. The new RDS spectra were added to 66 spectra from the SP175 RDS8 obtained from the PCDDB28. In addition, a number of the SP1758 entries have been updated in the PCDDB28 and indicated by ‘1’ in the 10th position of the PCDDBid.
Methods for assignment of secondary structures
The AlphaFold221 website (https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold22.ipynb#scrollTo=kOblAo-xetgx) was used to produce structures from the IDP protein sequences with the default settings that produce five models for each sequence. The helix, sheet, and turn secondary structure percentages of the proteins were defined by their DSSP23 classifications using the 2Struc46 webserver. Residues defined as H (alpha helix), G (310 helix) and I (pi helix) were combined and classified as helix, and the DSSP E class was assigned as beta strand. DSSP S (beta loop) and T (bonded turn) classes were designated “turn”, whereas B (bend) was combined with the remainder and designated disordered. The values obtained from the five top AlphaFold221 models for each IDP protein were averaged. AlphaFold2 models were also produced for the folded test proteins for comparison. DSSP23 values for the SP1758 proteins and folded test proteins were calculated from their crystal structures in the PDB22 (where available) using the 2Struc webserver46 (Supplementary Tables S1 (top) and S1 (bottom)).
Method for CD-based calculations of secondary structure
The IDP175 reference dataset was incorporated into the selectable list of available reference datasets in the DichroIDP standalone application produced using the Qt framework47. It uses the existing SelMat8 algorithm, rewritten in C + + using the ALGLIB48 package and can be used for analysing spectra that contain data between a high wavelength of at least 240 nm and any low wavelength between 200 and 175 nm. SelMat8 is a version of SELCON314 where the sum, fraction and helix rules are relaxed to give at least one solution for any protein spectrum, and was originally written for MATLAB49. Spectra can be scaled if necessary before analysis. The output consists of a table showing results from all stages of the algorithm calculation and includes a list of the closest proteins in the dataset to the query spectrum. The final result is presented in a second table that includes the normalised root mean square deviation (NRMSD11) between the query data and the back-calculated spectrum (which is displayed along with the query spectrum for comparison). The RMSD is normalised because it does not take into account the relative magnitude of the spectral fitting error. For example, where the CD signal is small in magnitude, error bars will exaggerate the error compared to where it is large in magnitude. The widely-used “NRMSD” parameter attempts to rectify this, and is defined as:
1 |
where θexp and θcalc are the experimental and back-calculated ellipticities, respectively, at each data point in the spectrum, with lower values indicating a closer match between experimental and reference data. The NRMSD (calculated in the same way for all methods) depends on how close the query spectrum is to an average of the nearest selected spectra in the dataset, from which the back-calculated spectrum is calculated. This means that it does not always reflect the accuracy of the secondary structure estimate in every case. This is demonstrated when for example a disordered spectrum is analysed using SP175 (or the BeStSel reference database), or when analysing a β2 spectrum using IDP175. However, the NRMSD does usually give a good indication of accuracy when using an appropriate dataset for analysing the query protein. Hence we have created a number of datasets over the years for different types of proteins, including the IDP dataset reported in this study. The result tables (or any part of them) produced by DichroIDP can be pasted directly into spreadsheet software. There is an extensive help file associated with the app which can be accessed directly from its “help” menu.
Validation and testing
The IDP175 and IDP175t reference datasets were cross-validated using the leave-one-out approach in a modified version of DichroIDP. Statistical parameters are the Pearsons correlation coefficient (r) and the root mean square deviation (δ). The zeta (ζ) value, which is the ratio of δ over the population standard deviation is defined (as previously reported8) as follows:
2 |
where σX is the standard deviation of the calculated fractions of secondary structure x. Values of ζ = < 1 indicate a value no better than a guess whereas values of 2-3 are statistically significant. Higher values of r and lower values of δ correspond to better cross-validation performances. The results (Table 1) were compared with the cross-validation of the SP1758, SP175t, CDPro10, and SP175+17 reference datasets using the new definitions of secondary structure classes.
The reference datasets were then tested for accuracy with the IDP test dataset of spectra of related and unrelated IDPs (Fig. 3 and Supplementary Table S1 [bottom]). Other proteins with [alpha + beta] contents of ≤40%, and thus significant amounts of “other” structure based on their crystal structures, were also included in the test dataset. The test results were compared to those obtained using datasets SP1758 CDPro10 and SP175 + 17 with DichroIDP and the secondary structure assignments mentioned above (Supplementary Table S5). Further comparisons were made using the results from the BeStSel12 and K2D329 servers (Supplementary Table S6) using the secondary structure assignments discussed in references 12 and 25 respectively, and also using the SESCA13 method in conjunction with the IDP175 and DSSP-F RDS (Supplementary Table S7) with the secondary structure assignments used in DichroIDP.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Supplementary information
Acknowledgements
This work was supported by grants P024092 to BAW and P024106 to Dr. Robert W. Janes at Queen Mary University of London from the Bioinformatics and Biological Resources programme of the U.K. Biotechnology and Biological Sciences Research Council (BBSRC).
Beamtime grants that enabled collection of the SRCD spectra that comprise the IDP175 reference dataset were provided by the Institute for Synchrotron Facilities (ISA, Denmark), and the CD12 beamline at the SRS Daresbury (now decommissioned). We thank Dr. Mark Richards (formerly a student in the Wallace lab, now at the University of Leicester) for providing us with the β-b1 C-terminus spectrum, (the late) Professor Ricardo DeMarco (University of Sao Paulo, Brazil) for the MEG-14 spectrum, Professor Tharin Blumenschein (University of East Anglia) for the TARP Protein, and Christy Panethymitaki (formerly a student at Imperial College) who produced the HASP proteins and worked with the Wallace lab to obtain their SRCD spectra. We thank Dr. Jose Luis Lopes (formerly of the Wallace lab at Birkbeck, and currently a lecturer at the University of Sao Paulo, Brazil) for helpful discussions. We thank Dr. Robert Janes for his help and advice throughout this project.
Author contributions
B.A.W. conceived of, and initiated, the project. A.J.M. collected, processed and analysed SRCD spectra, and created the DichroIDP software. A.J.M. produced the DSSP secondary structure assignments, ran the self-validation analyses, created the reference datasets, ran the secondary structure analyses on the test proteins, and deposited spectra in the PCDDB. B.A.W. and A.J.M. wrote the manuscript and tested the application. E.D.D. (formerly of Dr. Robert Janes’ lab at Queen Mary University of London) helped with the AlphaFold2 analyses.
Peer review
Peer review information
Communications Biology thanks Mauricio Carbajal-Tinoco and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Gene Chong.
Data availability
The new reference dataset spectra described in this paper, and their associated metadata, have been deposited in the Protein Circular Dichroism Data Bank (PCDDB)28 (located at http://pcddb.cryst.bbk.ac.uk). They include the following proteins: HASPA, HASPB, casein, and Tarp174-222, with consecutive records CD0006406000 to CD0006409000; they are identified by the keyword “IDP175”. MEG-1433 was already present in the PCDDB with PCDDBid CD0004064000. The test protein spectra have also been deposited in the PCDDB with the following PCDDBids: osteopontin, CD0003667000; β2-microglobulin, CD0003894000; Bence Jones protein, CD0000077000; prealbumin, CD0000091000; eGFP, CD0004251000; MAGI-1PDZ1, CD0000596000; UTPase, CD0003897000; ecotin, CD0003896000; trypsin, CD0000096000; and α-lactalbumin, CD0000072000. Amelogenin, Sic1, β−B1 C-terminus1-94, pokeweed lectin and saporin have consecutive records CD0006410000 to CD0006414000 and are identified by the keyword “IDPtest”. The spectra of proteins present in the SP175 dataset are already available in the PCDDB, and are identified by the keyword “SP175”. Twelve existing SP175 entries have been updated for this project; these are identified by a “1” in the 10th position of the PCDDBID. The PCDDB accession codes for each protein, and their secondary structures that were used in creating the IDP reference dataset are listed in Supplementary Tables S1 (top), S1 (bottom) and S2.
Code availability
The DichroIDP app is freely available for download at: https://dichroidp.cryst.bbk.ac.uk and from GitHub at https://github.com/pcddb/DichroIDPs.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
The online version contains supplementary material available at 10.1038/s42003-023-05178-2.
References
- 1.Uversky VN, Gillespie JR, Fink AL. Why are “natively unfolded” proteins unstructured under physiologic conditions? Proteins Struct. Funct. Bioinf. 2000;41:415–427. doi: 10.1002/1097-0134(20001115)41:3<415::aid-prot130>3.0.co;2-7. [DOI] [PubMed] [Google Scholar]
- 2.van der Lee R, et al. Classification of intrinsically disordered regions and proteins. Chem. Rev. 2014;114:6589–6631. doi: 10.1021/cr400525m. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Dyson HJ, Wright PE. Intrinsically unstructured proteins and their functions. Nat. Rev. Mol. Cell Biol. 2005;6:197–208. doi: 10.1038/nrm1589. [DOI] [PubMed] [Google Scholar]
- 4.Haynes C, et al. Intrinsic disorder is a common feature of hub proteins from four eukaryotic interactomes. PLoS Comp. Biol. 2006;2:e100. doi: 10.1371/journal.pcbi.0020100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Miles AJ, Wallace BA. Synchrotron radiation circular dichroism spectroscopy of proteins and applications in structural and functional genomics. Chem. Soc. Rev. 2006;35:39–51. doi: 10.1039/b316168b. [DOI] [PubMed] [Google Scholar]
- 6.Miles AJ, Janes RW, Wallace BA. Tools and methods for circular dichroism spectroscopy of proteins: a tutorial review. Chem. Soc. Rev. 2021;50:8400–8413. doi: 10.1039/d0cs00558d. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Whitmore L, Wallace BA. Protein secondary structure analyses from circular dichroism spectroscopy: Methods and reference databases. Biopolymers. 2008;89:392–400. doi: 10.1002/bip.20853. [DOI] [PubMed] [Google Scholar]
- 8.Lees JG, Miles AJ, Wien F, Wallace BA. A reference database for circular dichroism spectroscopy covering fold and secondary structure space. Bioinformatics. 2006;22:1955–1962. doi: 10.1093/bioinformatics/btl327. [DOI] [PubMed] [Google Scholar]
- 9.Abdul-Gader A, Miles AJ, Wallace BA. A reference dataset for the analyses of membrane protein secondary structures and transmembrane residues using circular dichroism spectroscopy. Bioinformatics. 2011;27:1630–1636. doi: 10.1093/bioinformatics/btr234. [DOI] [PubMed] [Google Scholar]
- 10.Sreerama N, Venyaminov SY, Woody RW. Estimation of protein secondary structure from CD spectra: Inclusion of denatured proteins with native proteins in the analysis. Anal. Biochem. 2000;287:243–251. doi: 10.1006/abio.2000.4879. [DOI] [PubMed] [Google Scholar]
- 11.Miles AJ, Ramalli SG, Wallace BA. DichroWeb, a website for calculating protein secondary structure from circular dichroism spectroscopic data. Protein Sci. 2021;31:37–46. doi: 10.1002/pro.4153. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Micsonai A, et al. BeStSel: A web server for accurate protein secondary structure prediction and fold recognition from the circular dichroism spectra. Nucleic Acids Res. 2018;46:W315–W322. doi: 10.1093/nar/gky497. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Nagy G, Igaev M, Jones NC, Hoffmann SV, Grubmüller H. SESCA: Predicting circular dichroism spectra from protein molecular structures. J. Chem. Theory Comput. 2019;15:5087–5102. doi: 10.1021/acs.jctc.9b00203. [DOI] [PubMed] [Google Scholar]
- 14.Sreerema N, Woody RW. A self-consistent method for the analysis of protein secondary structure from circular dichroism. Anal. Biochem. 1993;209:32–44. doi: 10.1006/abio.1993.1079. [DOI] [PubMed] [Google Scholar]
- 15.Provencher SW, Glöckner J. Estimation of globular protein secondary structure from circular dichroism. Biochemistry. 1981;20:33–37. doi: 10.1021/bi00504a006. [DOI] [PubMed] [Google Scholar]
- 16.Compton LA, Johnson WC., Jr. Analysis of protein circular dichroism spectra for secondary structure using a simple matrix multiplication. Anal. Biochem. 1986;155:155–167. doi: 10.1016/0003-2697(86)90241-1. [DOI] [PubMed] [Google Scholar]
- 17.Micsonai A, et al. Accurate secondary structure prediction and fold recognition for circular dichroism spectroscopy. Proc. Nat. Acad. Sci. 2015;112:E3095–E3103. doi: 10.1073/pnas.1500851112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Hanson J, Paliwal K, Litfin T, Yang Y, Zhou Y. Improving prediction of protein secondary structure, backbone angles, solvent accessibility and contact numbers by using predicted contact maps and an ensemble of recurrent and residual convolutional neural networks. Bioinformatics. 2019;35:2403–2410. doi: 10.1093/bioinformatics/bty1006. [DOI] [PubMed] [Google Scholar]
- 19.Klausen MS, et al. NetSurfP-2.0: Improved prediction of protein structural features by integrated deep learning. Proteins Struct. Funct. Bioinf. 2019;87:520–527. doi: 10.1002/prot.25674. [DOI] [PubMed] [Google Scholar]
- 20.Källberg M, et al. Template-based protein structure modelling using the RaptorX web server. Nat. Protoc. 2012;7:1511–1522. doi: 10.1038/nprot.2012.085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Jumper J, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–589. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Burley SK, et al. RCSB Protein Data Bank: powerful new tools for exploring 3D structures of biological macromolecules for basic and applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences. Nucleic Acids Res. 2021;49:D437–D451. doi: 10.1093/nar/gkaa1038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Kabsch W, Sander C. Dictionary of Protein Secondary Structure: Pattern recognition of hydrogen-bonded geometrical features. Biopolymers. 1983;22:2577–2637. doi: 10.1002/bip.360221211. [DOI] [PubMed] [Google Scholar]
- 24.Senior AW, et al. Improved protein structure prediction using potentials from deep learning. Nature. 2020;577:706–710. doi: 10.1038/s41586-019-1923-7. [DOI] [PubMed] [Google Scholar]
- 25.David A, Islam S, Tankhilevich E, Sternberg MJE. The AlphaFold Database of Protein Structures: A Biologist’s Guide. J. Mol. Biol. 2022;434:167336. doi: 10.1016/j.jmb.2021.167336. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Ruff KM, Pappu RV. AlphaFold and implications for intrinsically disordered proteins. J. Mol. Biol. 2021;433:167208. doi: 10.1016/j.jmb.2021.167208. [DOI] [PubMed] [Google Scholar]
- 27.Wilson CJ, Choy W-Y, Karttunen M. AlphaFold2: A role for disordered protein/region prediction? Int. J. Mol. Sci. 2022;23:4591. doi: 10.3390/ijms23094591. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Ramalli SG, Miles AJ, Janes RW, Wallace BA. The PCDDB (Protein Circular Dichroism Data Bank): A bioinformatics resource for protein characterisations and methods development. J. Mol. Biol. 2022;6:167441. doi: 10.1016/j.jmb.2022.167441. [DOI] [PubMed] [Google Scholar]
- 29.Louis-Jeune C, Andrade-Navarro MA, Perez-Iratxeta C. Prediction of protein secondary structure from circular dichroism using theoretically derived spectra. Proteins. 2012;80:374–381. doi: 10.1002/prot.23188. [DOI] [PubMed] [Google Scholar]
- 30.Panethymitaki, C. Kinetoplastid myristoyl CoA: protein N-myristoyltransferase and two substrates, the Leishmania vaccine antigen candidates, HASPA and HASPB. PhD Thesis, Imperial College London. (2005).
- 31.Micsonai A, et al. Disordered–ordered protein binary classification by circular dichroism spectroscopy. Front. Mol. Biosci. 2022;9:863141. doi: 10.3389/fmolb.2022.863141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Bulheller BM, Hirst JD. DichroCalc – circular and linear dichroism online. Bioinformatics. 2009;25:539–540. doi: 10.1093/bioinformatics/btp016. [DOI] [PubMed] [Google Scholar]
- 33.Lopes JLS, Orcia D, Araujo APU, DeMarco R, Wallace BA. Folding factors and partners for the intrinsically disordered protein micro-exon gene 14 (MEG-14) Biophys. J. 2013;104:2512–2520. doi: 10.1016/j.bpj.2013.03.063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Richards, M. W. Structural studies of a Ca++ channel beta subunit using biophysical methods. PhD Thesis, Birkbeck College, University of London (2004).
- 35.Richards MW, et al. Synchrotron radiation circular dichroism and circular dichroism spectroscopic studies for the voltage-dependent calcium channel beta subunit. Biophys. J. 2002;82:456a. [Google Scholar]
- 36.Brocca S, et al. Order propensity of an intrinsically disordered protein, the cyclin-dependent-kinase inhibitor Sic1. Proteins. 2009;76:731–746. doi: 10.1002/prot.22385. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Lacruz RS, et al. Structural analysis of a repetitive protein sequence motif in strepsirrhine primate amelogenin. PLoS One. 2011;6:e18028. doi: 10.1371/journal.pone.0018028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Rohatgi, A. WebPlotDigitizer at URL https://automeris.io/WebPlotDigitizer, Version: 4.5, (2021).
- 39.The UniProt Consortium. UniProt: The universal protein knowledgebase in 2021. Nucleic Acids Res. 49, D480–D489 (2021). [DOI] [PMC free article] [PubMed]
- 40.Gasteiger E, et al. ExPASy: The proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Res. 2003;31:3784–3788. doi: 10.1093/nar/gkg563. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Sutherland, J. Circular Dichroism and the Conformational Analysis of Biomolecules. (Plenum Press, 1996). 616–618.
- 42.Anthis NJ, Clore GM. Sequence-specific determination of protein and peptide concentrations by absorbance at 205 nm. Protein Sci. 2013;22:851–858. doi: 10.1002/pro.2253. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Miles AJ, Wien F, Lees JG, Wallace BA. Calibration and standardisation of synchrotron radiation and conventional circular dichroism spectrometers. Part 2: Factors affecting magnitude and wavelength. Spectroscopy. 2005;19:43–51. [Google Scholar]
- 44.Miles AJ, Wallace BA. CDtoolX, a downloadable software package for processing and analyses of circular dichroism spectroscopic data. Protein Sci. 2018;27:1717–1722. doi: 10.1002/pro.3474. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Miles AJ, et al. Calibration and standardisation of synchrotron radiation circular dichroism and conventional circular dichroism spectrophotometers. Spectroscopy. 2003;17:653–661. [Google Scholar]
- 46.Klose DP, Wallace BA, Janes RW. 2Struc: The secondary structure server. Bioinformatics. 2010;26:2624–2625. doi: 10.1093/bioinformatics/btq480. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.The Qt Company. https://www.qt.io/.
- 48.Bochkanov, S. A. ALGLIB. http://www.alglib.net.
- 49.MATLAB [7.0]. MathWorks, 2005.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The new reference dataset spectra described in this paper, and their associated metadata, have been deposited in the Protein Circular Dichroism Data Bank (PCDDB)28 (located at http://pcddb.cryst.bbk.ac.uk). They include the following proteins: HASPA, HASPB, casein, and Tarp174-222, with consecutive records CD0006406000 to CD0006409000; they are identified by the keyword “IDP175”. MEG-1433 was already present in the PCDDB with PCDDBid CD0004064000. The test protein spectra have also been deposited in the PCDDB with the following PCDDBids: osteopontin, CD0003667000; β2-microglobulin, CD0003894000; Bence Jones protein, CD0000077000; prealbumin, CD0000091000; eGFP, CD0004251000; MAGI-1PDZ1, CD0000596000; UTPase, CD0003897000; ecotin, CD0003896000; trypsin, CD0000096000; and α-lactalbumin, CD0000072000. Amelogenin, Sic1, β−B1 C-terminus1-94, pokeweed lectin and saporin have consecutive records CD0006410000 to CD0006414000 and are identified by the keyword “IDPtest”. The spectra of proteins present in the SP175 dataset are already available in the PCDDB, and are identified by the keyword “SP175”. Twelve existing SP175 entries have been updated for this project; these are identified by a “1” in the 10th position of the PCDDBID. The PCDDB accession codes for each protein, and their secondary structures that were used in creating the IDP reference dataset are listed in Supplementary Tables S1 (top), S1 (bottom) and S2.
The DichroIDP app is freely available for download at: https://dichroidp.cryst.bbk.ac.uk and from GitHub at https://github.com/pcddb/DichroIDPs.