Skip to main content
Biophysical Journal logoLink to Biophysical Journal
. 2013 Jan 22;104(2):488–495. doi: 10.1016/j.bpj.2012.12.012

Inherent Relationships among Different Biophysical Prediction Methods for Intrinsically Disordered Proteins

Fan Jin 1, Zhirong Liu 1,
PMCID: PMC3552272  PMID: 23442871

Abstract

Intrinsically disordered proteins do not have stable secondary and/or tertiary structures but still function. More than 50 prediction methods have been developed and inherent relationships may be expected to exist among them. To investigate this, we conducted molecular simulations and algorithmic analyses on a minimal coarse-grained polypeptide model and discovered a common basis for the charge-hydropathy plot and packing-density algorithms that was verified by correlation analysis. The correlation analysis approach was applied to realistic datasets, which revealed correlations among some physical-chemical properties (charge-hydropathy plot, packing density, pairwise energy). The correlations indicated that these biophysical methods find a projected direction to discriminate ordered and disordered proteins. The optimized projection was determined and the ultimate accuracy limit of the existing algorithms is discussed.

Introduction

The traditional sequence-structure-function paradigm serves as the foundation of modern protein science and has been supported by the enormous success of studies of proteins with unique three-dimensional structures. In the 1990s, proteins of a particular type were discovered: proteins that have a biological function but under physiological conditions were found to lack a stable native structure for whole or part of their sequence. Proteins of this kind are called intrinsically disordered proteins (IDPs) (1–4), and are involved in various critical physiological processes such as the regulation of transcription and translation (5), cellular signal transmission, protein phosphorylation, and storage of small molecules (2,6).

Because of their chain flexibility, IDPs are more resistant to various perturbations and are capable of transmitting signals faster and more smoothly than ordered proteins (7,8). Bioinformatics analyses have indicated that >30% of the proteins in eukaryotic cells are IDPs (9,10) and are associated with a wide range of protein-protein interactions (11,12). However, IDPs also have some adverse effects. Many diseases have been reported to be strongly correlated with predicted IDPs. For example, one study found that ∼79% of cancer-related proteins contain disordered regions longer than 30 residues (13). Consequently, IDPs are potential drug targets (14,15).

The development of algorithms for protein disorder prediction has provided valuable tools for the study of IDPs. The algorithms are helpful in understanding the principles of protein folding and function as well as in directing laboratory experiments. More than 50 prediction methods are available as of this writing (16–18). Most of the methods are based on machine leaning techniques such as artificial neural networks and support vector machines (9,19–22). These methods perform excellently in predicting IDPs, but they are usually short of explanations for the underlying mechanisms due to their black-box nature. Alternatively, the biophysical methods for predicting IDPs (1,23–27) average the physico-chemical properties over the sequences to derive a state-index to predict order/disorder. These methods are usually not as accurate as the machine-learning-based methods, but they have the advantages of simplicity (making them faster) and have a clearer meaning.

Various physico-chemical properties have been exploited in the biophysical methods for IDP prediction. The most intuitive biophysical description of IDPs is the charge-hydropathy plot (CH-plot) proposed by Uversky and co-workers (1,10,28): in the plane of the mean net charge versus mean hydrophobicity, ordered and disordered proteins separate into distinct regions. This property was used to develop an IDPs predictor, FoldIndex (24). The mechanism underlying the CH-plot is easily understood because the order/disorder of a protein is governed by the balance between hydrophobic attractive forces and Coulombic repulsive forces. The CH-plot can be regarded as a two-dimensional physico-chemical property with good order/disorder discriminating capacity.

Another well-understood physico-chemical property adopted in IDP predictions is the pairwise energy. Dosztányi et al. (23) determined an effective 20 × 20 interresidue interaction matrix and estimated the energy of a protein based on its amino-acid composition by assuming that the pairwise contacts are completely random. They found a clear separation between the energy distributions of ordered and disordered proteins in which the estimated energy of disordered proteins was higher. An order/disorder predictor called IUPred (29) was developed based on this result. In this approach, the utilized property can be regarded as 20-dimensional.

The final physico-chemical property discussed in this article is the expected packing density of residues (25,26). The packing density of a residue in a protein structure is defined as the number of contacts that the residue has with other residues within a given distance and is similar to the concept of ligancy in chemistry. The packing density is related to the protein flexibility (30) and the protein flexibility unavoidably influences the probability of IDPs (31). It has been demonstrated that the expected packing density of a given sequence is lower for disordered proteins than for ordered proteins (25,26) and based on this result, the FoldUnfold predictor was designed (32). The expected packing density is a one-dimensional property.

Although the above three physico-chemical properties are markedly different in connotation and dimensionality (one-, two-, versus 20-dimensional), the corresponding predictors have all been widely used in IDP studies and have given good performances that are comparable. How can the markedly different physico-chemical properties used in the protein order/disorder predictors produce similar results?

As Leo Tolstoy wrote (33), “Happy families are all alike; every unhappy family is unhappy in its own way.” For IDP predictions, we could paraphrase the statement as: excellent algorithms are all alike; every poor algorithm is disappointing in its own way. In this article, using a simulation on a minimal three-letter model, we compared the mechanisms of IDPs’ prediction algorithms with the CH-plot, pairwise energy, and expected packing density to reveal the inherent relationships among them.

Materials and Methods

HPQ continuum model and molecular-dynamics simulations

Molecular modeling is very useful in understanding the properties of IDPs (34–37). We adopted a minimal coarse-grained HPQ continuum model described previously by Ashbaugh and Hatch (38) in molecular-dynamics simulations to describe the behaviors of protein order/disorder. The HPQ model is an extension of the conventional HP model (39) and defines three types of amino-acid residues: hydrophobic (H), uncharged polar (P), and charged polar (Q). Both hydrophobic and electrostatic interactions are considered in the model. The hydrophobic interaction is described by a Lennard-Jones-like potential function. The electrostatic interaction is modeled in an infinite solvent by a screened Coulomb potential using the Debye-Hückel theory. Details and related parameters of the HPQ model have been described previously (38).

Molecular-dynamics simulations were performed in the canonical ensemble by Langevin dynamics (40), with the same temperature (comparable to an ambient temperature of 300 K) and integration time-step as those used earlier (38). The sequence length of the polypeptide chains was fixed as N = 150. The sequences were generated randomly under the constraint of specified fractions of hydrophobic (〈H〉) and charged (〈Q〉) residues. For high hydrophobic sequences, a random collapsed initial conformation was used; for high polar and charged sequences, a random extended initial conformation was used. For each chain, 5 × 106 molecular dynamics steps were performed for equilibration, followed by 107 steps for the evaluation of averages.

Radius of gyration and coil-to-globule transition

Radius of gyration (Rg) works as an intuitional parameter in describing the collapse or extension of protein structures and can be regarded as a structure metric of protein order/disorder. As reported previously by Ashbaugh and Hatch (38), HPQ polypeptide chains undergo a coil-to-globule transition when Rg is measured as a function of 〈H〉. To make a quantitative analysis, we introduced a two-state formulation to empirically fit the behavior of HPQ chains,

Rg=pDRg(D)+(1pD)Rg(O)=Rg(D)+Rg(O)exp(ΔGkBT)1+exp(ΔGkBT), (1)

where Rg(D) is the radius of gyration for coil (disordered) state and Rg(O) is the radius of gyration for globule (ordered) state. The value pD is the probability of a chain to be in the disordered state. ΔG is the free energy difference. The values Rg(D), Rg(O), and ΔG all depend on the average hydrophobicity and charge of the sequences. A global fit to the simulation data of HPQ chains with 〈Q〉 = 0, 0.1, 0.2, …, 0.9 and with various 〈H〉 values gives the following expression at the fixed simulating temperature,

{Rg(D)=9.37+35.9Q18.6Q2+[10.5+31.6Q31.1Q2]H,Rg(O)=2.72+0.19Q,ΔGkBT=4.81+15.8Q+[16.6+8.27Q]H. (2)

It should be noted that the HPQ chains with random sequences are not real proteins because the globule conformations are not unique structures and there is no free-energy barrier between the coil and globule states; therefore, the system is not two-state in nature. As a result, the formalism in Eq. 1 and 2 should be regarded as merely an effective empirical method to determine whether a polypeptide chain is in a coil (disordered) or globule (ordered) state.

Datasets

To determine the relationships among different prediction methods, we created several databases from our simulations on HPQ model and from experimental data of real proteins.

HPQ simulation datasets

We randomly specified the hydrophobicity and charge values (〈H〉 and 〈Q〉) to generate 1000 HPQ sequences (with a constraint of 〈Q〉 < 0.6 because chains with 〈Q〉 ≥ 0.6 are all disordered for all values of 〈H〉). Molecular dynamics simulations were conducted for each sequence to obtain 〈Rg〉. Then the probability of being disordered (coil) was calculated as

pD=RgRg(O)Rg(D)Rg(O), (3)

where Rg(D) and Rg(O) are calculated using Eq. 2. Polypeptide chains with pD ≤ 0.5 were classified as ordered and they compose the HPQ ordered dataset, while chains with pD > 0.5 were classified as disordered and they compose the HPQ disordered dataset.

Experimental SCOP dataset

The experimental dataset of ordered proteins was obtained from the SCOP database (Ver. 1.75, June 2009) (41). Four SCOP classes (all α; all β; α+β; α/β) were considered and redundant sequences with >30% sequence identity were removed. The final SCOP dataset contained 2005 proteins.

Experimental DisProt datasets

The experimental dataset of disordered proteins is based on DisProt (Ver. 5.6, January 2011) (42), which contained 638 proteins with 1368 disordered regions. We defined a residue to be disordered if, in DisProt, the residue was annotated to be disordered at least once. The disordered ratio of each protein in the DisProt dataset was calculated for use in our analysis.

CH-plot and packing-density prediction algorithms for the HPQ simulation datasets

The CH-plot and packing-density prediction algorithms were developed for the HPQ simulation datasets to reveal the relationships between the two algorithms.

In the CH-plot algorithm, polypeptide chains in the HPQ ordered and disordered datasets were mapped into the (〈H〉,〈Q〉) space, and an optimal straight boundary line was determined to separate the ordered and disordered chains. The boundary line then acts as a prediction criterion to predict a chain to be ordered or disordered, according to the side of the boundary upon which it lies. We used the linear classifier in the statistical pattern recognition toolbox for MATLAB (43) to determine the optimal boundary line.

In the packing density algorithm, the average of the expected packing density was calculated in a window that was moved along the sequence; successive residues, provided not less than the window size but with an average packing density smaller than a critical value, were regarded as disordered. When the predicted disordered residue ratio was smaller/larger than 0.5, the chain was classified as ordered/disordered in a binary manner. The window size and the critical packing density value were optimized to minimize the error for the HPQ ordered and disordered datasets. This method is an analog of the original version that was developed for real proteins (25). The packing density of the three kinds of residues (H, P, Q) was determined in advance from simulations on 50 sequences with highly collapsed conformations (〈H〉 > 0.7, 〈Q〉 < 0.15). A critical distance of 8 Å was used to calculate the packing density.

Results

Order/disorder predictions in the HPQ model

We examined a minimal coarse-grained HPQ continuum model (38), which is capable of describing the hydrophobic and electrostatic interactions in polypeptides, to search for possible clues to why markedly different physico-chemical properties are comparably good in predicting protein order/disorder.

We conducted molecular-dynamics simulations to determine the conformational properties of HPQ polypeptide chains with various fractions of hydrophobic and charged residues (〈H〉,〈Q〉). Consistent with a previous study (38), the chains underwent a coil-to-globule transition with increasing hydrophobicity as measured by the radius of gyration (〈Rg〉) (Fig. 1). We used a two-state formalism as described in Materials and Methods to empirically describe the simulation data, with the fitting results plotted as solid lines in Fig. 1. The global agreement between the fit and the simulation data is satisfactory. The formalistic description provides an effective way to determine the ordered/disordered state of each HPQ polypeptide chain from the simulated 〈Rg〉 value.

Figure 1.

Figure 1

Average radius of gyration (〈Rg〉) of HPQ chains with sequence length N = 150 as a function of the hydrophobic-residue fraction (〈H〉) when the charged-residue fraction is: (from bottom to top) 〈Q〉 = 0.0, 0.1, 0.2, …, 0.9. (Points) Simulation results averaged over five random sequences. (Lines) Fit of the simulation results as given by Eqs. 1 and 2.

We then generated 1000 random sequences with randomly specified 〈H〉 and 〈Q〉 values, and conducted molecular-dynamics simulations to calculate the 〈Rg〉 for each chain. The obtained 〈Rg〉 values were combined with the two-state formalism to determine whether the chain was ordered (globule) or disordered (coil) as described in Materials and Methods. The resulting ordered and disordered chains comprise the HPQ ordered and disordered datasets, respectively.

With the constructed HPQ datasets, we developed the CH-plot and the packing-density algorithms for order/disorder prediction. The results are summarized in Fig. 2. In the CH-plot, a sharp boundary between ordered and disordered chains is clearly seen. The corresponding algorithm based on the optimized straight boundary line possesses excellent performance in order/disorder prediction, giving only 16 wrong predictions (i.e., mistakes) among 1000 chains. For the packing-density algorithm, the mistake ratio (68:1000) was quite good, but not as good as the mistake ratio for the CH-plot method. This situation was reversed for real proteins where the packing-density algorithm performs better than the CH-plot methods (25). This result encouraged us to investigate the possible reasons for these observations.

Figure 2.

Figure 2

Performance of the CH-plot and packing-density algorithms on the HPQ model. The ordered and disordered datasets are shown (circles and rectangles), respectively. (a) Performance of the CH-plot with the determined boundary shown (solid line). (b) Performance of the packing-density algorithm: (open symbols) the successful predictions; (solid symbols) the false predictions. The optimized window size was 55 and the critical packing-density value was 52.8.

Relationships among three order/disorder algorithms

The results described above for the CH-plot and packing-density algorithms on the HPQ model provided valuable clues for us to understand the relationships among the various biophysical order/disorder algorithms. When a mean-field approximation is adopted, that is, the properties of a chain are assumed to be solely determined by the amino-acid composition of the sequence, an HPQ chain is completely described by two variables, 〈H〉 and 〈Q〉. In such a situation, the order/disorder behaviors of HPQ chains can be represented as a phase diagram in the (〈H〉,〈Q〉) plane. The results displayed in Fig. 2 a indicated that the boundary between ordered and disordered phases is approximately a straight line. Consequently, the CH-plot algorithm that optimizes a straight boundary is highly accurate. On the other hand, in the packing-density algorithm, the (〈H〉,〈Q〉) of a chain is mapped to a one-dimensional variable and the expected packing-density D, to discriminate order/disorder, is calculated as

D(H,Q)=D(H)H+D(Q)Q+D(P)(1HQ), (4)

where D(H), D(Q), and D(P) are the packing density values of the H, Q, and P residues, respectively. Equation 4 is a linear function of 〈H〉 and 〈Q〉, and accordingly, the contour of D(〈H〉,〈Q〉) is composed of parallel lines as shown in Fig. 3 a. In the packing-density algorithm, a critical D value is used to discriminate the ordered and disordered polypeptides so that the predicted phase boundary for the algorithm is also a straight line under the mean-field approximation. If the packing-density algorithm works reasonably well, its predicted boundary should be close to the CH-plot boundary.

Figure 3.

Figure 3

Relationships between the CH-plot and packing-density algorithms in the HPQ model. (a) The CH-plot boundary (red) and the packing-density contour (blue) in the (〈H〉, 〈Q〉) plane. (Thick line) Contour line with the critical packing-density value. (Arrows) Normal-lines normal to the boundaries in the CH-plot and packing-density algorithms. (b and c) Correlations between the packing density and the CH-plot projection (b) for three residues and (c) at the polypeptide level.

Note that the D value at (〈H〉,〈Q〉) can be regarded as the projection of (〈H〉,〈Q〉) normal to the predicted boundary (indicated by the blue arrow in Fig. 3 a). Similarly, when (〈H〉,〈Q〉) is projected normal to the CH-plot boundary (indicated by the red arrow in Fig. 3 a), the projection (hereafter called the CH projection) should correlate with the packing-density value. In Fig. 3 b, the CH projection and the packing density of three residues are shown. Clearly, there is good correlation between them. When the data are presented at the polypeptide level (Fig. 3 c), a similar correlation is observed.

The insights obtained in the HPQ model were applied to real systems. Correlations between the packing density and the CH projection for both real residues and proteins were plotted and are shown in Fig. 4. At the residue level, the correlation was moderate (R = 0.46) and not as evident as in the HPQ model. At the protein level, however, the correlation was enhanced (R = 0.70). These features reflect the complexity of real systems that are 20-dimensional rather than two-dimensional, and may also originate from the fact that these algorithms were trained on proteins and not on residues. Despite the differences, the correlations shown in Fig. 4 clearly indicated an inherent relationship between the CH-plot and packing-density algorithms.

Figure 4.

Figure 4

Correlations between the CH-plot and packing-density algorithms in real systems. (a) Correlation at the 20-residue level. (b) Correlation at the protein level in the SCOP (blue circles) and DisProt (red rectangles) datasets. (Straight lines) Linear fits of data; the correlation coefficients are also shown. To reduce the overwhelming number of SCOP data points, the same numbers of SCOP and DisProt data points were used in the global linear fit.

This analysis can be extended to other biophysical algorithms for order/disorder prediction. For the pairwise-energy algorithm (23), the physico-chemical property that is used is characterized by a 20 × 20 interaction matrix. To make a correlation analysis at the residue level, we used principle component analysis (44) to extract a dominant component from the matrix. At the protein level, the calculated total energy without the principle component analysis reduction was compared directly with the packing density or with the CH projection. The results of this analysis showed that the correlation between the pairwise-energy and packing-density algorithms was remarkable at both the residue and protein levels (Fig. 5, a and b). In comparison, the correlation between the pairwise-energy and the CH-plot algorithms was lower (Fig. 5, c and d), and similar to the correlation between the packing-density and CH-plot algorithms (Fig. 3, b and c).

Figure 5.

Figure 5

Relationships between the pairwise-energy algorithm and other algorithms. (a and b) Correlation between the pairwise-energy and packing-density algorithms. (c and d) Correlation between the pairwise-energy and CH-plot algorithms. The first principle component of the pairwise-energy matrix was used in the analysis at the residue level (a and c). SCOP and DisProt datasets (blue circles and red rectangles), respectively (b and d).

Optimized projection and ultimate performance of biophysical prediction methods

The above analysis of the packing density, CH-Plot, and pairwise-energy algorithms revealed that the common essence of these methods is to find a projected direction in the 20-dimensional amino-acid composition space that can be used to discriminate ordered and disordered proteins. The directions that are used in effective algorithms are unavoidably close to each other and show high correlations. With this understanding, it was recognized that, for any given dataset, a best projected direction that determines the ultimate performance of all biophysical prediction methods with similar underlying characteristics should exist.

We used the DisProt dataset to approximately determine the best projected direction. The disorder ratios of the proteins in DisProt were expressed as a linear function of their amino-acid compositions, that is, as a dot multiplication between the optimized projection vector and the 20-dimensional amino-acid composition. Thus, the determination of the optimized projection is converted into a linear fit problem that can be easily solved. The optimized projection values for the 20 amino-acid residues are shown in Table 1. The Lys residue had the highest value, indicating a strong disorder propensity; the Trp residue had the lowest value.

Table 1.

The optimized projection values for the 20 amino-acid residues

Amino-acid residue Optimized projection Amino-acid residue Optimized projection
Cys −2.41 Thr 0.91
Met 1.44 Ser 1.38
Phe −0.83 Gln 1.02
Ile −1.36 Asn −1.63
Leu −1.41 Glu 1.59
Val −2.03 Asp 1.21
Trp −2.79 His 0.27
Tyr −1.57 Arg 0.42
Ala 0.53 Lys 1.85
Gly 0.9 Pro 1.66

The optimized projection was determined by solving a linear fit between the disorder ratio and the amino-acid composition of the proteins in DisProt with the optimized projection as the parameter vector.

We quantitatively compared the performance of different physico-chemical properties in order/disorder prediction by adopting a procedure as described here. Proteins in DisProt with sequences longer than 100 residues and disordered ratio larger than 0.5 were used as the positive set (disordered set), while the SCOP dataset was adopted as the negative set (ordered set). To make order/disorder prediction using a specified residue property, we calculated the average of the property over the sequence residues (without considering factors such as window averaging) for every protein and used it as the order/disorder indicator. If the calculated value of a protein was smaller than a critical value, it was predicted to be disordered. Otherwise, it was predicted to be ordered (or vice versa, depending on the specified property). The prediction performance was measured by the average between the false-positive ratio and the false-negative ratio. The resulting performances of different physico-chemical properties are shown in Fig. 6 as a function of the used critical value.

Figure 6.

Figure 6

Performance of various physico-chemical properties in order/disorder prediction. Scales for five properties are shown at the bottom of the figure. The properties (denoted as x) were also normalized into a comparable scale (x) using the mean value (〈x〉) and the distribution width (σx) of the positive (disordered) set: x = (x – 〈x〉)/σx, which was adopted in aligning different cures.

The best performance of each method and the distance from the accuracy limit are summarized in Table 2. Comparatively, the CH projection was the least accurate. The accuracies of the packing density and the pairwise energy algorithms were high and similar to each other as well as to the ultimate performance given by the optimized projection. This result is consistent with the high correlation that was seen between the packing-density and pairwise-energy algorithm (Fig. 5). The advantage of the optimized projection with respect to the other properties was very small (for example, its best false ratio was only 1.2% lower than that of the pairwise energy), suggesting that the existing biophysical algorithms have achieved close to their ultimate performance. In Fig. 6, we also included the analysis on another amino-acid propensity scale, Top-IDP, that was directly optimized by Campen et al. (45) to discriminate between ordered and disordered proteins. The accuracy of Top-IDP is slightly higher than that of the packing density and the pairwise energy algorithms, while is slightly smaller than that of the optimized projection (maybe because different datasets were used).

Table 2.

The best performance of each method and the distance from the accuracy limit

Method Best performance (false ratio %) Distance to optimized projection (%)
Packing density 16.36 3.08
CH-plot 22.12 8.84
Pairwise energy 14.48 1.2
TOP-IDP 14.28 1.0
Optimized projection 13.28 0

The best performance of each prediction method was defined as the lowest false ratio under the optimal critical value. The distance from the accuracy limit was measured as the difference between the best performance of each method and that of optimized projection.

Extended analysis: the amyloid propensity

Protein fibrillogenesis requires relatively unfolded conformations (46). Interestingly, the packing density has been found to be an efficient parameter for amyloid prediction (25,26), suggesting that there may be a close relationship between the protein order/disorder and amyloid prediction algorithms. We applied the correlation analysis approach to examine this further. We used the amyloid propensity of the 20 amino-acid residues at pH 7.0 from a previous study (47) and calculated their correlation with various properties that are used in order/disorder prediction (Fig. 7). We found that the amyloid propensity correlated with all the examined properties of order/disorder, with correlation coefficients between 0.64 and 0.85. These results supported the concept that the mechanism of amyloid fibril formation is closely related to that of protein order/disorder (25), suggesting that the interaction between the amyloid and protein order/disorder algorithms deserves further investigation (for example, to apply order/disorder algorithms in amyloid prediction, and vice versa).

Figure 7.

Figure 7

Correlations between the amyloid propensity and various physico-chemical properties used in order/disorder prediction.

Discussion

Three physical properties (CH-plot, packing density, pairwise energy) in the above analysis look markedly different. They even possess different dimensionality, i.e., one-, two-, and 20-dimensional. However, the analysis revealed that they are inherently connected with each other and there exists a common basis for the prediction algorithms employing them. Therefore, these properties lack independence in the protein disorder/flexibility prediction. Although our analysis was conducted on only a few properties, we expect that similar inherent relationships exist for other properties used in order/disorder predictors no matter whether machine learning algorithms were adopted. Studies on such inherent relationships may provide valuable insights into understanding the underlying mechanism of algorithms and improving the prediction accuracy.

The IDPs prediction algorithms considered in this work all belong to the class of biophysical methods, which calculates the average of certain amino-acid properties over the sequences as an index of order/disorder. The simplicity and clear meaning of this class of methods make it more feasible to reveal the inherent relationships among them, which were then utilized to develop deeper analyses such as the optimized projection and the ultimate performance. Generally speaking, it would be interesting to include more sophisticated machine learning methods. For example, Fig. 6 can be generated for any prediction method by averaging over all positions. However, the black-box nature of machine learning methods makes it more difficult to extract their underlying dominant factor. Therefore, we did not pursue this aim in this study.

Conclusion

In conclusion, we have studied the inherent relationships among some of the biophysical algorithms of protein order/disorder prediction. The molecular simulation conducted on a HPQ continuum polypeptide model with only three types of residues showed that the relationships among different algorithms could be revealed by a correlation analysis. When the correlation analysis approach was applied to the SCOP and DisProt datasets, an obvious correlation was observed among the CH-plot, the packing density, and the pairwise energy at both the residue and protein levels. An optimized projection was determined in the order/disorder phase space as an estimated ultimate limit of the biophysical algorithms. Further, we have shown that the existing algorithms are quite close to their limit.

Acknowledgments

The authors thank Daqi Yu, Ning Yin, and Shuangyu Bi for insightful discussions.

This work was supported by the National Natural Science Foundation of China (grants No. 20973016 and No. 11021463) and the Ministry of Science and Technology of China (grant No. 2009CB918500).

References

  • 1.Uversky V.N. Natively unfolded proteins: a point where biology waits for physics. Protein Sci. 2002;11:739–756. doi: 10.1110/ps.4210102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Dunker A.K., Brown C.J., Obradović Z. Intrinsic disorder and protein function. Biochemistry. 2002;41:6573–6582. doi: 10.1021/bi012159+. [DOI] [PubMed] [Google Scholar]
  • 3.Wright P.E., Dyson H.J. Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. J. Mol. Biol. 1999;293:321–331. doi: 10.1006/jmbi.1999.3110. [DOI] [PubMed] [Google Scholar]
  • 4.Huang Y., Liu Z. Intrinsically disordered proteins: the new sequence-structure-function relations. Acta Phys. Chim. Sin. 2010;26:2061–2072. [Google Scholar]
  • 5.Fuxreiter M., Tompa P., Asturias F.J. Malleable machines take shape in eukaryotic transcriptional regulation. Nat. Chem. Biol. 2008;4:728–737. doi: 10.1038/nchembio.127. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Huang F., Oldfield C., Dunker A.K. Subclassifying disordered proteins by the CH-CDF plot method. Pac. Symp. Biocomput. 2012;2012:128–139. [PubMed] [Google Scholar]
  • 7.Huang Y., Liu Z. Kinetic advantage of intrinsically disordered proteins in coupled folding-binding process: a critical assessment of the “fly-casting” mechanism. J. Mol. Biol. 2009;393:1143–1159. doi: 10.1016/j.jmb.2009.09.010. [DOI] [PubMed] [Google Scholar]
  • 8.Huang Y., Liu Z. Smoothing molecular interactions: the “kinetic buffer” effect of intrinsically disordered proteins. Proteins. 2010;78:3251–3259. doi: 10.1002/prot.22820. [DOI] [PubMed] [Google Scholar]
  • 9.Ward J.J., Sodhi J.S., Jones D.T. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J. Mol. Biol. 2004;337:635–645. doi: 10.1016/j.jmb.2004.02.002. [DOI] [PubMed] [Google Scholar]
  • 10.Oldfield C.J., Cheng Y., Dunker A.K. Comparing and combining predictors of mostly disordered proteins. Biochemistry. 2005;44:1989–2000. doi: 10.1021/bi047993o. [DOI] [PubMed] [Google Scholar]
  • 11.Fong J.H., Shoemaker B.A., Panchenko A.R. Intrinsic disorder in protein interactions: insights from a comprehensive structural analysis. PLOS Comput. Biol. 2009;5:e1000316. doi: 10.1371/journal.pcbi.1000316. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Hsu W.L., Oldfield C., Dunker A.K. Intrinsic protein disorder and protein-protein interactions. Pac. Symp. Biocomput. 2012;2012:116–127. [PubMed] [Google Scholar]
  • 13.Iakoucheva L.M., Brown C.J., Dunker A.K. Intrinsic disorder in cell-signaling and cancer-associated proteins. J. Mol. Biol. 2002;323:573–584. doi: 10.1016/s0022-2836(02)00969-5. [DOI] [PubMed] [Google Scholar]
  • 14.Cheng Y., LeGall T., Dunker A.K. Rational drug design via intrinsically disordered protein. Trends Biotechnol. 2006;24:435–442. doi: 10.1016/j.tibtech.2006.07.005. [DOI] [PubMed] [Google Scholar]
  • 15.Metallo S.J. Intrinsically disordered proteins are potential drug targets. Curr. Opin. Chem. Biol. 2010;14:481–488. doi: 10.1016/j.cbpa.2010.06.169. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Ferron F., Longhi S., Karlin D. A practical overview of protein disorder prediction methods. Proteins. 2006;65:1–14. doi: 10.1002/prot.21075. [DOI] [PubMed] [Google Scholar]
  • 17.He B., Wang K., Dunker A.K. Predicting intrinsic disorder in proteins: an overview. Cell Res. 2009;19:929–949. doi: 10.1038/cr.2009.87. [DOI] [PubMed] [Google Scholar]
  • 18.Deng X., Eickholt J., Cheng J. A comprehensive overview of computational protein disorder prediction methods. Mol. Biosyst. 2012;8:114–121. doi: 10.1039/c1mb05207a. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Yang Z.R., Thomson R., Esnouf R.M. RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins. Bioinformatics. 2005;21:3369–3376. doi: 10.1093/bioinformatics/bti534. [DOI] [PubMed] [Google Scholar]
  • 20.Walsh I., Martin A.J., Tosatto S.C. ESpritz: accurate and fast prediction of protein disorder. Bioinformatics. 2012;28:503–509. doi: 10.1093/bioinformatics/btr682. [DOI] [PubMed] [Google Scholar]
  • 21.Peng Z., Kurgan L. On the complementarity of the consensus-based disorder prediction. Pac. Symp. Biocomput. 2012;2012:176–187. [PubMed] [Google Scholar]
  • 22.Xue B., Oldfield C.J., Uversky V.N. CDF it all: consensus prediction of intrinsically disordered proteins based on various cumulative distribution functions. FEBS Lett. 2009;583:1469–1474. doi: 10.1016/j.febslet.2009.03.070. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Dosztányi Z., Csizmók V., Simon I. The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. J. Mol. Biol. 2005;347:827–839. doi: 10.1016/j.jmb.2005.01.071. [DOI] [PubMed] [Google Scholar]
  • 24.Prilusky J., Felder C.E., Sussman J.L. FoldIndex: a simple tool to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics. 2005;21:3435–3438. doi: 10.1093/bioinformatics/bti537. [DOI] [PubMed] [Google Scholar]
  • 25.Galzitskaya O.V., Garbuzynskiy S.O., Lobanov M.Y. Prediction of amyloidogenic and disordered regions in protein chains. PLOS Comput. Biol. 2006;2:1639–1648. doi: 10.1371/journal.pcbi.0020177. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Galzitskaya O.V., Garbuzynskiy S.O., Lobanov M.Y. Expected packing density allows prediction of both amyloidogenic and disordered regions in protein chains. J. Phys. Condens. Matter. 2007;19:285225. [Google Scholar]
  • 27.Lobanov M.Y., Galzitskaya O.V. The Ising model for prediction of disordered residues from protein sequence alone. Phys. Biol. 2011;8:035004. doi: 10.1088/1478-3975/8/3/035004. [DOI] [PubMed] [Google Scholar]
  • 28.Uversky V.N., Gillespie J.R., Fink A.L. Why are “natively unfolded” proteins unstructured under physiologic conditions? Proteins. 2000;41:415–427. doi: 10.1002/1097-0134(20001115)41:3<415::aid-prot130>3.0.co;2-7. [DOI] [PubMed] [Google Scholar]
  • 29.Dosztányi Z., Csizmok V., Simon I. IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics. 2005;21:3433–3434. doi: 10.1093/bioinformatics/bti541. [DOI] [PubMed] [Google Scholar]
  • 30.Halle B. Flexibility and packing in proteins. Proc. Natl. Acad. Sci. USA. 2002;99:1274–1279. doi: 10.1073/pnas.032522499. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Radivojac P., Obradovic Z., Dunker A.K. Protein flexibility and intrinsic disorder. Protein Sci. 2004;13:71–80. doi: 10.1110/ps.03128904. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Galzitskaya O.V., Garbuzynskiy S.O., Lobanov M.Y. FoldUnfold: web server for the prediction of disordered regions in protein chain. Bioinformatics. 2006;22:2948–2949. doi: 10.1093/bioinformatics/btl504. [DOI] [PubMed] [Google Scholar]
  • 33.Tolstoy L. The Russian Messenger; Moscow, Russia: 1878. Anna Karenina. [Google Scholar]
  • 34.Bhattacherjee A., Wallin S. Coupled folding-binding in a hydrophobic/polar protein model: impact of synergistic folding and disordered flanks. Biophys. J. 2012;102:569–578. doi: 10.1016/j.bpj.2011.12.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Knott M., Best R.B. A preformed binding interface in the unbound ensemble of an intrinsically disordered protein: evidence from molecular simulations. PLOS Comput. Biol. 2012;8:e1002605. doi: 10.1371/journal.pcbi.1002605. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Ganguly D., Zhang W., Chen J. Synergistic folding of two intrinsically disordered proteins: searching for conformational selection. Mol. Biosyst. 2012;8:198–209. doi: 10.1039/c1mb05156c. [DOI] [PubMed] [Google Scholar]
  • 37.Huang Y., Liu Z. Anchoring intrinsically disordered proteins to multiple targets: lessons from N-terminus of the p53 protein. Int. J. Mol. Sci. 2011;12:1410–1430. doi: 10.3390/ijms12021410. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Ashbaugh H.S., Hatch H.W. Natively unfolded protein stability as a coil-to-globule transition in charge/hydropathy space. J. Am. Chem. Soc. 2008;130:9536–9542. doi: 10.1021/ja802124e. [DOI] [PubMed] [Google Scholar]
  • 39.Lau K.F., Dill K.A. A lattice statistical-mechanics model of the conformational and sequence-spaces of proteins. Macromolecules. 1989;22:3986–3997. [Google Scholar]
  • 40.Kaya H., Chan H.S. Solvation effects and driving forces for protein thermodynamic and kinetic cooperativity: how adequate is native-centric topological modeling? J. Mol. Biol. 2003;326:911–931. doi: 10.1016/s0022-2836(02)01434-1. [DOI] [PubMed] [Google Scholar]
  • 41.Murzin A.G., Brenner S.E., Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 1995;247:536–540. doi: 10.1006/jmbi.1995.0159. [DOI] [PubMed] [Google Scholar]
  • 42.Sickmeier M., Hamilton J.A., Dunker A.K. DisProt: the database of disordered proteins. Nucleic Acids Res. 2007;35(Database issue):D786–D793. doi: 10.1093/nar/gkl893. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Franc V., Hlavac V. Czech Technical University; Prague, Czech Republic: 2004. Statistical Pattern Recognition Toolbox for MATLAB—User's Guide. Research Report of CMP. [Google Scholar]
  • 44.Hao L., Chao T., Wingreen N.S. Nature of driving force for protein folding: a result from analyzing the statistical potential. Phys. Rev. Lett. 1997;79:765–768. [Google Scholar]
  • 45.Campen A., Williams R.M., Dunker A.K. TOP-IDP-scale: a new amino acid scale measuring propensity for intrinsic disorder. Protein Pept. Lett. 2008;15:956–963. doi: 10.2174/092986608785849164. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Uversky V.N., Fink A.L. Conformational constraints for amyloid fibrillation: the importance of being unfolded. Biochim. Biophys. Acta. 2004;1698:131–153. doi: 10.1016/j.bbapap.2003.12.008. [DOI] [PubMed] [Google Scholar]
  • 47.Pawar A.P., Dubay K.F., Dobson C.M. Prediction of “aggregation-prone” and “aggregation-susceptible” regions in proteins associated with neurodegenerative diseases. J. Mol. Biol. 2005;350:379–392. doi: 10.1016/j.jmb.2005.04.016. [DOI] [PubMed] [Google Scholar]

Articles from Biophysical Journal are provided here courtesy of The Biophysical Society

RESOURCES