Skip to main content
Protein Science : A Publication of the Protein Society logoLink to Protein Science : A Publication of the Protein Society
. 2022 Dec;31(12):e4496. doi: 10.1002/pro.4496

Rapid prediction and analysis of protein intrinsic disorder

Guy W Dayhoff II 1,, Vladimir N Uversky 2,
PMCID: PMC9679974  PMID: 36334049

Abstract

Protein intrinsic disorder is found in all kingdoms of life and is known to underpin numerous physiological and pathological processes. Computational methods play an important role in characterizing and identifying intrinsically disordered proteins and protein regions. Herein, we present a new high‐efficiency web‐based disorder predictor named Rapid Intrinsic Disorder Analysis Online (RIDAO) that is designed to facilitate the application of protein intrinsic disorder analysis in genome‐scale structural bioinformatics and comparative genomics/proteomics. RIDAO integrates six established disorder predictors into a single, unified platform that reproduces the results of individual predictors with near‐perfect fidelity. To demonstrate the potential applications, we construct a test set containing more than one million sequences from one hundred organisms comprising over 420 million residues. Using this test set, we compare the efficiency and accessibility (i.e., ease of use) of RIDAO to five well‐known and popular disorder predictors, namely: AUCpreD, IUPred3, metapredict V2, flDPnn, and SPOT‐Disorder2. We show that RIDAO yields per‐residue predictions at a rate two to six orders of magnitude greater than the other predictors and completely processes the test set in under an hour. RIDAO can be accessed free of charge at https://ridao.app.

Keywords: comparative genomics, comparative proteomics, disorder analysis, disorder prediction, intrinsically disordered proteins, intrinsically disordered regions, protein intrinsic disorder, structural bioinformatics

1. INTRODUCTION

Anfinsen's classic work on bovine pancreatic ribonuclease 1 helped establish the protein folding problem. 2 Since then, extensive work has been done to decipher the protein folding code and build computer models capable of predicting a protein's native structure based solely on its amino acid sequence. While some proteins require a unique and nearly time‐independent structure before becoming functional, most proteins exhibit some degree of time‐dependent structural heterogeneity in their native, functional state due to intrinsic disorder. 3 Protein intrinsic disorder can manifest globally or locally giving rise to intrinsically disordered proteins (IDPs) and intrinsically disordered protein regions (IDPRs), respectively. 4 These proteins and protein regions have been found in all kingdoms of life, 5 , 6 , 7 , 8 , 9 , 10 , 11 often serve as actors in important physiological processes, 3 , 12 , 13 , 14 and are implicated in numerous human diseases. 15 , 16 , 17 , 18 , 19 , 20 IDPs and IDPRs have challenged the structure–function paradigm, 21 , 22 , 23 complicated the protein folding problem, 24 , 25 , 26 and hindered structural proteomics studies. 27

There are significant differences between the amino acid sequences of proteins with stable, well‐defined structures and IDPs. 28 These differences, which include attributes, such as hydrophobicity, content of charged residues, 29 and sequence complexity, 30 serve as the basis for predicting if a protein is expected to be wholly or partially disordered under physiological conditions. To date, more than a hundred disorder predictors have been developed. 31 Some predictors, for example, the Predictors of Natural Disordered Regions (PONDRs), 30 produce disorder profiles by providing continuous per‐residue disorder propensity predictions from N‐termini to C‐termini (Figure 1). Per‐residue predictors, like the PONDRs, are useful for identifying IDPRs. Other predictors, including charge‐hydropathy (CH) plots, 29 and those based on cumulative distribution functions (CDFs), 11 yield binary predictions that indicate whether a given sequence is expected to be wholly disordered or ordered under physiological conditions. These predictors, and binary disorder predictors in general, are used to identify IDPs.

FIGURE 1.

FIGURE 1

Human p53 Rapid Intrinsic Disorder Analysis Online (RIDAO) disorder profile. Plotting the residue number against disorder score, also known as disorder propensity, for all residues in a sequence yields a disorder profile. Residues with a disorder score exceeding 0.5 are predicted to be disordered. Residues with a disorder score below the 0.5 are predicted to be ordered. The 0.5 threshold is represented with a dashed line. Here we show a complete RIDAO disorder profile, including the mean disorder profile (MDP) error, PONDRs VLXT, VL3, and VSL2, as well as IUPred‐Long, IUPred‐Short, and PONDR‐FIT

Disorder predictors have a history of being implemented into web‐servers and made available to the public free of charge. However, with a few exceptions, 32 , 33 these servers are not designed with genome‐scale structural bioinformatics or comparative genomics/proteomics in mind. We previously developed a web‐crawler style disorder predictor for internal usage that was implemented into a tool we refer to as Disorder Spider (DiSpi) in our previous studies, see for example, refs. 34 , 35 , 36 , 37 , 38 , 39 DiSpi works by first aggregating disorder profiles from six well‐known disorder predictors: PONDR VLXT, 40 PONDR VL3, 41 PONDR VLS2B, 42 PONDR‐FIT, 43 IUPred‐Short, and IUPred‐Long. 44 Then, a mean disorder profile (MDP) is computed along with the standard error. The final output consists of all seven sets of predictions along with the error computed from the MDP as well as a disorder profile plot (Figure 1). Additionally, DiSpi performs CH‐CDF analysis 45 , 46 , 47 and yields CH‐CDF plots (Figure 2), which enables rapid discrimination between flavors of disorder. 48 Despite the fact that DiSpi was designed to facilitate the application of disorder prediction in comparative genomics, it suffers from the same computational bottlenecks as the web‐servers it crawls.

FIGURE 2.

FIGURE 2

Charge‐hydropathy (CH)‐CDF plots. Plots of CH versus cumulative distribution function (CDF) predictions yield multinomial classification of protein disorder. Classification is based on which quadrant a protein is placed in. When both methods predict disorder, the protein is found in quadrant 4 and is predicted to be disordered. Alternatively, when both methods predict a protein to be well‐structured it is found in quadrant 2 and is predicted to be structured. If the two methods disagree, then the protein is predicted to be “mixed” or “rare” and is found in quadrant 3 or quadrant 1, respectively. Here, we present 100 mapped proteins pulled from the human proteome (left) and slime mold proteome (right)

In this work, we construct a high‐efficiency unified disorder predictor based on DiSpi that does not rely on web‐crawling. This command‐line predictor, henceforth referred to as Rida, subsequently serves as the basis for the development of a second, web‐based, disorder predictor. Apart from rounding errors, the disorder scores produced by these two unified predictors exactly match those produced by each of their original, stand‐alone counterparts—PONDR VLXT, PONDR VSL2B, PONDR VL3, PONDR‐FIT, IUPred‐Short, and IUPred‐Long. Herein, we present RIDAO, an easy to use, publicly accessible, high‐efficiency, web‐based disorder predictor. We compare the accessibility and computational efficiency of RIDAO with five powerful and popular disorder predictors 49 —AUCpreD, 50 flDPnn, 51 IUPred3, 52 metapredict V2, 53 and SPOT‐Disorder2. 54 Using a test set containing more than one million sequences comprising over 420 million residues, we find that RIDAO is the most efficient predictor, analyzing the complete dataset in under an hour with a per‐residue prediction rate (Equation (1)) at least two orders of magnitude greater than the other predictors.

2. RESULTS AND DISCUSSION

2.1. Implementation

Rida is a single‐threaded C program that is designed to run on 1 CPU core. RIDAO, on the other hand, is a Ruby on Rails web application with a C backend that implements the logic of Rida. Unlike Rida, the RIDAO backend takes advantage of multi‐threading; as a result, RIDAO can run on as many CPU cores as are available to it. This in turn allows RIDAO to treat multiple sequences simultaneously without a loss in efficiency. In principle, this design paradigm could be applied to any predictor to yield improved efficiency.

Unlike existing web‐servers, RIDAO requires users to login using an email address and password. This design choice may be viewed unfavorably due to concerns regarding privacy and FAIR data principles 55 —findability, accessibility, interoperability, and reusability. However, in our view, the benefits of this choice outweigh these drawbacks. One benefit of user logins is the ability of RIDAO to create dedicated user workspaces, wherein results can be aggregated, stored, and shared. Within the user workspace, three types of job submissions can be made: (a) users can enter sequences manually, (b) users can upload multi‐FASTA files, and (c) users can upload a compressed file (.zip) containing any number of multi‐FASTA files. In all cases, RIDAO requires FASTA formatted sequences and allows, but does not require, FASTA headers. Once submitted, a job is added to the user's workspace and placed into the run queue to wait for system resources to become available. The run queue is implemented in a first in, first out fashion and is intended to prevent over‐subscription of system resources. Once the necessary system resources are available, the next scheduled job in the queue moves into an active state and processing begins. After processing, the results are compressed (.zip) before the job moves into the final, completed state. Once a job has completed its run, users can view RIDAO generated figures by accessing the job within their workspace. Users can also download or share their results, including raw data, generated figures, and the scripts used to generate said figures. Job sharing among RIDAO users is a straight forward, in‐app process, designed to facilitate collaboration. Results can also be shared outside of RIDAO via a downloadable link that does not require credentials to access.

2.2. Efficiency

Each predictor's efficiency was evaluated on the test set over a 24 hr period (Figure 3). We included Rida in this evaluation for two reasons. First, unlike the other predictors, RIDAO is web‐based and can only be accessed via a web browser. Second, RIDAO leverages parallel processing to analyze multiple sequences simultaneously while the other predictors do not. Rida, on the other hand, is restricted to serial processing and is therefore more directly comparable to the others. Rida and RIDAO completed processing the entire dataset within the allotted 24 hr window. Specifically, RIDAO completed the job in ≈42 min and Rida finished in ≈613 min. The other predictors were unable to process the entire dataset during the 24 hr period. On average, we found that metapredict V2 processed 64.0% of the dataset (644,823 sequences); IUPred3 processed 25.0% of the dataset (252,267 sequences); AUCpreD processed 23.8% of the dataset (240,016 sequences); flDPnn processed 0.7% of the dataset (6,557 sequences); and SPOT‐Disorder2 processed less than 0.1% of the dataset (34 sequences).

FIGURE 3.

FIGURE 3

Efficiency Evaluation. Each predictor was evaluated on the same hardware over a 24 hr period using a set of 1,007,540 sequences. Rapid Intrinsic Disorder Analysis Online (RIDAO) and Rida completed the entire set within the allotted time; metapredict V2 completed 644,823; IUPred3 completed 252,267; AUCpreD completed 240,016; flDPnn completed 6,557; and SPOT‐Disorder2 completed 34. (Top left) test set performance over 24 hr, (top right) test set performance over 24 hr in log 10 scale, (bottom left) per‐residue performance (Equation (1)). RIDAO outperforms other predictors by two orders of magnitude. (Bottom right) per‐residue prediction rate (Equation (1)) in log 10 scale

Based on the 24 hr performance, we evaluate the per‐residue prediction rate, R p , of each predictor, p, as,

Rp=t1s=0Sls, (1)

where t is the observed runtime in seconds, S is the number of processed sequences, s is the sequence index, and l s is the length of sequence s. Using Equation (1), we find that RIDAO performed predictions at a rate of 170,153.28 residues per second (RPS) and Rida performed at a rate of 11,658.13 RPS. For metapredict V2, IUPred3, AUCpreD, flDPnn, and SPOT‐Disorder2 the observed rates were: 3,155.12 RPS; 1,378.18 RPS; 1,265.44 RPS; 21.62 RPS; and 0.13 RPS, respectively.

Considering these rates and the total number of residues in the test set (428,786,255), the estimated total runtime, Ttotal*, can be computed as,

Ttotal*=NrRp, (2)

where N r is the total number of residues in the test set and R p is the per‐residue prediction rate defined in Equation (1). For metapredict V2, IUPred3, AUCpreD, flDPnn, and SPOT‐Disorder2, the Ttotal* is computed to be 37.8 hr; 86.4 hr; 94.1 hr; 5,509.1 hr (≈229 days); and 916,209.9 hr (≈104 years), respectively (Figure 4).

FIGURE 4.

FIGURE 4

Runtime evaluation. Actual and estimated runtimes reported in log scale. Rapid Intrinsic Disorder Analysis Online (RIDAO) required 42 min of runtime while Rida required 613 min. Runtimes for the other predictors, which did not completely process the test set during the 24 hr test period, are estimated (Equation (2)) based on observed per‐residue prediction rates shown in Figure 3. Then, 38 hr are estimated for metapredict V2, 86 hr for IUPred3, and 94 hr for AUCpreD; flDPnn is estimated to require 229 days and SPOT‐Disorder2 is estimated to require 104 years of runtime

We show in Figure 3 that RIDAO yields disorder predictions at a per‐residue rate that is two to six orders of magnitude greater than metapredict V2, IUPred3, AUCpreD, flDPnn, and SPOT‐Disorder2. Rida is also shown to out‐perform each predictor by at least an order of magnitude.

The difference in efficiency between these predictors may be explained in terms of design principles, prediction methods, and implementation choices. AUCpreD, metapredict V2, Rida, and RIDAO have all been developed with computational efficiency in mind and employ some form of machine‐learning to make predictions. While neural networks serve as the underpinnings of flDPnn and SPOT‐Disorder2, these predictors were not designed for high‐throughput applications. Consequently, both flDPnn and SPOT‐Disorder2 rely on inefficient system calls to various supplemental tools prior to performing disorder predictions. These system calls bog down the prediction process dramatically. This is especially the case for SPOT‐Disorder2, which must invoke HHBlits, PSI‐BLAST, CCMpred, SPIDER3, SPOT‐1D, SPOT‐Contact, and DCA prior to making its disorder prediction. The inclusion of PSI‐BLAST in particular exacts a heavy toll, despite running on 32 CPU cores.

Each predictor we examined produces per‐residue disorder propensities that can be viewed as a disorder profile or converted to binary predictions. Rida, RIDAO, flDPnn, and SPOT‐Disorder2 yield additional outputs as well. For Rida and RIDAO, these outputs include disorder propensities from seven different predictors. Additionally, they compute mean disorder propensities and their standard errors. Furthermore, RIDAO tabulates results for whole sequences based on each of its constituent predictors. These tabulated results come with an increase in computational cost and include the rank—or average disorder score over the entire sequence, the percentage of disordered residues based on a disorder propensity threshold of 0.5, the average distance from the CDF boundary line (dCDF) with respect to PONDR VLXT, and the distance from the boundary line in CH phase space (dCH). When taken together the dCDF and dCH allow for the construction of CH‐CDF plots shown in Figure 3. The production of these data points is a unique feature of RIDAO. For flDPnn, the additional outputs include eight data points in addition to the disorder propensities and their respective binary predictions. These data points describe propensities and binary predictions for linker‐specific disorder, protein‐binding, DNA‐binding, and RNA‐binding. While SPOT‐Disorder2 does not directly produce additional outputs itself, the supplemental tools it requires do. These supplemental outputs include a contact map, secondary structure predictions, a position‐specific scoring matrix and more. With the exception of RIDAO's tabulated results, the additional outputs of the aforementioned predictors also serve as inputs for their final predictions. SPOT‐Disorder2 and flDPnn would see improved computational efficiency by combining their sublayers, as RIDAO does, instead of daisy chaining them together.

The energy estimation method was elaborated and implemented in the IUPred family of predictors. 44 The IUPred predictors are unique in that they employ a physical model describing the interaction energy between residues to make predictions instead of relying on machine‐learning. RIDAO outperformed IUPred3 by two orders of magnitude with respect to its per‐residue prediction rate. This difference in performance may be ascribed to the parallel processing strategy employed by RIDAO. However, Rida also outperforms IUPred3 by an order of magnitude despite being restricted to serial processing on a single CPU core. This is surprising given that the energy estimation method is incorporated into Rida and RIDAO and suggests that a more efficient implementation of IUPred3 could be constructed.

Rida and RIDAO are most comparable to AUCpreD and metapredict V2, as each of these predictors are self‐contained and designed with proteome‐scale analysis in mind. These predictors all employ neural networks to make disorder predictions. AUCpreD leverages a deep convolutional neural network and is implemented in C++; metapredict V2 uses a bidirectional recurrent neural network with long short‐term memory and is implemented in Python. Rida and RIDAO employ feed‐forward neural networks, as well as support vector machines (SVMs). Rida is implemented in C and RIDAO is implemented in Ruby with a C backend. Although implementations in C and C++ can be expected to run faster than those in Python, 56 , 57 the choice of language in this case does not explain the dramatic difference in per‐residue prediction rates compared to RIDAO. Especially given that metapredict V2, which is implemented in Python, outperforms AUCpreD. Therefore, these differences must be a consequence of the neural network architecture 58 and program design.

2.3. Accessibility

RIDAO was designed to be accessible to a large number of end‐users, without requiring command‐line experience. Nevertheless, RIDAO is a web‐based predictor and is therefore inaccessible to individuals without an internet connection. For these individuals, Rida is available for download as a precompiled command‐line program. RIDAO does not restrict user input with respect to file size or the number of submitted sequences nor does it restrict the frequency of job submissions. To the best of the authors' knowledge existing web‐servers providing disorder prediction services are all limited in some capacity. A consequence of these restrictions is the need for end‐users to setup local installations in order to process large datasets.

With the exception of AUCpreD, each predictor described herein has an associated web‐server. Although AUCpreD is not available as a web‐server, the software acquisition and installation is easy and straight forward for users with command‐line experience. The acquisition and installation of metapredict V2 is streamlined as well. The authors have prepared a package that is available through the de facto python package management system, which can be used as a stand‐alone program or a Python package. The simplicity of this process is appreciated given that the metapredict V2 web‐server only accepts a single sequence at a time, requiring multi‐FASTA datasets to be processed locally.

SPOT‐Disorder2, flDPnn, and IUPred3 also offer web‐servers for end‐users but restrict user submissions. SPOT‐Disorder2 limits submissions to 10 sequences, each containing less than 750 residues. Similarly, flDPnn limits submissions to 20 sequences and IUPred3 limits submission to 1 MB of data. For larger jobs, flDPnn provides a stand‐alone version in an easy to use docker container, as well as source code for local installations. IUPred3 and SPOT‐Disorder2 also offer stand‐alone versions for download. In the case of IUPred3, the setup is easy and only requires one additional Python package, that is, SciPy. In contrast, setting up SPOT‐Disorder2 requires a number of additional tools to also be acquired and installed, each with their own dependencies, and an appreciable amount of storage to house the required NCIB database. The accessibility of SPOT‐Disorder2 could be improved through the addition of an installation script to automate the acquisition and installation of all dependencies. Alternatively, SPOT‐Disorder2 could provide a ready‐to‐run container, similar to flDNpp.

3. CONCLUSIONS

This article introduces a new high‐efficiency web‐based disorder prediction tool, RIDAO. The purpose of this tool is to facilitate the application of disorder analysis in comparative genomics/proteomics and genome‐scale structural bioinformatics. To this end, we have focused on investigating computational efficiency and accessibility (i.e., ease of use) instead of accuracy in this article. Readers interested in the accuracies of the predictors integrated into RIDAO as well as those examined in this work are referred to the original publications 42 , 43 , 44 , 50 , 51 , 52 , 53 , 54 , 59 , 60 and the recent critical assessment of protein intrinsic disorder experiment. 49 Importantly, we note that the accuracies of the disorder predictors integrated into RIDAO remain unchanged from their original publications. Indeed, with the exception of rounding errors, RIDAO reproduces disorder scores from its constituent predictors with perfect fidelity. Moving forward we plan to expand the analytical functionality of RIDAO. This includes but is not limited to the development and integration of additional disorder predictors, as well as tools for annotation of functional disorder sites, such as molecular recognition features and posttranslational modification sites. RIDAO can be accessed for free at: https://ridao.app.

4. MATERIALS AND METHODS

In this section, we outline the concepts and methods of disorder prediction which are leveraged by Rida and RIDAO. For any given sequence, six per‐residue disorder predictors are used to construct an MDP. These predictors, that is, PONDRs VLXT, VL3, and VSL2B along with IUPred‐Long, IUPred‐Short, and PONDR‐FIT, are introduced below. In addition to creating disorder profiles, RIDAO also performs multinomial classification by creating CH‐CDF plots. These plots along with the underlying CH and CDF analysis is also discussed. Finally, we detail how predictor efficiency is evaluated.

4.1. CH phase space

Unfavorable thermodynamic interactions between nonpolar solutes and their aqueous environments give rise to the hydrophobic effect, 61 , 62 which is one of the major driving forces behind protein folding. 63 Early work on the protein folding problem led to the elaboration of different amino acid scales describing the affinity and antipathy between each amino acid and water. 63 , 64 , 65 George Rose, for example, used empirical water–ethanol transfer free energies 66 to derive his scale. 67 Kyte and Doolittle, on the other hand, based their scale 68 on interior–exterior amino acid distribution data collected from experiment 69 as well as empirical water‐vapor transfer free energies. 70 , 71 Today, these scales and others like them serve as the basis for CH plots and as features for machine‐learning‐based approaches.

4.1.1. CH plots and FoldIndex

Uversky et al. brought to bear Kyte and Doolittle's hydropathy scale to examine the relationship between sequence composition and disorder in the native state. 29 They showed, with a set of 336 proteins, that a border between natively folded and IDPs could be drawn in CH phase space. The so‐called “Uversky plot” takes the mean net charge at pH 7.0, R, and graphs it against the mean hydropathy, H^KD, derived from Kyte and Doolittle's scale. Their borderline can be expressed by the equation

R=2.785H^KD1.151 (3)

which can be rearranged to yield the FoldIndex 72 :

IFKD=2.785H^KDR1.151 (4)

CH plots enable rapid binary classification of IDPs without the need for experimental isolation and characterization.

Moreover, when Equation (4) is applied to a sequence using a sliding window, disorder profiles can be constructed to detect IDPRs.

4.2. Predictors of natural disordered regions

Work by Romero et al. in 1997 resulted in the first formal predictors of disorder, 73 which were eventually called PONDRs. 30 To distinguish various PONDRs from one another, a descriptive nomenclature is used. 60 For example, XL1 and NL1 are both first generation predictors trained on long disordered regions (i.e., more than 30 residues); however, the XL1 dataset contains X‐ray‐characterized sequences while the NL1 dataset contains sequences characterized by nuclear magnetic resonance. 74 Regardless of how the training data were obtained, all PONDRs employ machine‐learning techniques and operate on a sliding window of residues. Typically features are selected based on conditional probabilities 75 and include but are not limited to: amino‐acid compositions, attributes derived directly from compositions, and attributes derived from some function of compositions. 74 The chosen features are weighted and combined in a nonlinear fashion byway of SVMs, artificial neural networks (ANNs), or logistic regression to yield a final prediction.

4.2.1. PONDR VLXT

The early PONDRs were all specialized—with respect to length or position—and therefore were unable to accurately predict disorder along the entire length of a protein sequence. 40 , 73 , 76 , 77 , 78 Despite this limitation, these predictors paved the way for future development and, for the time, were accurate within their own domains. 79 PONDR VLXT 30 , 40 was the first predictor designed to work over the entire length of a sequence. VLXT integrates three feed‐forward neural networks predictors, two for the terminal regions (i.e., XT) and one for the internal region (i.e., VL1). The final prediction is the result of merging the individual results of each underlying PONDR byway of two averaging passes over the entire sequence.

4.2.2. PONDR VL3

The first generation VL dataset consisted of 15 variously characterized long regions of disorder. 30 This number was increased to 145 for the second generation dataset 80 and was further refined to 152 samples to produce a third generation dataset, VL3. 41 PONDR VL3 feeds 20 features through an ensemble of 10 feed‐forward ANNs before selecting the final result by a simple majority vote. In the fifth Critical Assessment of Structured Prediction (CASP5) experiment the VL3 predictor, along with two variants (i.e., VL3‐H, and VL3‐P) were shown to outperform VLXT on long regions of disorder by margins of 7–16%. However, VLXT—with an accuracy of 58%—outperformed the same VL3 predictors on short regions of disorder by margins of 23–33%.

4.2.3. PONDR VSL2B

In CASP6, a significant improvement in predicting short regions of disorder was achieved with PONDR VSL1. 81 Later, the VSL1 dataset was refined and a new two‐level predictor, PONDR VSL2, along with two variants, VSL2P and VSL2B, were developed using three linear SVMs. 42 VSL2 predictors work in two stages. First, two length‐specific predictors, VSL2‐L and VSL2‐S, make their own predictions for every residue in the sequence. Then, a meta‐predictor, VSL2‐M1, weights and combines their outputs to provide a final per‐residue result. VSL2, VSL2P, and VSL2B differ only with respect to their input features. VSL2 uses a total of 54 features, including 22 derived from multiple sequence alignments and 6 from secondary structure predictions. In contrast, VSL2P and VSL2B use reduced feature sets of 48 and 26 features, respectively. VSL2B strictly employs features derived from the amino acid composition of the sequence; and, is therefore less computationally expensive than VSL2 and VSL2P. While the overall accuracy of VSL2B is 2–3% less than VSL2P and VSL2, its lower computational cost makes it well‐suited for genome‐scale studies.

4.3. Pairwise energy estimations

Machine learning‐based disorder predictors, like the PONDRs, are inherently biased due to the difficulty of constructing large unbiased training datasets. Ideally, each disordered segment in a training set would have been characterized by all available experimental techniques. Additionally, the distributions of segment size and location would be expected to be uniform and free of false positives. Despite a continuous influx of empirical data on IDPs and IDPRs, much more work is needed before an ideal training dataset can be constructed. The IUPred predictor circumvents this issue by relying on an energy estimation method, instead of machine learning, to generate disorder profiles.

4.3.1. IUPred

The energy estimation method incorporated into IUPred is based on statistical inter‐residue inter‐action potentials, which can be approximated from solved protein structures. 82 Starting with an initial 20 by 20 matrix, M (0), where the superscript (0) denotes the zeroth approximation, and each element, m ij , represents the statistical potential for the pairing of amino acid types i and j, a final matrix M *, may be derived iteratively. At each iteration, the error is computed between the predicted energies and observed energies as:

Mijn=lnpijobservedpijnpredicted, (5)

where p ij is the pairing frequency of amino acid type i with amino acid type j. Then, the computed error is applied as a correction to yield the next approximation:

Mijn=Mijn1+Mijn1, (6)

This process is repeated until the magnitude of the error falls below a predetermined threshold. Provided this process converges, the final matrix M * can be used to predict form from sequence. For any given protein in a known conformation, the calculated energy, Ejp, of a residue of type i at position p can be then be expressed as:

Eip=cjp, (7)

where cjp is the conformation‐specific interaction frequency between the residue of type i at position p and residues of type j. Alternatively, the energy of a given residue may be estimated purely from the primary structure without prior knowledge of the actual conformation.

eip=j=120Pijnjp, (8)

where eip is the estimated energy for the residue at position p with type i, P ij is a pairwise energy prediction matrix entry for atom types i and j, and njp is the jth element of the amino acid composition vector, which is specific for position p. The elements of matrix P are determined beforehand by least‐squares fitting of a reference set of proteins with known conformations. To develop IUPred, a reference set comprised of 785 nonredundant globular proteins was used.

4.4. Combinatory techniques (metapredictors)

The initial results on disorder prediction implied that there are at least two distinct types or “flavors” of protein intrinsic disorder. 73 Models trained on long segments saw sharp decreases in accuracy when applied to short segments and vice versa. The ability of VLXT 30 , 40 to accurately predict disorder along an entire sequence stems from the fact that it combines three other domain specific predictors. Indeed, numerous examples of combinatory predictors outperforming their individual constituents exist in the literature.

4.4.1. PONDR‐FIT

PONDR‐FIT 43 is an ANN meta‐predictor that integrates three PONDRs (i.e., VLXT, VSL2B, and VL3) with FoldIndex, IUPred, and TOP‐IDP. The underlying predictors all generate disorder profiles given an amino acid sequence and, with the exception of TOP‐IDP, are described elsewhere. TOP‐IDP is a numerical amino acid disorder propensity scale elaborated from an initial pool of 517 sequence attributes using genetic algorithms. Similar to the FoldIndex, it can be applied in a sliding window to yield a disorder profile. Each prediction made by PONDR‐FIT depends on 126 input features which are the outputs of its 6 constituent predictors operating on sliding windows of 21 residues. Overall, PONDR‐FIT yields predictions that are, on average, 11% more accurate when compared to VLXT, VSL2B, VL3, FoldIndex, IUPred, or TOP‐IDP alone.

4.4.2. CH‐CDF

Disorder profiles are useful for identifying IDPRs as they provide a continuous measure of disorder throughout a sequence. Alternatively, the data in a disorder profile can be viewed as a histogram of disorder propensities. This histogram then serves as input for the CDF. 83 Afterward, for a given predictor, the optimal boundary line between ordered and disordered proteins can be computed and CDF plots can be used for the binary classification of IDPs, 11 , 84 similar to CH plots. PONDR VLXT was the first per‐residue predictor to be given this treatment. 11

Oldfield et al. compared the CDF and CH methods and found disagreements between their predictions when applied to the same data sets. 85 Their results may be interpreted through the lens of the protein quartet model, 86 which classifies soluble proteins as being natively extended, relaxed (pre‐molten), collapsed (molten), or well‐structured. When the two methods are in agreement, the protein can be expected to belong to either the extended or the well‐structured class. When the two methods do not agree, the protein likely belongs to an intermediate class.

Combined CH‐CDF plots (Figure 2) can be constructed to perform multinomial classification. 45 , 46 , 47 , 48 Classification is based on which quadrant a protein is placed in. When both methods predict disorder, the protein is found in quadrant 4 and is predicted to be disordered. Alternatively, when both methods predict a protein to be well‐structured it is found in quadrant 2 and is predicted to be structured. If the two methods disagree, then the protein is predicted to be “mixed” or “rare.” These proteins are found in quadrant 3 or quadrant 1, respectively. Together, CH and CDF analysis provides sought after insight into the sequence‐based partitioning of functional disorder flavors. 73 , 78 , 80 , 87 , 88

4.5. Efficiency evaluation

To evaluate the efficiency of each predictor, a test set comprised of 1,007,540 sequences taken from a 100 UniProt 89 reference proteomes is constructed. Complete reference proteomes are selected from all kingdoms of life (Table 1). Specifically, the test set contains: 76,772 sequences from 9 fungi; 210,052 sequences from 8 protists; 329,717 sequences from 8 animals; 66,211 sequences from 14 bacteria; 59,809 sequences from 19 archaea; 259,300 sequences from 6 plants; and 5,679 sequences from 36 viruses. Sequences in the dataset containing less than 31 residues are extended by appending a poly‐alanine tail at the C‐termini such that the sequence meets the minimum length requirement for all predictors—31 residues. Additionally, we convert U to C, B to N, O to K, Z to N, and X to G so that no sequences are rejected from any of the predictors. After modification, a grand total of 428,786,255 residues are contained in the dataset, which is available for downloaded at https://ridao.app/publications. Each predictor is given 24 hr of runtime to process as much as this dataset as possible.

TABLE 1.

Test set reference proteomes. Reference proteomes from all kingdoms of life used to construct the test set for predictor efficiency evaluation. The final test set is available for download at https://ridao.app/publications. It contains 1,007,540 sequences from 100 organisms comprising 428,786,255 residues

Kingdom UniProtIDs Total sequences
Animalia UP000001940, UP000000803, UP000005640, UP000002358, 329,717
UP000410492, UP000000437, UP000186698, UP000000589
Archaea UP000000536, UP000000590, UP000000792, 59,809
UP000001018, UP000001974, UP000002613,
UP000006794, UP000008243, UP000011555,
UP000305339, UP000000554, UP000000758,
UP000001013, UP000001903, UP000002487,
UP000006663, UP000007722, UP000011511, UP000297295
Bacteria UP000000428, UP000000537, UP000000625, UP000000807, 66,211
UP000001364, UP000001414, UP000001425, UP000001570,
UP000001584, UP000002332, UP000028875, UP000095713,
UP000184536, UP000192674
Fungi UP000000560, UP000000561, UP000000591, UP000001805, 76,772
UP000001861, UP000002149, UP000002311, UP000002485,
UP000007431
Plantae UP000001514, UP000006548, UP000006727, UP000006729, 259,300
UP000011750, UP000595140
Protist UP000000542, UP000000600, UP000001449, UP000001542, 210,052
UP000002195, UP000009168, UP000011087, UP000013827
Virus UP000000354, UP000000730, UP000000854, UP000000938, 5,679
UP000001134, UP000001240, UP000001451, UP000002241,
UP000002242, UP000002500, UP000006720, UP000007639,
UP000008158, UP000008286, UP000008288, UP000008597,
UP000009253, UP000009255, UP000009294, UP000012678,
UP000018636, UP000101154, UP000105007, UP000114976,
UP000145840, UP000202558, UP000203665, UP000204142,
UP000217350, UP000241349, UP000247059, UP000249273,
UP000281246, UP000383418, UP000464024, UP000514836

The total time required to read inputs, make predictions, and write outputs is taken into consideration when evaluating efficiency. The input order of sequences is kept constant throughout this work and any predictors capable of reading multi‐FASTA files are given the entire set as a single input. If a multi‐FASTA file cannot be used, the test set is broken up into individual FASTA files. Importantly, we note that flDPnn could not load the test set using a single multi‐FASTA file without experiencing catastrophic failure; for that reason, flDPnn was evaluated using individual FASTA files. Furthermore, because it has been reported that a majority of the metapredict V2 runtime is spent loading its neural network, 53 we evaluated metapredict V2 using individual FASTA files and a single multi‐FASTA file. Herein, we report the results of the latter, for which metapredict V2's code was modified to write outputs as they are generated instead of at the end of processing. This modification was necessary as metapredict V2 does not finish processing the multi‐FASTA file input within the allotted time and therefore does not produce any output otherwise. Each predictor is evaluated on the same server housing RIDAO. This server employs a 2.60 GHz Intel Xeon E5‐2640 v3 CPU, 64 GB 2,133 MHz DDR4 SDRAM, and two NVIDIA GeForce GTX 980 s.

AUTHOR CONTRIBUTIONS

Vladimir N. Uversky motivated and guided the work. Guy W. Dayhoff conceived the project, designed and developed the web application, performed analysis, made figures, and wrote the manuscript.

CONFLICT OF INTEREST

Both the authors declare no conflict of interest.

ACKNOWLEDGMENTS

G. W. D. thanks Christopher J. Oldfield and Bin Xue for providing original model parameters needed to recreate PONDR VLXT, PONDR VSL2, PONDR VL3, and PONDR‐FIT. G. W. D. was supported by an NSF Graduate Research Fellowship (NSF‐1746051).

Dayhoff GW II, Uversky VN. Rapid prediction and analysis of protein intrinsic disorder. Protein Science. 2022;31(12):e4496. 10.1002/pro.4496

Review Editor: Nir Ben‐Tal

Funding information National Science Foundation, Grant/Award Number: NSF‐1746051

Contributor Information

Guy W. Dayhoff, II, Email: gdayhoff@usf.edu.

Vladimir N. Uversky, Email: vuversky@usf.edu.

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are available from the corresponding author upon reasonable request.

REFERENCES

  • 1. Anfinsen CB, Haber E, Sela M, White F Jr. The kinetics of formation of native ribonuclease during oxidation of the reduced polypeptide chain. Proc Natl Acad Sci U S A. 1961;47(9):1309–1314. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Dill KA, MacCallum JL. The protein‐folding problem, 50 years on. Science. 2012;338(6110):1042–1046. [DOI] [PubMed] [Google Scholar]
  • 3. Wright PE, Dyson HJ. Intrinsically unstructured proteins: re‐assessing the protein structure‐function paradigm. J Mol Biol. 1999;293(2):321–331. [DOI] [PubMed] [Google Scholar]
  • 4. Habchi J, Tompa P, Longhi S, Uversky VN. Introducing protein intrinsic disorder. Chem Rev. 2014;114(13):6561–6588. [DOI] [PubMed] [Google Scholar]
  • 5. Peng Z, Yan J, Fan X, et al. Exceptionally abundant exceptions: Comprehensive characterization of intrinsic disorder in all domains of life. Cell Mol Life Sci. 2015;72(1):137–151. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Xue B, Dunker AK, Uversky VN. Orderly order in protein intrinsic disorder dis‐tribution: Disorder in 3500 proteomes from viruses and the three domains of life. J Biomol Struct Dyn. 2012;30(2):137–149. [DOI] [PubMed] [Google Scholar]
  • 7. Uversky VN, Dunker AK. Understanding protein non‐folding. Biochim Biophys Acta. 2010;1804(6):1231–1264. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Uversky VN. The mysterious unfoldome: Structureless, underappreciated, yet vital part of any given proteome. J Biomed Biotechnol. 2010;2010:1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol. 2004;337(3):635–645. [DOI] [PubMed] [Google Scholar]
  • 10. Dunker AK, Lawson JD, Brown CJ, et al. Intrinsically disordered protein. J Mol Graph Model. 2001;19(1):26–59. [DOI] [PubMed] [Google Scholar]
  • 11. Dunker AK, Romero P, Obradovic Z, Garner EC, Brown CJ. Intrinsic protein dis‐order in complete genomes. Genome Infom Ser. 2000;11:161–171. [PubMed] [Google Scholar]
  • 12. Dyson HJ, Wright PE. Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol. 2005;6(3):197–208. [DOI] [PubMed] [Google Scholar]
  • 13. Dunker AK, Cortese MS, Romero P, Iakoucheva LM, Uversky VN. Flexible nets: The roles of intrinsic disorder in protein interaction networks. FEBS J. 2005;272(20):5129–5148. [DOI] [PubMed] [Google Scholar]
  • 14. Van Der Lee R, Buljan M, Lang B, et al. Classification of intrinsically disordered regions and proteins. Chem Rev. 2014;114(13):6589–6631. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Iakoucheva LM, Brown CJ, Lawson JD, Obradović Z, Dunker AK. Intrinsic disorder in cell‐signaling and cancer‐associated proteins. J Mol Biol. 2002;323(3):573–584. [DOI] [PubMed] [Google Scholar]
  • 16. Cheng Y, LeGall T, Oldfield CJ, Dunker AK, Uversky VN. Abundance of intrinsic disorder in protein associated with cardiovascular disease. Biochemistry. 2006;45(35):10448–10460. [DOI] [PubMed] [Google Scholar]
  • 17. Uversky VN, Oldfield CJ, Dunker AK. Intrinsically disordered proteins in human diseases: Introducing the d2 concept. Annu Rev Biophys. 2008;37:215–246. [DOI] [PubMed] [Google Scholar]
  • 18. Hegyi H, Buday L, Tompa P. Intrinsic structural disorder confers cellular viability on oncogenic fusion proteins. PLoS Comput Biol. 2009;5(10):e1000552. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Uversky VN, Davé V, Iakoucheva LM, et al. Pathological unfoldomics of uncontrolled chaos: Intrinsically disordered proteins and human diseases. Chem Rev. 2014;114(13):6844–6879. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Kulkarni P, Uversky VN. Intrinsically disordered proteins in chronic diseases. Biomolecules. 2019;9(4):147. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Dunker AK, Obradovic Z. The protein trinity—Linking function and disorder. Nat Biotechnol. 2001;19(9):805–806. [DOI] [PubMed] [Google Scholar]
  • 22. Tompa P, Fuxreiter M. Fuzzy complexes: Polymorphism and structural disorder in protein–protein interactions. Trends Biochem Sci. 2008;33(1):2–8. [DOI] [PubMed] [Google Scholar]
  • 23. DeForte S, Uversky VN. Order, disorder, and everything in between. Molecules. 2016;21(8):1090. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Receveur‐Bréchot V, Bourhis JM, Uversky VN, Canard B, Longhi S. Assessing protein disorder and induced folding. Proteins. 2006;62(1):24–45. [DOI] [PubMed] [Google Scholar]
  • 25. Sugase K, Dyson HJ, Wright PE. Mechanism of coupled folding and binding of an intrinsically disordered protein. Nature. 2007;447(7147):1021–1025. [DOI] [PubMed] [Google Scholar]
  • 26. Tompa P. Intrinsically disordered proteins: A 10‐year recap. Trends Biochem Sci. 2012;37(12):509–516. [DOI] [PubMed] [Google Scholar]
  • 27. Oldfield CJ, Xue B, Van YY, et al. Utilization of protein intrinsic disorder knowledge in structural proteomics. Biochim Biophys Acta. 2013;1834(2):487–498. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Brown CJ, Johnson AK, Dunker AK, Daughdrill GW. Evolution and disorder. Curr Opin Struct Biol. 2011;21(3):441–446. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Uversky VN, Gillespie JR, Fink AL. Why are ‘natively unfolded’ proteins unstructured under physiologic conditions? Proteins. 2000;41(3):415–427. [DOI] [PubMed] [Google Scholar]
  • 30. Romero P, Obradovic Z, Li X, Garner EC, Brown CJ, Dunker AK. Sequence complexity of disordered protein. Proteins. 2001;42(1):38–48. [DOI] [PubMed] [Google Scholar]
  • 31. Zhao B, Kurgan L. Deep learning in prediction of intrinsic disorder in proteins. Comput Struct Biotechnol J. 2022;20:1286–1294. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Yan J, Mizianty MJ, Filipow PL, Uversky VN, Kurgan L. Rapid: Fast and accurate sequence‐based prediction of intrinsic disorder content on proteomic scale. Biochim Biophys Acta. 2013;1834(8):1671–1680. [DOI] [PubMed] [Google Scholar]
  • 33. Peng Z, Mizianty MJ, Kurgan L. Genome‐scale prediction of proteins with long intrinsically disordered regions. Proteins. 2014;82(1):145–158. [DOI] [PubMed] [Google Scholar]
  • 34. Dayhoff GW, van Regenmortel MH, Uversky VN. Intrinsic disorder in protein sense‐antisense recognition. J Mol Recognit. 2020;33(10):e2868. [DOI] [PubMed] [Google Scholar]
  • 35. Van Bibber NW, Haerle C, Khalife R, Xue B, Uversky VN. Intrinsic disorder in tetratricopeptide repeat proteins. Int J Mol Sci. 2020;21(10):3709. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Van Bibber NW, Haerle C, Khalife R, Dayhoff GW, Uversky VN. Intrinsic disorder in human proteins encoded by core duplicon gene families. J Phys Chem B. 2020;124(37):8050–8070. [DOI] [PubMed] [Google Scholar]
  • 37. Gutierrez‐Beltran E, Elander PH, Dalman K, et al. Tudor staphylococcal nuclease is a docking platform for stress granule components and is essential for snrk1 activation in Arabidopsis. EMBO J. 2021;40:e105043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Koshkin SA, Anatskaya OV, Vinogradov AE, et al. Isolation and characterization of human colon adenocarcinoma stem‐like cells based on the endogenous expression of the stem markers. Int J Mol Sci. 2021;22(9):4682. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Polanco C, Polanco C, Uversky VN, et al. Bioinformatics‐based characterization of proteins related to SARS‐CoV‐2 using the polarity index method® (PIM®) and intrinsic disorder predisposition. Curr Proteom. 2022;19(1):51–64. [Google Scholar]
  • 40. Li X, Romero P, Rani M, Dunker AK, Obradovic Z. Predicting protein disorder for n‐, c‐and internal regions. Genome Infom Ser. 1999;10:30–40. [PubMed] [Google Scholar]
  • 41. Radivojac P, Obradović Z, Brown CJ, Dunker AK. Prediction of boundaries between intrinsically ordered and disordered protein regions. Pac Symp Biocomput. 2003;216–227. [PubMed] [Google Scholar]
  • 42. Peng K, Radivojac P, Vucetic S, Dunker AK, Obradovic Z. Length‐dependent prediction of protein intrinsic disorder. BMC Bioinf. 2006;7(1):1–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Xue B, Dunbrack RL, Williams RW, Dunker AK, Uversky VN. PONDR‐FIT: A metapredictor of intrinsically disordered amino acids. Biochim Biophys Acta. 2010;1804(4):996–1010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Dosztanyi Z, Csizmok V, Tompa P, Simon I. The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. J Mol Biol. 2005;347(4):827–839. [DOI] [PubMed] [Google Scholar]
  • 45. Mohan A, Sullivan WJ Jr, Radivojac P, Dunker AK, Uversky VN. Intrinsic disorder in pathogenic and non‐pathogenic microbes: Discovering and analyzing the unfoldomes of early‐branching eukaryotes. Mol Biosyst. 2008;4(4):328–340. [DOI] [PubMed] [Google Scholar]
  • 46. Sun X, Xue B, Jones WT, Rikkerink E, Dunker AK, Uversky VN. A functionally required unfoldome from the plant kingdom: Intrinsically disordered n‐terminal domains of GRAS proteins are involved in molecular recognition during plant development. Plant Mol Biol. 2011;77(3):205–223. [DOI] [PubMed] [Google Scholar]
  • 47. Xue B, Oldfield CJ, Van YY, Dunker AK, Uversky VN. Protein intrinsic disorder and induced pluripotent stem cells. Mol Biosyst. 2012;8(1):134–150. [DOI] [PubMed] [Google Scholar]
  • 48. Huang F, Oldfield C, Meng J, et al. Subclassifying disordered proteins by the ch‐cdf plot method. Pac Symp Biocomput. 2012;128–139. [PubMed] [Google Scholar]
  • 49. Necci M, Piovesan D, Tosatto SC. Critical assessment of protein intrinsic disorder prediction. Nat Methods. 2021;18(5):472–481. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Wang S, Ma J, Xu J. AUCpreD: Proteome‐level protein disorder prediction by AUCmaximized deep convolutional neural fields. Bioinformatics. 2016;32(17):i672–i679. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Hu G, Katuwawala A, Wang K, et al. flDPnn: Accurate intrinsic disorder prediction with putative propensities of disorder functions. Nat Commun. 2021;12(1):1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Erdős G, Pajkos M, Dosztányi Z. Iupred3: Prediction of protein disorder enhanced with unambiguous experimental annotation and visualization of evolutionary conservation. Nucleic Acids Res. 2021;49(W1):W297–W303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Emenecker RJ, Griffith D, Holehouse AS. Metapredict: A fast, accurate, and easy‐to‐use predictor of consensus disorder and structure. Biophys J. 2021;120(20):4312–4319. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Hanson J, Paliwal KK, Litfin T, Zhou Y. Spot‐disorder2: Improved protein intrinsic disorder prediction by ensembled deep learning. Genom Proteom Bioinf. 2019;17(6):645–656. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Wilkinson MD, Dumontier M, Aalbersberg IJ, et al. The fair guiding principles for scientific data management and stewardship. Sci Data. 2016;3(1):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Prechelt L. An empirical comparison of C, C++, Java, Perl, Python, Rexx and Tcl. IEEE Comput. 2000;33(10):23–29. [Google Scholar]
  • 57. Oden L. Lessons learned from comparing C‐CUDA and Python‐Numba for GPU‐computing. In 2020 28th Euromicro International Conference on Parallel, Distributed and Network‐Based Processing (PDP):216–223. IEEE; 2020.
  • 58. Laudani A, Lozito GM, Riganti Fulginei F, Salvini A. On training efficiency and computational costs of a feed forward neural network: A review. Comput Intell Neurosci. 2015;2015:818243. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Peng K, Vucetic S, Radivojac P, Brown CJ, Dunker AK, Obradovic Z. Optimizing long intrinsic disorder predictors with protein evolutionary information. J Bioinf Comput Biol. 2005;3(1):35–60. [DOI] [PubMed] [Google Scholar]
  • 60. Obradovic Z, Peng K, Vucetic S, Radivojac P, Brown CJ, Dunker AK. Predicting intrinsic disorder from amino acid sequence. Proteins. 2003;53(S6):566–572. [DOI] [PubMed] [Google Scholar]
  • 61. Frank HS, Evans MW. Free volume and entropy in condensed systems III. Entropy in binary liquid mixtures; partial molal entropy in dilute solutions; structure and thermodynamics in aqueous electrolytes. J Chem Phys. 1945;13(11):507–532. [Google Scholar]
  • 62. Jencks WP. Catalysis in chemistry and enzymology. Dover Publications, New York; 1987. [Google Scholar]
  • 63. Kauzmann W. Some factors in the interpretation of protein denaturation. Adv Protein Chem. 1959;14:1–63. [DOI] [PubMed] [Google Scholar]
  • 64. Traube J. Ueber die capillaritätsconstanten organischer stoffe in wässerigen lösungen. Justus Liebigs Ann Chem. 1891;265(1):27–55. [Google Scholar]
  • 65. Tanford C. How protein chemists learned about the hydrophobic factor. Protein Sci. 1997;6(6):1358–1366. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66. Nozaki Y, Tanford C. The solubility of amino acids and two glycine peptides in aqueous ethanol and dioxane solutions: Establishment of a hydrophobicity scale. J Biol Chem. 1971;246(7):2211–2217. [PubMed] [Google Scholar]
  • 67. Rose GD. Prediction of chain turns in globular proteins on a hydrophobic basis. Nature. 1978;272(5654):586–590. [DOI] [PubMed] [Google Scholar]
  • 68. Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982;157(1):105–132. [DOI] [PubMed] [Google Scholar]
  • 69. Chothia C. The nature of the accessible and buried surfaces in proteins. J Mol Biol. 1976;105(1):1–12. [DOI] [PubMed] [Google Scholar]
  • 70. Wolfenden R, Cullis P, Southgate C. Water, protein folding, and the genetic code. Science. 1979;206(4418):575–577. [DOI] [PubMed] [Google Scholar]
  • 71. Wolfenden R, Andersson L, Cullis P, Southgate C. Affinities of amino acid side chains for solvent water. Biochemistry. 1981;20(4):849–855. [DOI] [PubMed] [Google Scholar]
  • 72. Prilusky J, Felder CE, Zeev‐Ben‐Mordehai T, et al. Foldindex©: A simple tool to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics. 2005;21(16):3435–3438. [DOI] [PubMed] [Google Scholar]
  • 73. Romero P, Obradovic Z, Kissinger C, Villafranca J, Dunker AK. Identifying disordered regions in proteins from amino acid sequence. In Proceedings of International Conference on Neural Networks (ICNN'97). 1:90–95. IEEE; 1997.
  • 74. Xie Q, Obradovic Z, Arnold GE, Garner E, Romero P, Dunker AK. The sequence attribute method for determining relationships between sequence and protein disorder. Genome Inform Ser. 1998;9:193–200. [PubMed] [Google Scholar]
  • 75. Arnold GE, Dunker AK, Johns SJ, Douthart RJ. Use of conditional probabilities for determining relationships between amino acid sequence and protein secondary structure. Proteins. 1992;12(4):382–399. [DOI] [PubMed] [Google Scholar]
  • 76. Romero P, Obradovic Z, Kissinger CR, et al. Thousands of proteins likely to have long disordered regions. Pac Symp Biocomput. 1998;3:437–448. [PubMed] [Google Scholar]
  • 77. Garner E, Cannon P, Romero P, Obradovic Z, Dunker AK. Predicting disordered regions from amino acid sequence common themes despite differing structural characterization. Genome Infom Ser. 1998;9:201–213. [PubMed] [Google Scholar]
  • 78. Garner E, Romero P, Dunker AK, Brown C, Obradovic Z. Predicting binding regions within disordered proteins. Genome Infom Ser. 1999;10:41–50. [PubMed] [Google Scholar]
  • 79. Li X, Brown CJ, Obradovic Z, Garner EC, Dunker AK. Comparing predictors of disordered protein. Genome Infom Ser. 2000;11:172–184. [PubMed] [Google Scholar]
  • 80. Vucetic S, Brown CJ, Dunker AK, Obradovic Z. Flavors of protein disorder. Proteins. 2003;52(4):573–584. [DOI] [PubMed] [Google Scholar]
  • 81. Obradovic Z, Peng K, Vucetic S, Radivojac P, Dunker AK. Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins. 2005;61(S7):176–182. [DOI] [PubMed] [Google Scholar]
  • 82. Thomas PD, Dill KA. An iterative method for extracting energy‐like quantities from protein structures. Proc Natl Acad Sci U S A. 1996;93(21):11628–11633. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83. Sprent P, Sprent P. Applied nonparametric statistical methods. New York, NY: Chapman and Hall London, 1989. OCLC: 858942954. [Google Scholar]
  • 84. Xue B, Oldfield CJ, Dunker AK, Uversky VN. Cdf it all: Consensus prediction of intrinsically disordered proteins based on various cumulative distribution functions. FEBS Lett. 2009;583(9):1469–1474. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85. Oldfield CJ, Cheng Y, Cortese MS, Brown CJ, Uversky VN, Dunker AK. Comparing and combining predictors of mostly disordered proteins. Biochemistry. 2005;44(6):1989–2000. [DOI] [PubMed] [Google Scholar]
  • 86. Uversky VN. Natively unfolded proteins: A point where biology waits for physics. Protein Sci. 2002;11(4):739–756. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87. Romero P, Obradovic Z, Dunker AK. Intelligent data analysis for protein disorder prediction. Art Intell Rev. 2000;14(6):447–484. [Google Scholar]
  • 88. Williams R, Obradovic Z, Mathura V, et al. The protein non‐folding problem: Amino acid determinants of intrinsic order and disorder. Pac Symp Biocomput, 2001;89–100. [DOI] [PubMed] [Google Scholar]
  • 89. UniProt Consortium . UniProt: A worldwide hub of protein knowledge. Nucleic Acids Res. 2019;47(D1):D506–D515. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.


Articles from Protein Science : A Publication of the Protein Society are provided here courtesy of The Protein Society

RESOURCES