Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2007 Jun 12;35(Web Server issue):W460–W464. doi: 10.1093/nar/gkm363

PrDOS: prediction of disordered protein regions from amino acid sequence

Takashi Ishida 1,*, Kengo Kinoshita 1,2
PMCID: PMC1933209  PMID: 17567614

Abstract

PrDOS is a server that predicts the disordered regions of a protein from its amino acid sequence (http://prdos.hgc.jp). The server accepts a single protein amino acid sequence, in either plain text or FASTA format. The prediction system is composed of two predictors: a predictor based on local amino acid sequence information and one based on template proteins. The server combines the results of the two predictors and returns a two-state prediction (order/disorder) and a disorder probability for each residue. The prediction results are sent by e-mail, and the server also provides a web-interface to check the results.

INTRODUCTION

Recent progress in structural genomics has revealed that many proteins have regions with very flexible and unstable structures, even in their native states. Such proteins or regions are referred to as being natively disordered or unstructured (1). Disordered protein regions often lead to difficulties in purification and crystallization, and become a bottleneck in high throughput structural determination (2). Therefore, it would be quite useful to identify the disordered regions of target proteins from their amino acid sequences.

The prediction of disordered regions is also important for the functional annotation of proteins. In the sense of the classical ‘lock-and-key’ theory (3), it is hard to imagine that natively disordered regions have some biological meaning. However, disordered regions are reportedly involved in many biological processes, such as regulation, signaling and cell cycle control (4,5). The primary role of natively disordered regions seems to be the molecular recognition of proteins or DNA. Upon binding with ligands, disorder-to-order transitions are frequently observed, where the flexibility of the disordered regions may be necessary to facilitate interactions with multiple partners with high-specificity and low-affinity (6). In addition, recent research has indicated that phosphorylation sites are frequently found in disordered regions, and thus the prediction of phosphorylation sites is expected to be improved by the accurate identification of disordered regions (7).

There are some particular amino acid sequence characteristics in protein disordered regions, such as a higher frequency of hydrophilic and charged residues, or low sequence complexity (4). Thus, the disordered regions are predictable based on these characteristics, and various prediction methods have been reported (8–10).

We have also developed a system to predict disordered regions from the amino acid sequence. Our system is composed of two predictors, that is, a predictor based on the local amino acid sequence, and one based on template proteins (or homologous proteins for which structural information is available). The first part is implemented using a support vector machine (SVM) algorithm (11) for the position-specific score matrix (or profile) of the input sequence. More precisely, a sliding window is used to map individual residues into a feature space. A similar idea has already been used in secondary structure prediction, as in PSIPRED (12). The second part assumes the conservation of intrinsic disorder in protein families (13,14), and is simply implemented using PSI-BLAST (15) and our own measure of disorder, as described later. The final prediction is done as the combination of the results of the two predictors.

The performance of disorder prediction methods has been evaluated since 2002 by the structural biology community at the CASP benchmark, that is, critical assessment of techniques for protein structure prediction (16). In 2006, the seventh round of the CASP benchmark was held, and the assessors also evaluated our method. As a result, our methods achieved high performance [estimated accuracy (Q2) (>90%) with the sensitivity of 0.56], especially for short disordered regions. The details are available at the CASP7 meeting web page at http://predictioncenter.org/casp7/meeting/presentations/Presentations_assessors/CASP7_DR_Bordoli.pdf (our group number is 443, team name is fais). PrDOS is the web interface of this prediction system.

Inputting data and accessing results

The server requires protein amino acid sequences in either plain text or FASTA (17) format as the input. The user can submit a multiple FASTA formatted input to predict disordered regions of multiple proteins. The number of sequences in the multiple FASTA formatted input is limited to 100, due to the limitation of the computational resources. The server accepts the 20 single letter codes for standard amino acids and the code ‘X’ generally used for non-standard amino acids. The server automatically replaces other letters such as ‘U’ for a selenocystein by ‘X’. The user can choose to receive the prediction result by either e-mail or web-interface, if the user submits a single protein amino acid sequence. The user can also select the prediction false positive rate, which is the rate of residues incorrectly predicted as disordered residues. The allowed rate of false positives strongly depends on the purpose of the prediction. Therefore, the user has to decide on a false positive rate threshold of the classifier, according to the application of the user, but the user can also change this parameter at the result web page. The user can check the true positive rate of each false positive rate from the receiver operating characteristic (ROC) curve on the web page. This ROC curve was derived by calculating the true positive rate at each false positive rate by varying its order/disorder threshold, using the results of the 5-fold cross-validation test for the training set. The default value of this parameter is set to 5%.

Although the calculation time is sensitive to the length of the query protein and the server conditions, a typical prediction will take from 5 to 10 min. The user can check the estimated calculation time on the submission confirmation page. The e-mail results also include the URL of the result web page. The result web page contains the result of the two-state prediction with the given false positive rate, and the disorder profile plot (Figure 1). The user can also download the raw prediction results in the CSV format or the CASP format from the same page.

Figure 1.

Figure 1.

An example of the prediction result page for HIV-1 NEF (PDB code: 2NEF). (A) The prediction result of the two-state prediction (disorder/order) is shown in this part. The red residues are predicted to be disordered at the given prediction false positive rate. (B) The plot of disorder probability of each residue along the sequence is shown in this part. Residues beyond the red threshold line in this plot are predicted to be disordered. The user can change the size of the plot through the web-interface.

Figure 1 shows a typical result page as an output. The query protein is HIV-1 negative factor protein, which is known to have disordered regions at the N-terminus in the monomer, and this region is critically important for binding with an SH3 domain (18).

Prediction flow

Step 1: Making the sequence profile

The information content of a single amino acid sequence is vastly enriched by using information about homologous proteins. For this purpose, multiple alignments with the homolog are more useful than a single amino acid sequence. In our system, a position-specific score matrix (PSSM or a profile) is used as a more convenient representation of similar information, as compared to a multiple alignment of the homologues. Therefore, in the first step, the target amino acid sequence is converted into a PSSM, using two rounds of PSI-BLAST searches against NCBI non-redundant (nr) amino acid sequence databases (19) with default parameters. Then, the following two predictions are performed using the PSSM.

Step 2: Prediction based on local amino acid sequence information

In the first predictor, the prediction is done using SVM, which is a supervised machine learning technique. The SVM was trained using a non-redundant protein chain set from the Protein Data Bank (PDB) (20), using the PISCES server (21). The training set was selected by the following criteria: determined by X-ray crystallography, resolution ≤2.0 Å, R-factor ≤0.25, sequence identities to each other ≤20% and sequence length >50. Disordered regions for these proteins were identified as the missing residues denoted at the REMARK 465 lines in the PDB. The residues with crystal or biological contacts with other chains were excluded, because such contacts may stabilize disordered residues into an ordered state. As a result, 1954 chains with 5110 disordered residues (4.8%) and 109 921 ordered residues (95.2%) were used as the training set. The protein sequences information was then converted into the input vector. The input vector consisted of PSSM information and spacers in a 27-resiude window centered at the residue (Figure 2). A spacer represents whether the site is beyond N- or C-terminus or not. If the site of a residue was beyond the N- or C-terminus, then the spacer was set to 1; otherwise it was set to 0. Each element of PSSM was converted into the range from −1.0 to 1.0 by dividing by 10. Finally, the dimension of the input vector was 567 [=(20 + 1) × 27].

Figure 2.

Figure 2.

Diagram of sequence encoding scheme. The sequence information in a 27-residue window is converted into an input vector by aligning the elements in a certain order. For each site, the value of each element of PSSM for 20 amino acid types and the spacer information are appended to the input vector, thus total dimension of the input vector is 567 [=(20 + 1) × 27]

For the query sequence, the same encoding is carried out, and using the trained SVM, the disorder propensity of each residue is predicted. It should be noted that SVM is a binary classifier, and thus it returns only order or disorder as prediction results. We use the distances from decision planes in feature spaces called the decision value, as a prediction value.

Step 3: Template-based prediction

In the second predictor, the prediction is done using the alignments of homologues with structures that have been determined. The sequence homologues are searched against the PDB, using a PSI-BLAST search with the PSSM obtained in the first step. The alignments of the hit sequences with e-values <1.0e-3 are used for the prediction. If there are no significant hits, then this prediction is skipped. The disorder tendency of the ith residue, Pi, is defined by the following equation:

graphic file with name gkm363um1.jpg

where n is the number of alignments, Ij is the sequence identity of the jth hit and αj is set to 1 if the aligned residue in the jth hit is disordered; otherwise, it is 0. In other words, Pi evaluates the weighted ratio of disordered residues among the homologous proteins.

Step 4: Combining prediction results

To combine the results of the two independent predictions, the weighted average between the results of the two predictions is calculated. The weight for template-based prediction equals about 0.11, and the weight for prediction based on local amino acid sequence information equals 1.0. These weights are obtained by optimizing the ROC score (22) of the result of the 5-fold cross-validation test. Next, a low-pass filter by moving-average is applied along the sequence to smooth the prediction results. This smoothing process is performed to avoid unrealistic predictions, such as the case that an isolated ordered residue exists in a long disordered region. Finally, the prediction values are scaled from 0.0 to 1.0, so the values can correspond to the disorder probability used in the CASP.

ACKNOWLEDGEMENTS

This work was supported by a Grant-in-Aid for Scientific Research on Priority Areas from the Ministry of Education, Culture, Sports and Technology of Japan. Computation time was provided by the Super Computer System, Human Genome Center, Institute of Medical Science, The University of Tokyo. Funding to pay the Open Access publication charges for this article was provided by Japan Science and Technology Agency, Institute for Bioinformatics Research and Development (JST-BIRD).

Conflict of interest statement. None declared.

REFERENCES

  • 1.Tompa P. Intrinsically unstructured proteins. Trends Biochem. Sci. 2002;27:523–533. doi: 10.1016/s0968-0004(02)02169-2. [DOI] [PubMed] [Google Scholar]
  • 2.Oldfield CJ, Ulrich EL, Cheng Y, Dunker AK, Markley JL. Addressing the intrinsic disorder bottleneck in structural proteomics. Proteins. 2005;59:444–453. doi: 10.1002/prot.20446. [DOI] [PubMed] [Google Scholar]
  • 3.Fischer E. Einfluss der configuration auf die wirkung der enzyme. Ber. Dt. Chem. Ges. 1894;27:2985–2993. [Google Scholar]
  • 4.Dunker AK, Brown CJ, Lawson JD, Iakoucheva LM, Obradovic Z. Intrinsic disorder and protein function. Biochemistry. 2002;41:6573–6582. doi: 10.1021/bi012159+. [DOI] [PubMed] [Google Scholar]
  • 5.Wright PE, Dyson HJ. Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. J. Mol. Biol. 1999;293:321–331. doi: 10.1006/jmbi.1999.3110. [DOI] [PubMed] [Google Scholar]
  • 6.Dyson HJ, Wright PE. Intrinsically unstructured proteins and their functions. Nat. Rev. Mol. Cell Biol. 2005;6:197–208. doi: 10.1038/nrm1589. [DOI] [PubMed] [Google Scholar]
  • 7.Lakoucheva LM, Radivojac P, Brown CJ, O'Connor TR, Sikes JG, Obradovic Z, Dunker AK. The importance of intrinsic disorder for protein phosphorylation. Nucleic Acids Res. 2004;32:1037–1049. doi: 10.1093/nar/gkh253. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Garner E, Cannon P, Romero P, Obradovic Z, Dunker AK. Predicting disordered regions from amino acid sequence: common themes despite differing structural characterization. Genome Inform. Ser. Workshop Genome Inform. 1998;9:201–213. [PubMed] [Google Scholar]
  • 9.Linding R, Jensen LJ, Diella F, Bork P, Gibson TJ, Russell RB. Protein disorder prediction: implications for structural proteomics. Structure. 2003;11:1453–1459. doi: 10.1016/j.str.2003.10.002. [DOI] [PubMed] [Google Scholar]
  • 10.Jones DT, Ward JJ. Prediction of disordered regions in proteins from position specific score matrices. Proteins. 2003;53:573–578. doi: 10.1002/prot.10528. [DOI] [PubMed] [Google Scholar]
  • 11.Vapnik V. Statistical Learning Theory. New York: John Wiley & Sons; 1998. [Google Scholar]
  • 12.Jones DT. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 1999;17:195–202. doi: 10.1006/jmbi.1999.3091. [DOI] [PubMed] [Google Scholar]
  • 13.Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT. Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J. Mol. Biol. 2004;337:635–645. doi: 10.1016/j.jmb.2004.02.002. [DOI] [PubMed] [Google Scholar]
  • 14.Chen JW, Romero P, Uversky VN, Dunker AK. Conservation of intrinsic disorder in protein domains and families: I. A database of conserved predicted disordered regions. J. Proteome Res. 2006;5:879–887. doi: 10.1021/pr060048x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Melamud E, Moult J. Evaluation of disorder predictions in CASP5. Proteins. 2003;53:561–565. doi: 10.1002/prot.10533. [DOI] [PubMed] [Google Scholar]
  • 17.Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. 1988;85:2444–2448. doi: 10.1073/pnas.85.8.2444. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Lee CH, Saksela K, Mirza UA, Chait BT, Kuriyan J. Crystal structure of the conserved core of HIV-1 Nef complexed with a Src family SH3 domain. Cell. 1996;14:931–942. doi: 10.1016/s0092-8674(00)81276-3. [DOI] [PubMed] [Google Scholar]
  • 19.McEntyre J, Ostell J. The NCBI Handbook. Bethesda, MD: National Library of Medicine (US), NCBI; 2005. [Google Scholar]
  • 20.Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Wang G, Dunbrack RL. PISCES: a protein sequence culling server. Bioinformatics. 2003;19:1589–1591. doi: 10.1093/bioinformatics/btg224. [DOI] [PubMed] [Google Scholar]
  • 22.Zweig MH, Campbell G. Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clin. Chem. 1993;39:561–577. [PubMed] [Google Scholar]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES