Appendix 1—table 6. Retrospective analysis of predictor quality at different stages during the training process.
AUC values for distinguishing proteomic phase-separating sequences from the human proteome are shown for prediction scores made from pi-contact frequencies (average contacts predicted per residue) obtained at each training step of the protocol in order of their sequential development, with prediction scores calculated as the highest number of contacts predicted for any given 100 residue window in each sequence. Analysis of the relative effects of different contact types was added by excluding contacts from each score and retesting. Standard error of the mean (SEM), by bootstrap analysis, is consistently in the range from 0.021 to 0.039.
| Training step | AUC at training step | Sidechain contacts only |
Backbone contacts only |
Short-range sidechain only | Long-range sidechain only | Short-range backbone only |
Long-range backbone only |
|---|---|---|---|---|---|---|---|
| (1) Baseline Frequencies | 0.57 | 0.51 | 0.84 | 0.52 | 0.50 | 0.73 | 0.80 |
| 2) Context-Averaged Frequencies | 0.57 | 0.51 | 0.86 | 0.53 | 0.51 | 0.77 | 0.83 |
| (3) Smoothed Frequency Predictions | 0.82 | 0.64 | 0.89 | 0.59 | 0.65 | 0.71 | 0.85 |
| (4) Weight Optimized Final Predictor |
0.88 | N/A | N/A | N/A | N/A | N/A | N/A |