Table 4. Statistical analysis of site non-optimality for protein crystallizability engineering using the independent test dataset for the CRYs class (sequence redundancy removed at 25% sequence identity).
Ci threshold | Alla(645599) | Secondary structure |
Disorder |
Buried/Exposed |
Side chain entropyb |
||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Coil(296571) | Helix(260752) | Sheet(88276) | Disorder(66129) | Order(536962) | Exposed(275189) | Buried(370410) | SCE(96423) | SCE_E(74506) | SCE_B(21917) | ||
Ci > 0.005 | 52.2% | 49.8% | 56.1% | 49.1% | 57.3% | 52.3% | 51.1% | 53.2% | 36.3% | 38.0% | 30.4% |
Ci > 0.010 | 32.3% | 30.2% | 35.4% | 29.8% | 39.3% | 32.2% | 31.7% | 32.7% | 21.4% | 22.9% | 16.3% |
Ci > 0.02 | 15.3% | 14.2% | 16.9% | 14.6% | 22.7% | 15.1% | 15.9% | 15.0% | 10.6% | 11.6% | 7.01% |
Ci > 0.05 | 4.08% | 3.92% | 4.36% | 3.83% | 8.80% | 3.76% | 4.65% | 3.67% | 3.37% | 3.80% | 1.90% |
Ci > 0.1 | 1.22% | 1.27% | 1.23% | 1.02% | 3.83% | 0.99% | 1.53% | 0.99% | 1.26% | 1.44% | 0.62% |
Ci > 0.2 | 0.33% | 0.41% | 0.27% | 0.22% | 1.64% | 0.19% | 0.46% | 0.23% | 0.43% | 0.48% | 0.24% |
Charged Amino acids | Hydrophobic | Sequence locic | |||||||||
Negative(73537) | Positive(83209) | Charged(156745) | Low(196812) | Middle(164236) | High(284551) | N-terminal (36180) | Intermediate(573239) | C-terminal(36180) | |||
35.4% | 55.4% | 46.0% | 49.6% | 51.7% | 54.3% | 67.5% | 50.6% | 62.3% | |||
22.7% | 36.0% | 29.7% | 31.2% | 30.3% | 34.1% | 48.9% | 30.4% | 45.6% | |||
12.4% | 19.0% | 15.9% | 15.7% | 13.9% | 16.0% | 29.3% | 13.7% | 28.1% | |||
4.45% | 6.01% | 5.28% | 4.42% | 3.81% | 4.01% | 10.8% | 3.18% | 11.8% | |||
1.62% | 2.35% | 2.00% | 1.35% | 1.30% | 1.08% | 3.76% | 0.78% | 5.66% | |||
0.56% | 0.95% | 0.77% | 0.36% | 0.47% | 0.23% | 0.90% | 0.14% | 2.75% |
aThe dataset contains 2,342 proteins comprising of 1,814 proteins currently classified as non-crystallizable. Residue numbers for different groups are shown in brackets.
bStatistical analysis of side-chain entropy considered three residues with high conformational entropies (KQE). SCE denotes the number of KQE residues in the entire sequence, while SCE_E and SCE_B denote the numbers of KQE residues annotated to be localized to exposed or buried regions, respectively.
cN-terminal and C-terminal denote the initial and final 20 residues located at the N- or C-terminal region of protein sequences. The Intermediate group is comprised of all residues from protein sequences, excluding N-terminal and C-terminal residues.