. 2016 Feb 24;6:21383. doi: 10.1038/srep21383

Table 4. Statistical analysis of site non-optimality for protein crystallizability engineering using the independent test dataset for the CRYs class (sequence redundancy removed at 25% sequence identity).

C_i threshold	All^a(645599)	Secondary structure			Disorder		Buried/Exposed		Side chain entropy^b
C_i threshold	All^a(645599)	Coil(296571)	Helix(260752)	Sheet(88276)	Disorder(66129)	Order(536962)	Exposed(275189)	Buried(370410)	SCE(96423)	SCE_E(74506)	SCE_B(21917)
C_i > 0.005	52.2%	49.8%	56.1%	49.1%	57.3%	52.3%	51.1%	53.2%	36.3%	38.0%	30.4%
C_i > 0.010	32.3%	30.2%	35.4%	29.8%	39.3%	32.2%	31.7%	32.7%	21.4%	22.9%	16.3%
C_i > 0.02	15.3%	14.2%	16.9%	14.6%	22.7%	15.1%	15.9%	15.0%	10.6%	11.6%	7.01%
C_i > 0.05	4.08%	3.92%	4.36%	3.83%	8.80%	3.76%	4.65%	3.67%	3.37%	3.80%	1.90%
C_i > 0.1	1.22%	1.27%	1.23%	1.02%	3.83%	0.99%	1.53%	0.99%	1.26%	1.44%	0.62%
C_i > 0.2	0.33%	0.41%	0.27%	0.22%	1.64%	0.19%	0.46%	0.23%	0.43%	0.48%	0.24%
Charged Amino acids			Hydrophobic			Sequence loci^c
Negative(73537)	Positive(83209)	Charged(156745)	Low(196812)	Middle(164236)	High(284551)	N-terminal (36180)		Intermediate(573239)		C-terminal(36180)
35.4%	55.4%	46.0%	49.6%	51.7%	54.3%	67.5%		50.6%		62.3%
22.7%	36.0%	29.7%	31.2%	30.3%	34.1%	48.9%		30.4%		45.6%
12.4%	19.0%	15.9%	15.7%	13.9%	16.0%	29.3%		13.7%		28.1%
4.45%	6.01%	5.28%	4.42%	3.81%	4.01%	10.8%		3.18%		11.8%
1.62%	2.35%	2.00%	1.35%	1.30%	1.08%	3.76%		0.78%		5.66%
0.56%	0.95%	0.77%	0.36%	0.47%	0.23%	0.90%		0.14%		2.75%

^aThe dataset contains 2,342 proteins comprising of 1,814 proteins currently classified as non-crystallizable. Residue numbers for different groups are shown in brackets.

^bStatistical analysis of side-chain entropy considered three residues with high conformational entropies (KQE). SCE denotes the number of KQE residues in the entire sequence, while SCE_E and SCE_B denote the numbers of KQE residues annotated to be localized to exposed or buried regions, respectively.

^cN-terminal and C-terminal denote the initial and final 20 residues located at the N- or C-terminal region of protein sequences. The Intermediate group is comprised of all residues from protein sequences, excluding N-terminal and C-terminal residues.