Table 8.
Construction Step | Training Set | Soluble | Insoluble | Test Set 1 | Soluble | Insoluble | Test Set 2 | Soluble | Insoluble |
---|---|---|---|---|---|---|---|---|---|
Input | 129,593 | - | - | 2001 | 1000 | 1001 | 9703 | - | - |
Pre-processing and solubility assignment | 109,648 | - | - | 2001 | 1000 | 1001 | - | - | - |
Redundancy removal | 87,969 | 40,905 | 14,064 | 2001 | 1000 | 1001 | 9423 | 5718 | 3705 |
Removal of short sequences and sequences with unknown residues | 82,902 | 50,004 | 32,898 | 2001 | 1000 | 1001 | 9420 | 5715 | 3705 |
Removal of transmembrane proteins | 76,274 | 45,603 | 30,671 | 2001 | 1000 | 1001 | 8769 | 5421 | 3348 |
Removal of insoluble sequences with available PDB structure | 72,756 | 42,530 | 30,226 | 2001 | 1000 | 1001 | 8754 | 5421 | 3333 |
Clustering to 25% identity | 49,369 | 26,422 | 22,947 | 2001 | 1000 | 1001 | 3945 | 2078 | 1867 |
Overlap removal with test sets 15% identity | 46,028 | 24,920 | 21,108 | 2001 | 1000 | 1001 | 3945 | 2078 | 1867 |
Class and length balancing | 40,317 | 19,718 | 20,599 | 2001 | 1000 | 1001 | 3729 | 1864 | 1865 |