Skip to main content
. 2021 Dec 17;22(24):13555. doi: 10.3390/ijms222413555

Table 8.

Construction steps for dataset preparation and number of sequences retained in each dataset construction step. Note: final amount of data within training and testing sets after pre-processing are in bold.

Construction Step Training Set Soluble Insoluble Test Set 1 Soluble Insoluble Test Set 2 Soluble Insoluble
Input 129,593 - - 2001 1000 1001 9703 - -
Pre-processing and solubility assignment 109,648 - - 2001 1000 1001 - - -
Redundancy removal 87,969 40,905 14,064 2001 1000 1001 9423 5718 3705
Removal of short sequences and sequences with unknown residues 82,902 50,004 32,898 2001 1000 1001 9420 5715 3705
Removal of transmembrane proteins 76,274 45,603 30,671 2001 1000 1001 8769 5421 3348
Removal of insoluble sequences with available PDB structure 72,756 42,530 30,226 2001 1000 1001 8754 5421 3333
Clustering to 25% identity 49,369 26,422 22,947 2001 1000 1001 3945 2078 1867
Overlap removal with test sets 15% identity 46,028 24,920 21,108 2001 1000 1001 3945 2078 1867
Class and length balancing 40,317 19,718 20,599 2001 1000 1001 3729 1864 1865