Table.1.
Number of protein sequences in benchmark dataset D3106 created by Shen et al.
| Subcellular location | Number of protein sequences |
|---|---|
| Nucleus | 1021 |
| Cytoplasm | 817 |
| Extracellular | 385 |
| Mitochondrion | 364 |
| Plasma membrane | 354 |
| Endoplasmic reticulum | 229 |
| Golgi apparatus | 161 |
| Cytoskeleton | 79 |
| Centriole | 77 |
| Lysosome | 77 |
| Peroxisome | 47 |
| Endosome | 24 |
| Microsome | 24 |
| Synapse | 22 |
| Total | 3681 |
The dataset D3106 covers 14 subcellular which are listed at the first column of this table. And the numbers of proteins located at each subcellular location are listed at the second column. There are 3106 protein sequences in this dataset, and the total number of subcellular locations is 3681 since many certain sequences can be found in multiple locations. The sequences distribute at those 14 locations unevenly. 32.9% sequences are located at Nucleus and 26.3% sequences are at Cytoplasm, while less than 1% sequences are located at Synapse. This dataset is unbalanced. 3681 positive cases take only 8.47% of all 3106 × 14 cases