Skip to main content
. 2021 Oct 22;22(Suppl 10):515. doi: 10.1186/s12859-021-04404-0

Table.1.

Number of protein sequences in benchmark dataset D3106 created by Shen et al.

Subcellular location Number of protein sequences
Nucleus 1021
Cytoplasm 817
Extracellular 385
Mitochondrion 364
Plasma membrane 354
Endoplasmic reticulum 229
Golgi apparatus 161
Cytoskeleton 79
Centriole 77
Lysosome 77
Peroxisome 47
Endosome 24
Microsome 24
Synapse 22
Total 3681

The dataset D3106 covers 14 subcellular which are listed at the first column of this table. And the numbers of proteins located at each subcellular location are listed at the second column. There are 3106 protein sequences in this dataset, and the total number of subcellular locations is 3681 since many certain sequences can be found in multiple locations. The sequences distribute at those 14 locations unevenly. 32.9% sequences are located at Nucleus and 26.3% sequences are at Cytoplasm, while less than 1% sequences are located at Synapse. This dataset is unbalanced. 3681 positive cases take only 8.47% of all 3106 × 14 cases