Skip to main content
Data in Brief logoLink to Data in Brief
. 2016 May 21;8:105–107. doi: 10.1016/j.dib.2016.05.024

Benchmark data for identifying multi-functional types of membrane proteins

Shibiao Wan a,, Man-Wai Mak a,, Sun-Yuan Kung b
PMCID: PMC4889873  PMID: 27294176

Abstract

Identifying membrane proteins and their multi-functional types is an indispensable yet challenging topic in proteomics and bioinformatics. In this article, we provide data that are used for training and testing Mem-ADSVM (Wan et al., 2016. “Mem-ADSVM: a two-layer multi-label predictor for identifying multi-functional types of membrane proteins[1]), a two-layer multi-label predictor for predicting multi-functional types of membrane proteins.


Specifications Table

Subject area Biology
More specific subject area Bioinformatics/Computational Biology
Type of data Text
How data was acquired Process datasets that were obtained by searching against the UniProtKB/Swiss-Prot database with a series of stringent criteria
Data format Analyzed
Experimental factors Proteins were manually annotated and were extracted from UniProtKB.
Experimental features For each protein sequence, its associated gene ontology (GO) information was retrieved by searching a compact GO-term database[2], [3], [4]with its homologous accession number.
Data source location Hong Kong SAR, China
Data accessibility The dataset is available with this article andhttp://bioinfo.eie.polyu.edu.hk/MemADSVMServer/datasets.html

Value of the data

  • Knowing the functional types of membrane proteins can be helpful to elucidate the biological functions of membrane proteins.

  • This article presents the first comprehensive dataset that contains non-membrane proteins, single-functional-type membrane proteins and multi-functional-type membrane proteins.

  • The dataset presented here can be used as an important benchmark dataset to evaluate the performance of membrane-protein predictors.

1. Data

Using benchmark datasets for evaluating the performance of predictors are of great significance in various domains of bioinformatics [5], [6], [7], [8], [9], [10], such as membrane protein type prediction [11]. However, existing benchmark datasets for predicting membrane proteins are either incomplete or non-stringent. This data article describes a stringent and comprehensive benchmark dataset that comprises non-membrane proteins, single-functional-type membrane proteins and multi-functional-type membrane proteins. All of the benchmark datasets (Dataset II(C) together with Dataset I, Dataset II(A) and Dataset II(B)) are accessible from the link in http://bioinfo.eie.polyu.edu.hk/MemADSVMServer/datasets.html.

2. Experimental design, materials and methods

The dataset (we named as ‘Dataset II(C)’) here is a benchmark dataset to evaluate Mem-ADSVM [1], a webserver to identify membrane proteins and their multi-functional types. Dataset II(C) was created based on two previous datasets [5], [8], which we named as Dataset I [5] and Dataset II(A) [8]. First, we retrieved all of the 7965 non-membrane proteins in Dataset I. The procedures to create Dataset I are as follows: (1) select proteins in the UniProtKB/Swiss-Prot database; (2) exclude those protein sequences annotated with “fragment” (3) exclude those protein sequences with less than 50 amino acid residues; (4) remove those protein sequences annotated with ambiguous words, such as “by similarity”, “potential”, “probable”, etc.; (5) remove those sequences which are annotated with “membrane protein” (6) use BLASTCLUST [12] to reduce the sequence similarity to no more than 80%. The procedures for obtaining Dataset II(A) are similar to those for Dataset I except that the former collected membrane proteins instead of excluding them, and the former reduced the sequence identity to 25% instead of 80%. Because the sequence identity of Dataset I (80%) was much higher than that of Dataset II(A) (25%), we used BLASTCLUST to reduce the sequence similarity to 25%, leading to 2009 non-membrane proteins. Then, we combined these 2009 non-membrane proteins with Dataset II(A) (5307 membrane proteins) to constitute Dataset II(C) with a total of 7316 proteins, of which 7126 belong to one type, 185 to two types and 5 to three types. Specifically, the distribution of Dataset II(C) is as follows: (1) 626 single-pass type I, (2) 299 single-pass type II, (3) 42 single-pass type III, (4) 73 single-pass type IV, (5) 2437 multi-pass, (6) 403 Lipid-anchor, (7) 172 GPI-anchor, (8) 1450 peripheral and (9) 2009 non-membrane.

Acknowledgments

This work was in part supported by the RGC of Hong Kong SAR Grant nos. PolyU152117/14E and PolyU152068/15E.

Footnotes

Appendix A

Transparency associated with this article can be found in the online version at doi:10.1016/j.dib.2016.05.024.

Appendix B

Supplementary data associated with this article can be found in the online version at doi:10.1016/j.dib.2016.05.024.

Contributor Information

Shibiao Wan, Email: shibiao.wan@connect.polyu.hk.

Man-Wai Mak, Email: enmwmak@polyu.edu.hk.

Sun-Yuan Kung, Email: kung@princeton.edu.

Appendix A. Transparency document

Supplementary material

mmc1.docx (12.5KB, docx)

Appendix B. Supplementary material

Supplementary material

mmc2.zip (2.3MB, zip)

References

  • 1.Wan S., Mak M.W., Kung S.Y. Mem-ADSVM: a two-layer multi-label predictor for identifying multi-functional types of membrane proteins. J. Theor. Biol. 2016;398:32–42. doi: 10.1016/j.jtbi.2016.03.013. [DOI] [PubMed] [Google Scholar]
  • 2.Wan S., Mak M.W. Machine Learning for Protein Subcellular Localization Prediction. De Gruyter; Germany: 2015. p. 192. [Google Scholar]
  • 3.Wan S., Mak M.W., Kung S.Y. mLASSO-Hum: a LASSO-based interpretable human-protein subcellular localization predictor. J. Theor. Biol. 2015;382:223–234. doi: 10.1016/j.jtbi.2015.06.042. [DOI] [PubMed] [Google Scholar]
  • 4.Wan S., Mak M.W., Kung S.Y. R3P-Loc: a compact multi-label predictor using ridge regression and random projection for protein subcellular localization. J. Theor. Biol. 2014;360:34–45. doi: 10.1016/j.jtbi.2014.06.031. [DOI] [PubMed] [Google Scholar]
  • 5.Huang C., Yuan J.Q. A multilabel model based on Chou׳s pseudo-amino acid composition for identifying membrane proteins with both single and multiple functional types. J. Membr. Biol. 2013;246(4):327–334. doi: 10.1007/s00232-013-9536-9. [DOI] [PubMed] [Google Scholar]
  • 6.Wan S., Mak M.W., Kung S.Y. mPLR-Loc: an adaptive decision multi-label classifier based on penalized logistic regression for protein subcellular localization prediction. Anal. Biochem. 2015;473:14–27. doi: 10.1016/j.ab.2014.10.014. [DOI] [PubMed] [Google Scholar]
  • 7.Wan S., Mak M.W., Kung S.Y. HybridGO-Loc: mining hybrid features on gene ontology for predicting subcellular localization of multi-Location proteins. PLoS One. 2014;9:3. doi: 10.1371/journal.pone.0089545. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Xiao X., Zou H.L., Lin W.Z. iMem-Seq: a multi-label learning classifier for predicting membrane proteins types. J. Membr. Biol. 2015;248(4):745–752. doi: 10.1007/s00232-015-9787-8. [DOI] [PubMed] [Google Scholar]
  • 9.Wan S., Mak M.W., Kung S.Y. mGOASVM: multi-label protein subcellular localization based on gene ontology and support vector machines. BMC Bioinform. 2012:13. doi: 10.1186/1471-2105-13-290. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Wan S., Mak M.W., Kung S.Y. GOASVM: a subcellular location predictor by incorporating term-frequency gene ontology into the general form of Chou׳s pseudo-amino acid composition. J. Theor. Biol. 2013;323:40–48. doi: 10.1016/j.jtbi.2013.01.012. [DOI] [PubMed] [Google Scholar]
  • 11.Wan S., Mak M.W., Kung S.Y. Mem-mEN: predicting multi-functional types of membrane proteins by interpretable elastic nets. IEEE/ACM Trans. Comput. Biol. Bioinform. 2015 doi: 10.1109/TCBB.2015.2474407. [DOI] [PubMed] [Google Scholar]
  • 12.〈http://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html〉.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material

mmc1.docx (12.5KB, docx)

Supplementary material

mmc2.zip (2.3MB, zip)

Articles from Data in Brief are provided here courtesy of Elsevier

RESOURCES