Skip to main content
Springer Nature - PMC COVID-19 Collection logoLink to Springer Nature - PMC COVID-19 Collection
. 2022 Oct 26;17(3):173902. doi: 10.1007/s11704-022-2011-y

AE-TPGG: a novel autoencoder-based approach for single-cell RNA-seq data imputation and dimensionality reduction

Shuchang Zhao 1,2, Li Zhang 1,3, Xuejun Liu 1,2,
PMCID: PMC9607720  PMID: 36320820

Abstract

Single-cell RNA sequencing (scRNA-seq) technology has become an effective tool for high-throughout transcriptomic study, which circumvents the averaging artifacts corresponding to bulk RNA-seq technology, yielding new perspectives on the cellular diversity of potential superficially homogeneous populations. Although various sequencing techniques have decreased the amplification bias and improved capture efficiency caused by the low amount of starting material, the technical noise and biological variation are inevitably introduced into experimental process, resulting in high dropout events, which greatly hinder the downstream analysis. Considering the bimodal expression pattern and the right-skewed characteristic existed in normalized scRNA-seq data, we propose a customized autoencoder based on a two-part-generalized-gamma distribution (AE-TPGG) for scRNA-seq data analysis, which takes mixed discrete-continuous random variables of scRNA-seq data into account using a two-part model and utilizes the generalized gamma (GG) distribution, for fitting the positive and right-skewed continuous data. The adopted autoencoder enables AE-TPGG to captures the inherent relationship between genes. In addition to the ability of achieving low-dimensional representation, the AE-TPGG model also provides a denoised imputation according to statistical characteristic of gene expression. Results on real datasets demonstrate that our proposed model is competitive to current imputation methods and ameliorates a diverse set of typical scRNA-seq data analyses.

Electronic Supplementary Material

Supplementary material is available in the online version of this article at 10.1007/s11704-022-2011-y.

Keywords: scRNA-seq, autoencoder, TPGG, data imputation, dimensionality reduction

Electronic Supplementary Material

11704_2022_2011_MOESM1_ESM.pdf (476.4KB, pdf)

AE-TPGG: A Novel Autoencoder-Based Approach for Single-cell RNA-seq Data Imputation and Dimensionality Reduction

Acknowledgements

This research was supported by the National Natural Science Foundation of China (Grant Nos. 62136004, 61802193), the National Key R&D Program of China (2018YFC2001600, 2018YFC2001602), the Natural Science Foundation of Jiangsu Province (BK20170934), and the Fundamental Research Funds for the Central Universities (NJ2020023). Thanks to all the open-minded researchers providing the codes and research resources. Thanks to all the anonymous reviewers.

Footnotes

Shuchang Zhao received the BSc degree from the Suzhou University, China in 2013, and MSc degree from the Anhui University of Science and Technology, China in 2016, respectively. Currently, he is working toward the PhD degree in the PARNEC group of the College of Computer Science and Technology at Nanjing University of Aeronautics and Astronautics (NUAA), China. His research interests include machine learning and bioinformatics.

Li Zhang received the BSc degree from the Changsha University of Science and Technology, China in 2007, and the MSc and PhD degrees from the Nanjing University of Aeronautics and Astronautics (NUAA), China in 2010 and 2015, respectively. He joined the College of Computer Science and Technology, Nanjing Forestry University, as a Lecturer, China in 2016. His current research interests include machine learning and bioinformatics.

Xuejun Liu received the BSc and MSc degrees in computer science from the Nanjing University of Aeronautics and Astronautics (NUAA), China in 1999 and 2002, respectively, and the PhD degree in computer science from the University of Manchester, UK in 2006. Currently, she is a professor in the PARNEC group of the College of Computer Science and Technology at NUAA, China. Her research interests include machine learning and its practical applications, including bioinformatics.

References

  • 1.Potter S S. Single-cell RNA sequencing for the study of development, physiology and disease. Nature Reviews Nephrology. 2018;14(8):479–492. doi: 10.1038/s41581-018-0021-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Li H, Courtois E T, Sengupta D, Tan Y, Chen K H, Goh J J L, Kong S L, Chua C, Hon L K, Tan W S, Wong M, Choi P J, Wee L J K, Hillmer A M, Tan I B, Robson P, Prabhakar S. Reference component analysis of single-cell transcriptomes elucidates cellular heterogeneity in human colorectal tumors. Nature Genetics. 2017;49(5):708–718. doi: 10.1038/ng.3818. [DOI] [PubMed] [Google Scholar]
  • 3.Cao Y, Su B, Guo X, Sun W, Deng Y, Bao L, Zhu Q, Zhang X, Zheng Y, Geng C, Chai X, He R, Li X, Lv Q, Zhu H, Deng W, Xu Y, Wang Y, Qiao L, Tan Y, Song L, Wang G, Du X, Gao N, Liu J, Xiao J, Su X, Du Z, Feng Y, Qin C, Qin C, Jin R, Xie X S. Potent neutralizing antibodies against SARS-CoV-2 identified by high-throughput single-cell sequencing of convalescent patients’ B cells. Cell. 2020;182(1):73–84.e16. doi: 10.1016/j.cell.2020.05.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Kharchenko P V, Silberstein L, Scadden D T. Bayesian approach to single-cell differential expression analysis. Nature Methods. 2014;11(7):740–742. doi: 10.1038/nmeth.2967. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Finak G, McDavid A, Yajima M, Deng J, Gersuk V, Shalek A K, Slichter C K, Miller H W, Mcelrath M J, Prlic M, Linsley P S, Gottardo R. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biology. 2015;16(1):278. doi: 10.1186/s13059-015-0844-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Lun A T L, Bach K, Marioni J C. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biology. 2016;17(1):75. doi: 10.1186/s13059-016-0947-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Li W V, Li J J. An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nature Communications. 2018;9(1):997. doi: 10.1038/s41467-018-03405-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Huang M, Wang J, Torre E, Dueck H, Shaffer S, Bonasio R, Murray J I, Raj A, Li M, Zhang N R. SAVER: gene expression recovery for single-cell RNA sequencing. Nature Methods. 2018;15(7):539–542. doi: 10.1038/s41592-018-0033-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Van Dijk V, Sharma R, Nainys J, Yim K, Kathail P, Carr A J, Burdziak C, Moon K R, Chaffer C L, Pattabiraman D, Bierie B, Mazutis L, Wolf G, Krishnaswamy S, Pe’er D. Recovering gene interactions from single-cell data using data diffusion. Cell. 2018;174(3):716–729.e27. doi: 10.1016/j.cell.2018.05.061. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Basharat Z, Majeed S, Saleem H, Khan I A, Yasmin A. An overview of algorithms and associated applications for single cell RNA-seq data imputation. Current Genomics. 2021;22(5):319–327. doi: 10.2174/1389202921999200716104916. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–444. doi: 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]
  • 12.Bengio Y, Courville A, Vincent P. Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2013;35(8):1798–1828. doi: 10.1109/TPAMI.2013.50. [DOI] [PubMed] [Google Scholar]
  • 13.Hornik K. Approximation capabilities of multilayer feedforward networks. Neural Networks. 1991;4(2):251–257. doi: 10.1016/0893-6080(91)90009-T. [DOI] [Google Scholar]
  • 14.Hinton G E, Salakhutdinov R R. Reducing the dimensionality of data with neural networks. Science. 2006;313(5786):504–507. doi: 10.1126/science.1127647. [DOI] [PubMed] [Google Scholar]
  • 15.Kadurin A, Nikolenko S, Khrabrov K, Aliper A, Zhavoronkov A. druGAN: an advanced generative adversarial autoencoder model for de novo generation of new molecules with desired molecular properties in silico. Molecular Pharmaceutics. 2017;14(9):3098–3104. doi: 10.1021/acs.molpharmaceut.7b00346. [DOI] [PubMed] [Google Scholar]
  • 16.Eraslan G, Simon L M, Mircea M, Mueller N S, Theis F J. Single-cell RNA-seq denoising using a deep count autoencoder. Nature Communications. 2019;10(1):390. doi: 10.1038/s41467-018-07931-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Zhang Z, Cui F, Wang C, Zhao L, Zou Q. Goals and approaches for each processing step for single-cell RNA sequencing data. Briefings in Bioinformatics. 2021;22(4):bbaa314. doi: 10.1093/bib/bbaa314. [DOI] [PubMed] [Google Scholar]
  • 18.Mortazavi A, Williams B A, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nature Methods. 2008;5(7):621–628. doi: 10.1038/nmeth.1226. [DOI] [PubMed] [Google Scholar]
  • 19.Pickrell J K, Marioni J C, Pai A A, Degner J F, Engelhardt B E, Nkadori E, Veyrieras J B, Stephens M, Gilad Y, Pritchard J K. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010;464(7289):768–772. doi: 10.1038/nature08872. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Risso D, Schwartz K, Sherlock G, Dudoit S. GC-content normalization for RNA-seq data. BMC Bioinformatics. 2011;12(1):480. doi: 10.1186/1471-2105-12-480. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Vallejos C A, Risso D, Scialdone A, Dudoit S, Marioni J C. Normalizing single-cell RNA sequencing data: challenges and opportunities. Nature Methods. 2017;14(6):565–571. doi: 10.1038/nmeth.4292. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Li B, Ruotti V, Stewart R M, Thomson J A, Dewey C N. RNA-seq gene expression estimation with read mapping uncertainty. Bioinformatics. 2010;26(4):493–500. doi: 10.1093/bioinformatics/btp692. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Belotti F, Deb P, Manning W G, Norton E C. Twopm: two-part models. The Stata Journal: Promoting communications on statistics and Stata. 2015;15(1):3–20. doi: 10.1177/1536867X1501500102. [DOI] [Google Scholar]
  • 24.Lawless J F. Inference in the generalized gamma and log gamma distributions. Technometrics. 1980;22(3):409–419. doi: 10.1080/00401706.1980.10486173. [DOI] [Google Scholar]
  • 25.Risso D, Perraudeau F, Gribkova S, Dudoit S, Vert J P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nature Communications. 2018;9(1):284. doi: 10.1038/s41467-017-02554-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Klein A M, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, Peshkin L, Weitz D A, Kirschner M W. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015;161(5):1187–1201. doi: 10.1016/j.cell.2015.04.044. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Minka T P. Estimating a gamma distribution. Microsoft Research. 2002;1(3):3–5. [Google Scholar]
  • 28.Chollet F. Keras. See Github.com/fchollet/keraswebsite
  • 29.Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, et al. TensorFlow: large-scale machine learning on heterogeneous distributed systems. 2016, arXiv preprint arXiv: 1603.04467
  • 30.Deng Q, Ramsköld D, Reinius B, Sandberg R. Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian cells. Science. 2014;343(6167):193–196. doi: 10.1126/science.1245316. [DOI] [PubMed] [Google Scholar]
  • 31.Kolodziejczyk A A, Kim J K, Tsang J C H, Ilicic T, Henriksson J, Natarajan K N, Tuck A C, Gao X, Bühler M, Liu P, Marioni J C, Teichmann S A. Single cell RNA-Sequencing of pluripotent states unlocks modular transcriptional variation. Cell Stem Cell. 2015;17(4):471–485. doi: 10.1016/j.stem.2015.09.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Ding J, Adiconis X, Simmons S K, Kowalczyk M S, Hession C C, Marjanovic N D, Hughes T K, Wadsworth M H, Burks T, Nguyen L T, Kwon J Y H, Barak B, Ge W, Kedaigle A J, Carroll S, Li S, Hacohen N, Rozenblatt-Rosen O, Shalek A K, Villani A C, Regev A, Levin J Z. Systematic comparison of single-cell and single-nucleus RNA-sequencing methods. Nature Biotechnology. 2020;38(6):737–746. doi: 10.1038/s41587-020-0465-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Arthur D, Vassilvitskii S. k-means++: the advantages of careful seeding. In: Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms. 2007, 1027–1035
  • 34.Zheng G X Y, Terry J M, Belgrader P, Ryvkin P, Bent Z W, Wilson R, Ziraldo S B, Wheeler T D, McDermott G P, Zhu J, Gregory M T, Shuga J, Montesclaros L, Underwood J G, Masquelier D A, Nishimura S Y, Schnall-Levin M, Wyatt P W, Hindson C M, Bharadwaj R, Wong A, Ness K D, Beppu L W, Deeg H J, Mcfarland C, Loeb K R, Valente W J, Ericson N G, Stevens E A, Radich J P, Mikkelsen T S, Hindson B J, Bielas J H. Massively parallel digital transcriptional profiling of single cells. Nature Communications. 2017;8(1):14049. doi: 10.1038/ncomms14049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Stoeckius M, Hafemeister C, Stephenson W, Houck-Loomis B, Chattopadhyay P K, Swerdlow H, Satija R, Smibert P. Simultaneous epitope and transcriptome measurement in single cells. Nature Methods. 2017;14(9):865–868. doi: 10.1038/nmeth.4380. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Xu C, Su Z. Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics. 2015;31(12):1974–1980. doi: 10.1093/bioinformatics/btv088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Levine J, Simonds E, Bendall S, Davis K, Amir E A, Tadmor M, Litvin O, Fienberg H, Jager A, Zunder E, Finck R, Gedman A, Radtke I, Downing J, Pe’er D, Nolan G. Data-driven phenotypic dissection of AML reveals progenitor-like cells that correlate with prognosis. Cell. 2015;162(1):184–197. doi: 10.1016/j.cell.2015.05.047. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Francesconi M, Lehner B. The effects of genetic variation on gene expression dynamics during development. Nature. 2014;505(7482):208–211. doi: 10.1038/nature12772. [DOI] [PubMed] [Google Scholar]
  • 39.Boeck M E, Huynh C, Gevirtzman L, Thompson O A, Wang G, Kasper D M, Reinke V, Hillier L W, Waterston R H. The time-resolved transcriptome of C. elegans. Genome Research. 2016;26(10):1441–1450. doi: 10.1101/gr.202663.115. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

11704_2022_2011_MOESM1_ESM.pdf (476.4KB, pdf)

AE-TPGG: A Novel Autoencoder-Based Approach for Single-cell RNA-seq Data Imputation and Dimensionality Reduction


Articles from Frontiers of Computer Science are provided here courtesy of Nature Publishing Group

RESOURCES