Skip to main content
Patterns logoLink to Patterns
. 2022 Mar 11;3(3):100465. doi: 10.1016/j.patter.2022.100465

Data-driven assessment of dimension reduction quality for single-cell omics data

Xiaoru Dong 1, Rhonda Bacher 1,
PMCID: PMC9058902  PMID: 35510193

Abstract

Dimension reduction (DR) techniques have become synonymous with single-cell omics data due to their ability to generate attractive visualizations and enable analyses of high-dimensional data. In this issue of Patterns, Johnsona et al. develop a statistical approach to assist in selecting high-quality reduced representations to improve analyses and biological interpretations.


Dimension reduction (DR) techniques have become synonymous with single-cell omics data due to their ability to generate attractive visualizations and enable analyses of high-dimensional data. In this issue of Patterns, Johnsona et al. develop a statistical approach to assist in selecting high-quality reduced representations to improve analyses and biological interpretations.

Main text

Single-cell RNA sequencing (scRNA-seq) experiments have revolutionalized the field of genomics by capturing gene-expression data at the level of individual cells and allowing researchers to uncover biological properties of individual cells in complex tissues. In order to enable such powerful insights, this high-dimensional sequencing data requires a combination of specific preprocessing steps, including quality control, normalization, dimensionality reduction (DR), and clustering.1 Following these steps, one has the ability to identify rare cell types, discover trajectories representing biological processes, and identify differentially expressed genes across conditions for particular cell types. However, it has been demonstrated that these analyses and their biological conclusions are influenced by different approaches used for preprocessing.2 Notably, DR has been a focus in single-cell analysis given its ubiquitousness among visualization and computational methods.

In general, DR involves projecting high-dimensional data into a lower dimensional space in order to reduce noise signals in the data while retaining key features. The traditional and most familiar form of DR is done via principal component analysis (PCA), which performs linear transformations and preserves the Euclidean distance between features. More recent nonlinear approaches, such as t-distributed stochastic neighbor embedding (t-SNE)3 and uniform approximation and projection method (UMAP),4 have become popular in single-cell data and are highly regarded for their ability to produce appealing visualizations of cell clusters. This is because they aim to preserve the local structure of the data while typically ignoring or placing less emphasis on the global structure of the data, i.e., the distance between cells. The non-linear algorithms are also stochastic and heavily dependent on hyperparameters chosen by users.5 Besides visualization, DR is additionally required for the majority of downstream analyses. Thus, choosing an appropriate DR method, one that is able to retain the structure of original data and impose the least distortion of biological signals, is a priority.

The concern around DR approaches used on single-cell data has largely resulted in developing novel DR methods or heuristic guidelines based on benchmarking studies.6,7 However, choosing an optimal DR method for a given dataset and analysis remains an open question. In this issue of Patterns, Johnsona et al.8 tackle this problem by developing a quantitative quality assessment scheme: empirical marginal resampling better evaluates dimensionality reduction (EMBEDR). EMBEDR distinguishes those structures in the reduced dimension embedding consistent with those in the original high-dimensional data versus those attributable to noise, allowing users to determine which DR representation captures the structure of the original data most accurately.

The key to EMBEDR’s evaluation is the introduction of a quality statistic termed the empirical embedding statistic, which compares cell-to-cell distance distributions between the original data and its reduced dimension embedding.8 The quality statistic is generated for each DR method and compared to the distribution of quality statistics calculated on null datasets generated via marginal resampling. An empirical hypothesis test is performed comparing the sample cell’s quality to the null quality distribution, with p values calculated as the probability that the observed data yield a lower-quality embedding compared to the null datasets. If the p value for a cell is small, it indicates the structure of the cell in the embedding is close to the structure in the original high-dimensional data.

EMBEDR is implemented in Python and provides users multiple evaluations of the DR approaches. For example, visualizing the cell-specific p values provides users a measure of where signals are best preserved in a given embedding and most likely to reflect biological signal. EMBEDR can also be used to select the optimal hyperparameters for a given approach and compare embeddings across DR methods. EMBEDR also allows users to explore the locally optimal embedding for each cell type. Johnson et al. emphasize that the globally optimal embedding does not necessarily mean that the quality in each local cell type is ideal and that performing local optimization may facilitate identification of rare cell types.8

As interest in scRNA-seq technologies grows, datasets are increasing in size and complexity. DR will continue to be a key step for visualizing and analyzing single-cell RNA data, and identifying an optimal DR method remains a high priority. Johnson et al. demonstrate EMBEDR’s ability to assist users in selecting the most appropriate DR method objectively by quantitatively measuring each cell’s quality in embeddings. Given the increasing number of methodologies that are being developed for single-cell analyses, we anticipate a greater emergence and focus on data-driven methodology selections9 and comprehensive evaluation frameworks10 in the coming years.

Acknowledgments

Declaration of interests

The authors declare no competing interests.

References

  • 1.Bacher R., Kendziorski C. Design and computational analysis of single-cell RNA-sequencing experiments. Genome Biol. 2016;17:63. doi: 10.1186/s13059-016-0927-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Wang C., Gao X., Liu J. Impact of data preprocessing on cell-type clustering based on single-cell RNA-seq data. BMC Bioinformatics. 2020;21:440. doi: 10.1186/s12859-020-03797-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.van der Maaten L., Hinton G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008;9:2579–2605. [Google Scholar]
  • 4.McInnes L., Healy J., Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv. 2020:1802.03426. [Google Scholar]
  • 5.Kobak D., Berens P. The art of using t-SNE for single-cell transcriptomics. Nat. Commun. 2019;10:5416. doi: 10.1038/s41467-019-13056-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Sun S., Zhu J., Ma Y., Zhou X. Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biol. 2019;20:269. doi: 10.1186/s13059-019-1898-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Raimundo F., Vallot C., Vert J.-P. Tuning parameters of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biol. 2020;21:212. doi: 10.1186/s13059-020-02128-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Johnson E., Kath W., Mani M. Patterns; 2022. EMBEDR: Distinguishing Signal from Noise in Single-Cell Omics Data. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Heiser C.N., Lau K.S. A Quantitative Framework for Evaluating Single-Cell Data Structure Preservation by Dimensionality Reduction Techniques. Cell Rep. 2020;31:107576. doi: 10.1016/j.celrep.2020.107576. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Germain P.-L., Sonrel A., Robinson M.D. pipeComp, a general framework for the evaluation of computational pipelines, reveals performant single cell RNA-seq preprocessing tools. Genome Biol. 2020;21:227. doi: 10.1186/s13059-020-02136-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Patterns are provided here courtesy of Elsevier

RESOURCES