Skip to main content
. 2022 Jun 6;40(11):1644–1653. doi: 10.1038/s41587-022-01341-y

Extended Data Fig. 1. Challenges for identification of trait/phenotype-relevant cells using colocalization-based approaches and the network-based solution.

Extended Data Fig. 1

a, The kernel density plots show that high sparsity commonly exists across curated scATAC-seq data. The sparsity in scATAC-seq data is characterized using the peak-by-cell matrix from five different datasets. The sparsity of peaks is defined as the proportion of cells that show no signal (zero-valued) for a given peak (left) and the sparsity of cells is defined as the proportion of peaks that show no signal for a given cell (right). The 10X PBMC scATAC-seq dataset is used in the following SCAVENGE analysis. b, To investigate the causal cell type/state that is relevant to a genetic trait, the most used strategy is co-localization of epigenetic signals that occur in regulatory elements (peaks) and risk variants. However, this approach is uninformative for a majority of cells when applied to scATAC-seq profiles. Given the noise and sparse nature of scATAC-seq data, absence of signals are extensive across cells and regulatory peaks, which cannot be distinguished between technical or biological causes. Therefore, only a few cells demonstrate reliable phenotypic relevance. c, Global high-dimensional features of individual single cells are sufficient to represent the underlying cell identities or states, which enables the relationships among such cells to be readily inferred. We reason that the real relevant cell populations can be revealed and recovered by building a search engine for cell-to-cell networks that enable discovery of similar cells with the same phenotype. d, UMAP plots display M-kNN graph construction from the latent space for the 10X PBMC scATAC-seq profiles.