SCRaPL: A Bayesian hierarchical framework for detecting technical associates in single cell multiomics data

Christos Maniatis; Catalina A Vallejos; Guido Sanguinetti

doi:10.1371/journal.pcbi.1010163

. 2022 Jun 21;18(6):e1010163. doi: 10.1371/journal.pcbi.1010163

SCRaPL: A Bayesian hierarchical framework for detecting technical associates in single cell multiomics data

Christos Maniatis ^1,^*, Catalina A Vallejos ^2,^3,^*, Guido Sanguinetti ^1,^4,^*

Editor: Jingyi Jessica Li⁵

PMCID: PMC9249169 PMID: 35727848

Abstract

Single-cell multi-omics assays offer unprecedented opportunities to explore epigenetic regulation at cellular level. However, high levels of technical noise and data sparsity frequently lead to a lack of statistical power in correlative analyses, identifying very few, if any, significant associations between different molecular layers. Here we propose SCRaPL, a novel computational tool that increases power by carefully modelling noise in the experimental systems. We show on real and simulated multi-omics single-cell data sets that SCRaPL achieves higher sensitivity and better robustness in identifying correlations, while maintaining a similar level of false positives as standard analyses based on Pearson and Spearman correlation.

Author summary

Single-cell multi-omics assays offer unprecedented opportunities to explore epigenetic regulation at cellular level. However, high levels of noise frequently hide genomics regions with strong epigenetic regulation or produce misleading results. By carefully addressing this common problem SCRaPL aims become a useful tool in the hands of practitioners seeking to understand the role of particular genomic regions in the epigenetic landscape. Using different single cell multi-omics datasets, we have demonstrated that SCRaPL can increase detection rates up to five times compared to standard practices. This can improve performance of tools used for post experimental analysis, but more importantly it can indicate currently unknown genomic regions worth to further investigate.

This is a PLOS Computational Biology Methods paper.

Introduction

High throughput single cell assays based on next generation sequencing are revolutionising our understanding of biology, with profound implications both fundamental and translational [1]. Single cell technologies avoid the confounding factors emerging from averaging over potentially heterogeneous cell populations [2], providing a global map of biological cell-to-cell variability at the molecular level [3].

While single-cell transcriptomic technologies are rapidly reaching maturity, more recent platforms have emerged that enable simultaneous large scale measurements of multiple molecular layers within the same cell. Multi-omics assays can now capture DNA methylation and gene expression [4, 5], gene expression and copy-number variations [5], DNA accessibility and gene expression [6, 7], and chromatin accessibility along with DNA methylation and gene expression [8] for the same cell. Such platforms have enormous potential to elucidate the mechanisms of epigenetic regulation in unprecedented detail.

Despite the huge potential for breakthroughs, technical limitations in multi-omics technologies create formidable statistical challenges in the interpretations of their results. Single-cell sequencing technologies are notoriously affected by high noise levels, including very strong data sparsity. Such problems are amplified in multi-omics studies, where multiple independent sources of noise might affect the joint distribution of the measurements. Additionally, challenges with normalization strategies, batch effects or other latent variables related to cellular processes might further prevent biological components to emerge clearly from data [9]. As a result, direct adoption of classical statistical tools to assess associations between different molecular layers (e.g. Pearson or Spearman correlation) routinely leads to underpowered analyses, which are only able to identify a handful of significant associations [4, 8, 10].

In this paper, we argue that proper treatment of noise is essential in order to robustly retrieve significant statistical associations. To do so, we introduce SCRaPL (Single Cell Regulatory Pattern Learning), a Bayesian hierarchical model to infer associations between different omics components. The Bayesian hierarchical framework, which has already been extensively used in single-omics single-cell analyses (e.g. [11, 12]), explicitly and transparently decomposes noise in the data, enabling efficient extraction of biological signals from technical noise. We demonstrate on both synthetic and real data sets that SCRaPL is both highly accurate and sensitive, identifying much larger numbers of statistically significant associations than standard correlation analyses while retaining a good control on false positives.

Results

SCRaPL: Single Cell Regulatory Pattern Learning

SCRaPL is a tool for exploratory analysis of high-throughput, single cell data, which aims to establish more robust associations between different molecular layers. The example we will focus on here is relating the expression of a specific gene with its epigenetic state, measured either by DNA methylation or chromatin accessibility, however different types of associations might be considered, by introducing alternative noise models within our framework.

Our starting point is the observation that correlations, while invaluable tools to generate hypotheses, are critically sensitive to noise. In particular, it is well known that adding uncorrelated noise to correlated random variables reduces the estimates of correlation, thus weakening statistical power in any analysis. To obviate this problem, SCRaPL introduces a hierarchical model, schematically described in Fig 1.

Fig 1 — Here, we assume observed data consists of RNA expression and DNA methylation. 1A Schematic representation of the SCRaPL model. 1B SCRaPL’s graphical model, depicting the statistical dependencies between observed genomic data (Y_ij1 is RNA expression; Y_ij2 is DNA methylation), their associated latent variables (X_ij1, X_ij2) and feature-specific model parameters (μ_j, Σ_j). The additional parameter π_j is specific to the noise model that is assigned to RNA expression data and captures zero inflation. Full details are given in the model description section in *Methods*.

Briefly, for each cell i, the SCRaPL model associates observed values Y_ij = (Y_ij1, Y_ij2)′ for each feature j (e.g., gene/ promoter pair) with a bivariate Gaussian vector (denoted as X_ij = (X_ij1, X_ij2)′) with unknown latent mean μ_j and correlation matrix Σ_j. The latter is parameterized such that ρ_j captures the feature-specific underlying correlation across both molecular layers. The latent variables X_ij are then passed through a suitable nonlinear link function to generate the expected value of the observation. The observation noise model, as well as the nonlinear link function, are tailored to the type of assay being analysed (and can also be designed in a data-driven fashion by using model selection techniques). In particular, we use a zero inflated Poisson noise model for RNA expression and binomial noise models for DNA methylation or chromatin accessibility; full details are given in the model description subsection, Eqs (1)–(4), Methods section. SCRaPL then uses Bayesian inference to reconstruct the latent mean values and correlation ρ_j from independent observations over many cells. A probabilistic decision rule together with Bayesian multiple testing correction methods [Expected False Discovery Rate, EFDF; [13]] can be deployed to quantify association strength and associate statistical significance to the reported correlations.

Benchmarking SCRaPL using synthetic data

To assess the estimation performance of SCRaPL, we experimented on synthetic datasets consisting of 300 simulated features (pairs of gene expression and promoter methylation values). The experiments were varied to cover a number of different scenarios: numbers of cells; coverage levels; fraction of zeros in expression data (zero inflation, ZI); as well as different latent mean and covariance structures. A detailed description of the various simulation scenarios, is provided in Table 1 and S3 Text.

Table 1. Summary of synthetic data experiments.

In all cases, latent means and standard deviations were set as μ_j1 = 4, μ_j2 = 1, σ_j1 = 3 and σ_j2 = 2. Unless otherwise stated, our simulations were based on: I = 60 cells, J = 300 features, 20% ZI rate on average for the expression data (π_j = 0.20) and an average methylation coverage (n_ij) equal to 275 (sampled from a Uniform distribution with range [50, 500]) across cells and genes. When varying the number of cells, we use I ∈ {5, 10, 25, 50, 100, 200, 400, 800, 1600}. When varying expression ZI, we use π_j ∈ {0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 0.8}. When varying methylation coverage, we sample n_ij from Uniform distributions with ranges given by [5, 10], [10, 20], [20, 50], [50, 250] and [500, 1000]. Full details are provided in S3 Text.

Experiment	Description
1	Correlations ρ_j sampled from a Beta(15, 15) distribution, varying number of cells.
2	Correlations ρ_j sampled from a Beta(15, 15) distribution, varying expression ZI.
3	Correlations ρ_j sampled from a Beta(15, 15) distribution, varying methylation coverage.
4	Correlations ρ_j sampled from a U[−0.8, −0.6] distribution, varying number of cells.
5	Correlations ρ_j sampled from a U[−0.8, −0.6] distribution, varying expression ZI.
6	Correlations ρ_j sampled from a U[−0.8, −0.6] distribution, varying methylation coverage.
7	As experiment 1, but latent expression means sampled from scVI.
8	As experiment 2, but latent expression means sampled from scVI.
9	As experiment 3, but latent expression means sampled from scVI.

Open in a new tab

Here, we primarily focus on estimation accuracy for the feature-specific latent correlation ρ_j but also summarize results for other parameters to get the complete view. Violin plots summarizing the difference of SCRaPL’s posterior from generating parameters as a function of cells can be found in Fig 2. Results for other model parameters are displayed in S3 Text (See A-I Figs in S3 Text).

Fig 2 — (2A) Estimated correlation difference from true correlation as a function of cells for SCRaPL, Spearman and Pearson. (2B) Estimated correlation as a function of true correlation for SCRaPL, Spearman and Pearson in synthetic datasets with 300 genes and 1600 cells. Each dot represents a gene and is color-coded based inference approach.

We start by considering a situation of perfect model specification (experiments 1–3 in S3 Text), in order to assess the identifiability of our model and to document the degradation of correlation estimates obtained with classical methods. In this case, we observe that all methods provide estimates of correlation with zero-mean expected error, with an accuracy which increases with the number of cells in the data set. However, particularly for relatively low numbers of cells, the accuracy of the estimates was considerably higher for SCRaPL than for the classical Pearson and Spearman methods. Fig 2A shows a comparison of the three methods as we vary the number of cases, with a ZI level fixed to 20% (a benign setting similar to what encountered in high-depth plate-based technologies). SCRaPL outperforms both Spearman and Pearson by a large margin for all numbers of cells considered. Even in the most favourable case of 1600 cells (an unreasonably large number of cells for plate-based technologies), Pearson and Spearman systematically underestimate the (absolute value) of the correlation, while SCRaPL returns an accurate estimation for all true correlation values, as shown in the scatterplot in Fig 2B. So, while overall all methods are unbiased in their estimates, Spearman and Pearson systematically underestimate the absolute value of the correlation, potentially leading to lesser power (see next section). As expected, the performance for all methods degrades with increasing levels of ZI (see S3 Text). However, we did not observe significant differences for SCRaPL correlation estimates across different levels of coverage (see S3 Text).

To probe the importance of prior specification, we generated data where the underlying correlation values ρ_j were in an area with low prior mass (experiments 4, 5 and 6 in S3 Text). In this case, we did observe some bias in our estimates (see Figs D-F in S3 Text), particularly when the number of cells is low. Similarly, performance diminishes with increasing ZI levels and stays relatively intact across different coverage levels.

As a final test of more severe model mismatch, we evaluated predictive performance in a scenario where we retained the same noise model, but replaced the latent multivariate Gaussian distribution by expression rates inferred using a variational auto-encoder [scVI; [14]] that was trained on the scRNAseq data from [15] (see S3 Text, experiments 7–9). Despite the model mismatch, we observed good estimation performance for ρ_j across a range of simulation parameters (see Figs G-I in S3 Text).

SCRaPL improves the power to identify associations between molecular layers in mouse embryonic stem and brain cells

We next consider two single cell multi-omics datasets generated by scNMT-seq [8] and the 10x Genomics Multiome ATAC plus Gene Expression platform. Samples correspond to mouse embryonics stem cells (mESC) and brain cells (fresh cortex, hippocampus, and ventricular zone) (mEBC) at various developmental stages (embryonic days 4.5, 5.5, 6.5,7.5 for mESC and 18 for mEBC), which comprise the exit from pluripotency and primary germ layer [15] and the end of retinal ganglion cell generation [16].

In mESC cells we are investigating correlations between methylation for protein coding promoters within ±2.5kbps from Transcription Start Site (TSS) and expression. The mEBC data set consists instead of accessibility (measured by ATAC seq) and expression data. In that case we quantify the associations between expression and accessibility of enhancers lying in a region of 12.5kbps form the gene (an analysis of associations between expression and promoter accessibility for mESC cells is shown in S3 Text). Importantly, the two data sets were obtained using technologies with widely differing technical characteristics: the plate-based scNMT platform returns good coverage levels (and hence low ZI) for a limited number of cells, while the 10x platform assays many more cells but with lower coverage and higher dropout rates. After quality control, the resulting data sets contained 9480 features (gene promoters) and 679 mESCs and 4249 features (enhancers) and 4052 mEBCs respectively (Methods).

To compare the power of different methods to detect associations between molecular layers, we considered, alongside SCRaPL, the classical Spearman and Pearson correlation tests (Methods). The latter in particular has been widely used for single cell multi-omics data (e.g. [4, 8]); neither method takes into account noise in producing estimates of correlation. Molecular layer associations were retrieved as significant by controlling EFDR and FDR to 10%, respectively.

Fig 3 shows the summary results of these analyses. On the ESC data set, SCRaPL retrieves approximately 2.5 times more associations compared to both Pearson and Spearman testing, retrieving 217 (SCRaPL) versus 68(Pearson)/85(Spearman) (Fig 3C and Table A in S4 Text) associations. Fig 3A shows Bayesian Volcano plots, demonstrating how SCRaPL captures many more associations than frequentist alternatives. The overwhelming majorities of the associations recovered by frequentist methods are also captured by SCRaPL (Pearson and Spearman tend to be in very good agreement on this data set), which captures many more associations. We also looked at accessibility-expression pairs but due to weak signal no significant features were found by any of the methods. Later, we will investigate the biological significance of these results, showing how the greater statistical power of SCRaPL does in fact afford greater insights in the underlying biology.

Fig 3B and 3D show the analogous results for the analysis of EBC data. Here the picture is completely different: while SCRaPL still detects many more associations than Pearson, Spearman testing collapses and can only detect one significant associations. This points to a statistical vulnerability of Spearman testing when applied to data with high zero inflation in both molecular layers. More precisely, the large number of zeros present in both expression and accessibility mEBC data creates a large set of ranking ties, creating an intrinsic mathematical problem for Spearman correlation. That is reflected in the 4180, 816 and 1 detected associations for SCRaPL, Pearson and Spearman (Fig 3D and Table B in S4 Text). Unsurprisingly, both SCRaPL and, to a lesser extent, Pearson testing identify as significant a greater fraction of association pairs between accessibility and expression, as expected from the basic biology of gene expression. In particular, the overwhelming majority of correlations between proximal enhancers accessibility and gene expression were deemed to be significant by SCRaPL, reflecting the importance of proximal enhancers in the regulation of gene expression.

While the ability of SCRaPL to detect larger numbers of associations is certainly an encouraging feature, it is essential to characterize whether this is due to greater power, or simply to a greater vulnerability to false positives. However, determining empirically the false positive rate is challenging as access to ground truth correlation values for each feature is impossible.

To address these issues, we proceed pragmatically by constructing negative control data sets in which observations of methylation and expression values for a particular feature in different cells are randomly permuted. This will destroy any correlation structure between the two quantities, so that features detected as significant in negative control data can be considered as false positives. Here, we constructed 5 negative control datasets. For all negative controls, SCRaPL and Pearson/Spearman testing only detected a handful of associations, consistently less than for the original data (see Tables A-B in S4 Text). These results suggest that all methods control for false positives, reinforcing the significance of the associations retrieved.

In summary, these results demonstrate that SCRaPL displays significantly increased statistical power in detecting associations between different molecular layers for both main types of multi-omics technological platforms. Intriguingly, both SCRaPL and Pearson testing appear to be largely insensitive to the type of technology, with SCRaPL identifying between 2.5 and 5 times more associations. Instead, Spearman testing reveals an intrinsic weakness in dealing with high sparsity data, making it potentially unsuitable as a tool for 10x multi-omics data analysis.

SCRaPL associations are influenced by data sparsity and are robust to outliers

Fig 3A–3D (and Figs A-C in S6 Text) show clearly that, while most Pearson/ Spearman associations are also detected by SCRaPL, there are still some discrepancies. It is therefore natural to wonder to what extent the signals detected by the alternative methods are different, and what factors influence the different outcomes, in particular the much greater detection power obtained by SCRaPL.

From the modeling perspective, there are two major differences: first, SCRaPL considers noise models which capture overdispersion and take into account coverage in the epigenomic data. This should make SCRaPL associations less vulnerable to outlier values (eg. genes with low average expression with one or two high readings) or to epigenomic measurements with low coverage. Secondly, SCRaPL includes zero inflation in its accessibility/expression model, and can therefore attribute to that component some measurements of zero expression should the evidence dictate so. In the rest of this section, we present some empirical evidence that indeed supports the presence of these benefits in our real data analysis.

We consider the set of associations which are called as significant by at least one method, and split it into 3 categories: agreement between predictions, association labeling as significant by SCRaPL, but not by Pearson/Spearman testing, and vice-versa. We then analyze these three sets attempting to detect common patterns, discussing some examples to substantiate our findings.

Features for which Pearson/Spearman testing and SCRaPL agree tend to have high coverage and small number of zeros in case of expression (or accessibility in 10X). An example feature called as significant by SCRaPL and Pearson is in Fig 4A.

To gain more insight on the factors driving SCRaPL inferences it is interesting to focus on associations, whose significance differs between the two methods. An example of an association detected by Pearson/Spearman testing but not SCRaPL is shown in Fig 4B. As we can see, we have a large fraction of zero expression values with very low methylation coverage. As a result, SCRaPL, while placing most of the posterior mass over negative correlation values, cannot confidently exclude the possibility of no correlation. This example perfectly illustrates that divergences between SCRaPL and Pearson/Spearman testing are often driven not by expected values, but by the fact that SCRaPL additionally performs uncertainty quantification on its results.

An example of an association deemed significant by SCRaPL, but not by Pearson/Spearman testing, is shown in Fig 4C. In this case, we tend to have medium to high expression and good coverage. However, Pearson/Spearman correlation remain below detection levels due to a number of observations with zero expression. This is an example where SCRaPL can be particularly beneficial, since the noise model can better capture potential effect of zero inflation.

To provide a more quantitative, global explanation of the differences between SCRaPL and Pearson, we regress the absolute difference in inferred correlation against methylation coverage and percentage of zero counts for each feature across all cells. The resulting regressions, shown in Fig 4D, demonstrate a weak but consistent effect of both forms on noise, confirming that differences between the two methods are more prominent in noisier situations where methylation coverage is low or sparsity is high. This analysis is also confirmed for Pearson and Spearman in both mESC and mEBC data in A Fig of S11 Text.

SCRaPL identifies biologically meaningful epigenetic regulation in early mouse gastrulation

To provide further biological support to SCRaPL associations, we perform our own exploratory analysis in early gastrulation phases using SCRaPL significant findings.

We start by choosing early pluripotency and germ cell markers where methylation’s strong repressive role is widely investigated (e.g. [15, 17, 18]). Developmental pluripotency markers (ie. Dppa2,Dppa4,Dppa5a) exhibit strong regulatory patterns with the generally high expression levels in days 4.5 and 5.5 being gradually suppressed as cells diversity to progenitors of major organs. Methylation’s strong silencing role was also found in Dnmt3l, a catalytically inactive DNA methyltransferase that cooperates with Dnmt3a and Dnmt3b to methylate DNA [19]. In addition, our analysis identified a series of genes with strong regulatory action vital to embryonic development Atp6v0d1 [20], to spermatogenesis/placenta-supported development Tex19.1 [21] and others with unknown roles like Zfp981 and Trap1a.

To complete the exploratory analysis, we look at Gene Set Enrichment Analysis (GSEA) using DAVID [22] to establish links with biological phenomena observed in early embryogenesis and gene promoter methylation. To identify the processes we allow a minimum of 7 genes, a p-values up to 0.3 and sort them based on their enrichment score. As a result we have identified a total of fifteen developmental and house-keeping processes (see Fig A in S7 Text). The highest enrichment scores are encountered for angiogenesis and in utero embryonic development with 2.6 and 2.1 respectively. For house-keeping processes we get proteolysis, ion transport and negative regulation of transcription with enrichment scores 2.2,1.9 and 1.7 respectively. Using the same filtering parameters in DAVID with the set of genes detected by Pearson we would recover a single process, regulation of transcription with enrichment score 1.5. Spearman testing detects a larger number of associations than Pearson on the ESC data set, and consequently has increased power in detecting enriched processes. In this case the number of recovered processes increases to 7, which consist however of primarily house-keeping processes (see Fig B in S7 Text).

This analysis confirms the biological plausibility of the identified SCRaPL associations. It should be emphasised that the enrichment analysis has only been possible due to the larger number of associations identified by SCRaPL: GSEA analyses require considerable numbers of genes to identify any significant enrichment. This underlines the fact that technical variability not only erodes correlation but significantly under-powers downstream exploratory analysis in multi-omics data. Hence by modelling data generative processes we can increase substantially the scope of downstream interpretative analyses of single-cell multi-omics data.

Using SCRaPL as a data denoising tool

The detection of associations between layers is only one of the many possible analyses which can be performed on multi-omics data sets. A substantial line of research has recently emerged around the topic of data integration, which aims to combine data from multiple layers measured in different cells obtained from the same biological system. The goal of such analyses is to enhance our understanding of cellular identity and function [23]. Popular platforms like Seurat [24] implement data integration via a dimensionality reduction approach based on Canonical Correlation Analysis (CCA), a technique based on Singular Value Decomposition of empirical correlation matrices [25]. Despite its proven capabilities, CCA is not designed to handle count data. We therefore wondered whether SCRaPL’s likelihoods tailored on specific data formats could under certain cases provide a valuable addition to the integration pipeline. Specifically, here we use SCRaPL as a denoising tool, and perform data integration at the level of the latent variables, rather than the raw data. In this subsection we follow the vignettes provided by Seurat’s authors [26] and compare the results with and without SCRaPL’s denoising. We note that this analysis is only a proof of concept as SCRaPL uses multi-omics data collected in the same cells as input and therefore such integration is not required.

For comparison between SCRaPL denoised and raw data we looked at peripheral blood mononuclear cells (PBMC) data [27]. This dataset contains expression and accessibility for 12000 PBMCs gathered from a healthy 25 year old donor, see Methods for details on data pre-processing.

To perform data integration, we remove cell specific noise by sampling latent space accessibility/expression from the respective posterior distributions obtained from SCRaPL. In cases of peaks mapped to multiple genes, readings were averaged. These data were integrated by Seurat [24], ignoring TF-IDF(Seurat’s accessibility data preprocessing) and scRNA normalization (aimed at expression data) steps. Standard performance monitoring plots like label transfer between single cell expression and accessibility data as well as integration plots are presented in Fig 5. In general epigenomic and transcriptomic layers have integrated well for both raw and SCRaPL preprocessed data as suggested from Fig 5C and 5D. This picture remains consistent across multiple other trials (see Fig A in S12 Text). The integration metrics found in Fig 5A and 5D show a comparable performance between SCRaPL preprocessing and raw data. This is in stark contrast with the results of the preceding sections, which demonstrated a consistent superiority of SCRaPL in detecting associations at the level of individual features. The reason for this is probably to be found in the dimensionality reduction performed by Seurat: canonical components found by CCA are obtained via an averaging process which already does an excellent job at filtering out noise, much in the way that robust principal components can often be extracted also from noisy data.

Fig 5 — Visualization of sc-RNA and scATAC data on the same plot for raw 5C and SCRaPL 5D preprocessed data.

Discussion

Single cell multi-omics sequencing technologies are rapidly becoming an important tool to understand epigenetic regulation for individual cells in complex biological processes, such as early embryo development. However, analysis of such data still presents a major bottleneck, due to the high-dimensionality, sparsity and heterogeneous noise affecting them. In this paper, we argued that the introduction of noise-aware approaches is fundamental in developing the field of single-cell multi-omics. We introduced SCRaPL, a Bayesian approach to perform perhaps the most basic and common multi-omics analysis, the discovery of correlative associations between different data modalities. By employing dedicated noise models in a latent-Gaussian framework, SCRaPL achieves more powerful and more robust results than simple analyses based on Pearson correlation, which is by far the most widespread tool currently used.

Our analyses were based on existing annotation, where the expression of a given gene was correlated with epigenetic data from a nearby genomic region (promoters or nearby enhancers). This appears to be a reasonable demonstration of the tool, although it clearly limits the scope for discovery of interesting biological processes such as distal regulation. It should be pointed out that SCRaPL could also be used to test associations between unannotated regions along the lines explored in e.g. [12], [28].

The Bayesian hierarchical framework employed by SCRaPL also offers a template for the application of more complex analysis techniques (such as clustering, dimensionality reduction and network inference) to multi-omics data. In many analyses, we expect that consistent handling of noise will be valuable, although it should be pointed out that some downstream analyses already perform noise filtering implicitly. This was demonstrated in our comparison with the CCA approach implemented in Seurat [24], which effectively averages out noise during dimensionality reduction, yielding very similar results to SCRaPL. As with most Bayesian methods, SCRaPL does suffer from a higher computational burden, particularly when compared with extremely simple analyses, such as Pearson correlation. Extension of noise-aware Bayesian methods to different single-cell multi-omics analyses may therefore require the adoption and evaluation of more efficient computational inference techniques, such as variational inference [29].

Materials and methods

A Bayesian hierarchical framework for noisy single cell multi-omics data

SCRaPL implements a Bayesian hierarchical approach that is tailored to the data generated by single cell multi-omic assays. Here, we assume that matched data is available for two molecular phenotypes, but our formulation is flexible and can in principle be expanded to include additional layers. A graphical representation for the model implemented in SCRaPL is provided in Fig 1. The distribution of a latent vector X_ij is used to capture the association across molecular layers. For each cell i (∈{1, … I}) and feature j (∈{1, … J}), the latter is given by

\begin{matrix} X_{i j} = (\begin{matrix} X_{i j 1} \\ X_{i j 2} \end{matrix}) | μ_{j}, Σ_{j} \overset{i n d}{\sim} N (μ_{j}, Σ_{j}), \end{matrix}

(1)

where

\begin{matrix} μ_{j} = (\begin{matrix} μ_{j 1} \\ μ_{j 2} \end{matrix}) and Σ_{j} = (\begin{matrix} σ_{j 1}^{2} & ρ_{j} σ_{j 1} σ_{j 2} \\ ρ_{j} σ_{j 1} σ_{j 2} & σ_{j 2}^{2} \end{matrix}) . \end{matrix}

(2)

In this formulation, we assume independence across all features, which will be analyzed separately (this enables trivial parallelization across features). Different noise models are then assigned to each molecular layer based on the properties of the associated data. There are two different likelihoods that we use depending the types of cells we use. For count data (i.e. gene expression in mESC/mEBC and chromatin accessibility in mEBC) we use a zero-inflated Poisson noise and for the rest (i.e. DNAm and accessibility in mESC) we use a Binomial distribution. Specific noise models for each of the data types considered here are described below.

RNA expression noise model

Let Y_ij1 be a random variable representing the number of raw read-counts observed for each cell i and feature j. Conditional on the value of the latent variable X_ij1, we use an exponential link function and assume that

\begin{matrix} P (Y_{i j 1} = y_{i j 1} | X_{i j 1} = x_{i j 1}, s_{i}, π_{j}) = {\begin{matrix} (1 - π_{j}) \frac{{(s_{i} e^{x_{i j 1}})}^{y_{i j 1}} exp (- s_{i} e^{x_{i j 1}})}{y_{i j 1}!} if y_{i j 1} > 0, \\ π_{j} + (1 - π_{j}) exp (- s_{i} e^{x_{i j 1}}) if y_{i j 1} = 0 . \end{matrix} \end{matrix}

(3)

The latter corresponds to a zero-inflated Poisson (ZIP) model with an exponential link, where s_i (> 0) is a cell-specific scaling factor that accounts for global differences across cells (e.g. due to sequencing depth) and π_j (∈[0, 1]) represents a zero-inflation parameter (if π_j = 0, Eq (3) reduces to a Poisson model). The exponential link function leads to a zero-inflated Poisson-lognormal model, whose variations have been previously used for single cell RNA sequencing (scRNAseq) data [30, 31]. In practice, we infer scaling factors s_i using scran [32] and use them as known model offsets.

The need for a zero-inflation component is a matter of debate for scRNA-seq data [33] and may depend on the experimental protocol used to generate the data. See Comparing between alternative models later in this section for a quantitative approach to evaluate the need for zero-inflation in specific datasets.

DNAm noise model

For each cell i and feature j, let n_ij be the number of CpG sites within a pre-specified genomic region (e.g. gene promoter) for which DNAm reads were obtained. These capture differences in coverage across cells and features. The conditional model for the number of methylated CpG sites Y_ij2 is then assumed to follow a binomial distribution such that

\begin{matrix} P (Y_{i j 2} = y_{i j 2} | X_{i j 2} = x_{i j 2}, n_{i j}) = (\binom{n_{i j}}{y_{i j 2}}) {(Φ (x_{i j 2}))}^{y_{i j 2}} {(1 - Φ (x_{i j 2}))}^{n_{i j} - y_{i j 2}}, \end{matrix}

(4)

where Φ(⋅) denotes a probit link function.

Chromatin accessibility noise model

The choice of noise model depends on the format of the input data. If the data consists of the number of features Y_ij2 in a genomic region [e.g. as in [34]], our modelling approach analysis follow the one described for RNA expression data. If the data consists of open peaks within a genomic region (e.g. as for the 10X scATAC seq protocol), the same binomial noise model used for DNA methylation data can be applied.

Parameter interpretation

To aid the interpretation of each model parameter, mean and variance expressions are derived for the noise models introduced above after integrating out the distribution of the latent vector X_ij (see S8 Text). In both cases, μ_j1 and μ_j2 control the overall RNA expression and DNAm values for the population of cells under study. Moreover, σ_j1 and σ_j2 capture the excess of variability (overdispersion) that is observed with respect to the baseline noise model. Finally, ρ_j captures the latent correlation between molecular layers.

Prior specification

A popular prior choice for covariance matrices is the inverse Wishart distribution. However, this has been shown to bias correlation coefficients depending whether marginal variances are small or large [35]. Instead, [36] used a separation strategy to decouple correlation from marginal variances. Our prior specification for Σ_j is based on the parametrization introduced in Eq (2), with independent priors assigned to all feature-specific parameters. Our prior specification is given by

\begin{matrix} π_{j} & \overset{i n d}{\sim} & Beta (a_{j}, b_{j}), \end{matrix}

(5)

\begin{matrix} μ_{j} & \overset{i i d}{\sim} & N (m, H), \end{matrix}

(6)

\begin{matrix} σ_{j 1}, σ_{j 2} & \overset{i i d}{\sim} & Inv-Gamma (c_{1}, c_{2}), \end{matrix}

(7)

\begin{matrix} ρ_{j} & \overset{i i d}{\sim} & {Beta}_{[- 1, 1]} (d_{1}, d_{2}) . \end{matrix}

(8)

In Eq (8), the prior for ρ_j corresponds to a four-parameter Beta distribution, whose support has been scaled to be [−1, 1]. In order to avoid systematically favoring positive or negative correlations, we centered the prior at 0 by setting d₁ = d₂. Then we tuned these prior hyper-parameters on negative control data (see S2 Text), eventually choosing Beta(15, 15) as it helped to suppress false positive detection rates. More information about this provided in S5 Text. For the remaining hyper-parameter values, default values were set as a_j = 2, b_j = 8 to encourage low zero inflation. Moreover, we set c₁ = 2.5, c₂ = 4.5 but keeping the parameters within a reasonable range will also work, μ_j = (4, 0)′ for mESC data, μ_j = (4, 3)′ for mEBC data, H was set to be a 2 × 2 identity matrix.

Implementation

As the posterior distribution associated to the model above does not have a closed analytical form, inference is implemented using No-U-Turn Sampler [37], a state of the art variation of Hamiltonian Monte Carlo [38].

For all the analyses shown in this article, we obtained 5000 samples from this algorithm and discarded the first 3000 iterations (burn-in) before estimating model parameters. Parameters were optimized during burn-in to an acceptance ratio of 0.65. Convergence is monitored using the Gelman-Rubin criterion [39].

A probabilistic rule to detect statistically significant associations across layers

SCRaPL identifies features with statistically significant correlation across multi-omics layers (e.g. RNA expression and promoter DNAm) based on the posterior distribution of feature-specific latent correlation parameters ρ_j. Our decision rule depends on whether the posterior mass for |ρ_j| is concentrated around high values. As in [40], this is quantified by the following tail posterior probabilities

p_{j} (γ) = P (| ρ_{j} | \geq γ),

(9)

where γ (> 0) denotes a minimum correlation threshold. If p_j(γ) is greater than a probability threshold α, a statistically significant correlation is reported for feature j.

Suitable values for γ and α could be chosen using different approaches. In principle, γ can be fixed a priori by the user. Instead, we adopt a data-driven approach based on the distribution of feature-specific posterior estimates obtained for |ρ_j| using negative control datasets (see S4 Text). Such distribution can be used to quantify the strength of correlation estimates that can be expected by chance for a given sample size and sequencing depth. As a default choice, we select γ to match the 90% quantile of the distribution described above. For a fixed value of γ, a grid search is used to select α according to a target EFDR. The latter is defined as

\begin{matrix} {EFDR}_{α} = \frac{\sum_{j = 1}^{J} (1 - p_{j} (γ)) I (p_{j} (γ) \geq α)}{\sum_{j = 1}^{J} I (p_{j} (γ) \geq α)}, \end{matrix}

(10)

where $I (A) = 1$ if A is true, 0 otherwise. Our default target EFDR is equal to 10%.

Current approach based on Pearson/Spearman correlation

To date, single cell multi-omics analyses have primarily used the Pearson/Spearman correlation coefficient r to quantify associations between different types of molecular data [e.g. [4, 8]]. These estimates are directly derived from the observed data and do not assume a specific noise model. As the input for this calculation, gene expression data is typically normalised [e.g. using scran; [32]] and subsequently log-transformed after adding a pseudocount, while DNAm is normalised by coverage [note that the addition of a pseudocount is arbitrary and has been shown to distort variance estimates; [31]].

Based on these estimates, statistically significant correlations are selected by contrasting the hypotheses H₀: ‖r‖ ≤ u and H₁: ‖r‖ ≥ u, for some threshold u. To control the False Discovery Rate (FDR) across features, the Benjamini-Hochberg correction [41] is typically used.

Comparing between alternative models

SCRaPL is a noise-aware approach with error models crafted to address challenges related to various multi-omics data. Our likelihood of choice for count data such as gene expression and accessibility in 10X data has been Poisson distribution. Since there is a debate surrounding the extend to which zero-inflation is required [33], we take an unbiased stance and use posterior model samples to perform model selection using Deviance Information Criterion [DIC; [42]]. DIC is a method for assessing goodness of fit while penalizing large effective numbers of parameters between alternative models, with lower DIC values indicating more preferable models.

To assess the need of zero-inflation in SCRaPL, we fit the zero-inflated and the standard Poisson in the methylation/expression the mESC data and accessibility/expression of mEBC data. For the large majority of features in the mESC and mEBC, DIC favors zero inflation as it is indicated from Fig 6.

Fig 6 — The more negative the difference, the stronger the evidence in favor of the model with zero inflation on the gene expression component and vice versa. As a visual reference, zero is marked with dashed red line.

Single cell multi-omic datasets

We applied SCRaPL in the context of two single cell multiomic datasets. First, we consider the mESC dataset generated by [15] using the scNMT-seq protocol [8]. For these data, our analysis focuses on the correlation between gene expression and DNAm. We also experimented with chromatin accessibility and gene expression pair with not much success due to sparsity in accessibility data. Our second case study considers mEBC data generated using the 10X Genomics Multiome ATAC plus gene expression platform. Quality control steps applied to both datasets are described in S1 Text.

To aggregate DNAm data from different mESC and link open DNA chromatin from mESC to nearby genes we follow a window based approach. Reads are mapped using the GRCm38 mouse genome (accession number GSE56879). For more information, the reader is directed to S1 Text. When looking at methylation/gene expression of promoter regions in mESC data, a window of ±2.5kbp was used. For chromatin accessibility/gene expression in the same dataset the window was ±0.25kbp. Similarly, for accessibility and expression in the mECB data we map enhancers to genes at most ±12.5 kbp away. To control how our window choices affect results, we experimented with multiple window sizes, noticing minimal impact on the results.

Subsequently, a quality control step was applied to both datasets. For mESC data we removed features with zero variance in each modality and for which the percentage of expression zeros was above 80%. This resulted in a dataset with 9480 features and 679 cells. Similarly, for mEBC data we removed features with more than 80% of zeroes in accessibility or expression, leading to a dataset with 4249 features and 4052 cells.

The data denoising analysis was done using peripheral blood mononuclear cells (PBMCs). For illustration purposes, we downsampled the dataset to 3000 cells by keeping only the ones with the highest sum across peaks. Then peaks and genes were reduced from 180000 to 30000 and from 36000 to 10000 respectively, based on their variability. Then the to 60k features in association magnitude were used by SCRaPL.

Supporting information

S1 Text. Data preprocessing.

Here we discuss the preprocessing and quality control steps taken in mESC and bESC datasets. This includes aggregating raw epigenetic data from multiple cells, normalizing single cell transcriptomic data, integrating epigenomic and trascriptomic layers and removing low quality data.

(PDF)

Click here for additional data file.^{(78.3KB, pdf)}

S2 Text. Creating negative control datasets.

In this section we describe step by step the generation of negative control data, explaining our address to problems like missing coverage.

(PDF)

Click here for additional data file.^{(120.2KB, pdf)}

S3 Text. Synthetic data.

We include an extensive analysis of synthetic data experiments used to develop SCRaPL. In particular, we include three sets of experiments that investigate SCRaPL’s performance as a function of cells, zero-inflation and coverage. The first set is performed on data sampled form the model, the second on data sampled from the model and the correlation from a U[−0.8, −0.6], and the third partly sampled from a deep generative model and partly from the model.

(PDF)

Click here for additional data file.^{(1.2MB, pdf)}

S4 Text. Negative control experiments.

In this section we lay the detection rate comparisons between SCRaPL and Pearson for methylation/expression of mESC, accessibility/expression of mESC and accessibility/expression of bESC.

(PDF)

Click here for additional data file.^{(798.9KB, pdf)}

S5 Text. Choosing between correlation priors.

In this section we present a data driven approach for choosing prior hyper-parameters.

(PDF)

Click here for additional data file.^{(657.3KB, pdf)}

S6 Text. Extended comparisons between SCRaPL,Pearson and Spearman predictions.

In this section a more extended comparison between SCRaPL, Pearson and SPearman predictions is provided.

(PDF)

Click here for additional data file.^{(1.8MB, pdf)}

S7 Text. Gene set enrichment analysis.

The complete gene set enrichment analysis using DAVID is provided.

(PDF)

Click here for additional data file.^{(1.5MB, pdf)}

S8 Text. Connecting SCRaPL error model to likelihoods currently employed by practitioners.

We demonstrate that SCRaPL’s expression likelihood serves as valid alternative as it exhibits the over-dispersion property that practitioners seek.

(PDF)

Click here for additional data file.^{(148.8KB, pdf)}

S9 Text. Null hypothesis testing.

We give a thorough description of the hypothesis testing done to identify regions with strong regulatory action.

(PDF)

Click here for additional data file.^{(150.9KB, pdf)}

S10 Text. Efficiency analysis.

We provide evidence of SCRaPL’s scaling as a function of problem size.

(PDF)

Click here for additional data file.^{(68.8KB, pdf)}

S11 Text. Macroscopic analysis of SCRaPL behavior compared to Pearson and Spearman Correlation.

We provide additional figures to Fig 3D that demonstrate the different behavior that SCRaPL presents from Pearson/Spearman correlation when high zero inflation and low CpG coverage is present in data.

(PDF)

Click here for additional data file.^{(1MB, pdf)}

S12 Text. Stability of denoising.

Figures summarizing scRNA and scATAC integration using posterior latent space samples.

(PDF)

Click here for additional data file.^{(2.1MB, pdf)}

Acknowledgments

We would like to thank Chantriolnt-Andreas Kapourani, Michalis Michaelides for their valuable comments and discussion.

Data Availability

The mESC dataset is available in the Gene Expression Omnibus under accession number GSE121708 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE121708). Similarly the mEBC dataset can be found in the official 10X Genomics website (https://www.10xgenomics.com/resources/datasets/fresh-embryonic-e-18-mouse-brain-5-k-1-standard-2-0-0). The PBMC dataset is also found in 10X Genomics website (https://support.10xgenomics.com/single-cell-multiome-atac-gex/datasets/1.0.0/pbmc_granulocyte_sorted_10k). A Python implementation of SCRaPL along with preprocessing scripts are available at https://github.com/chrmaniatis/SCRaPL.

Funding Statement

Funding from Engineering and Physical Sciences Research Council (EPSRC) Centre for Doctoral Training in Data Science (grant EP/L016427/1) supported CM. The funders had no role in study design, data collection and analysis, decisions to publish, or preparation of the manuscript.

References

1. Tang F, Barbacioru C, Wang Y, Nordman E, Lee C, Xu N, et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nature Methods. 2009;6. doi: 10.1038/nmeth.1315 [DOI] [PubMed] [Google Scholar]
2. Shapiro E, Biezuner T, Linnarsson S. Single-cell sequencing-based technologies will revolutionize whole-organism science. Nature Reviews Genetics. 2013;14. doi: 10.1038/nrg3542 [DOI] [PubMed] [Google Scholar]
3. Bock C, Farlik M, Sheffield NC. Multi-Omics of Single Cells: Strategies and Applications. Trends in Biotechnology. 2016;34. doi: 10.1016/j.tibtech.2016.04.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Angermueller C, Clark SJ, Lee HJ, Macaulay IC, Teng MJ, Hu TX, et al. Parallel single-cell sequencing links transcriptional and epigenetic heterogeneity. Nature Methods. 2016;13. doi: 10.1038/nmeth.3728 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Hu Y, Huang K, An Q, Du G, Hu G, Xue J, et al. Simultaneous profiling of transcriptome and DNA methylome from a single cell. Genome Biology. 2016;17. doi: 10.1186/s13059-016-0950-z [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Cao J, Cusanovich DA, Ramani V, Aghamirzaie D, Pliner HA, Hill AJ, et al. Joint profiling of chromatin accessibility and gene expression in thousands of single cells. Science. 2018;361. doi: 10.1126/science.aau0730 [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Chen S, Lake BB, Zhang K. High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nature Biotechnology. 2019;37. doi: 10.1038/s41587-019-0290-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Clark SJ, Argelaguet R, Kapourani CA, Stubbs TM, Lee HJ, Alda-Catalinas C, et al. ScNMT-seq enables joint profiling of chromatin accessibility DNA methylation and transcription in single cells. Nature Communications. 2018;9. doi: 10.1038/s41467-018-03149-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Stegle O, Teichmann SA, Marioni JC. Computational and analytical challenges in single-cell transcriptomics. Nature Reviews Genetics. 2015;16. doi: 10.1038/nrg3833 [DOI] [PubMed] [Google Scholar]
10. Hernando-Herraez I, Evano B, Stubbs T, Commere PH, Bonder MJ, Clark S, et al. Ageing affects DNA methylation drift and transcriptional cell-to-cell variability in mouse muscle stem cells. Nature Communications. 2019;10. doi: 10.1038/s41467-019-12293-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Vallejos CA, Marioni JC, Richardson S. BASiCS: Bayesian Analysis of Single-Cell Sequencing Data. PLoS Computational Biology. 2015;11. doi: 10.1371/journal.pcbi.1004333 [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Kapourani CA, Argelaguet R, Sanguinetti G, Vallejos CA. scMET: Bayesian modeling of DNA methylation heterogeneity at single-cell resolution. Genome Biology. 2021;22. doi: 10.1186/s13059-021-02329-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Newton MA, Noueiry A, Sarkar D, Ahlquist P. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics. 2004;5. doi: 10.1093/biostatistics/5.2.155 [DOI] [PubMed] [Google Scholar]
14. Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep generative modeling for single-cell transcriptomics. Nature Methods. 2018;15. doi: 10.1038/s41592-018-0229-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Argelaguet R, Clark SJ, Mohammed H, Stapel LC, Krueger C, Kapourani CA, et al. Multi-Omics profiling of mouse gastrulation at single-cell resolution. Nature. 2019;576. doi: 10.1038/s41586-019-1825-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Robinson SR, Dreher B. The visual pathways of eutherian mammals and marsupials develop according to a common timetable. Brain, Behavior and Evolution. 1990;36. doi: 10.1159/000316082 [DOI] [PubMed] [Google Scholar]
17. Eckersley-Maslin M, Alda-Catalinas C, Blotenburg M, Kreibich E, Krueger C, Reik W. Dppa2 and Dppa4 directly regulate the Dux-driven zygotic transcriptional program. Genes and Development. 2019;33. doi: 10.1101/gad.321174.118 [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Hu YG, Hirasawa R, Hu JL, Hata K, Li CL, Jin Y, et al. Regulation of DNA methylation activity through Dnmt3L promoter methylation by Dnmt3 enzymes in embryonic development. Human Molecular Genetics. 2008;17. doi: 10.1093/hmg/ddn165 [DOI] [PubMed] [Google Scholar]
19. Neri F, Krepelova A, Incarnato D, Maldotti M, Parlato C, Galvagni F, et al. Dnmt3L antagonizes DNA methylation at bivalent promoters and favors DNA methylation at gene bodies in ESCs. Cell. 2013;155. doi: 10.1016/j.cell.2013.08.056 [DOI] [PubMed] [Google Scholar]
20. Miura GI, Froelick GJ, Marsh DJ, Stark KL, Palmiter RD. The d subunit of the vacuolar ATPase (Atp6d) is essential for embryonic development. Transgenic Research. 2003;12. doi: 10.1023/A:1022118627058 [DOI] [PubMed] [Google Scholar]
21. Tarabay Y, Kieffer E, Teletin M, Celebi C, Montfoort AV, Zamudio N, et al. The mammalian-specific Tex19.1 gene plays an essential role in spermatogenesis and placenta-supported development. Human Reproduction. 2013;28. doi: 10.1093/humrep/det129 [DOI] [PubMed] [Google Scholar]
22. Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature Protocols. 2009;4. doi: 10.1038/nprot.2008.211 [DOI] [PubMed] [Google Scholar]
23. Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM, et al. Comprehensive Integration of Single-Cell Data. Cell. 2019;177. doi: 10.1016/j.cell.2019.05.031 [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Hao Y, Hao S, Andersen-Nissen E, Mauck WM, Zheng S, Butler A, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184. doi: 10.1016/j.cell.2021.04.048 [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Loan CFV. GENERALIZING THE SINGULAR VALUE DECOMPOSITION. SIAM Journal on Numerical Analysis. 1976;13. [Google Scholar]
26.Hoffman SL, Collaborators. Integrating scRNA-seq and scATAC-seq data; 2021. Available from: https://satijalab.org/seurat/articles/atacseq_integration_vignette.html.
27.10X. PBMC from a healthy donor—granulocytes removed through cell sorting (10k); 2020. Available from: https://support.10xgenomics.com/single-cell-multiome-atac-gex/datasets/1.0.0/pbmc_granulocyte_sorted_10k.
28. Cusanovich DA, Hill AJ, Aghamirzaie D, Daza RM, Pliner HA, Berletch JB, et al. A Single-Cell Atlas of In Vivo Mammalian Chromatin Accessibility. Cell. 2018;174. doi: 10.1016/j.cell.2018.06.052 [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Kingma DP, Welling M. Auto-encoding variational bayes. 2nd International Conference on Learning Representations, ICLR 2014—Conference Track Proceedings. 2014.
30. Gu J, Wang X, Halakivi-Clarke L, Clarke R, Xuan J. BADGE: A novel Bayesian model for accurate abundance quantification and differential analysis of RNA-Seq data. BMC Bioinformatics. 2014;15. doi: 10.1186/1471-2105-15-S9-S6 [DOI] [PMC free article] [PubMed] [Google Scholar]
31. Townes FW, Hicks SC, Aryee MJ, Irizarry RA. Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model. Genome Biology. 2019;20. doi: 10.1186/s13059-019-1861-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Lun ATL, McCarthy DJ, Marioni JC. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Research. 2016;5. doi: 10.12688/f1000research.9501.2 [DOI] [PMC free article] [PubMed] [Google Scholar]
33. Svensson V. Droplet scRNA-seq is not zero-inflated. Nature Biotechnology. 2020;38. doi: 10.1038/s41587-019-0379-5 [DOI] [PubMed] [Google Scholar]
34. Buenrostro JD, Wu B, Litzenburger UM, Ruff D, Gonzales ML, Snyder MP, et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature. 2015;523. doi: 10.1038/nature14590 [DOI] [PMC free article] [PubMed] [Google Scholar]
35. Liu H, Zhang Z, Grimm KJ. Comparison of Inverse Wishart and Separation-Strategy Priors for Bayesian Estimation of Covariance Parameter Matrix in Growth Curve Analysis. Structural Equation Modeling. 2016;23. doi: 10.1080/10705511.2015.1057285 [DOI] [Google Scholar]
36. Barnard J, McCulloch R, Meng XL. Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage. Statistica Sinica. 2000;10. [Google Scholar]
37. Hoffman MD, Gelman A. The no-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research. 2014;15. [Google Scholar]
38.Duane S, Kennedy AD, Pendleton BJ, Roweth D. Hybrid Monte Carlo. Physics Letters B. 1987;195:216-22. Available from: https://www.sciencedirect.com/science/article/pii/037026938791197X.
39. Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences. Statistical Science. 1992;7. doi: 10.1214/ss/1177011136 [DOI] [Google Scholar]
40. Bochkina N, Richardson S. Tail posterior probability for inference in pairwise and multiclass gene expression data. Biometrics. 2007;63. doi: 10.1111/j.1541-0420.2007.00807.x [DOI] [PubMed] [Google Scholar]
41. Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological). 1995;57. [Google Scholar]
42. Spiegelhalter DJ, Best NG, Carlin BP, Linde AVD. Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2002;64. doi: 10.1111/1467-9868.00353 [DOI] [Google Scholar]

PLoS Comput Biol. 2022 Jun 21;18(6):e1010163. doi: 10.1371/journal.pcbi.1010163.r001

Author response to previous submission

3 Sep 2021

Attachment

Submitted filename: 117169_1_rebuttal_2136573_qwwftc.pdf

Click here for additional data file.^{(139.5KB, pdf)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010163.r002

Decision Letter 0

Jingyi Jessica Li, Sushmita Roy

27 Sep 2021

Dear Dr Maniatis,

Thank you very much for submitting your manuscript "SCRaPL: hierarchical Bayesian modelling of associations in single cell multi-omics data" (PCOMPBIOL-D-21-01603) for consideration at PLOS Computational Biology. As with all papers, your manuscript was reviewed by members of the editorial board. Based on our initial assessment, we regret that we will not be pursuing this manuscript for publication at PLOS Computational Biology.

We found that the manuscript would require a significant amount of revision to reach the quality of formal submission. The current issues include the inconsistent notations, widespread typos (e.g., Figure 4 and its caption are hardly comprehensible), and insufficient real data evidence. We would like to see another real dataset where the proposed method shows significant advances. In addition to the Pearson correlation, comparison with existing single-cell methods such as scLink is also necessary to show the advantage of the proposed method. If you find these comments addressable, please submit your revised manuscript as a new submission. Please also fully address the comments of the three ReviewCommons reviewers.

We are sorry that we cannot be more positive on this occasion. We very much appreciate your wish to present your work in one of PLOS's Open Access publications. Thank you for your support, and we hope that you will consider PLOS Computational Biology for other submissions in the future.

Sincerely,

Jingyi Jessica Li

Guest Editor

PLOS Computational Biology

Sushmita Roy

Deputy Editor

PLOS Computational Biology

PLoS Comput Biol. 2022 Jun 21;18(6):e1010163. doi: 10.1371/journal.pcbi.1010163.r003

Author response to Decision Letter 0

3 Jan 2022

Attachment

Submitted filename: Response_to_reviewers.pdf

Click here for additional data file.^{(98KB, pdf)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010163.r004

Decision Letter 1

Jingyi Jessica Li, Sushmita Roy

9 Feb 2022

Dear Dr Maniatis,

Thank you very much for submitting your manuscript "SCRaPL: hierarchical Bayesian modelling of associations in single cell multi-omics data" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Jingyi Jessica Li

Guest Editor

PLOS Computational Biology

Sushmita Roy

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: In this work, the authors developed a method to identify region-gene association based on single-cell multi-omics data. The method is based on a Bayesian hierarchical model, which uses zero-inflated Poisson model with logit link function for the modalities of gene expression and chromatin accessibility, and binomial model with probit link for the modality of methylation. Overall, the method and study look solid. This is also an important problem in single-cell multi-omics. This could be a nice addition to the literature. I think Spearman correlation is more commonly used under this setting, because of its robustness to outliers. It will be good to also include Spearman correlation in the comparison.

Reviewer #2: Maniatis et al proposes a methods named SCRaPL to investigate the correlation in multi-omics single-cell datasets. The method is novel in its usage of a Bayesian hierarchical model to infer associations between different omics components. If the claim that the method has higher power and a good control on false positives can be better supported in analysis, this will be a potentially useful method in single-cell studies.

Writing:

Since a significant amount of descriptions is included in the supplementary file, the authors need to improve the clarity of the main text to help readers navigate between the manuscript and the supp file. I found myself spending a lot of time searching for explanations in the supp file.

Methods:

It is not clear to me what’s the meaning of Y. From formula (3), it seems that it should represent raw counts. However, the supplementary methods mention that the RNA data is normalized during preprocessing. Please clarify.

The authors need to explain how the data (except for gene expression) is binarized in order to use the Binomial distribution. Would the binarization cutoff significantly impact the final results?

Is there any justification for the usage of the probit link function in formula (4)?

Key derivation steps to obtain the posterior distribution are not given. The distribution should be added to Method and the key steps should be included at least in the supp file.

No software package is available for others to use the method.

Results:

In the experiments with synthetic data, (1) what’s the definition of “gene coverage”? (2) I would suggest moving the plots of true and inferred correlations to the main manuscript. (3) The Method section describes the approach to identify statistically significant correlation using SCRaPL. Can the authors show the accuracy of this method on these datasets?

In the analysis of mESC data, “a dataset with 9480 features and 679 cells” was used. This number is much smaller than the possible number of features. How many genes or DNAm features are included in these 9480 features? How would it affect the performance of SCRaPL if a less stringent filtering is applied and more features are included? Similar questions apply to the mEBC data.

Can the authors also show the comparison between SCRaPL and Pearson’s correlation (power and false positive rate) using the aynthetic data?

The last Results section presents SCRaPL as a data denoising method, and performs Seurat integration with and without SCRaPL’s preprocessing. (1) From Figure 4, it is not clear to me that SCRaPL’s preprocessing improves the analysis. Can the authors provide some quantitative comparisons? (2) A more detailed description needs to be provided in Methods. With SCRaPL’s preprocessing, what data is provided as the input into Seurat? (3) Since the procedure involves sampling from posterior distributions, how different are the integration results if the data are sampled multiple times?

Reviewer #3: It seems that the author has largely addressed previous reviewers' comments. However, the authors need to check if every single comment has been replied. For example I don't see response for the first comment of the first reviewer. Also, the figure legends need to be improved to discuss each of the subplots. Such description is lacking for figures 2 and 4.

For the software package on Github, I don't see any instructions about how to use the software or how to reproduce the results in the paper. This needs to be significantly improved.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: None

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. 2022 Jun 21;18(6):e1010163. doi: 10.1371/journal.pcbi.1010163.r005

Author response to Decision Letter 1

12 Apr 2022

Attachment

Submitted filename: Response to reviewers.pdf

Click here for additional data file.^{(118.3KB, pdf)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010163.r006

Decision Letter 2

Jingyi Jessica Li, Sushmita Roy

2 May 2022

Dear Dr Maniatis,

We are pleased to inform you that your manuscript 'SCRaPL: A Bayesian hierarchical framework for detecting technical associates in single cell multiomics data' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.

Best regards,

Jingyi Jessica Li

Guest Editor

PLOS Computational Biology

Sushmita Roy

Deputy Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: all my previous comments have been addressed.

Reviewer #2: The revised manuscript has addressed all my questions.

Reviewer #3: The authors have addressed all my concern.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

Reviewer #1: None

Reviewer #2: None

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010163.r007

Acceptance letter

Jingyi Jessica Li, Sushmita Roy

13 Jun 2022

PCOMPBIOL-D-21-01603R2

SCRaPL: A Bayesian hierarchical framework for detecting technical associates in single cell multiomics data

Dear Dr Maniatis,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Zsofia Freund

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Text. Data preprocessing.

(PDF)

Click here for additional data file.^{(78.3KB, pdf)}

S2 Text. Creating negative control datasets.

In this section we describe step by step the generation of negative control data, explaining our address to problems like missing coverage.

(PDF)

Click here for additional data file.^{(120.2KB, pdf)}

S3 Text. Synthetic data.

(PDF)

Click here for additional data file.^{(1.2MB, pdf)}

S4 Text. Negative control experiments.

In this section we lay the detection rate comparisons between SCRaPL and Pearson for methylation/expression of mESC, accessibility/expression of mESC and accessibility/expression of bESC.

(PDF)

Click here for additional data file.^{(798.9KB, pdf)}

S5 Text. Choosing between correlation priors.

In this section we present a data driven approach for choosing prior hyper-parameters.

(PDF)

Click here for additional data file.^{(657.3KB, pdf)}

S6 Text. Extended comparisons between SCRaPL,Pearson and Spearman predictions.

In this section a more extended comparison between SCRaPL, Pearson and SPearman predictions is provided.

(PDF)

Click here for additional data file.^{(1.8MB, pdf)}

S7 Text. Gene set enrichment analysis.

The complete gene set enrichment analysis using DAVID is provided.

(PDF)

Click here for additional data file.^{(1.5MB, pdf)}

S8 Text. Connecting SCRaPL error model to likelihoods currently employed by practitioners.

We demonstrate that SCRaPL’s expression likelihood serves as valid alternative as it exhibits the over-dispersion property that practitioners seek.

(PDF)

Click here for additional data file.^{(148.8KB, pdf)}

S9 Text. Null hypothesis testing.

We give a thorough description of the hypothesis testing done to identify regions with strong regulatory action.

(PDF)

Click here for additional data file.^{(150.9KB, pdf)}

S10 Text. Efficiency analysis.

We provide evidence of SCRaPL’s scaling as a function of problem size.

(PDF)

Click here for additional data file.^{(68.8KB, pdf)}

S11 Text. Macroscopic analysis of SCRaPL behavior compared to Pearson and Spearman Correlation.

(PDF)

Click here for additional data file.^{(1MB, pdf)}

S12 Text. Stability of denoising.

Figures summarizing scRNA and scATAC integration using posterior latent space samples.

(PDF)

Click here for additional data file.^{(2.1MB, pdf)}

Attachment

Submitted filename: 117169_1_rebuttal_2136573_qwwftc.pdf

Click here for additional data file.^{(139.5KB, pdf)}

Attachment

Submitted filename: Response_to_reviewers.pdf

Click here for additional data file.^{(98KB, pdf)}

Attachment

Submitted filename: Response to reviewers.pdf

Click here for additional data file.^{(118.3KB, pdf)}

Data Availability Statement

[pcbi.1010163.ref001] 1. Tang F, Barbacioru C, Wang Y, Nordman E, Lee C, Xu N, et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nature Methods. 2009;6. doi: 10.1038/nmeth.1315 [DOI] [PubMed] [Google Scholar]

[pcbi.1010163.ref002] 2. Shapiro E, Biezuner T, Linnarsson S. Single-cell sequencing-based technologies will revolutionize whole-organism science. Nature Reviews Genetics. 2013;14. doi: 10.1038/nrg3542 [DOI] [PubMed] [Google Scholar]

[pcbi.1010163.ref003] 3. Bock C, Farlik M, Sheffield NC. Multi-Omics of Single Cells: Strategies and Applications. Trends in Biotechnology. 2016;34. doi: 10.1016/j.tibtech.2016.04.004 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010163.ref004] 4. Angermueller C, Clark SJ, Lee HJ, Macaulay IC, Teng MJ, Hu TX, et al. Parallel single-cell sequencing links transcriptional and epigenetic heterogeneity. Nature Methods. 2016;13. doi: 10.1038/nmeth.3728 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010163.ref005] 5. Hu Y, Huang K, An Q, Du G, Hu G, Xue J, et al. Simultaneous profiling of transcriptome and DNA methylome from a single cell. Genome Biology. 2016;17. doi: 10.1186/s13059-016-0950-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010163.ref006] 6. Cao J, Cusanovich DA, Ramani V, Aghamirzaie D, Pliner HA, Hill AJ, et al. Joint profiling of chromatin accessibility and gene expression in thousands of single cells. Science. 2018;361. doi: 10.1126/science.aau0730 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010163.ref007] 7. Chen S, Lake BB, Zhang K. High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nature Biotechnology. 2019;37. doi: 10.1038/s41587-019-0290-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010163.ref008] 8. Clark SJ, Argelaguet R, Kapourani CA, Stubbs TM, Lee HJ, Alda-Catalinas C, et al. ScNMT-seq enables joint profiling of chromatin accessibility DNA methylation and transcription in single cells. Nature Communications. 2018;9. doi: 10.1038/s41467-018-03149-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010163.ref009] 9. Stegle O, Teichmann SA, Marioni JC. Computational and analytical challenges in single-cell transcriptomics. Nature Reviews Genetics. 2015;16. doi: 10.1038/nrg3833 [DOI] [PubMed] [Google Scholar]

[pcbi.1010163.ref010] 10. Hernando-Herraez I, Evano B, Stubbs T, Commere PH, Bonder MJ, Clark S, et al. Ageing affects DNA methylation drift and transcriptional cell-to-cell variability in mouse muscle stem cells. Nature Communications. 2019;10. doi: 10.1038/s41467-019-12293-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010163.ref011] 11. Vallejos CA, Marioni JC, Richardson S. BASiCS: Bayesian Analysis of Single-Cell Sequencing Data. PLoS Computational Biology. 2015;11. doi: 10.1371/journal.pcbi.1004333 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010163.ref012] 12. Kapourani CA, Argelaguet R, Sanguinetti G, Vallejos CA. scMET: Bayesian modeling of DNA methylation heterogeneity at single-cell resolution. Genome Biology. 2021;22. doi: 10.1186/s13059-021-02329-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010163.ref013] 13. Newton MA, Noueiry A, Sarkar D, Ahlquist P. Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics. 2004;5. doi: 10.1093/biostatistics/5.2.155 [DOI] [PubMed] [Google Scholar]

[pcbi.1010163.ref014] 14. Lopez R, Regier J, Cole MB, Jordan MI, Yosef N. Deep generative modeling for single-cell transcriptomics. Nature Methods. 2018;15. doi: 10.1038/s41592-018-0229-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010163.ref015] 15. Argelaguet R, Clark SJ, Mohammed H, Stapel LC, Krueger C, Kapourani CA, et al. Multi-Omics profiling of mouse gastrulation at single-cell resolution. Nature. 2019;576. doi: 10.1038/s41586-019-1825-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010163.ref016] 16. Robinson SR, Dreher B. The visual pathways of eutherian mammals and marsupials develop according to a common timetable. Brain, Behavior and Evolution. 1990;36. doi: 10.1159/000316082 [DOI] [PubMed] [Google Scholar]

[pcbi.1010163.ref017] 17. Eckersley-Maslin M, Alda-Catalinas C, Blotenburg M, Kreibich E, Krueger C, Reik W. Dppa2 and Dppa4 directly regulate the Dux-driven zygotic transcriptional program. Genes and Development. 2019;33. doi: 10.1101/gad.321174.118 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010163.ref018] 18. Hu YG, Hirasawa R, Hu JL, Hata K, Li CL, Jin Y, et al. Regulation of DNA methylation activity through Dnmt3L promoter methylation by Dnmt3 enzymes in embryonic development. Human Molecular Genetics. 2008;17. doi: 10.1093/hmg/ddn165 [DOI] [PubMed] [Google Scholar]

[pcbi.1010163.ref019] 19. Neri F, Krepelova A, Incarnato D, Maldotti M, Parlato C, Galvagni F, et al. Dnmt3L antagonizes DNA methylation at bivalent promoters and favors DNA methylation at gene bodies in ESCs. Cell. 2013;155. doi: 10.1016/j.cell.2013.08.056 [DOI] [PubMed] [Google Scholar]

[pcbi.1010163.ref020] 20. Miura GI, Froelick GJ, Marsh DJ, Stark KL, Palmiter RD. The d subunit of the vacuolar ATPase (Atp6d) is essential for embryonic development. Transgenic Research. 2003;12. doi: 10.1023/A:1022118627058 [DOI] [PubMed] [Google Scholar]

[pcbi.1010163.ref021] 21. Tarabay Y, Kieffer E, Teletin M, Celebi C, Montfoort AV, Zamudio N, et al. The mammalian-specific Tex19.1 gene plays an essential role in spermatogenesis and placenta-supported development. Human Reproduction. 2013;28. doi: 10.1093/humrep/det129 [DOI] [PubMed] [Google Scholar]

[pcbi.1010163.ref022] 22. Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature Protocols. 2009;4. doi: 10.1038/nprot.2008.211 [DOI] [PubMed] [Google Scholar]

[pcbi.1010163.ref023] 23. Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM, et al. Comprehensive Integration of Single-Cell Data. Cell. 2019;177. doi: 10.1016/j.cell.2019.05.031 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010163.ref024] 24. Hao Y, Hao S, Andersen-Nissen E, Mauck WM, Zheng S, Butler A, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184. doi: 10.1016/j.cell.2021.04.048 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010163.ref025] 25. Loan CFV. GENERALIZING THE SINGULAR VALUE DECOMPOSITION. SIAM Journal on Numerical Analysis. 1976;13. [Google Scholar]

[pcbi.1010163.ref026] 26.Hoffman SL, Collaborators. Integrating scRNA-seq and scATAC-seq data; 2021. Available from: https://satijalab.org/seurat/articles/atacseq_integration_vignette.html.

[pcbi.1010163.ref027] 27.10X. PBMC from a healthy donor—granulocytes removed through cell sorting (10k); 2020. Available from: https://support.10xgenomics.com/single-cell-multiome-atac-gex/datasets/1.0.0/pbmc_granulocyte_sorted_10k.

[pcbi.1010163.ref028] 28. Cusanovich DA, Hill AJ, Aghamirzaie D, Daza RM, Pliner HA, Berletch JB, et al. A Single-Cell Atlas of In Vivo Mammalian Chromatin Accessibility. Cell. 2018;174. doi: 10.1016/j.cell.2018.06.052 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010163.ref029] 29. Kingma DP, Welling M. Auto-encoding variational bayes. 2nd International Conference on Learning Representations, ICLR 2014—Conference Track Proceedings. 2014.

[pcbi.1010163.ref030] 30. Gu J, Wang X, Halakivi-Clarke L, Clarke R, Xuan J. BADGE: A novel Bayesian model for accurate abundance quantification and differential analysis of RNA-Seq data. BMC Bioinformatics. 2014;15. doi: 10.1186/1471-2105-15-S9-S6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010163.ref031] 31. Townes FW, Hicks SC, Aryee MJ, Irizarry RA. Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model. Genome Biology. 2019;20. doi: 10.1186/s13059-019-1861-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010163.ref032] 32. Lun ATL, McCarthy DJ, Marioni JC. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Research. 2016;5. doi: 10.12688/f1000research.9501.2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010163.ref033] 33. Svensson V. Droplet scRNA-seq is not zero-inflated. Nature Biotechnology. 2020;38. doi: 10.1038/s41587-019-0379-5 [DOI] [PubMed] [Google Scholar]

[pcbi.1010163.ref034] 34. Buenrostro JD, Wu B, Litzenburger UM, Ruff D, Gonzales ML, Snyder MP, et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature. 2015;523. doi: 10.1038/nature14590 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010163.ref035] 35. Liu H, Zhang Z, Grimm KJ. Comparison of Inverse Wishart and Separation-Strategy Priors for Bayesian Estimation of Covariance Parameter Matrix in Growth Curve Analysis. Structural Equation Modeling. 2016;23. doi: 10.1080/10705511.2015.1057285 [DOI] [Google Scholar]

[pcbi.1010163.ref036] 36. Barnard J, McCulloch R, Meng XL. Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage. Statistica Sinica. 2000;10. [Google Scholar]

[pcbi.1010163.ref037] 37. Hoffman MD, Gelman A. The no-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research. 2014;15. [Google Scholar]

[pcbi.1010163.ref038] 38.Duane S, Kennedy AD, Pendleton BJ, Roweth D. Hybrid Monte Carlo. Physics Letters B. 1987;195:216-22. Available from: https://www.sciencedirect.com/science/article/pii/037026938791197X.

[pcbi.1010163.ref039] 39. Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences. Statistical Science. 1992;7. doi: 10.1214/ss/1177011136 [DOI] [Google Scholar]

[pcbi.1010163.ref040] 40. Bochkina N, Richardson S. Tail posterior probability for inference in pairwise and multiclass gene expression data. Biometrics. 2007;63. doi: 10.1111/j.1541-0420.2007.00807.x [DOI] [PubMed] [Google Scholar]

[pcbi.1010163.ref041] 41. Benjamini Y, Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological). 1995;57. [Google Scholar]

[pcbi.1010163.ref042] 42. Spiegelhalter DJ, Best NG, Carlin BP, Linde AVD. Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society Series B: Statistical Methodology. 2002;64. doi: 10.1111/1467-9868.00353 [DOI] [Google Scholar]

PERMALINK

SCRaPL: A Bayesian hierarchical framework for detecting technical associates in single cell multiomics data

Christos Maniatis

Catalina A Vallejos

Guido Sanguinetti

Roles

Abstract

Author summary

Introduction

Results

SCRaPL: Single Cell Regulatory Pattern Learning

Fig 1. Schematic and graphical representations of SCRaPL.

Benchmarking SCRaPL using synthetic data

Table 1. Summary of synthetic data experiments.

Fig 2. Plots summarizing differences in correlation estimation between SCRaPL, Spearman in Experiment 1 with synthetic data.

SCRaPL improves the power to identify associations between molecular layers in mouse embryonic stem and brain cells

Fig 3. Summary of experiments on real data.

SCRaPL associations are influenced by data sparsity and are robust to outliers

Fig 4. SCRaPL’s behavior compared to Pearson/Spearman correlation in micro and macro scale.

SCRaPL identifies biologically meaningful epigenetic regulation in early mouse gastrulation

Using SCRaPL as a data denoising tool

Fig 5. Cell label transfer from expression to accessibility data for raw 5A and SCRaPL 5B preprocessed data.

Discussion

Materials and methods

A Bayesian hierarchical framework for noisy single cell multi-omics data

RNA expression noise model

DNAm noise model

Chromatin accessibility noise model

Parameter interpretation

Prior specification

Implementation

A probabilistic rule to detect statistically significant associations across layers

Current approach based on Pearson/Spearman correlation

Comparing between alternative models

Fig 6. DIC difference between model with and without inflation for mESC and mEBC data.

Single cell multi-omic datasets

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Author response to previous submission

Decision Letter 0

Jingyi Jessica Li

Sushmita Roy

Roles

Author response to Decision Letter 0

Decision Letter 1

Jingyi Jessica Li

Sushmita Roy

Roles

Author response to Decision Letter 1

Decision Letter 2

Jingyi Jessica Li

Sushmita Roy

Roles

Acceptance letter

Jingyi Jessica Li

Sushmita Roy

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases