Skip to main content
Nature Portfolio logoLink to Nature Portfolio
. 2023 Dec 4;21(1):28–31. doi: 10.1038/s41592-023-02112-6

Modeling fragment counts improves single-cell ATAC-seq analysis

Laura D Martens 1,2,3, David S Fischer 2,4, Vicente A Yépez 1, Fabian J Theis 1,2,3,4,, Julien Gagneur 1,2,3,5,
PMCID: PMC10776385  PMID: 38049697

Abstract

Single-cell ATAC sequencing coverage in regulatory regions is typically binarized as an indicator of open chromatin. Here we show that binarization is an unnecessary step that neither improves goodness of fit, clustering, cell type identification nor batch integration. Fragment counts, but not read counts, should instead be modeled, which preserves quantitative regulatory information. These results have immediate implications for single-cell ATAC sequencing analysis.

Subject terms: Machine learning, Computational models


This paper proposes quantitative modeling of single-cell ATAC-seq data, which improves various downstream analyses.

Main

Single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq)1 is a major method employed to study chromatin regulation2. It employs Tn5 transposase to insert sequencing adaptors into accessible genome regions, resulting in reads representing Tn5 insertions in individual cells1 (Fig. 1a,b). When analyzing scATAC-seq data, open chromatin regions are generally identified on the pooled data as peaks, which are genomic regions with a significant excess of reads compared to the background1,3,4. Alternative approaches define the feature set as genomic windows or bins5,6 (Supplementary Table 1). Subsequently, the reads overlapping each feature are counted for each cell, yielding a typically very sparse matrix with less than 10% non-zero counts7.

Fig. 1. scATAC-seq data are quantitative and fragments, rather than reads, should be counted.

Fig. 1

a, Illustrated is the scATAC-seq protocol and count aggregation strategy. Tn5 transposases insert into open chromatin regions, cut the DNA and attach sequencing adaptors (blue and red). Two Tn5 insertions create one fragment with adaptors. The orientation of the insertion is important as only fragments flanked with two distinct barcodes can be captured and amplified. Fragments are sequenced paired-end and aligned to the genome. scATAC-seq peak calling is performed using reads from multiple cells. Once peak regions are identified, reads (deduplicated fragment ends) or fragments overlapping the peak region are counted for each cell separately. b, Genome viewer snapshot of one peak region in the NeurIPS dataset at the promoter of the human gene RERE showing multiple insertions in a single cell. The tracks show, from top to bottom, the coverage of one batch used for peak calling, the aligned read pairs of a single cell, the peak region and genome annotation. The peak region overlaps with five reads and three fragments. c, Read count distribution on the entire NeurIPS dataset. The striking odd/even pattern in read count distribution reflects that reads come in pairs and suggests that fragment counts, rather than reads, should be modeled. Pie chart showing the percentage of all non-zero peaks with one, two or more than two reads (inset). d, Distribution of the approximated fragment count does not show an even/odd pattern. e, Variance of read counts across cells against mean read counts. Each dot represents one peak region. When fragment ends (reads) are counted, the variance of read counts is about twice the mean (gray dotted line), which is not consistent with a Poisson distribution (solid gray line). f, Same as e, but for fragment counts. The variance of fragment counts is approximately equal to the fragment count mean, consistent with a Poisson distribution (solid gray line).

Machine-learning modeling of scATAC-seq data supports investigations of single-cell genome regulation, including identification of cell types, differentially accessible regions and transcription factor activity inference. The loss function and data representation are crucial determinants of a model’s predictive power. Many methods default to binarizing the count matrix due to overall data sparsity and the conceptualization of chromatin accessibility as a binary state510 (Supplementary Table 1). While some approaches handle the data quantitatively3,11,12, there exists no systematic evaluation of the impact of binarization.

Here, we compare binarization versus count-based modeling on scATAC-seq data modeling tasks and assess the quality of the learnt latent space using multiple downstream evaluations. We based our analysis on four publicly available datasets representing different protocols, species and tissues1316 (Supplementary Table 1; Methods). First, we considered the proportion of peaks above the typical binarization threshold of one read. Across all datasets, over 65% of non-zero peaks had more than one read count (Fig. 1c and Extended Data Fig. 1). In the NeurIPS dataset, for instance, 74% of non-zero peaks had counts of two, with 12% having even higher counts. We furthermore saw a fivefold increase in peaks with even compared to odd counts (Fig. 1c). This pattern can be explained as an artifact of the count aggregation strategy used in the 10x Genomics CellRanger ATAC pipeline4, which counts reads (deduplicated fragment ends) instead of fragments (Fig. 1a). As scATAC-seq generates paired-end reads, even counts are predominant, whereas odd counts only occur when one read pair falls outside the peak region (Fig. 1a,b). In contrast, fragment counts showed a regular monotonic decay (Fig. 1d and Extended Data Fig. 1; Methods). Many methods rely on the read count matrices generated by the 10x pipeline or adopt the same counting strategy3,510,17 (Supplementary Table 1); however, no benchmark has compared the read and fragment count strategies.

Extended Data Fig. 1. Comparison of read and fragment counts.

Extended Data Fig. 1

a, b) Read count (a) and fragment count (b) distribution on the Satpathy dataset14. c, d) Read count (c) and fragment count (d) distribution of the sci-ATAC-seq3 dataset16. Plotted is a 10% random subset as the dataset consists of ~700 K cells. e) Fragment count distribution on the fly dataset15. CellRanger ATAC read counts were unavailable for this dataset as we generated fragment counts directly with Signac. f, g) Pie chart showing the percentage of all non-zero peaks with 1, 2, or more than 2 reads for the Satpathy dataset (f), sciATAC-seq3 dataset (10% random subset) (g). h) Pie chart with the percentage of all non-zero peaks with one or more than one fragment for the fly dataset (read counts are not available for this dataset). i, j) Variance of read counts across cells against mean read counts for the Satpathy dataset (i) and sciATAC-seq3 dataset (j). Each dot represents one peak region. When fragment ends (reads) are counted, the variance of read counts is around twice the mean (gray dotted line), which is not consistent with a Poisson distribution (solid gray line). k, l, m) Same as (i, j), but for fragment counts.

The alternating pattern of odd and even read counts does not align with standard statistical count distributions, such as the Poisson. We found that the variance of read counts for each region across cells was approximately twice the mean (Fig. 1e and Extended Data Fig. 1), violating the Poisson assumption of equal mean and variance. In contrast, the mean-variance relationship of fragment counts was broadly consistent with a Poisson distribution across the four datasets (Fig. 1f and Extended Data Fig. 1).

Altogether, these results have two implications. First, scATAC-data carries information beyond binary accessibility. Second, fragment counts, but not read counts, can be more suitably modeled with the Poisson distribution.

To assess how modeling fragment counts, rather than binarized signals, affects latent space learning, we adapted the PeakVI model, a state-of-the-art variational autoencoder (VAE) for scATAC-data9. Originally designed for binarized data, PeakVI learns the probability that a peak in each cell is accessible, while accounting for cell-specific effects and region biases through learnt factors. We modified PeakVI’s last layer to instead model Poisson-distributed fragment counts (Poisson VAE; Methods). As the total number of fragments per cell varies drastically across cells (Extended Data Fig. 2a), we incorporated the total fragment count as a precomputed offset in the loss instead of learning a cell-specific factor. Similarly, we tested the effect of including the precomputed offset in the binary case (Binary VAE; Methods).

Extended Data Fig. 2. Fragment count distribution and performance evaluation with excluded high counts and downsampled data.

Extended Data Fig. 2

a) Average fragment count distribution per peak for all four datasets. The sci-ATAC-seq3 dataset is 50% sparser than the 10x datasets. b) Average precision of the Poisson VAE and the Binary VAE model on the NeurIPS13 dataset for all cell-peaks and only the subset of cell-peaks with less than ten counts. c) Average precision for the Poisson VAE and Binary VAE model at different downsampling thresholds. P values were computed using the two-sided paired t-test. In boxplots, the central line denotes the median, boxes represent the interquartile range (IQR), and whiskers show the distribution except for outliers. Outliers are all points outside 1.5 times the IQR.

We first evaluated model performance across the four datasets by benchmarking them on predicting the presence of at least one read, the standard binarization threshold. For binary models, we used the predicted probability of a region being open, while for quantitative models, we converted predictions into the probability of having a count exceeding zero (Methods). There was no benefit from using binarized data in the 10x datasets as Poisson VAE significantly outperformed PeakVI and Binary VAE in reconstructing binarized counts (Fig. 2a). Notably, substantial performance gain was achieved by controlling for the observed rather than predicted total fragment counts as the binary model (Binary VAE) also showed significantly better reconstruction than PeakVI. We further tested that the performance improvement was not a result of disproportionately giving more weight to regions with high counts (Extended Data Fig. 2b). In contrast, the sparser sci-ATAC-seq3 dataset (median peak fragment count 0.036 versus 0.017 in the 10x datasets; Extended Data Fig. 2a and Supplementary Table 1), did not benefit from using quantitative information or the observed total fragment count. Downsampling of the NeurIPS dataset confirmed that the advantages of the quantitative model increased with a higher total fragment count (Extended Data Fig. 2c).

Fig. 2. Binarizing scATAC-seq data is unnecessary and hides quantitative information.

Fig. 2

a, Comparison of the Poisson VAE, Binary VAE and PeakVI models on reconstructing the binarized cell-peak matrix of the NeurIPS, the Satpathy, the Fly and the sci-ATAC-seq3 datasets for ten cross-validation (CV) runs. Poisson VAE and Binary VAE use the observed total fragment count. The horizontal line denotes the median. P values were computed using a two-sided paired Wilcoxon test and Benjamini–Hochberg corrected. **P = 0.0019, *P = 0.0195, NS, not significant, P = 0.0695. b, Uniform Manifold Approximation and Projection (UMAP) of the integrated latent space of all NeurIPS batches, colored by cell type for the Poisson VAE model. The isolated label ID2-hi myeloid progenitors and the erythrocyte lineage are annotated. UMAPs for all other methods and datasets are in Extended Data Figs. 58. c, Enrichment (odds ratio, one-sided Fisher exact test) of distal regulatory elements, super-enhancers in bone marrow, promoters of highly expressed genes and promoters of highly variable genes in the scATAC-seq peaks of the NeurIPS dataset. Peaks are sorted by the fraction of counts above the binarization threshold and grouped according to different quantiles. *P < 0.0001. d, Correlation of expression of the SLC4A1 gene and fragment counts in its promoter. The two-sided Spearman correlation analysis was computed on cells with at least one fragment count in the promoter (n = 775). The P values were adjusted for multiple testing using the Benjamini–Hochberg correction. We restricted the plot to cells of similar total fragment count (0.25–0.75 quantile) to not capture effects driven by total fragment count. eg, log-normalized gene expression over normalized accessibility of the SLC4A1 gene for the Poisson VAE (e), Binary VAE model (f) and cisTopic model (g). Cell type separation is measured with the silhouette width and area under the receiver operating characteristic (ROC) curve and is better with the Poisson VAE model. In all boxplots, the central line denotes the median, boxes represent the interquartile range (IQR) and whiskers show the distribution except for outliers. Outliers are all points outside 1.5 × IQR. AUC, area under the curve. B, B cell; T, T cell; Mono, Monocyte; prog, progenitor; HSC, Hematopoietic stem cell; ILC, Innate lymphoid cell; Lymph, Lymphoid; MK/E, Megakaryocyte and Erythrocyte; G/M, Granulocyte and Myeloid; NK, Natural Killer cell; cDC2, Classical dendritic celltype 2; pDCs, Plasmacytoid dencritic cells.

We also evaluated the learnt latent representations using several integration metrics divided into two categories, batch integration and bioconservation18. In addition to the three VAE models, we compared the embedding techniques of three widely used methods (Supplementary Table 1): latent semantic indexing (LSI; Signac3 and ArchR5); latent Dirichlet allocation (cisTopic8) and SCALE10, a deep generative model. While binary methods performed reasonably well across the datasets, there was no apparent benefit in utilizing binarized data (Extended Data Figs. 3, 4a and 58). cisTopic, Signac and SCALE are not explicitly designed for batch correction and may consequently exhibit lower scores in batch correction metrics (Supplementary Table 1). Batch correction can matter, as demonstrated by the successful integration of the Kenyon cell subtype (KC-g) in the Fly dataset (Extended Data Fig. 7) achieved by Poisson VAE, Binary VAE and PeakVI, which explicitly account for batch effects. Nevertheless, our observation that binarization offered no clear benefit remained consistent across different weightings of bioconservation and batch correction metrics (Extended Data Fig. 4b).

Extended Data Fig. 3. Full integration metrics per dataset.

Extended Data Fig. 3

Comparison of integration accuracy for Poisson VAE, Binary VAE, PeakVI9, Signac3 using LSI, cisTopic8 using LDA and SCALE10 on (a) the NeurIPS, (b) the Satpathy (c) the fly and (d) the sci-ATAC-seq3 datasets. For cisTopic and Signac, additional batch correction was performed using Harmony28. Metrics are categorized into batch correction and bioconservation categories. Reported is the mean over ten cross-validation runs. Overall scores were computed using a 40:60-weighted mean of batch correction and bioconservation scores.

Extended Data Fig. 4. Overall score of integration including different weightings of bioconservation and batch correction.

Extended Data Fig. 4

a) Comparison of integration accuracy for embeddings generated with Poisson VAE, Binary VAE, PeakVI, Signac, cisTopic, and SCALE on the four datasets. For cisTopic and Signac, additional batch correction was performed using Harmony. Overall integration accuracy scores were computed using a 40:60-weighted mean of batch correction and bioconservation scores. P values were computed using the two-sided paired Wilcoxon test; Benjamini–Hochberg corrected. Error bars represent the 95% confidence interval over ten cross-validation runs. b) Overall score computed from different bioconservation and batch correction weightings.

Extended Data Fig. 5. UMAPs of integrated latent space for the NeurIPS dataset.

Extended Data Fig. 5

UMAP of the integrated latent space of the NeurIPS dataset using the Poisson VAE, Binary VAE, PeakVI, Signac using LSI, cisTopic using LDA, and SCALE model. Cells are colored by cell type (top row) and batch (bottom row). For cisTopic and Signac, additional batch correction was performed using Harmony.

Extended Data Fig. 8. UMAPs of integrated latent space for the sci-ATAC-seq3 dataset.

Extended Data Fig. 8

UMAP of the integrated latent space of the sciATAC-seq3 dataset using the Poisson VAE, Binary VAE, PeakVI, Signac using LSI, cisTopic using LDA, and SCALE model. Cells are colored by cell type (top row) and batch (bottom row). For cisTopic and Signac, additional batch correction was performed using Harmony.

Extended Data Fig. 7. UMAPs of integrated latent space for the Fly dataset.

Extended Data Fig. 7

UMAP of the integrated latent space of the fly dataset using the Poisson VAE, Binary VAE, PeakVI, Signac using LSI, isTopic using LDA, and SCALE model. Cells are colored by cell type (top row) and batch (bottom row). For cisTopic and Signac, additional batch correction was performed using Harmony.

Beyond the lack of advantage in using binarized data, preserving quantitative information can enhance cell representation. For instance, Poisson VAE better recovered the rare cell type ID2-hi myeloid progenitors in the NeurIPS dataset (Supplementary Table 1), as indicated by the improved isolated label F1 score (Fig. 2b and Extended Data Figs. 3 and 5).

We further investigated the biological signal represented by quantitative data to understand effects that could be captured in the Poisson VAE. We first examined high-count peaks and found they tend to be broader (Extended Data Fig. 9a) and enriched for promoter regions of highly expressed genes, highly variable genes and super-enhancers (Fig. 2c; Methods). Conversely, low-count peaks were associated with distal enhancer elements, consistent with previous bulk observations highlighting the accessibility differences between active transcription start sites (TSSs) and enhancers2. Next, we examined whether increased TSS accessibility correlated with higher gene expression using the NeurIPS dataset, focusing on cells with at least one fragment in the promoter region. We observed a significant correlation (i.e., Spearman correlation P < 0.05) between promoter accessibility and gene expression in 481 out of 3,879 genes (12.4%, 2.5-times higher than expected, binomial test P < 0.05), in agreement with a recent preprint19. To illustrate, we considered cell type markers among the top 20 highest correlated genes (Extended Data Fig. 9b), including SLC4A1, a gene involved in the red blood cell lineage20 (Spearman correlation 0.12, P = 0.001; Fig. 2b,d). Similarly, we found a significant correlation for genes involved in other biological lineages (Extended Data Fig. 9c–e). We tested whether the Poisson VAE model can capture this quantitative accessibility signal and enhance cell type discrimination in these promoter regions. Indeed, the normalized accessibility from Poisson VAE showed improved cell type separation compared to cisTopic and Binary VAE in three out of four cases (Fig. 2e–g and Extended Data Fig. 10; Methods).

Extended Data Fig. 9. Peak length distribution and correlation of gene expression with chromatin accessibility counts for selected marker genes.

Extended Data Fig. 9

a) Peak distribution length for peaks in the top 0.05 quantile (n = 5727) and bottom 0–0.95 quantile (n = 110,760) according to the fraction of counts above the binarization threshold. High-count peaks are significantly longer. The P value was computed using a two-sided Wilcoxon test. b) Expression of genes (rows) associated with each cell type (columns). CR1L is involved in the red blood cell lineage34 (Proerythroblast, Erythroblast, Normoblast). CD74 is expressed in antigen-presenting cells and is known to regulate mature B-cell survival35. MAFB is a transcription factor that represses erythrocyte programs in myeloid cells36. Correlation of gene expression and fragment counts in the promoter of the (c) CD74 gene (n = 7000), (d) CR1L gene (n = 1917), and (e) MAFB gene (n = 1845). The two-sided Spearman correlation analysis was computed on fragment counts greater than 0. P values were adjusted for multiple testing using the Benjamini–Hochberg correction. We restricted the plot to cells of similar total fragment count (0.25–0.75 quantile) to avoid capturing effects driven by total fragment count. We see a quantitative signal in promoter accessibility that would be lost by binarization. In all boxplots, the central line denotes the median, boxes represent the interquartile range (IQR), and whiskers show the distribution except for outliers. Outliers are all points outside 1.5 times the IQR.

Extended Data Fig. 10. Cell type separation on promoters of marker genes.

Extended Data Fig. 10

a, b, c) Log-normalized gene expression against normalized accessibility for the Poisson VAE (top row), Binary VAE model (middle row), and cisTopic model (bottom row) for the (a) CD74 gene, (b) CR1L gene, and (c) MAFB gene. Cell type separation is measured with the silhouette width and area under the ROC curve and is better with the Poisson VAE model for CR1L and MAFB and second for CD74. d) Multiple biological factors contribute to DNA accessibility in single cells to be quantitative rather than binary. They include a diploid genome, density of chromatin packaging, nucleosome spacing, TFs in a peak region preventing the Tn5 from binding, and sequence preferences of Tn5.

In conclusion, we found that scATAC-seq binarization is unnecessary and results in a loss of useful information. What makes scATAC-seq quantitative? Chromatin accessibility is highly dynamic and nucleosome turnover rates are in the same order of magnitude as the scATAC-seq incubation duration1,21. Furthermore, transcription factors, not unlike transposase, must diffuse through the nucleus to access DNA, potentially reaching distinct chromosome territories and compartments with various efficiencies (Extended Data Fig. 10d). Also, a single genomic position in diploid cells may not be simultaneously open or closed on both alleles. Our observations indicate that scATAC-seq fragment counts capture this continuum of chromatin accessibility19. Even though the advantage of quantitative modeling is diminished for very sparse datasets, treating scATAC-seq data quantitatively is more general than binarization and it matters to study highly expressed and highly variable genes, including important marker genes. These findings have immediate practical implications as using a Poisson over a binary loss has no increase in computational cost. Future directions include investigating other typically binarized settings, such as scChIP-seq22 and alternative count distributions such as negative binomial.

Methods

Input data and preprocessing

NeurIPS dataset

The multiome hematopoiesis dataset from the NeurIPS 2021 challenge13 was downloaded from the AWS bucket s3://openproblems-bio/public/. We did not perform any additional filtering of the data. scATAC-seq BAM files were downloaded from the Gene Expression Omnibus (GEO) under accession code GSE194122.

Satpathy dataset

The second hematopoiesis dataset14 was downloaded from GEO (accession code GSE129785). Specifically, the processed count matrix and metadata files: scATAC-Hematopoiesis-All.cell-barcodes.txt.gz, scATAC-Hematopoiesis-All.mtx.gz and scATAC-Hematopoiesis-All.peaks.txt.gz. We then filtered the peaks to only those that were detected in at least 1% of the cells in the sample, reducing the data from 571,400 to 134,104 peaks.

Fly dataset

Raw fragment files for chromatin accessibility of the fly brain15 were downloaded from GEO (accession code GSE163697). Additionally, peak regions, cell barcodes and cell metadata were extracted from the cisTopic object AllTimepoints_cisTopic.Rds, which was downloaded from flybrain.aertslab.org. Fragments were counted per peak region using the Signac function FeatureMatrix. We then filtered the peaks to be detected in at least 1% of all cells. Furthermore, we excluded cells labeled unknown (CellType_lvl1 equal to ‘unk’ or ‘-’).

sci-ATAC-seq3 dataset

Count matrices and metadata were downloaded from GEO (accession code GSE149683)16. Peaks were filtered to be accessible in at least 1% of all cells.

Fragment computation

The standard 10x protocol for generating the cell-peaks matrix is to count the fragment ends (reads). To estimate fragment counts, we rounded all uneven counts to the next highest even number and halved the resulting read counts.

Poisson VAE model

Let XN×P be a fragment count matrix consisting of N cells and P peak regions. We model the counts xcp with a variational autoencoder:

zc~Normalfμxc,fσxc
ρcp=gpzc,sc
wcp=softmaxρcp+rp
λcp=explcwcp
xcp~Poissonλcp

The neural networks fμ,fσ encode the parameters of a multivariate normal random variable from which zc is drawn. gp is a neural network that maps the latent representation zc concatenated to the batch annotation sc back to the dimension of peaks. rp captures a region-specific bias such as the mean fragment count or peak length and is learned directly. lc refers to the log-transformed total fragment counts per cell lc=log(pxcp). wcp is constrained to encode the mean distribution of lc reads over all peaks by using a softmax activation in the last layer. This means that pwcp=1.

Binary VAE model

The Binary VAE model models binarized counts:

ycp={0ifxcp=01ifxcp>0

The binarized signal was modeled as follows:

zc~Normalfμyc,fσyc
ρcp=gp(yc,sc)
θcp=σ(ρcp+rp+l~c)
ycpBer(θcp)

We included the proportion of non-zeros by modeling:

l~c=σ1(1Ppycp)

Here, σ−1 is the logit function. This way θcp is equal to the mean accessibility of the cell for ρcp=rp=0.

Encoder and decoder functions

The functions fμ,fσ and the function gw are encoder and decoder functions, respectively. To be as comparable as possible to PeakVI as implemented in scvi-tools9,23 (v.0.20.3), we used the same architecture. Specifically, these networks consisted of two repeated blocks of fully connected neural networks with a fixed number of hidden dimensions set to the square root of the number of input dimensions, a dropout layer, a layer-norm layer and leakyReLU activation. The last layer in the encoder maps to a defined number of latent dimensions nlatent.

Training procedure

We used the default PeakVI training procedure with a learning rate of 0.0001, weight decay of 0.001 and minibatch size of 128 and used early stopping on the validation reconstruction loss. We used a random training, validation and test set of 80%, 10% and 10%, respectively. This was repeated ten times. We computed all evaluation metrics on the left-out test cells.

Hyperparameter optimization

All models were run using the default PeakVI parameters. For the reconstruction task, we optimized the number of latent dimensions nlatent on the validation set for each dataset and model on reconstructing the binary accessibility matrix as measured by average precision. The used range was from 10 to 100 in increments of 10.

Benchmarking methods

cisTopic

We used the Python implementation of cisTopic, pycisTopic8,24 (v.1.0.3.dev2+g45b7e66.d20230426). cisTopic objects were created from the binarized count matrices. We then modeled the topics using the Mallet algorithm on 10 to 100 topics in steps of 10. We selected the optimal topic number using the suggested model selection metrics Minmo_201125 and log-likelihood26. Finally, dimensionality reduction was performed on the cell-topic matrix with optionally first running Harmony27 (harmonypy, v.0.0.9) to reduce batch effects.

SCALE

We used the provided Python script on github.com/jsxlei/SCALE to run SCALE10 on the binarized count matrix. We set the number of clusters to the number of cell types in the dataset.

For visualization, a two-dimensional UMAP28 (umap-learn, v.0.5.3) of the integrated latent space was generated based on the 15-nearest-neighbor graph. The cross-validation run with the best reconstruction was used.

Signac

Count matrices were loaded into ChromatinAssays using Signac3 (v.1.9.0) and Seurat29 (v.4.3.0) without additional filtering (min.cells = min.features = 0). We then computed the LSI embedding using the default procedure (RunTFIDF followed by RunSVD). We removed components that correlated with the total fragment count by more than 0.5. To investigate the effect of batch normalization, we created a batch-normalized LSI embedding by running RunHarmony with the respective batch variable as input.

Evaluation

Reconstruction metrics

The reconstruction metrics were calculated on the binarized matrix. Poisson rate parameters λcp were transformed to a Bernoulli probability θcp by computing the probability of getting one or more fragments in a peak for a given cell:

θcp=Pxcp>0λcp=1Pxcp=0λcp=1eλcp
Average precision

As our reconstruction task is highly imbalanced (only a small fraction of all peaks are accessible), we used the average precision score as implemented in scikit-learn (v.1.2.2) to evaluate the reconstruction. Average precision estimates the area under the precision-recall curve.

Integration metrics

We used the scib18 (v.1.1.3) implementation for computing the integration metrics on the latent embedding of the cells. We used all available metrics using default parameters but excluded metrics that were specifically developed for single-cell RNA sequencing datasets (highly variable genes score and cell cycle score) and kBET due to its long run time. The trajectory score was only run for the NeurIPS dataset, which had a precomputed ATAC trajectory. Scib categorizes the metrics into metrics that measure batch correction and biology conservation.

Bioconservation comprises the following metrics that are applied to predefined cell-type labels that each dataset provided:

Normalized mutual information

This measures the consistency of two clusterings. Here, we compare how well a clustering on the integrated embedding agrees with predefined cell-type labels. For optimal clustering, the scib package runs Louvain clustering at resolutions ranging from 0.1–2 in steps of 0.1.

Adjusted Rand index

This is a different metric to compare the clusterings with the predefined cell-type labels.

Label silhouette width

This measures the within-cluster distance of cells compared to the distance to the closest neighboring cluster. A value close to 1 indicates a high separation between clusters. We used the predefined cell labels to define clusters for the label silhouette width calculation.

Graph cLISI

This measures the separation of the kNN graph. It evaluates the likelihood of observing the same cell-type label in the nearest neighbors, indicating good cell-type separation.

Isolated label metrics

The isolated labels are defined as the cell types present in the fewest number of batches (Supplementary Table 1). Two metrics evaluate how well isolated labels separate from other cell types. The F1 score is the harmonic mean of precision and recall. The isolated label silhouette measures the average silhouette width (ASW) of the isolated label compared to all non-isolated labels.

Trajectory conservation

This computes the correlation of inferred pseudotime ordering before and after integration.

Four metrics measure different levels of batch integration:

Principal component regression

This measures the amount of variance of the principal components of the embedded space that can be explained by the batch variables before and after integration.

Graph connectivity

This measures whether the kNN graph of the embedding connects all cells that have the same cell-type label. If there are strong batch effects, this will not be the case.

Graph iLISI

This measures the mixture of the kNN graph. It evaluates the likelihood of observing different batch labels in the nearest neighbors, indicating a good batch mixing.

Batch silhouette width

This is a metric similar to the label silhouette width but applied to batch labels. To ensure that higher scores represent better mixing, the silhouette metric is subtracted from 1. The ASW is computed separately for each cell label to assess the mixing within cells of the same label. Finally, the individual ASW scores for each cell label are averaged to obtain an overall measure of batch mixing.

Enrichment analysis

Enrichment analysis was performed with respect to four sets of regulatory elements: distal enhancers, super-enhancers, highly expressed genes and highly variable genes.

Annotations for distal enhancers in the hg38 genome assembly were downloaded from ENCODE Registry of CREs (v.3, screen.encodeproject.org)30. They were then subset to distal cCREs with enhancer-like signatures (dELS) and CTCF-bound cCREs with enhancer-like signatures (CTCF-bound, dELS).

Super-enhancers were downloaded from SEdb 2.0 (www.licpathway.net/sedb/)31. Only bone marrow samples were included.

Highly expressed genes were computed using the preprocessed single-cell RNA sequencing data from the NeurIPS dataset. They were defined as the top 2,000 genes ranked by mean expression across all cells.

Highly variable genes were computed with scanpy32 (v.1.9.2) using Seurat-based highly variable gene selection with default parameter settings.

We filtered annotations to overlap with at least one peak of the NeurIPS dataset. Region overlap was determined using the pyRanges package (v.0.0.124). Odds ratios and significance were computed using the Fisher exact test implemented in scipy (v.1.10.1) and corrected for multiple testing with Benjamini–Hochberg at a false discovery rate of 0.05.

Correlation with gene expression analysis

We used the peak annotation of CellRanger ATAC to subset high-count peaks to promoter regions. CellRanger annotates a peak as a promoter if it overlaps with the promoter region (−1,000 bp, +100 bp) of any transcription start site4. Then, we computed the Spearman correlation between a cell’s fragment count in the promoter peaks and the gene expression count using scipy, taking only cells with a fragment count >1 into account. As this correlation can be driven by cells with a high total fragment count, we restricted the computation to cells whose total fragment count was in the 0.25–0.75 quantile.

Normalized accessibility

We can use the learned latent space and generative model of Poisson VAE and Binary VAE to produce denoised and normalized estimates of accessibility, controlling for sequencing depth23. To this end, we defined the normalized accessibility of the model output using the median total fragment count across all cells. For cisTopic, we used the imputed and normalized accessibility scores.

We compared the normalized accessibility of the models by computing the cell type separation using the silhouette width and ROC AUC.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Online content

Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41592-023-02112-6.

Supplementary information

Reporting Summary (1.3MB, pdf)
Supplementary Table 1 (19.4KB, xlsx)

Description of the datasets and detailed information on scATAC-seq methods including their counting and binarization strategy.

Acknowledgements

We thank I. L. Ibarra, F. Curion, A. Karollus and P. T. da Silva for feedback on the manuscript. L.D.M. acknowledges support by the Helmholtz Association under the joint research school Munich School for Data Science and J.G. acknowledges the Deutsche Forschungsgemeinschaft (SFB/TransRegio TRR267, Project-ID 403584255). F.J.T. acknowledges support by the Helmholtz Association’s Initiative and Networking Fund through Helmholtz AI (ZT-I-PF-5-01) and the European Union (DeepCell 101054957). The views and opinions expressed are those of the authors and do not necessarily reflect those of the European Union or the European Research Council. Neither the European Union nor the granting authority can be held responsible for them. Figure 1a is adapted from the ‘ATAC Sequencing’ template by BioRender.com (2022) and Extended Data Figure 10d is adapted from ‘Regulation of Transcription in Eukaryotic Cells’, retrieved from app.biorender.com/biorender-templates.

Extended data

Extended Data Fig. 6. UMAPs of integrated latent space for the Satpathy dataset.

Extended Data Fig. 6

UMAP of the integrated latent space of the Satpathy dataset using the Poisson VAE, Binary VAE, PeakVI, Signac using LSI, cisTopic using LDA, and SCALE model. Cells are colored by cell type (top row) and batch (bottom row). For cisTopic and Signac, additional batch correction was performed using Harmony.

Author contributions

L.D.M. conducted the analysis and implemented the models. J.G. and F.J.T. conceived and supervised the project with the help of D.S.F. and V.A.Y. All authors wrote and contributed to the manuscript. The authors read and approved the final manuscript.

Peer review

Peer review information

Nature Methods thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team.

Funding

Open access funding provided by Helmholtz Zentrum München – Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH).

Data availability

Raw published data for the NeurIPS, Satpathy, the Fly and the sci-ATAC-seq3 datasets are available from the GEO under accession codes GSE194122, GSE129785, GSE163697 and GSE149683, respectively. Annotations for distal enhancers in the hg38 genome assembly were downloaded from ENCODE Registry of CREs (v.3, screen.encodeproject.org). Super-enhancers were downloaded from SEdb v.2.0 (www.licpathway.net/sedb/).

Code availability

All models, code and notebooks to reproduce our analysis and figures, as well as a tutorial notebook to use the Poisson VAE model, are available at github.com/theislab/scatac_poisson_reproducibility. The code has additionally been archived and is available on Zenodo at 10.5281/zenodo.8356171 (ref. 33). The Poisson VAE model is available as an extension of the scvi-tools suite at github.com/lauradmartens/scvi-tools.

Competing interests

F.J.T. consults for Immunai Inc., Singularity Bio B.V., CytoReason Ltd, Cellarity and has ownership interest in Dermagnostix GmbH and Cellarity. The remaining authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Fabian J. Theis, Email: fabian.theis@helmholtz-munich.de

Julien Gagneur, Email: gagneur@in.tum.de.

Extended data

is available for this paper at 10.1038/s41592-023-02112-6.

Supplementary information

The online version contains supplementary material available at 10.1038/s41592-023-02112-6.

References

  • 1.Buenrostro JD, et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature. 2015;523:486–490. doi: 10.1038/nature14590. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Klemm SL, Shipony Z, Greenleaf WJ. Chromatin accessibility and the regulatory epigenome. Nat. Rev. Genet. 2019;20:207–220. doi: 10.1038/s41576-018-0089-8. [DOI] [PubMed] [Google Scholar]
  • 3.Stuart T, Srivastava A, Madad S, Lareau CA, Satija R. Single-cell chromatin state analysis with Signac. Nat. Methods. 2021;18:1333–1341. doi: 10.1038/s41592-021-01282-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.10x Genomics. CellRanger ATAC Algorithms Overview. support.10xgenomics.com/single-cell-atac/software/pipelines/latest/algorithms/overview
  • 5.Granja JM, et al. ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat. Genet. 2021;53:403–411. doi: 10.1038/s41588-021-00790-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Fang R, et al. Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat. Commun. 2021;12:1337. doi: 10.1038/s41467-021-21583-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Li Z, et al. Chromatin-accessibility estimation from single-cell ATAC-seq data with scOpen. Nat. Commun. 2021;12:6386. doi: 10.1038/s41467-021-26530-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Bravo González-Blas C, et al. cisTopic: cis-regulatory topic modeling on single-cell ATAC-seq data. Nat. Methods. 2019;16:397–400. doi: 10.1038/s41592-019-0367-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Ashuach T, Reidenbach DA, Gayoso A, Yosef N. PeakVI: a deep generative model for single-cell chromatin accessibility analysis. Cell Rep. Methods. 2022;2:100182. doi: 10.1016/j.crmeth.2022.100182. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Xiong L, et al. SCALE method for single-cell ATAC-seq analysis via latent feature extraction. Nat. Commun. 2019;10:4576. doi: 10.1038/s41467-019-12630-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Schep AN, Wu B, Buenrostro JD, Greenleaf W. J. chromVAR: inferring transcription-factor-associated accessibility from single-cell epigenomic data. Nat. Methods. 2017;14:975–978. doi: 10.1038/nmeth.4401. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Ji Z, Zhou W, Hou W, Ji H. Single-cell ATAC-seq signal extraction and enhancement with SCATE. Genome Biol. 2020;21:161. doi: 10.1186/s13059-020-02075-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Luecken, M. D. et al. A sandbox for prediction and integration of DNA, RNA, and proteins in single cells. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) (2021).
  • 14.Satpathy AT, et al. Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion. Nat. Biotechnol. 2019;37:925–936. doi: 10.1038/s41587-019-0206-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Janssens J, et al. Decoding gene regulation in the fly brain. Nature. 2022;601:630–636. doi: 10.1038/s41586-021-04262-z. [DOI] [PubMed] [Google Scholar]
  • 16.Domcke S, et al. A human cell atlas of fetal chromatin accessibility. Science. 2020;370:eaba7612. doi: 10.1126/science.aba7612. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Bredikhin D, Kats I, Stegle O. MUON: multimodal omics analysis framework. Genome Biol. 2022;23:42. doi: 10.1186/s13059-021-02577-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Luecken MD, et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods. 2022;19:41–50. doi: 10.1038/s41592-021-01336-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Miao, Z. & Kim, J. Is single nucleus ATAC-seq accessibility a qualitative or quantitative measurement? Preprint at bioRxiv10.1101/2022.04.20.488960 (2022).
  • 20.Reithmeier RAF, et al. Band 3, the human red cell chloride/bicarbonate anion exchanger (AE1, SLC4A1), in a structural context. Biochim. Biophys. Acta Biomembr. 2016;1858:1507–1532. doi: 10.1016/j.bbamem.2016.03.030. [DOI] [PubMed] [Google Scholar]
  • 21.Deal RB, Henikoff JG, Henikoff S. Genome-wide kinetics of nucleosome turnover determined by metabolic labeling of histones. Science. 2010;328:1161–1164. doi: 10.1126/science.1186777. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Rotem A, et al. Single-cell ChIP-seq reveals cell subpopulations defined by chromatin state. Nat. Biotechnol. 2015;33:1165–1172. doi: 10.1038/nbt.3383. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Gayoso A, et al. A Python library for probabilistic analysis of single-cell omics data. Nat. Biotechnol. 2022;40:163–166. doi: 10.1038/s41587-021-01206-w. [DOI] [PubMed] [Google Scholar]
  • 24.Bravo González-Blas C, et al. SCENIC+: single-cell multiomic inference of enhancers and gene regulatory networks. Nat. Methods. 2023;20:1355–1367. doi: 10.1038/s41592-023-01938-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Mimno, D., Wallach, H. M., Talley, E., Leenders, M. & McCallum, A. Optimizing semantic coherence in topic models. In Proc. 2011 Conference on Empirical Methods in Natural Language Processing 262–272 (Association for Computational Linguistics, 2011).
  • 26.Griffiths TL, Steyvers M. Finding scientific topics. Proc. Natl Acad. Sci. USA. 2004;101:5228–5235. doi: 10.1073/pnas.0307752101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Korsunsky I, et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods. 2019;16:1289–1296. doi: 10.1038/s41592-019-0619-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.McInnes, L., Healy, J. & Melville, J. UMAP: Uniform Manifold Approximation and Projection for dimension reduction. Preprint at arXiv10.48550/arXiv.1802.03426 (2020).
  • 29.Hao Y, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184:3573–3587. doi: 10.1016/j.cell.2021.04.048. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Moore JE, et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature. 2020;583:699–710. doi: 10.1038/s41586-020-2493-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Jiang Y, et al. SEdb: a comprehensive human super-enhancer database. Nucleic Acids Res. 2019;47:D235–D243. doi: 10.1093/nar/gky1025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19:15. doi: 10.1186/s13059-017-1382-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Martens, L. D. et al. Analysis code used in publication. Zenodo10.5281/zenodo.8356171 (2023).
  • 34.Miwa T, Zhou L, Hilliard B, Molina H, Song W-C. Crry, but not CD59 and DAF, is indispensable for murine erythrocyte protection in vivo from spontaneous complement attack. Blood. 2002;99:3707–3716. doi: 10.1182/blood.V99.10.3707. [DOI] [PubMed] [Google Scholar]
  • 35.Lapter S, et al. A role for the B-cell CD74/macrophage migration inhibitory factor pathway in the immunomodulation of systemic lupus erythematosus by a therapeutic tolerogenic peptide. Immunology. 2011;132:87–95. doi: 10.1111/j.1365-2567.2010.03342.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Blank V, Andrews NC. The Maf transcription factors: regulators of differentiation. Trends Biochem. Sci. 1997;22:437–441. doi: 10.1016/S0968-0004(97)01105-5. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Reporting Summary (1.3MB, pdf)
Supplementary Table 1 (19.4KB, xlsx)

Description of the datasets and detailed information on scATAC-seq methods including their counting and binarization strategy.

Data Availability Statement

Raw published data for the NeurIPS, Satpathy, the Fly and the sci-ATAC-seq3 datasets are available from the GEO under accession codes GSE194122, GSE129785, GSE163697 and GSE149683, respectively. Annotations for distal enhancers in the hg38 genome assembly were downloaded from ENCODE Registry of CREs (v.3, screen.encodeproject.org). Super-enhancers were downloaded from SEdb v.2.0 (www.licpathway.net/sedb/).

All models, code and notebooks to reproduce our analysis and figures, as well as a tutorial notebook to use the Poisson VAE model, are available at github.com/theislab/scatac_poisson_reproducibility. The code has additionally been archived and is available on Zenodo at 10.5281/zenodo.8356171 (ref. 33). The Poisson VAE model is available as an extension of the scvi-tools suite at github.com/lauradmartens/scvi-tools.


Articles from Nature Methods are provided here courtesy of Nature Publishing Group

RESOURCES