Abstract
In this work, we extend Chromap, an ultrafast method for single-cell ATAC-seq data alignment, to directly report peak-based quality control (QC) metrics, such as the fraction of reads in peaks, without calling peaks. Recent single-cell ATAC-seq analysis methods like SnapATAC2 utilize the genome-interval-based feature for data analysis, which disables filtering low-quality cells using common peak-based QC metrics. We show that Chromap’s QC metrics capture additional low-quality cells missed by SnapATAC2 and improve downstream analysis results without sacrificing computational efficiency.
Keywords: Single-cell ATAC-seq, quality control, sketch
Background
Single-cell ATAC-seq (scATAC-seq) profiles the heterogeneity of chromatin accessibility across thousands of individual cells rather than only getting a single, bulk view. The common practice of ATAC-seq data analysis based on features defined by bulk peaks may miss definitive accessible regions of rare cell populations. This inspires the development of highly efficient methods, such as SnapATAC [1,2] and ArchR [3], that use genome-wide tiles of fixed-size intervals. The avoidance of peak calling disables generating several important quality control (QC) metrics, like the fraction of reads in peak regions (FRIP) score and the number of peaks [4]. Rare cell populations still share many accessible regions in other cell populations, e.g., regions near housekeeping genes, so peak-based QC metrics are applicable to them too. Chromap [5] is a method for ultrafast read alignment and preprocessing of scATAC-seq data, and it utilizes a sketch, which we refer to as the cache, to reuse alignment information for reads from peak regions to save running time. In this work, we augment this cache structure to report peak-based QC metrics that are related to FRIP score and the number of peaks for each cell (Figure 1, Methods). In particular, the FRIP score for a cell is predicted by a linear regression model with predictors like the fraction of reads hitting the cache. Cache occupancy is the number of distinct cache slots that reads from the cell hit, reflecting the number of peaks for a cell. We show that these peak-based QC metrics capture additional low-quality cells missed by SnapATAC2 [2].
Figure 1. Overview of Chromap’s method for estimating FRIP and cache occupancy.

Each read is queried against the candidate cache to speed up the alignment. Specifically, if its sequence of minimizers is present in the cache, the possible alignment positions will be returned, and this hit history is recorded. Chromap utilizes four features (Fraction of Reads hit Cache (FRIC), duplicate, unmapped, and low MAPQ) in a multi-variate linear model to estimate the FRIP This model is trained using a random subset of 1000 barcodes from the 10k Human PBMC dataset, where the true FRIP values are computed using MACS3 and BEDTools. The cache occupancy of a cell is the number of distinct cache slots in which reads from the cell have hit.
Results and Discussion
We first applied SnapATAC2 v2.7.0 to analyze the alignment file generated by Chromap v0.3.0 on a human PBMC scATAC-seq dataset from 10x Genomics with 10k cells (Figure 2a). With SnapATAC2’s QC filters (Supplementary Table 1, Supplementary Figure 1a), including transcription start site enrichment (TSSE) score and doublet detection, we observed that cluster 13 had noticeably lower estimated FRIP score and cache occupancy than other clusters (Figure 2b,c, Supplementary Figure 1b). To validate Chromap’s QC metrics, we calculated the true FRIP score (defined in Methods) for each cell and observed a strong correlation with the estimated FRIP score (Pearson r=0.887, Figure 2d, Supplementary Figure 1c). A low number of peaks in a cell may be due to issues like inefficient Tn5 transposition, while an extremely high number may suggest the cell is a “union” of multiple cells (i.e., doublet). By referencing SnapATAC2-deemed doublets, we found that cache occupancy was predictive of the doublet status (Figure 2e, Supplementary Figure 2d). We then filtered these Chromap-identified low-quality cells that were missed by SnapATAC2, where thresholds for Chromap’s QC metrics were adaptively determined (Supplementary Figure 1e,f). Among the 125 filtered cells, 72 were from cluster 13. The removed cells tended to have lower true FRIP values and TSSE scores than the unfiltered cells (Supplementary Figure 1g). After re-clustering, the remaining cluster 13 cells were absorbed into the original cluster 0 (Supplementary Figure 1h,i). Since there were no differentially accessible regions (DARs) between the original cluster 13 and cluster 0 cells (Supplementary Figure 1j), we inferred that the clusters were originally separated due to noise, and the merged result after Chromap’s QC filtering was more reasonable. Even with more stringent SnapATAC2 filters, Chromap’s QC could still identify low-quality cells and yielded cleaner clusters (Supplementary Note 1).
Figure 2. Analysis of Chromap’s quality metrics on scATAC-seq datasets (10k Human PBMC and 8k Mouse Cortex).

(a) UMAP clustering of the 10k Human PBMC dataset. Overlaying of (b) Chromap’s estimated FRIP and (c) Chromap’s cache occupancy metrics over the UMAP of the 10k Human PBMC dataset. (d) Correlation between the true FRIP using peak calling of MACS3 and the estimated FRIP value output by Chromap. (e) ROC curve showing classification power of the total number of fragments and the cache occupancy to identify doublets. Figures (f-j) follow the same pattern as (a-e) but correspond to the 8k Mouse Cortex dataset.
We next applied Chromap and SnapATAC2 to another 10x Genomics scATAC-seq data set from the mouse cortex with 8k cells (Figure 2f, Supplementary Figure 2a). A similar trend as the human data set was observed, where Chromap’s QC metrics identified additional low-quality cells (Figure 2g-j, Supplementary Figure 2b), especially in cluster 11 (Supplementary Figure 2c). Using Chromap’s QC metrics, we filtered 237 cells where 13% of cluster 11 cells were filtered, the largest ratio among clusters (Supplementary Figure 2d-g). After re-clustering, 63% of the original cluster 11 remained in the same cluster, named new cluster 13 (Supplementary Figure 2h). As expected, since new cluster 13 was mostly a subset of cluster 11, SnapATAC2 discovered 15% fewer accessible regions in the new cluster 13 than the original cluster 11 (Supplementary Figure 2i). Nevertheless, the new cluster 13 contained 1,368 more DARs than cluster 11, and thus a higher fraction of accessible regions were DARs, suggesting that Chromap’s QC filtering could increase the overall signal-to-noise ratio in terms of identifying DARs. We conducted pathway enrichment analysis for the genes containing DARs of cluster 11 and the new cluster 13, respectively, using clusterProfiler [6]. DAR-related genes in both scenarios were mostly enriched in the axonogenesis pathway (Supplementary Figure 2j). The DAR-related genes unique to the new cluster 13 were enriched in the Wnt signaling pathway (Supplementary Figure 2k), supporting the pathway’s crucial role in axon development [7,8].
Although the TSSE score is regarded as more important than peak-based QC metrics for ATAC-seq data analysis [4], our evaluations demonstrate that peak-based QC metrics help remove additional low-quality cells missed by SnapATAC2 and improve downstream analysis. More importantly, peak-based QC metrics are crucial for data types like ChIP-seq [4] data and less-studied organisms, where gene annotations for TSSE score estimation may be incomplete. Therefore, future work is needed to integrate Chromap with methods like SnapATAC2 to other single-cell data platforms, such as single-cell ChIP-seq data [9], and to more organisms.
Conclusions
The Chromap new version can output approximate peak-based QC metrics without sacrificing computational efficiency (Supplementary Table 2). This will complement genome-interval-based scATAC-seq data analysis methods like SnapATAC2, saving the extra effort of calling pseudo-bulk peaks. Furthermore, Chromap’s QC metrics allow these analysis methods to be applied to data where QC with the TSSE score is suboptimal.
Methods
Chromap’s cache was originally designed to speed up alignments. Each read is digested into a sequence of minimizers [10], meaning that we subsample the read’s k-mers according to a random hash function and store them into a vector respecting their original order. This minimizer vector is then used as a key to access the cache slot (m1 + mM)%N, where mi is the i-th minimizer’s value in the vector, M is size of the vector, and N is the cache size. If present, the list of potential alignment locations (candidates) would be returned without querying the genome index for each minimizer, which speeds up the alignment process. For choosing the minimizer vector and its associated alignment candidates to update the cache, each slot holds the information of the minimizer vector that is substantially more frequent than all the other minimizer vectors that are mapped to the same slot. The frequency is based on the reads that have been processed in a streaming fashion (Supplementary Note 2).
We can utilize the cache to estimate the FRIP score for each cell. Since peak regions are areas with high read coverage, minimizer vectors of reads in peak regions are likely to be stored in Chromap’s cache. Therefore, if we keep track of the number of instances when we query the cache successfully or not (Figure 1), we can estimate FRIP. Specially, we implement a multi-variate linear model that uses statistics collected within Chromap, including fraction of reads hit cache (FRIC), duplicate reads, unmapped reads, and low MAPQ reads (default good MAPQ threshold is 30) to estimate FRIP for each cell in the dataset (Figure 1). In order to obtain the true FRIP values for training, we took a random subset of 1,000 SnapATAC2-QCed cells from the human PBMC scATAC-seq dataset and computed the true FRIP values by calling peaks with MACS3 [11] and finding overlaps with BEDTools [12]. The linear model was fit with respect to the logit-transformed FRIP values, and therefore, when Chromap outputs the estimated FRIP to the summary file, it will apply an inverse logit transformation. This ensures the output values are strictly within the 0 to 1 range since FRIP is a fraction by definition. The coefficients fit from these 1,000 cells were all significant (Supplementary Table 3), and FRIC had the smallest p-value, 1.49e-246, among the four variables. These coefficient values were used for the human and mouse scATAC-seq data analysis.
The other Chromap QC metric is called cache occupancy, which is the number of unique cache slots occupied by reads from a particular cell. This metric is based on the observation that each cache slot in Chromap’s cache typically corresponds to a unique region in the genome. Therefore, if we keep track of all the unique slots that a particular cell is mapping to, it would reflect the number of peak regions in that cell. To compute this metric in practice, Chromap uses a k-MinHash sketch [13] to estimate the cardinality of the set of cache slot indices for each cell. We observed that using k=250 is sufficient for accurately estimating the cache occupancy with a low average deviation (Supplementary Figure 3).
Supplementary Material
Acknowledgements
This work was carried out at the Advanced Research Computing at Hopkins (ARCH) core facility (rockfish.jhu.edu), which is supported by the National Science Foundation (NSF) grant number OAC 1920103.
Funding
This work is supported by the NIH grants P20GM130454 (Dartmouth), 3P20GM130454-05WS (Dartmouth), R01HG011392 (B.L.), and R35GM139602 (B.L.).
Availability of code and data
Chromap is available at https://github.com/haowenz/chromap, and we used version 0.3.0 in the evaluations. The code for the experiments is at https://github.com/oma219/chromapQC-exps. The 10k human PBMC scATAC-seq dataset is available as 10k Human PBMCs, ATAC v2, Chromium X (https://www.10xgenomics.com/datasets/10k-human-pbmcs-atac-v2-chromium-x-2-standard) on the 10x Genomics website. The 8k mouse Cortex scATAC-seq dataset is available as 8k Adult Mouse Cortex Cells, ATAC v2, Chromium X (https://www.10xgenomics.com/datasets/8k-adult-mouse-cortex-cells-atac-v2-chromium-x-2-standard) on the 10x Genomics website.
References
- 1.Fang R, Preissl S, Li Y, Hou X, Lucero J, Wang X, et al. Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat Commun. 2021;12:1337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Zhang K, Zemke NR, Armand EJ, Ren B. A fast, scalable and versatile tool for analysis of single-cell omics data. Nat Methods. 2024;21:217–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Granja JM, Corces MR, Pierce SE, Bagdatli ST, Choudhry H, Chang HY, et al. ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat Genet. 2021;53:403–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Hitz BC, Lee J-W, Jolanki O, Kagda MS, Graham K, Sud P, et al. The ENCODE Uniform Analysis Pipelines [Internet]. bioRxiv; 2023. [cited 2025 Jun 8]. p. 2023.04.04.535623. Available from: https://www.biorxiv.org/content/10.1101/2023.04.04.535623v1
- 5.Zhang H, Song L, Wang X, Cheng H, Wang C, Meyer CA, et al. Fast alignment and preprocessing of chromatin profiles with Chromap. Nat Commun. 2021;12:6566. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Yu G, Wang L-G, Han Y, He Q-Y. clusterProfiler: an R Package for Comparing Biological Themes Among Gene Clusters. OMICS. 2012;16:284–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.He C-W, Liao C-P, Pan C-L. Wnt signalling in the development of axon, dendrites and synapses. Open Biol. 2018;8:180116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Stanganello E, Zahavi EE, Burute M, Smits J, Jordens I, Maurice MM, et al. Wnt Signaling Directs Neuronal Polarity and Axonal Growth. iScience. 2019;13:318–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Rotem A, Ram O, Shoresh N, Sperling RA, Goren A, Weitz DA, et al. Single-cell ChIP-seq reveals cell subpopulations defined by chromatin state. Nat Biotechnol. 2015;33:1165–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA. Reducing storage requirements for biological sequence comparison. Bioinformatics. 2004;20:3363–9. [DOI] [PubMed] [Google Scholar]
- 11.Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biology. 2008;9:R137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Broder AZ. On the resemblance and containment of documents. Proceedings Compression and Complexity of SEQUENCES 1997. (Cat No97TB100171). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Chromap is available at https://github.com/haowenz/chromap, and we used version 0.3.0 in the evaluations. The code for the experiments is at https://github.com/oma219/chromapQC-exps. The 10k human PBMC scATAC-seq dataset is available as 10k Human PBMCs, ATAC v2, Chromium X (https://www.10xgenomics.com/datasets/10k-human-pbmcs-atac-v2-chromium-x-2-standard) on the 10x Genomics website. The 8k mouse Cortex scATAC-seq dataset is available as 8k Adult Mouse Cortex Cells, ATAC v2, Chromium X (https://www.10xgenomics.com/datasets/8k-adult-mouse-cortex-cells-atac-v2-chromium-x-2-standard) on the 10x Genomics website.
