FBA: feature barcoding analysis for single cell RNA-Seq

Jialei Duan; Gary C Hon

doi:10.1093/bioinformatics/btab375

. 2021 May 17;37(22):4266–4268. doi: 10.1093/bioinformatics/btab375

FBA: feature barcoding analysis for single cell RNA-Seq

Jialei Duan ¹, Gary C Hon ^2,^3,^✉

Editor: Olga Vitek

PMCID: PMC9502162 PMID: 33999185

Abstract

Motivation

Single cell RNA-Seq (scRNA-Seq) has broadened our understanding of cellular heterogeneity and provided valuable insights into cellular functions. Recent experimental strategies extend scRNA-Seq readouts to include additional features, including cell surface proteins and genomic perturbations. These ‘feature barcoding’ strategies rely on converting molecular and cellular features to unique sequence barcodes, which are then detected with the transcriptome.

Results

Here, we introduce FBA, a flexible and streamlined package to perform quality control, quantification, demultiplexing, multiplet detection, clustering and visualization of feature barcoding assays.

Availabilityand implementation

FBA is available on PyPi at https://pypi.org/project/fba and on GitHub at https://github.com/jlduan/fba.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

scRNA-Seq has been a transformative tool to investigate cellular heterogeneity and cellular function. Since cell state is multifaceted, scRNA-Seq has been extended to simultaneously detect multiple orthogonal genomic features in the same cell. Feature barcoding assays encode molecular and cellular features as genetic barcodes, and this information is directly sequenced as an RNA readout and assigned to individual cells. Molecular feature barcoding assays can simultaneously measure RNA and protein expression in the same cells (Peterson et al., 2017; Shahi et al., 2017; Stoeckius et al., 2017), barcode samples for multiplexing (McGinnis et al., 2019; Stoeckius et al., 2018) and detect perturbations and phenotypes at cellular resolution (Datlinger et al., 2017; Replogle et al., 2020; Rubin et al., 2019; Xie et al., 2017).

Feature barcoding applications are expected to increase (Regev et al., 2017; van der Wijst et al., 2020). These assays rely on the accurate quantification of feature barcodes. However, one challenge to analysis is that many diverse feature barcoding specifications have been adopted, with different barcode lengths, positions and amplification strategies. Despite sophisticated computational approaches for cell hashing, demultiplexing and integrative analysis (Hao et al., 2020; Kim et al., 2020; Lian et al., 2020), flexible quality control and preprocessing tools are lacking. Here, we introduce FBA, a flexible and streamlined preprocessing package for quality control, quantification, demultiplexing, multiplet detection, clustering and visualization of feature barcoding assays.

2 Methods and implementation

FBA implements single cell partitioning and unique molecular identifier (UMI) quantification for feature barcoding assays. We assume that the initial input consists of paired-end fastq files, containing cell identifiers, UMIs and feature barcodes/enriched transcripts.

In the first step, FBA uses the FastSS algorithm to identify cell/feature barcodes, tolerating mismatches to account for PCR and/or sequencing errors (Bocek et al., 2007). Enriched transcripts are aligned to the transcriptome reference with Bowtie2 (Langmead and Salzberg, 2012) or BWA (Li and Durbin, 2009). To accurately identify barcodes, FBA searches all candidates within the mismatching threshold, and the most similar barcode is chosen. We break ties using the strategy implemented in the MAQ aligner by discarding barcodes with higher sequencing quality at mismatched bases (Li et al., 2008). FBA also incorporates flexible filters to increase the stringency of barcode alignment (e.g. matching sequences that flank barcodes, identifying barcodes at specified distances away from the expected position). In the second step, FBA performs UMI deduplication at specified read coordinates using UMI-tools (Smith et al., 2017) and quantifies a feature barcoding abundance matrix for downstream analysis. In the optional third step, for cell hashing or CRISPR perturbation experiments, FBA examines the feature matrix to make presence/absence calls for features across all cells. We use an approach inspired by Stoeckius et al. (2018) and HDBSCAN for clustering with customizable cutoffs(Campello et al., 2013)(Fig. 1, Supplementary Text).

Fig. 1. — The workflow of FBA: a flexible and streamlined package for feature barcoding assays. To link feature barcodes to their single cell transcriptomes, FBA relies on cell and feature barcode information on paired reads. qc: FBA searches cell and feature barcodes unbiasedly along full-length reads and generates diagnostic information, including barcode distribution, library specificity, etc. extract/filter: FBA matches cell and feature barcodes with flexible and customizable thresholds. count: FBA performs UMI deduplication at specified read coordinates. demultiplex: In this optional step, for cell hashing or CRISPR perturbation experiments, FBA examines the feature matrix to make presence/absence calls for features across all cells. FBA also generates summary plots of demultiplexing results

2.1 Feature comparison to existing tools

Compared to existing tools such as Cell Ranger or CITE-seq-Count (Supplementary Fig. S1), FBA has several advantages (Supplementary Text). First, FBA supports the analysis of CRISPR, cell hashing, targeted transcript and custom barcoding libraries. Unlike Cell Ranger, FBA can analyze feature barcoding libraries in standalone mode, without a transcriptome. Second, FBA is computationally efficient and fast (Supplementary Fig. S2). Third, FBA offers multiple customizable options for alignment and UMI quantification. Fourth, FBA supports diverse inputs and custom configurations. Unlike Cell Ranger, FBA supports the analysis of feature barcoding libraries with non-uniform read lengths, mixed length cell/feature barcodes and different starting positions. Fifth, FBA supports quality control (QC) and custom barcoding. The QC module is a useful tool to inspect library quality, including analysis of barcode coordinates, read structure, possible contamination and non-specific amplification. Bulk mode can aid in the design and troubleshooting of custom barcoding assays in bulk before performing costly single cell experiments. Sixth, FBA has a visualization module that generates summary plots of demultiplexing results. Overall, FBA increases the flexibility of analyzing feature barcoding assays.

3 Results

3.1 Applications to single cell CRISPR screen analyses

We applied FBA to a single-cell CRISPR screening dataset from 10× Genomics (https://support.10xgenomics.com/single-cell-gene-expression/datasets/4.0.0/SC3_v3_NextGem_DI_CRISPR_10K). In this dataset, A375 cells were transduced with non-target and target sgRNA (Rab1a sgRNA) separately. After culturing and selection, the two samples were mixed at a 1:1 ratio and ∼10 000 cells were subsequently sequenced. FBA summarizes the distribution of barcode positions and base-level abundance (Supplementary Fig. S3a and b) to confirm that the correct fragments have been enriched. After barcode extraction, FBA identifies ∼65% of total read pairs as having valid barcodes. We observe that the ratio of non-target: target sgRNA is about 2:1. Next, FBA performs UMI quantification with a 1 mismatch tolerance, and identifies an average of ∼477 UMIs detected per cell. Finally, FBA shows that ∼90% of cells have at least one feature barcode detected after demultiplexing and ∼10% of cells have more than one sgRNA detected (Supplementary Fig. 3c and d).

3.2 Other functionalities

Applications to cell hashing analyses. We applied FBA to a published cell hashing dataset (Stoeckius et al., 2018). FBA finds that ∼92% of sequenced read pairs have valid barcodes and feature barcodes (hashtags) (Supplementary Fig. S4a and b). 22% of total reads sequenced are unique molecular fragments, which contributes to the final quantification matrix. Demultiplexing and clustering results can be found in Supplementary Figure 4c and d.
Bulk mode. When developing single cell-based perturbation assays, it is common practice to build a bulk library first to test the overall design and detection of desired perturbations. We designed a bulk mode in FBA, which estimates (i) the number of reads that have valid feature barcodes and (ii) the distribution of feature barcodes.
Detection of enriched transcripts. FBA can also process targeted enrichment libraries, for example exogenous trans-genes like eGFP. Customized thresholds and read positions can be set for cell barcode recognition and enriched transcript quantification.
Quality control mode. To facilitate the design and troubleshooting of feature barcoding assays, FBA has a quality control mode, which enables unbiased barcode matching along full length reads and provides detailed summary.

3.3 Performance comparison to existing tools

We applied FBA to a cell hashing dataset consisting of a mixture of four cell lines (Supplementary Fig. S5), where transcriptomes can be used as a gold standard to distinguish cell identity. HTO demultiplexing indicates that singlets called by FBA are true, with a false-negative rate of 3.0%. Since Cell Ranger and CITE-Seq Count cannot demultiplex cell hashing datasets, we compared with a popular single cell analysis R package Seurat (Hao et al., 2020). Using the gold standard, we find that one singlet is misidentified and the false-negative rate is 4.4%.

4 Conclusions

FBA is a flexible and streamlined toolbox for quality control, quantification, demultiplexing of various feature barcoding assays. It can be applied to customized feature barcoding specifications, including different CRISPR constructs or targeted enriched transcripts. FBA allows users to customize a wide range of parameters for the quantification and demultiplexing process. FBA also has a user-friendly quality control module, which is helpful in troubleshooting feature barcoding experiments.

Supplementary Material

btab375_Supplementary_Data

Click here for additional data file.^{(19.2MB, pdf)}

Acknowledgements

The authors acknowledge the BioHPC computational infrastructure at UT Southwestern for providing HPC and storage resources that have contributed to the research results reported within this paper.

Funding

G.C.H. was supported by Cancer Prevention and Research Institute of Texas (CPRIT) [RP190451], National Institutes of Health [DP2GM128203], the Welch Foundation [I-1926-20170325], the Burroughs Wellcome Fund [1019804] and the Green Center for Reproductive Biology.

Data availability: No new data were generated in this research. Analyzed data were downloaded from: https://support.10xgenomics.com/single-cell-gene-expression/datasets/4.0.0/SC3_v3_NextGem_DI_CRISPR_10K.

Competing interests: The authors declare no competing interests.

Contributor Information

Jialei Duan, The Green Center, Tecil H. and Ida Green Center for Reproductive Biology Sciences, Dallas, TX, 75390, USA.

Gary C Hon, The Green Center, Tecil H. and Ida Green Center for Reproductive Biology Sciences, Dallas, TX, 75390, USA; Division of Basic Reproductive Biology Research, Department of Obstetrics and Gynecology, Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.

References

Bocek T. et al. (2007) Fast Similarity Search in Large Dictionaries.
Campello R.J.G.B. et al. (2013) Density-Based Clustering Based on Hierarchical Density Estimates. In, Advances in Knowledge Discovery and Data Mining. Springer: Berlin, Heidelberg, pp. 160–172. [Google Scholar]
Datlinger P. et al. (2017) Pooled CRISPR screening with single-cell transcriptome readout. Nat. Methods, 14, 297–301. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hao Y. et al. (2020) Integrated analysis of multimodal single-cell data. Cold Spring Harbor Laboratory, 2020.10.12.335331.
Kim H.J. et al. (2020) CiteFuse enables multi-modal analysis of CITE-seq data. Bioinformatics, 36, 4137–4143. [DOI] [PubMed] [Google Scholar]
Langmead B., Salzberg S.L. (2012) Fast gapped-read alignment with Bowtie 2. Nat. Methods, 9, 357–359. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lian Q. et al. (2020) Artificial-cell-type aware cell-type classification in CITE-seq. Bioinformatics, 36, i542–i550. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li H and Durbin, R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754–1760. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li H. et al. (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res., 18, 1851–1858. [DOI] [PMC free article] [PubMed] [Google Scholar]
McGinnis C.S. et al. (2019) MULTI-seq: sample multiplexing for single-cell RNA sequencing using lipid-tagged indices. Nat. Methods, 16, 619–626. [DOI] [PMC free article] [PubMed] [Google Scholar]
Peterson V.M. et al. (2017) Multiplexed quantification of proteins and transcripts in single cells. Nat. Biotechnol., 35, 936–939. [DOI] [PubMed] [Google Scholar]
Regev A. et al. ; Human Cell Atlas Meeting Participants. (2017) The Human Cell Atlas. Elife, 6, e27041. [DOI] [PMC free article] [PubMed] [Google Scholar]
Replogle J.M. et al. (2020) Combinatorial single-cell CRISPR screens by direct guide RNA capture and targeted sequencing. Nat. Biotechnol., 38, 954–961. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rubin A.J. et al. (2019) Coupled single-cell CRISPR screening and epigenomic profiling reveals causal gene regulatory networks. Cell, 176, 361–376.e17. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shahi P. et al. (2017) Abseq: ultrahigh-throughput single cell protein profiling with droplet microfluidic barcoding. Sci. Rep., 7, 44447. [DOI] [PMC free article] [PubMed] [Google Scholar]
Smith T. et al. (2017) UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome Res., 27, 491–499. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stoeckius M. et al. (2018) Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol., 19, 224. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stoeckius M. et al. (2017) Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods, 14, 865–868. [DOI] [PMC free article] [PubMed] [Google Scholar]
van der Wijst M. et al. (2020) The single-cell eQTLGen consortium. Elife, 9, e52155. [DOI] [PMC free article] [PubMed] [Google Scholar]
Xie S. et al. (2017) Multiplexed engineering and analysis of combinatorial enhancer activity in single cells. Mol. Cell, 66, 285–299.e5. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btab375_Supplementary_Data

Click here for additional data file.^{(19.2MB, pdf)}

[btab375-B17] Bocek T. et al. (2007) Fast Similarity Search in Large Dictionaries.

[btab375-B18] Campello R.J.G.B. et al. (2013) Density-Based Clustering Based on Hierarchical Density Estimates. In, Advances in Knowledge Discovery and Data Mining. Springer: Berlin, Heidelberg, pp. 160–172. [Google Scholar]

[btab375-B1] Datlinger P. et al. (2017) Pooled CRISPR screening with single-cell transcriptome readout. Nat. Methods, 14, 297–301. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btab375-B2] Hao Y. et al. (2020) Integrated analysis of multimodal single-cell data. Cold Spring Harbor Laboratory, 2020.10.12.335331.

[btab375-B3] Kim H.J. et al. (2020) CiteFuse enables multi-modal analysis of CITE-seq data. Bioinformatics, 36, 4137–4143. [DOI] [PubMed] [Google Scholar]

[btab375-B19] Langmead B., Salzberg S.L. (2012) Fast gapped-read alignment with Bowtie 2. Nat. Methods, 9, 357–359. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btab375-B4] Lian Q. et al. (2020) Artificial-cell-type aware cell-type classification in CITE-seq. Bioinformatics, 36, i542–i550. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btab375-B5] Li H and Durbin, R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754–1760. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btab375-B6] Li H. et al. (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res., 18, 1851–1858. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btab375-B7] McGinnis C.S. et al. (2019) MULTI-seq: sample multiplexing for single-cell RNA sequencing using lipid-tagged indices. Nat. Methods, 16, 619–626. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btab375-B8] Peterson V.M. et al. (2017) Multiplexed quantification of proteins and transcripts in single cells. Nat. Biotechnol., 35, 936–939. [DOI] [PubMed] [Google Scholar]

[btab375-B9] Regev A. et al. ; Human Cell Atlas Meeting Participants. (2017) The Human Cell Atlas. Elife, 6, e27041. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btab375-B10] Replogle J.M. et al. (2020) Combinatorial single-cell CRISPR screens by direct guide RNA capture and targeted sequencing. Nat. Biotechnol., 38, 954–961. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btab375-B11] Rubin A.J. et al. (2019) Coupled single-cell CRISPR screening and epigenomic profiling reveals causal gene regulatory networks. Cell, 176, 361–376.e17. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btab375-B12] Shahi P. et al. (2017) Abseq: ultrahigh-throughput single cell protein profiling with droplet microfluidic barcoding. Sci. Rep., 7, 44447. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btab375-B20] Smith T. et al. (2017) UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy. Genome Res., 27, 491–499. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btab375-B13] Stoeckius M. et al. (2018) Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol., 19, 224. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btab375-B14] Stoeckius M. et al. (2017) Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods, 14, 865–868. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btab375-B15] van der Wijst M. et al. (2020) The single-cell eQTLGen consortium. Elife, 9, e52155. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btab375-B16] Xie S. et al. (2017) Multiplexed engineering and analysis of combinatorial enhancer activity in single cells. Mol. Cell, 66, 285–299.e5. [DOI] [PubMed] [Google Scholar]

PERMALINK

FBA: feature barcoding analysis for single cell RNA-Seq

Jialei Duan

Gary C Hon

Roles

Abstract

Motivation

Results

Availabilityand implementation

Supplementary information

1 Introduction

2 Methods and implementation

Fig. 1.

2.1 Feature comparison to existing tools

3 Results

3.1 Applications to single cell CRISPR screen analyses

3.2 Other functionalities

3.3 Performance comparison to existing tools

4 Conclusions

Supplementary Material

Acknowledgements

Funding

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

FBA: feature barcoding analysis for single cell RNA-Seq

Jialei Duan

Gary C Hon

Roles

Abstract

Motivation

Results

Availabilityand implementation

Supplementary information

1 Introduction

2 Methods and implementation

Fig. 1.

2.1 Feature comparison to existing tools

3 Results

3.1 Applications to single cell CRISPR screen analyses

3.2 Other functionalities

3.3 Performance comparison to existing tools

4 Conclusions

Supplementary Material

Acknowledgements

Funding

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases