Abstract
Summary
CRISPR screens are increasingly performed to associate genotypes with genotypes. So far, however, their analysis required specialized computational knowledge to transform high-throughput next-generation sequencing (NGS) data into sequence formats amenable for downstream analysis. We developed ReCo, a stand-alone and user-friendly analytics tool for generating read-count tables of single and combinatorial CRISPR library and screen-based NGS data. Together with cutadapt and bowtie2 for rapid sequence trimming and alignment, ReCo enables the automated generation of read count tables from staggered NGS reads for the downstream identification of gRNA-induced phenotypes.
Availability and implementation
ReCo is published under the MIT license and available at: https://github.com/KaulichLab/ReCo.
1 Introduction
The CRISPR-Cas system has emerged as an important tool for genome editing (Jinek et al. 2012, Cong et al. 2013, Wang and Doudna 2023). In its engineered version, the system consists of two components, a Cas endonuclease and a single gRNA (sgRNA) that guides the Cas enzyme to a predefined locus in the genome. Depending on the type of system, the targeted locus can be perturbed in multiple ways, among them the induction of double-strand breaks (causing insertions or deletions, InDels), editing of individual bases (base or prime editing) (Anzalone et al., 2020), or recruiting effector domains to activate or repress gene transcription (CRISPRi or CRISPRa) (Gilbert et al. 2014, Liu et al. 2022). In its most widely used form, a Cas nuclease, e.g. SpCas9, induces a DNA double-strand break in coding exons resulting in frameshift mutations that cause functional knockouts of the genes of interest. When target sites are bundled in a gRNA library, a population of mutant cells can be generated and screened for a phenotype of interest, enabling unbiased genotype-to-phenotype associations (Shalem et al. 2014, Wang et al. 2014, Bock et al. 2022). To do so, the gRNA expression cassette is stably integrated into the host cell genome which allows its population frequency to be used as a surrogate for the gRNA-induced phenotype (Shalem et al. 2014, Wang et al. 2014, Ford et al. 2019). gRNA frequencies are quantified by NGS, comparing different screening time points with their gRNA library frequency. Due to the low sequence diversity of gRNA libraries (only the gRNA part of the NGS-read is variable), gRNA amplicons are commonly sequenced with staggered oligos, rendering the gRNA position random within a window of up to eight nucleotides, which avoids low diversity issues during NGS runs. This, however, prevents the extraction of gRNA sequences from NGS reads in which the gRNA position is fixed which requires additional read trimming and alignment steps for data processing. Although this setup is widely used, there is a lack of automatic pipelines to generate gRNA read count tables from staggered NGS data that enable computationally less developed groups to analyze their CRISPR libraries and screening samples. Closing this gap, we present the Read Counting tool ReCo that automatically generates read count tables from single-end and paired-end NGS fastq files with minimal input requirements.
ReCo is implemented as a Python 3 package that can also be run as a standalone command line tool. It uses the parallelization capabilities of two external tools, cutadapt and bowtie2 (Martin 2011, Langmead and Salzberg 2012), to decrease sample processing time. ReCo can process arbitrary numbers of single-end and paired-end samples per run, corresponding to single or combinatorial CRISPR gRNA libraries. The tool requires minimal information per sample, but a unique sample name, as well as fastq and gRNA library file locations. Optionally, ReCo integrates expected sequencing depths and accepts vector maps in SnapGene format to account for 3Cs-technology-based samples (Wegner et al. 2019, 2020, Diehl et al. 2021). If provided with a vector file, ReCo will automatically find the 3Cs-template sequences and report their abundance in the final report.
2 Results
PinAPL-py is the only open-access tool for generating read count files from staggered NGS runs (Fig. 1a). However, the implementation of PinAPL-py leaves room for improvement, particularly for samples with high sequencing depths or total reads (Spahn et al. 2017). Additionally, PinAPL-py is unable to handle combinatorial CRISPR samples. Thus, we implemented the stand-alone command line tool ReCo to automatically trim, align, and count single and combinatorial gRNA sequences from staggered NGS reads. Unlike previous tools, ReCo demands little user interaction and runs locally without the need to upload data, for which the connection speed can be a limiting factor. ReCo operates based on provided sample names, Illumina fastq and gRNA library file locations. Users can optionally provide expected sequencing depths per sample as well as plasmid maps in SnapGene (.dna) format to identify and account for 3Cs-technology-derived samples. ReCo then identifies the 3Cs placeholder gRNA sequence by sample subsampling (Supplementary Fig. S1). Sequence trimming and alignment by cutadapt and bowtie2, respectively, are implemented iteratively while providing the option to use parallelization parameters of both tools to decrease running time per sample. To improve run time further, particularly for highly diverse samples, all unique putative sequences are first counted and then aligned to the gRNA sequence library. As physical output, ReCo provides .csv read count files per sample containing the gRNA counts. For trouble-shooting purposes, ReCo reports all sequences that could not be aligned to the provided gRNA library. The final read-count table is visualized with a plot panel in .png and .pdf formats, and a .txt file reporting the ReCo run parameters. The analysis plot panel summarizes trimming and alignment rates, expected and observed sample sequencing depth, the distribution of read counts, as well as gRNA completeness and sample distribution skew. Optionally, the 3Cs template gRNA placeholder sequence is highlighted for quality control purposes.
Figure 1.
The ReCo pipeline. (a) Trimming and alignment strategy of ReCo. Single-end read samples are trimmed with cutadapt to isolate the putative gRNA sequences. Unique sequences are counted and then aligned only once to the gRNA library to reduce the number of required alignments. Paired-end read samples use the single-end pipeline for both reads individually. In an intermediate step, unique sequence combinations from corresponding reads are counted. In the final step, the unique gRNA combinations are mapped against the individual sets of alignments. The graph was created with BioRender.com. (b) Benchmarking of running times for ReCo and PinAPL-py. Sequencing samples of sizes between 0.5 million and 1.2 billion reads were processed on 15 cores and the time that was required for the individual steps was measured in seconds and is shown for each sample. While in the PinAPL-py algorithm, the time requirements grew for each processing step, in the ReCo algorithm, only the trimming procedure required more time in relation to the number of input reads. (c) The ratio of ReCo and PinAPL-py running times increases with the number of processed reads, meaning that the time requirements for PinAPL-py increase faster than those for ReCo.
3 Benchmarking
To assess the relative performance of ReCo, benchmarking against PinAPL-py was performed (Spahn et al. 2017). PinAPL-py was chosen as it is the only other available tool to operate on staggered NGS reads. Moreover, benchmarking was limited to single-end NGS reads, as PinAPL-py does not accept paired-end NGS reads. To assess their relative performance, we used the test data set provided by PinAPL-py, that are derived from a genome-wide CRISPR-Cas knockout screen using the SpCas9 Brunello gRNA library in a drop-out screen in A375 melanoma cells, containing 67.9 million reads (Doench et al. 2016). We separated the benchmarking in two aspects, the number of found gRNAs and their associated alignment rates, and the required run time. Alignment rates and the number of found gRNAs were similar with 83.88% and 83.91%, and 98.58% (76 341 of 77 441) and 98.62% (76 372 of 77 441) for PinAPL-py and RecCo, respectively. However, we found an issue within PinAPL-py’s alignment parameters which resulted in the failure to detect 31 gRNA sequences that are the reverse complement of other gRNAs, an issue that does not occur with ReCo. To benchmark the required run time, sampled datasets corresponding to 500K, 1M, 5M, 10M, 25M, 50M, 100M, 200M, 300M, 600M, and 1.2B reads were derived from the original test data and processed individually with no other jobs running by PinAPL-py and ReCo on 15 cores to maximize parallelization and ensure a fair comparison. While the run time of PinAPL-py grew exponentially with sample size and was dominated by trimming, alignment, and mapping/counting, the run time of ReCo was determined solely by trimming, with alignment and mapping/counting being decoupled from sample size (Fig. 1b). With increasing sample size, the ratio between the required run time for PinAPL-py and ReCo increased (Fig. 1c), demonstrating that ReCo scales better with samples of high diversity, such as combinatorial libraries or multiple diverse samples.
4 Conclusions
ReCo is a scalable read-counting tool for single and combinatorial CRISPR gRNA library data. It automatically recognizes gRNA positions in staggered single and paired-end NGS reads, generates read count files for further data analysis, and provides a visual quality control report summarizing the percentage of aligned and trimmed reads, expected and obtained sequencing depth, as well as gRNA and sample distribution skew. Combined with downstream CRISPR analysis tools, experienced and inexperienced users can efficiently analyze the effects of gRNAs/gene phenotypes across diverse CRISPR screen conditions.
Supplementary Material
Acknowledgements
We thank all members of the Kaulich lab for their continuous support. Figures 1 and S1 were created with BioRender.com.
Contributor Information
Martin Wegner, Goethe University Frankfurt, University Hospital, Institute of Biochemistry II, 60590, Frankfurt am Main, Germany.
Manuel Kaulich, Goethe University Frankfurt, University Hospital, Institute of Biochemistry II, 60590, Frankfurt am Main, Germany; Frankfurt Cancer Institute, Frankfurt am Main, Germany; Cardio-Pulmonary Institute, Frankfurt am Main, Germany.
Supplementary data
Supplementary data are available at Bioinformatics online.
Conflict of interest
M.W. is an employee of Vivlion GmbH. M.K. is a co-founder, shareholder, and chief officer of Vivlion GmbH.
Funding
This work was supported by the Hessisches Ministerium für Wissenschaft und Kunst (HMWK) [LOEWE—FCI IIIL5-519/03/03.001]; the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) [CPI-EXC 2026, 259130777—SFB 1177]; and the Bundesministerium für Bildung und Forschung (BMBF, Cluster4Future, Proxidrugs). Funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
References
- Anzalone AV, Koblan LW, Liu DR et al. Genome editing with CRISPR-Cas nucleases, base editors, transposases and prime editors. Nat Biotechnol 2020;38:824–44. [DOI] [PubMed] [Google Scholar]
- Bock C, Datlinger P, Chardon F et al. High-content CRISPR screening. Nat Rev Methods Primers 2022;2:23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cong L, Ran FA, Cox D et al. Multiplex genome engineering using CRISPR/Cas systems. Science 2013;339:819–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Diehl V, Wegner M, Grumati P et al. Minimized combinatorial CRISPR screens identify genetic interactions in autophagy. Nucleic Acids Res 2021;49:5684–704. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Doench JG, Fusi N, Sullender M et al. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nat Biotechnol 2016;34:184–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ford K, McDonald D, Mali P et al. Functional genomics via CRISPR-Cas. J Mol Biol 2019;431:48–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gilbert LA, Horlbeck MA, Adamson B et al. Genome-scale CRISPR-mediated control of gene repression and activation. Cell 2014;159:647–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jinek M, Chylinski K, Fonfara I et al. A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity. Science 2012;337:816–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Langmead B, Salzberg SL. Fast gapped-read alignment with bowtie 2. Nat Methods 2012;9:357–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu G, Lin Q, Jin S et al. The CRISPR-Cas toolbox and gene editing technologies. Mol Cell 2022;82:333–47. [DOI] [PubMed] [Google Scholar]
- Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J 2011;17:10. [Google Scholar]
- Shalem O, Sanjana NE, Hartenian E et al. Genome-scale CRISPR-Cas9 knockout screening in human cells. Science 2014;343:84–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Spahn PN, Bath T, Weiss RJ et al. PinAPL-Py: a comprehensive web-application for the analysis of CRISPR/Cas9 screens. Sci Rep 2017;7:15854. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang JY, Doudna JA. CRISPR technology: a decade of genome editing is only the beginning. Science 2023;379:eadd8643. [DOI] [PubMed] [Google Scholar]
- Wang T, Wei JJ, Sabatini DM et al. Genetic screens in human cells using the CRISPR-Cas9 system. Science 2014;343:80–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wegner M, Diehl V, Bittl V et al. Circular synthesized CRISPR/Cas gRNAs for functional interrogations in the coding and noncoding genome. Elife 2019;8:e42549. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wegner M, Husnjak K, Kaulich M. Unbiased and tailored CRISPR/Cas gRNA libraries by SynthesizingCovalently-closed-circular (3Cs) DNA. Bio Protoc 2020;10:e3472. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.