Abstract
Motivation:
Current CRISPR guide RNA design tools rely on reference genomes, overlooking how genetic variation impacts editing outcomes. As genome editing advances toward clinical applications, incorporating population diversity becomes essential for ensuring therapeutic efficacy across diverse populations.
Results:
We present CRISPR-HAWK, a framework integrating individual- and population-scale variants and haplotypes into gRNA design. Analyzing therapeutic targets across 79,648 genomes reveals that genetic variants substantially alter guide performance. For the clinically approved sickle cell disease therapeutic guide targeting BCL11A, we identify haplotypes that completely abolish predicted cutting activity. Across seven therapeutic loci, 82.5% of guides contain variants modifying on-target activity. Variants also create novel protospacer adjacent motif sites generating individual-specific guides invisible to reference-based design. These findings demonstrate that variant-aware selection is critical for equitable genome editing.
1. Introduction
CRISPR-Cas systems present unprecedented opportunities for therapeutic developments by offering a powerful means to precisely modify DNA within living cells. Originally repurposed from Streptococcus pyogenes Cas9 (SpCas9) for targeted genome editing, CRISPR has since evolved into a versatile platform comprising a diverse array of Cas orthologs and engineered variants (Jiang and Doudna, 2017). Central to this technology is a guide RNA (gRNA) that directs the Cas complex to a complementary genomic sequence, contingent on the presence of a protospacer adjacent motif (PAM). This modular mechanism enables a spectrum of biological outcomes, including the introduction of double strand breaks, precise nucleotide substitution via base editors (Rees and Liu, 2018), combinations of genetic modifications via prime editing (Anzalone et al., 2019), and epigenetic or transcriptional modulation through CRISPRa/I systems (Thakore et al., 2016; Kampmann, 2018).
The success of CRISPR genome editing relies on two key factors: achieving high on-target efficiency while minimizing off-target activity (Clement et al., 2020). On-target efficiency refers to the ability of a gRNA to precisely direct the Cas nuclease to the intended genomic locus, thereby executing the desired genetic modification. Conversely, off-target effects occur when the CRISPR system binds and edits unintended genomic loci with partial sequence similarity, potentially resulting in harmful or confounding mutations (Cho et al., 2014). Several computational tools have been developed to support gRNA design by optimizing both aspects (Hanna and Doench, 2020). Tools such as Cas-Designer (Cho et al., 2014) or CHOPCHOP (Labun et al., 2019) identify candidate guides by aligning input PAM sequences against a reference genome. For each reported gRNA, they compute different sequence-based features, such as GC content, and estimate the number and location of potential off-target sites. This information is then used to rank guides based on predicted editing efficiency and target specificity. CRISPick (Doench et al., 2016; DeWeirdt et al., 2021) and CRISPRon (Anthon et al., 2022) integrate machine learning models trained on experimental data to prioritize guides with high predicted activity and minimal off-target potential. CRISPOR (Concordet and Haeussler, 2018), meanwhile, integrates information on single nucleotide polymorphisms (SNPs) that may overlap with candidate gRNA sequences.
However, existing tools have a critical limitation: gRNAs selection is based on the reference genome, thereby potentially overlooking on-target sites introduced or modulated by genetic variants. While CRISPOR annotates candidate gRNAs with overlapping SNPs, the guide selection process still relies on the reference sequence. Consequently, the combined effect of multiple variants or haplotypes is not considered, nor is the impact of genetic variants on on-target activity quantitatively assessed. While it is well recognized that variants can create or modulate off-target sites (Cancellieri et al., 2023; Lazzarotto et al., 2025), there is growing evidence indicating that variants can affect on-target efficiency. This occurs through direct alteration of the intended protospacer sequences or changes in its local genomic context (Canver et al., 2018). Such effects are critical in therapeutic applications, where even subtle differences in editing efficiency or specificity can impact clinical efficacy and safety (Lessard et al., 2017; Liu et al., 2021).
The clinical significance of this problem is exemplified by recent therapeutic applications. Casgevy (exagamglogene autotemcel), one of the first FDA-approved CRISPR therapies for sickle cell disease and beta-thalassemia relies on precise editing of the BCL11A enhancer using guide sg1617 (Frangoul et al., 2021). Yet these treatments, designed using reference genomes, may have variable efficacy across genetically diverse populations.
As genome editing advances toward personalized and clinical applications, the need to account for both individual- and population-level genetic variation becomes increasingly important (Scott and Zhang, 2017). Incorporating genetic diversity into gRNA design is key to enhancing on-target efficiency, reducing off-target effects, and ensuring the safety and efficacy of therapeutical outcomes.
To address these challenges, we developed CRISPR-HAWK, representing, to our knowledge, the first framework to integrate both genetic variants and haplotypes in gRNA design. By reconstructing individual haplotypes from genetic population-scale variant datasets, CRISPR-HAWK enables sample-specific guide selection while providing comprehensive assessment of guide performance. On-target efficiency of candidate gRNAs is predicted using machine learning models, while off-target activity is assessed through seamless integration with CRISPRitz (Cancellieri et al., 2020), a robust off-target nomination search engine. Candidate guides are further annotated with genetic, variants, and functional information to support informed gRNA selection. We demonstrate the utility of CRISPR-HAWK, by designing candidate gRNAs targeting clinically relevant or widely tested genomic regions in human, while accounting for genetic variants from the 1000 Genomes Project (1000G) (Consortium et al., 2015; Zheng-Bradley et al., 2017), Human Genome Diversity Project (HGDP)(Bergström et al., 2020), and genome Aggregation Database (gnomAD) (Karczewski et al., 2020; Chen et al., 2024) datasets. We show that genetic variation gives rise to numerous population- and individual-specific alternative gRNAs within each analyzed target region, revealing that many candidate guides are overlooked when relying solely on the reference genome. Moreover, we analyze how genetic variants impact on-target activity of gRNAs designed on the reference genome. These findings underscore the importance of adopting variant-aware strategies to ensure accurate, efficient, and equitable genome editing, particularly in therapeutic contexts.
2. Materials & Methods
CRISPR-HAWK is a command-line tool designed for the efficient enumeration, scoring, and annotation of CRISPR-Cas guide RNAs within user-defined genomic regions. Figures 1 and 2 provide an overview of its architecture, illustrating the required and optional inputs, the main computational steps, and the structure of the resulting outputs. The following subsections detail the implementation of CRISPR-HAWK, including its strategy for integrating genetic variants and haplotype data, the algorithmic details underlying gRNA search and scoring, and the annotation pipeline. We further describe the datasets and experimental settings used for benchmarking and performance evaluation.
Fig. 1: Overview of CRISPR-HAWK architecture.
(A) CRISPR-HAWK designs gRNAs from a reference genome, by specifying genomic coordinates, a PAM motif, and a spacer length. Optional inputs include variant datasets for variant- and haplotype-aware guide design, genomic annotation files, and user-provided candidate guide coordinates for assessing variant effects. (B) Genomic regions of interest are extracted and (C) combined with variant data, when available, to reconstruct haplotype-resolved sequences. (D) Reconstructed haplotypes and PAM sequences are encoded using a 4-bit binary scheme that supports IUPAC ambiguity codes, enabling fast bitwise pattern matching. (E) Encoded sequences are scanned to identify PAM occurrences, and adjacent spacers are extracted to generate candidate gRNAs. (F) Candidate guides are evaluated using three complementary metrics: on-target efficiency of haplotype-matched guides, residual on-target activity of reference-designed guides on variant-containing target sequences, and guide specificity via integrated predictive models. (G) Guides are annotated with functional, gene-level, and cancer-related features using default annotation tracks (ENCODE, GENCODE, COSMIC) or user-provided BED files. (H) CRISPR-HAWK generates comprehensive outputs including a table of all candidate guides identified within the input region. Optional outputs include a table of reconstructed haplotypes, graphical reports, a dedicated report for user-selected candidate guides, a genome-wide list of predicted off-target sites, and a guide activity map integrating residual on-target activity with specificity.
Fig. 2: Overview of CRISPR-HAWK outputs and visualizations.
(A) The primary output summarizes all candidate gRNAs with their associated performance metrics in a tabular report, including on-target efficiency, residual on-target activity, and guide specificity. (B) When requested, a haplotype table is provided, listing each reconstructed haplotype alongside the corresponding samples and variants. (C) Graphical reports include scatter plots depicting residual on-target activity across variant-containing target sequences, with point size indicating the number of individuals carrying each variant, and pie charts classifying gRNAs by type (reference, spacer alternative, PAM alternative, or both). (D) Users may optionally restrict the analysis to selected candidate guides. In this mode, the tool produces a dedicated report detailing all variant-containing target sequences and their associated scores for on-target efficiency, residual on-target activity, and guide specificity. (E) A genome-wide list of predicted off-target sites is generated for each candidate gRNA using CRISPRitz. (F) The guide activity map integrates residual on-target activity with guide specificity, providing a global view of guide performance across diverse genomic backgrounds.
2.1. Required and Optional Input Parameters
CRISPR-HAWK operates through a set of main input parameters guiding its search and annotation workflow (Figure 1A). Four inputs are mandatory: (i) a reference genome in FASTA format, (ii) genomic coordinates specifying the target regions in BED format, (iii) the PAM recognized by the selected Cas nuclease, and (iv) the desired spacer length. In addition to these core inputs, users may optionally provide variant datasets in VCF format (phased or unphased) to enable variant- and haplotype-aware guide design, as well as pre-defined candidate gRNAs for focused evaluation. Functional genomic annotation in BED format can also be provided to enrich the resulting guides with contextual information such as regulatory or coding region overlap.
2.2. Retrieving Target Genomic Regions and Haplotypes Reconstruction
Given a reference genome and a set of genomic coordinates, CRISPR-HAWK begins by extracting the corresponding target genomic regions from the reference genome (Figure 1B). To ensure sufficient context for both gRNA identification and downstream analyses, such as on-target efficiency prediction and variant mapping, each region is symmetrically extended by 100 base pairs upstream and downstream. The resulting extended sequences form the basis for integrating genetic variation and reconstructing haplotype-resolved representations of the target loci (Figure 1C). To incorporate genetic diversity, CRISPR-HAWK enriches the extracted sequences with single-nucleotide variants (SNVs) and short insertions/deletions (indels) provided in input VCF files. The tool supports both phased and unphased genotype data. For phased data, CRISPR-HAWK reconstructs haplotypes by creating two separate sequences per individual corresponding to the maternal and paternal alleles and sequentially inserts the appropriate variants. This process yields accurate reconstruction of each individual’s diploid genomic context within the target region. In the case of unphased data, where the chromosomal origin of each allele is unknown, heterozygous and multiallelic positions are initially encoded using IUPAC ambiguity codes which represent sets of possible nucleotides at each site. To resolve these ambiguities, CRISPR-HAWK performs combinatorial enumeration of all feasible haplotypes that could arise from the observed genotypes, only for sequences identified as guide candidates (see Section 2.3). This selective strategy significantly reduces memory consumption and computational time, enabling the analysis of large population-scale datasets, such as gnomAD. To further reduce computational burdens, CRISPR-HAWK collapses biologically equivalent haplotypes using hashing-based deduplication. Importantly, the haplotype reconstruction process considers the combined effect of SNVs and indels in both phased and unphased contexts, representing each individual’s sequence consistently. Each reconstructed haplotype is annotated with rich metadata including the list of occurring variants, samples or populations of origin, and phasing status (in case of phased variants only). This haplotype-centric approach enables CRISPR- HAWK to identify gRNAs with improved population coverage and individual-level precision.
2.3. Genomic Regions Binary Encoding and Search for Candidate Guides
Guide RNAs are identified by scanning the reconstructed haplotype sequences for the user-specified PAM motif on both forward and reverse strands. This step is computationally intensive, as it requires exhaustive base-by-base comparisons across potentially thousands of haplotypes. To address this, CRISPR-HAWK employs an optimized binary encoding strategy that reduces the computational complexity of guide discovery to linear time with respect to the length of each haplotype (Figure 1D). Each haplotype is converted into a vectorial representation using a 4-bit encoding scheme that supports both standard nucleotides and IUPAC ambiguity codes, allowing for efficient representation of both unambiguous and heterozygous positions. This encoding enables fast bitwise operations for PAM detection and guide extraction, significantly enhancing performance, particularly when processing large and genetically diverse populations. Upon detection of a valid PAM sequence, the adjacent spacer sequence, of user-defined length, is extracted (Figure 1E). For hits on the reverse strand the corresponding sequences are reverse-complemented to preserve consistent orientation. These extracted sequences are hereafter referred to as candidate gRNAs. To capture the impact of genetic variation, CRISPR-HAWK systematically flags all candidate gRNAs that overlap with variant sites. Within each guide, variant-affected nucleotides are displayed in lowercase, while reference-matching bases remain in uppercase, enabling clear distinction between conserved and variable positions. This representation facilitates rapid visual inspection and supports flexible downstream filtering, allowing users to prioritize guides based on their tolerance or sensitivity to genetic variation, depending on the intended application.
2.4. Scoring and Annotating Guides
Accurate evaluation of the designed gRNAs requires assessing their cleavage efficiency, specificity, and functional genomic context. CRISPR-HAWK integrates multiple scoring models (Figure 1F) and a comprehensive annotation framework (Figure 1G) to support informed guide selection.
We distinguish three complementary performance metrics. On-target efficiency predicts the cleavage activity of a guide at its intended target site, estimated using sequence-based machine learning models. Residual on-target activity predicts how effectively a reference-designed guide will cleave variant-containing target sequences, quantifying the impact of genetic mismatches on expected performance. Guide specificity measures selectivity across the genome by aggregating predicted off-target cleavage likelihoods. The following subsections describe each component in detail.
2.4.1. Predicting On-Target Efficiency
Several computational models have been developed to predict on-target cleavage efficiency of gRNAs (Doench et al., 2014; Sherkatghanad et al., 2023). CRISPR-HAWK integrates two widely used scoring models for SpCas9: Azimuth (Doench et al., 2016) and Rule Set 3 (RS3) (DeWeirdt et al., 2022). Azimuth employs a machine learning model trained on more than 4,000 experimentally validated gRNAs targeting coding regions across 17 genes. RS3 extends Azimuth by incorporating additional sequence and thermodynamics features in the model, including poly(T) content, spacer-DNA melting temperature, and minimum free energy of the folded gRNA structure. To support Cas12a applications, CRISPR-HAWK also integrates DeepCpf1 (Kim et al., 2018), a deep learning model trained to estimate the on-target activity of Cpf1 (Cas12a) systems using high-throughput experimental profiling data.
2.4.2. Predicting Residual On-Target Activity Across Variant Haplotypes
A critical challenge in variant-aware gRNA design is predicting how genetic variants at target sites affect the cutting efficiency of reference-designed guides. We address this by repurposing established off-target prediction models as surrogates for residual on-target activity when variants create mismatches. The key insight is that the same biophysical principles governing Cas9 binding and cleavage at mismatched off-target sites also apply when variants introduce mismatches at intended on-target sites.
CRISPR-HAWK implements this approach using two established models: cutting frequency determination (CFD) (Doench et al., 2016) and Elevation (Listgarten et al., 2018). While originally developed to predict off-target cleavage with sequence mismatches, these models quantify the fundamental relationship between sequence complementarity and Cas nuclease activity. When genetic variants alter a target site, they create imperfect base-pairing analogous to off-target mismatches.
For each reference-designed gRNA, CRISPR-HAWK identifies all alternative target sequences arising from genetic variants in the input population. These variant-containing sequences represent individual-specific on-target sites that differ from the reference. The tool then computes CFD and Elevation scores between each reference guide and its alternative targets. Lower scores indicate reduced predicted cleavage efficiency due to variant-induced mismatches, while scores near 1.0 suggest preserved activity despite genetic variation. This framework enables systematic assessment of how population-level genetic diversity may modulate therapeutic efficacy of CRISPR-Cas systems designed from reference genomes.
2.4.3. Estimating Guide Specificity
To estimate guide specificity, CRISPR-HAWK integrates CRISPRitz (Cancellieri et al., 2020), a high-performance genome-wide search engine that identifies putative off-target sites allowing a user-defined number of mismatches and DNA/RNA bulges. For computational efficiency, off-target enumeration is performed against the reference genome, providing a baseline estimate of global specificity for each candidate guide. To extend this analysis to population-scale contexts, CRISPR-HAWK generates input files compatible with CRISPRme (Cancellieri et al., 2023), a variant- and haplotype-aware off-target nomination tool. This interoperability enables evaluation of guide specificity across diverse genetic backgrounds.
For each predicted off-target site, CRISPR-HAWK computes CFD and Elevation scores to estimate the likelihood of unintended cleavage. A global specificity score is then calculated for each gRNA by aggregating the CFDs of all its predicted off-targets, allowing users to rank and prioritize guides with minimal off-target potential.
2.4.4. Annotating Guides
To support biological interpretation and downstream filtering, CRISPR-HAWK annotates candidate gRNAs with functional genomic context and sequence-derived features. Guides are annotated using user-provided genomic features, such as regulatory regions, coding exons, or disease-associated loci, like COSMIC cancer-related annotations (Sondka et al., 2024). Additionally, CRISPR-HAWK computes sequence-derived features that may influence gRNAs performance. GC content is calculated for each guide, as it has been shown to affect Cas binding affinity and gRNA stability (Yuen et al., 2017).
2.5. Reports Generation
CRISPR-HAWK generates a comprehensive set of output files and visualizations summarizing all candidate gRNAs and their associated features (Figures 1H and 2). The main output is a tabular report listing each candidate gRNA with its genomic coordinates, strand, PAM sequence, spacer sequence, predicted efficiency and specificity scores, functional and gene-level annotations, as well as haplotype and sample associations when variant data are provided (Figure 2A). When variant-aware analysis is performed, CRISPR-HAWK optionally produces a table reporting the reconstructed haplotypes for the input genomic region, including the constituent variants, population of origin, and allele frequencies (Figure 2B).
To facilitate data interpretation, the tool provides a series of graphical summaries (Figure 2C). These include comparative plots of on-target efficiency scores for reference-derived versus alternative guide sequences, illustrating the influence of genetic variation on predicted activity. Additional visualizations show the distribution of identified gRNAs by type (reference-only, variant-containing spacer, novel PAM, or both).
If candidate guide coordinates are supplied, CRISPR-HAWK generates a dedicated report restricted to the specified guides, detailing their corresponding alternative sequences arising from genetic variants, with associated residual on-target activity scores (Figure 2D). When the off-target search option is enabled, an off-target summary table is produced, listing all predicted off-target sites for each guide with their genomic positions, mismatch and bulge counts, and CFD/Elevation scores (Figure 2E). For each candidate guide, CRISPR-HAWK also produces a guide activity map that integrates residual on-target activity across variant haplotypes with guide specificity, providing a global view of guide performance across diverse genomic backgrounds (Figure 2F).
3. Results
We evaluated CRISPR-HAWK across clinically relevant and well-characterized genomic regions to assess its ability to design and characterize variant- and haplotype-aware gRNAs using population-scale variant datasets. All analyses were performed using the GRCh38 human genome assembly, enriched with variants from three major population-scale datasets: the 1000 Genomes Project (1000G) (Consortium et al., 2015; Zheng-Bradley et al., 2017), the Human Genome Diversity Project (HGDP) (Bergström et al., 2020), and the genome Aggregation Database (gnomAD) (Karczewski et al., 2020). The 1000G dataset includes whole-genome sequencing data from 2,504 individuals spanning 26 populations, organized into five superpopulations. The HGDP dataset comprises 929 individuals representing 54 populations, grouped into broader ancestries. The gnomAD dataset aggregates high-quality whole-genome sequencing data from 76,215 unrelated individuals, encompassing diverse genetic ancestries.
Using these datasets, we applied CRISPR-HAWK to two complementary tasks. First, we performed de novo guide enumeration and efficiency prediction at the BCL11A +58 erythroid enhancer, the target of sg1617, the clinically optimized guide used in Casgevy (exagamglogene autotemcel), one of the first FDA-approved CRISPR therapies for sickle cell disease and β-thalassemia. This analysis enabled both identification of alternative candidate guides and assessment of how population-level variants affect the predicted activity of sg1617 itself. Second, we quantified the effects of genetic variation on residual on-target activity and guide specificity across a broader panel of guides, including therapeutic gRNAs currently in clinical development and benchmark guides commonly used in off-target characterization studies.
All analyses presented in this paper were performed on a Linux workstation with an AMD Ryzen Threadripper 3970X CPU with 32 cores and 64 GB of RAM.
3.1. Variant-Aware Guide Design on the BCL11A +58 Erythroid Enhancer
To demonstrate CRISPR-HAWK’s functionality, we focused on the BCL11A +58 erythroid enhancer, a regulatory region critical for hemoglobin gene expression (Bauer et al., 2013), targeted by the clinically optimized SpCas9 guide sg1617 (Frangoul et al., 2021; Canver et al., 2015; Wu et al., 2019). Analyzing the enhancer region (Figure 3A), CRISPR-HAWK identified several gRNA candidates on both the reference and alternative genomes (Figure 3B). Specifically, the tool found 295 gRNAs from 1000G and 268 from HGDP, with 40.3% and 34.3% containing genetic variants, respectively. A small subset (4.07% in 1000G; 2.61% in HGDP) originated from novel PAM motifs introduced by genetic variants and appeared exclusively in non-reference haplotypes. In contrast, the broader variant spectrum of gnomAD yielded a markedly higher fraction of guides carrying variants (94.28%), underscoring the importance of large-scale variant databases for comprehensive guide design (Figure 3B). Interestingly, using gnomAD variants CRISPR-HAWK identified four alternative gRNAs for sg1617.
Fig. 3: Variant-aware gRNA design and scoring analysis of the BCL11A +58 erythroid enhancer.
(A) The clinically optimized guide sg1617 targets the +58 erythroid enhancer located within an intronic region of BCL11A. (B) Classification of gRNAs identified across three variant datasets (1000G, HGDP, and gnomAD): reference guides, spacer alternative guides (variants in spacer only), PAM alternative guides (variants creating novel PAM), and guides with variants in both PAM and spacer. (C) On-target efficiency scores (Azimuth) for the top 25 guides at the BCL11A enhancer, comparing reference sequences to haplotype-matched alternative sequences. Guides are ranked by maximum absolute score difference; sg1617 is shown in bold. Point size indicates the number of individuals carrying each variant. (D) Residual on-target activity (CFD) for the top 25 reference-designed guides at the BCL11A enhancer, showing predicted cleavage efficiency on variant-containing target sequences. Guides are ranked by maximum delta between reference and variant targets; sg1617 is shown in bold. Point size indicates the number of individuals carrying each variant. (E) Relationship between residual on-target activity and guide specificity for sg1617 and its variant-containing target sequences. Background shading indicates guide penalty, with green representing optimal combinations of high residual activity and high specificity. Off-target searches were performed with CRISPRme v2.1.7 using 1000G and HGDP variants, allowing up to 6 mismatches and 2 DNA/RNA bulges. (F) Summary of variant-containing target sequences for sg1617, reporting haplotype, population of origin (Pop), allele frequency (AF), sample count, residual on-target activity, on-target efficiency, and guide specificity.
We first assessed the on-target efficiency of guides designed to match specific haplotypes. By comparing the on-target efficiency scores predicted using the Azimuth model across the top 25 guides showing the largest score differences, we observed that specific variant combinations altered predicted activity, either enhancing or reducing efficiency depending on local sequence context (Figure 3C). Notably, a subset of alleles reduced predicted activity below 0.2, suggesting potential functional impairment. For sg1617 specifically, guides redesigned to perfectly match variant-containing sequences showed Azimuth scores comparable to the reference guide, indicating that variants in this region do not inherently compromise guide design potential. Although these predictions derive from computational models and may not fully reflect in vivo performance, they illustrate how genetic variants can modulate gRNA cutting efficiency. Dataset-specific efficiency estimates for the BCL11A enhancer are provided in Supplementary Data Section 1 and Supplementary Figure 1.
The preceding analysis assumes guides are redesigned to match each patient’s haplotype; however, in clinical practice, a single reference-designed guide is typically administered to all patients. To assess how variants affect the residual on-target activity of reference-designed guides, we computed CFD scores between each guide and its variant-containing target sequences (Figure 3D). In this context, variants function analogously to mismatches at off-target sites, potentially reducing or abolishing cleavage. Among the top 25 guides showing the largest CFD deltas, we observed a consistent reduction in predicted cleavage efficiency. Several alternative on-target sites displayed complete loss of activity (CFD = 0). Notably, this includes sg1617, the clinically deployed guide, which showed complete loss of predicted activity (CFD = 0) for two target sequences harboring gnomAD variants, suggesting that the approved therapy may have reduced or absent efficacy in patients carrying these variants.
Many of these variants occur at moderate to high population frequencies, implying that guides designed exclusively on the reference genome may fail to achieve effective editing in certain individuals. As with Azimuth, low CFD values do not imply total loss of cleavage but rather a reduced likelihood of efficient binding or cutting. Dataset-specific CFD results are detailed in Supplementary Data Section 2 and Supplementary Figure 2.
To complete the characterization, we assessed the off-target potential of sg1617 and its alternatives using CRISPRme v2.1.7 (Cancellieri et al., 2023), integrating 1000G and HGDP variants and allowing up to six mismatches and two RNA/DNA bulges (Figures 3E–F). Among the alternative target sequences (Figure 3F), those carrying the haplotypes chr2-60495268-T/G-chr2-60495273-G/A and chr2-60495276-G/A exhibited complete loss of predicted cleavage activity (CFD = 0), whereas the sequence carrying haplotype chr2-60495273-G/A retained moderate activity (CFD = 0.308). In contrast, the alternative target with haplotype chr2-60495283-G/C showed minimal impact on residual on-target activity (CFD = 0.913). Regarding guide specificity, most alternative sequences for sg1617 displayed a modest improvement in predicted specificity (Figure 3E). Notably, haplotype chr2-60495283-G/C maintained both high residual on-target activity (CFD = 0.913) and specificity comparable to the reference guide, whereas haplotypes with reduced on-target activity (CFD = 0) showed improved specificity, likely because the same mismatches that reduce on-target cleavage also reduce off-target binding.
To generalize these findings, we extended the analysis to two loci targeted by the CRISPR-Cas12a system, HBG1 and HBG2 (De Dreuzy et al., 2019). The analysis revealed similar patterns to those observed for BCL11A, where genetic variation modulated guide performance (Supplementary Data Section 3 and Supplementary Figures 3 and 4).
Taken together, these results demonstrate that CRISPR-HAWK enables comprehensive assessment of guide performance across three dimensions: on-target efficiency for haplotype-matched designs, residual activity when reference guides encounter variant targets, and genome-wide specificity.
3.2. Residual On-Target Activity of Therapeutic and Benchmark gRNAs
We next extended the residual on-target activity analysis to additional genomic loci targeted by gRNAs in clinical or preclinical development (Figure 4A) including TRBC1/TRBC2 (Stadtmauer et al., 2020), HBB (DeWitt et al., 2016; Xu et al., 2019b), HBG1/HBG2 (Métais et al., 2019; De Dreuzy et al., 2019) and CCR5 (Xu et al., 2017; Xu et al. 2019a) genes. To provide broader context, we also included EMX1 and FANCF, benchmark loci commonly employed in off-target assessment studies (Tsai et al., 2015). For each guide, we computed CFD scores between the reference-designed sequence and all variant-containing target sequences identified across the three population datasets.
Fig. 4: Residual on-target activity of therapeutic and benchmark gRNAs across variant-containing target sites.
(A) Summary of analyzed guides, including target genes, Cas nucleases, clinical status, genomic coordinates, and PAM sequences. (B) CFD scores quantifying predicted cleavage efficiency of reference-designed guides on variant-containing target sequences. Each point represents a variant-containing target sequence; point size reflects the number of individuals carrying the corresponding variant(s).
For most analyzed guides, variants reduced predicted cleavage efficiency, with several cases showing a complete loss of activity (CFD = 0) at specific variant-containing target sites (Figure 4B). Notably, variants causing loss of cleavage activity often occurred in haplotypes shared by more than 200 individuals, as observed for HBB_2:5226804 and TRBC2:142801351 (Figure 4B). In contrast, other variants modulated gRNA activity more moderately, resulting in only partial decreases in CFD scores and suggesting a lesser impact on cleavage potential. For some guides, such as those targeting TRBC1, all variant-containing sites retained high predicted activity, indicating relatively low susceptibility to sequence variation at this locus.
Across the seven loci analyzed, 82.5% of reference-designed guides contained at least one variant predicted to modify on-target activity. Collectively, these results emphasize that even clinically validated gRNAs designed on the reference genome may experience substantial variability in predicted editing efficiency when applied to genetically diverse populations.
4. Conclusions
In this study, we introduced CRISPR-HAWK, a computational framework that integrates individual- and population-specific genetic variants and haplotype information into CRISPR-Cas gRNA design and evaluation. By accounting for genetic diversity, CRISPR-HAWK enables variant- and haplotype-aware gRNA design and assessment of three complementary performance metrics: on-target efficiency for haplotype-matched guides, residual on-target activity when reference-designed guides encounter variant targets, and genome-wide specificity, supporting the development of population-aware and personalized genome editing strategies.
Applying CRISPR-HAWK to the BCL11A +58 erythroid enhancer, we demonstrated that naturally occurring variants can substantially alter gRNA sequence composition and predicted editing performance. Integration of population-scale variant data from 1000 Genomes, HGDP, and gnomAD revealed that a large fraction of candidate guides contains variants capable of modifying PAM recognition or spacer complementarity. Critically, analysis of sg1617, the guide used in Casgevy, revealed that certain variant-containing target sequences are predicted to completely abolish cleavage activity (CFD = 0), suggesting that the approved therapy may have reduced efficacy in individuals carrying these variants. Extending this analysis to therapeutic and benchmark gRNAs across seven loci confirmed these findings: 82.5% of reference-designed guides contained at least one variant predicted to modify on-target activity. These results emphasize the need for variant-aware evaluation in translational genome editing pipelines.
Despite characterizing guide performance across these three axes, several limitations should be noted. CRISPR-HAWK relies on computational predictors that, while trained on experimental data, may not fully capture in vivo editing outcomes; experimental validation remains essential for clinical applications. The integrated scoring models were developed and validated for specific Cas nucleases (Azimuth and CFD for SpCas9; DeepCpf1 for Cas12a); performance may vary when applied to other enzymes or engineered variants. The current implementation primarily addresses single nucleotide variants, with limited support for complex indels that may also affect guide performance. Additionally, predictions do not account for chromatin accessibility or local epigenetic context, which can substantially influence editing efficiency at specific loci. Delivery efficiency, which varies across cell types and delivery methods, is also beyond the scope of the current framework. Finally, population variant databases, while extensive, may not capture rare or population-specific variants relevant to individual patients.
Overall, CRISPR-HAWK provides a comprehensive and scalable solution for assessing the functional consequences of genetic variation on CRISPR-Cas targeting. Its ability to incorporate haplotype and population-level variant data represents a methodological advancement toward precision genome editing, facilitating the development of guides optimized for diverse genetic backgrounds and enhancing the safety and efficacy of therapeutic interventions.
Supplementary Material
Supplementary information: Supplementary data are available at Bioinformatics online.
Acknowledgments
The authors thank the members of InfOmics lab at University of Verona and of Pinello lab at Massachusetts General Hospital for their valuable suggestions.
Funding
L.P. was supported by NIH R01HG013618, NIH UM1HG012010, and the Rappaport MGH Research Scholar Award (2024-2029).
Funding Statement
L.P. was supported by NIH R01HG013618, NIH UM1HG012010, and the Rappaport MGH Research Scholar Award (2024-2029).
Footnotes
Competing interests
No competing interest is declared.
Data Availability
The CRISPR-HAWK source code (v0.1.2) is available at https://github.com/pinellolab/CRISPR-HAWK and https://github.com/InfOmics/CRISPR-HAWK. All data used to perform the analyses and generate the figures presented in this paper, including the CRISPR-HAWK guide reports and the off-target sites identified by CRISPRme for the sg1617 gRNA and its haplotype-matched alternatives, are available at https://doi.org/10.5281/zenodo.18070463. The scripts used to generate the results and figures reported in the manuscript are available at https://github.com/pinellolab/CRISPR-HAWK/tree/main/paper.
Availability:
CRISPR-HAWK is available at https://github.com/pinellolab/CRISPR-HAWK and https://github.com/InfOmics/CRISPR-HAWK
References
- Anthon C., Corsi G. I., and Gorodkin J.. Crispron/off: Crispr/cas9 on-and off-target grna design. Bioinformatics, 38(24):5437–5439, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Anzalone A. V., Randolph P. B., Davis J. R., Sousa A. A., Koblan L. W., Levy J. M., Chen P. J., Wilson C., Newby G. A., Raguram A., et al. Search-and-replace genome editing without double-strand breaks or donor dna. Nature, 576(7785):149–157, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bauer D. E., Kamran S. C., Lessard S., Xu J., Fujiwara Y., Lin C., Shao Z., Canver M. C., Smith E. C., Pinello L., et al. An erythroid enhancer of bcl11a subject to genetic variation determines fetal hemoglobin level. Science, 342(6155):253–257, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bergström A., McCarthy S. A., Hui R., Almarri M. A., Ayub Q., Danecek P., Chen Y., Felkel S., Hallast P., Kamm J., et al. Insights into human genetic variation and population history from 929 diverse genomes. Science, 367(6484):eaay5012, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cancellieri S., Canver M. C., Bombieri N., Giugno R., and Pinello L.. Crispritz: rapid, high-throughput and variant-aware in silico off-target site identification for crispr genome editing. Bioinformatics, 36(7):2001–2008, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cancellieri S., Zeng J., Lin L. Y., Tognon M., Nguyen M. A., Lin J., Bombieri N., Maitland S. A., Ciuculescu M.-F., Katta V., et al. Human genetic diversity alters off-target outcomes of therapeutic gene editing. Nature Genetics, 55(1):34–43, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Canver M. C., Smith E. C., Sher F., Pinello L., Sanjana N. E., Shalem O., Chen D. D., Schupp P. G., Vinjamur D. S., Garcia S. P., et al. Bcl11a enhancer dissection by cas9-mediated in situ saturating mutagenesis. Nature, 527(7577):192–197, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Canver M. C., Joung J. K., and Pinello L.. Impact of genetic variation on crispr-cas targeting. The CRISPR journal, 1(2): 159–170, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen S., Francioli L. C., Goodrich J. K., Collins R. L., Kanai M., Wang Q., Alföldi J., Watts N. A., Vittal C., Gauthier L. D., et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature, 625(7993):92–100, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cho S. W., Kim S., Kim Y., Kweon J., Kim H. S., Bae S., and Kim J.-S.. Analysis of off-target effects of crispr/cas-derived rna-guided endonucleases and nickases. Genome research, 24 (1):132–141, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Clement K., Hsu J. Y., Canver M. C., Joung J. K., and Pinello L.. Technologies and computational analysis strategies for crispr applications. Molecular cell, 79(1):11–29, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Concordet J.-P. and Haeussler M.. Crispor: intuitive guide selection for crispr/cas9 genome editing experiments and screens. Nucleic acids research, 46(W1):W242–W245, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 1000 G. P. Consortium et al. A global reference for human genetic variation. Nature, 526(7571):68, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- De Dreuzy E., Heath J., Zuris J. A., Sousa P., Viswanathan R., Scott S., Da Silva J., Ta T., Capehart S., Wang T., et al. Edit-301: an experimental autologous cell therapy comprising cas12a-rnp modified mpb-cd34+ cells for the potential treatment of scd. Blood, 134:4636, 2019. [Google Scholar]
- DeWeirdt P. C., Sanson K. R., Sangree A. K., Hegde M., Hanna R. E., Feeley M. N., Griffith A. L., Teng T., Borys S. M., Strand C., et al. Optimization of ascas12a for combinatorial genetic screens in human cells. Nature biotechnology, 39(1): 94–104, 2021. [Google Scholar]
- DeWeirdt P. C., McGee A. V., Zheng F., Nwolah I., Hegde M., and Doench J. G.. Accounting for small variations in the tracrrna sequence improves sgrna activity predictions for crispr screening. Nature Communications, 13(1):5255, 2022. [Google Scholar]
- DeWitt M. A., Magis W., Bray N. L., Wang T., Berman J. R., Urbinati F., Heo S.-J., Mitros T., Muñoz D. P., Boffelli D., et al. Selection-free genome editing of the sickle mutation in human adult hematopoietic stem/progenitor cells. Science translational medicine, 8(360):360ra134–360ra134, 2016. [Google Scholar]
- Doench J. G., Hartenian E., Graham D. B., Tothova Z., Hegde M., Smith I., Sullender M., Ebert B. L., Xavier R. J., and Root D. E.. Rational design of highly active sgrnas for crispr-cas9–mediated gene inactivation. Nature biotechnology, 32(12):1262–1267, 2014. [Google Scholar]
- Doench J. G., Fusi N., Sullender M., Hegde M., Vaimberg E. W., Donovan K. F., Smith I., Tothova Z., Wilen C., Orchard R., et al. Optimized sgrna design to maximize activity and minimize off-target effects of crispr-cas9. Nature biotechnology, 34(2): 184–191, 2016. [Google Scholar]
- Frangoul H., Altshuler D., Cappellini M. D., Chen Y.-S., Domm J., Eustace B. K., Foell J., de la Fuente J., Grupp S., Handgretinger R., et al. Crispr-cas9 gene editing for sickle cell disease and β-thalassemia. New England Journal of Medicine, 384(3):252–260, 2021. [DOI] [PubMed] [Google Scholar]
- Hanna R. E. and Doench J. G.. Design and analysis of crispr–cas experiments. Nature biotechnology, 38(7):813–823, 2020. [Google Scholar]
- Jiang F. and Doudna J. A.. Crispr–cas9 structures and mechanisms. Annual review of biophysics, 46:505–529, 2017. [Google Scholar]
- Kampmann M.. Crispri and crispra screens in mammalian cells for precision biology and medicine. ACS chemical biology, 13 (2):406–416, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karczewski K. J., Francioli L. C., Tiao G., Cummings B. B., Alföldi J., Wang Q., Collins R. L., Laricchia K. M., Ganna A., Birnbaum D. P., et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature, 581(7809): 434–443, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim H. K., Min S., Song M., Jung S., Choi J. W., Kim Y., Lee S., Yoon S., and Kim H. H.. Deep learning improves prediction of crispr–cpf1 guide rna activity. Nature biotechnology, 36(3): 239–241, 2018. [Google Scholar]
- Labun K., Montague T. G., Krause M., Torres Cleuren Y. N., Tjeldnes H., and Valen E.. Chopchop v3: expanding the crispr web toolbox beyond genome editing. Nucleic acids research, 47(W1):W171–W174, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lazzarotto C. R., Li Y., Flory A. R., Chyr J., Yang M., Katta V., Urbina E., Lee G., Wood R., Matsubara A., et al. Population-scale cellular guide-seq-2 and biochemical change-seq-r profiles reveal human genetic variation frequently affects cas9 off-target activity. bioRxiv, pages 2025–02, 2025. [Google Scholar]
- Lessard S., Francioli L., Alfoldi J., Tardif J.-C., Ellinor P. T., MacArthur D. G., Lettre G., Orkin S. H., and Canver M. C.. Human genetic variation alters crispr-cas9 on-and off-targeting specificity at therapeutically implicated loci. Proceedings of the National Academy of Sciences, 114(52):E11257–E11266, 2017. [Google Scholar]
- Listgarten J., Weinstein M., Kleinstiver B. P., Sousa A. A., Joung J. K., Crawford J., Gao K., Hoang L., Elibol M., Doench J. G., et al. Prediction of off-target activities for the end-to-end design of crispr guide rnas. Nature biomedical engineering, 2(1):38–47, 2018. [Google Scholar]
- Liu W., Li L., Jiang J., Wu M., and Lin P.. Applications and challenges of crispr-cas gene-editing to disease treatment in clinics. Precision clinical medicine, 4(3):179–191, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Métais J.-Y., Doerfler P. A., Mayuranathan T., Bauer D. E., Fowler S. C., Hsieh M. M., Katta V., Keriwala S., Lazzarotto C. R., Luk K., et al. Genome editing of hbg1 and hbg2 to induce fetal hemoglobin. Blood advances, 3(21):3379–3392, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rees H. A. and Liu D. R.. Base editing: precision chemistry on the genome and transcriptome of living cells. Nature reviews genetics, 19(12): 770—788, 2018. [Google Scholar]
- Scott D. A. and Zhang F.. Implications of human genetic variation in crispr-based therapeutic genome editing. Nature medicine, 23 (9):1095–1101, 2017. [Google Scholar]
- Sherkatghanad Z., Abdar M., Charlier J., and Makarenkov V.. Using traditional machine learning and deep learning methods for on-and off-target prediction in crispr/cas9: a review. Briefings in Bioinformatics, 24(3):bbad131, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sondka Z., Dhir N. B., Carvalho-Silva D., Jupe S., Madhumita N., McLaren K., Starkey M., Ward S., Wilding J., Ahmed M., et al. COSMIC: a curated database of somatic variants and clinical data for cancer. Nucleic acids research 52(D1): D1210—D1217, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stadtmauer E. A., Fraietta J. A., Davis M. M., Cohen A. D., Weber K. L., Lancaster E., Mangan P. A., Kulikovskaya I., Gupta M., Chen F., et al. Crispr-engineered t cells in patients with refractory cancer. Science, 367(6481):eaba7365, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thakore P. I., Black J. B., Hilton I. B., and Gersbach C. A.. Editing the epigenome: technologies for programmable transcription and epigenetic modulation. Nature methods, 13 (2):127–137, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tsai S. Q., Zheng Z., Nguyen N. T., Liebers M., Topkar V. V., Thapar V., Wyvekens N., Khayter C., Iafrate A. J., Le L. P., et al. Guide-seq enables genome-wide profiling of off-target cleavage by crispr-cas nucleases. Nature biotechnology, 33(2): 187–197, 2015. [Google Scholar]
- Wu Y., Zeng J., Roscoe B. P., Liu P., Yao Q., Lazzarotto C. R., Clement K., Cole M. A., Luk K., Baricordi C., et al. Highly efficient therapeutic gene editing of human hematopoietic stem cells. Nature medicine, 25(5):776–783, 2019. [Google Scholar]
- Xu L., Yang H., Gao Y., Chen Z., Xie L., Liu Y., Liu Y., Wang X., Li H., Lai W., et al. Crispr/cas9-mediated ccr5 ablation in human hematopoietic stem/progenitor cells confers hiv-1 resistance in vivo. Molecular Therapy, 25(8):1782–1789, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu L., Wang J., Liu Y., Xie L., Su B., Mou D., Wang L., Liu T., Wang X., Zhang B., et al. Crispr-edited stem cells in a patient with hiv and acute lymphocytic leukemia. New England Journal of Medicine, 381(13):1240–1247, 2019a. [DOI] [PubMed] [Google Scholar]
- Xu S., Luk K., Yao Q., Shen A. H., Zeng J., Wu Y., Luo H.-Y., Brendel C., Pinello L., Chui D. H., et al. Editing aberrant splice sites efficiently restores β-globin expression in β-thalassemia. Blood, The Journal of the American Society of Hematology, 133(21):2255–2262, 2019b. [Google Scholar]
- Yuen G., Khan F. J., Gao S., Stommel J. M., Batchelor E., Wu X., and Luo J.. Crispr/cas9-mediated gene knockout is insensitive to target copy number but is dependent on guide rna potency and cas9/sgrna threshold expression level. Nucleic acids research, 45(20):12039–12053, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zheng-Bradley X., Streeter I., Fairley S., Richardson D., Clarke L., Flicek P., and. G. P. Consortium. Alignment of 1000 genomes project reads to reference assembly grch38. Gigascience, 6(7):gix038, 2017 [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The CRISPR-HAWK source code (v0.1.2) is available at https://github.com/pinellolab/CRISPR-HAWK and https://github.com/InfOmics/CRISPR-HAWK. All data used to perform the analyses and generate the figures presented in this paper, including the CRISPR-HAWK guide reports and the off-target sites identified by CRISPRme for the sg1617 gRNA and its haplotype-matched alternatives, are available at https://doi.org/10.5281/zenodo.18070463. The scripts used to generate the results and figures reported in the manuscript are available at https://github.com/pinellolab/CRISPR-HAWK/tree/main/paper.
CRISPR-HAWK is available at https://github.com/pinellolab/CRISPR-HAWK and https://github.com/InfOmics/CRISPR-HAWK




