Abstract
High-fidelity clustered regularly interspaced palindromic repeats (CRISPR)-associated protein 9 (Cas9) variants have been developed to reduce the off-target effects of CRISPR systems at a cost of efficiency loss. To systematically evaluate the efficiency and off-target tolerance of Cas9 variants in complex with different single guide RNAs (sgRNAs), we applied high-throughput viability screens and a synthetic paired sgRNA–target system to assess thousands of sgRNAs in combination with two high-fidelity Cas9 variants HiFi and LZ3. Comparing these variants against wild-type SpCas9, we found that ∼20% of sgRNAs are associated with a significant loss of efficiency when complexed with either HiFi or LZ3. The loss of efficiency is dependent on the sequence context in the seed region of sgRNAs, as well as at positions 15–18 in the non-seed region that interacts with the REC3 domain of Cas9, suggesting that the variant-specific mutations in the REC3 domain account for the loss of efficiency. We also observed various degrees of sequence-dependent off-target reduction when different sgRNAs are used in combination with the variants. Given these observations, we developed GuideVar, a transfer learning-based computational framework for the prediction of on-target efficiency and off-target effects with high-fidelity variants. GuideVar facilitates the prioritization of sgRNAs in the applications with HiFi and LZ3, as demonstrated by the improvement of signal-to-noise ratios in high-throughput viability screens using these high-fidelity variants.
Graphical Abstract
Graphical Abstract.
INTRODUCTION
Clustered regularly interspaced palindromic repeats (CRISPR)/CRISPR-associated protein 9 (CRISPR/Cas9) technology is a widely used tool for genome editing and is currently being tested in clinical trials for therapeutic applications (1–5). Many applications of this technology utilize wild-type Streptococcus pyogenes Cas9 (WT SpCas9) owing to its high on-target activity. However, the off-target effects of the WT SpCas9 have raised critical concerns in scientific and clinical applications (6–9). To address this issue, high-fidelity Cas9 variants have been developed, including eSpCas9 (10), SpCas9-HF1 (11), HypaCas9 (12), evoCas9 (13), xCas9 (14), Sniper-Cas9 (15) and HiFi (16). Despite the improved specificity of these engineered or evolved Cas9 variants, they often sacrifice the editing efficiency compared with WT SpCas9 (17). Recently, two promising Cas9 variants, HiFi and LZ3, have been reported to be associated with enhanced specificity while maintaining on-target activity comparable with WT SpCas9 (16,18). However, the cleavage activities of these high-fidelity variants were only tested in complex with a limited number of single guide RNAs (sgRNAs), and it is unclear if the improvements made by the variants are general for all sgRNAs or are specific to a subset of sgRNAs.
The efficiency and off-target tolerance of the CRISPR/Cas9 system are affected not only by the Cas9 protein, but also by the sequence context of the sgRNA (19–21). We and others have proposed sets of sequence rules for the prediction and optimization of sgRNAs with WT SpCas9 (7,19,22,23). Some of the rules, such as a guide-intrinsic mismatch tolerance (GMT) that can be explained by a kinetic model, are generally applicable to different Cas9 variants (24). However, structural analysis has shown that the mutations introduced into the Cas9 variants significantly alter the interacting patterns between the Cas9 protein and the RNA/DNA heteroduplex (12). Therefore, it is expected that the sequence determinants underlying the sensitivity and specificity of Cas9 variants could be divergent from the rules derived for WT SpCas9 (21).
To systematically explore the variant-specific sequence rules that determine the editing efficiency of Cas9 variants, we applied high-throughput viability screens to evaluate the efficiency of ∼24000 sgRNAs complexed with HiFi, LZ3 or WT SpCas9. Moreover, we measured the off-target tolerance of Cas9 variants in the context of 328 sgRNAs and 1753 target sequences using a synthetic paired sgRNA–target system. These data allowed comprehensive modeling of the efficiency and off-target effect of the variants, which further led to the development of a new machine learning framework for sgRNA design for application with the high-fidelity Cas9 variants.
MATERIALS AND METHODS
Cell culture
DLD-1 (CCL-221), H2171 (CRL-5929) and HEK293T (CRL-3216) cell lines were purchased from the ATCC. DLD-1 and HEK293T cells were maintained in RPMI1640 and Dulbecco’s modified Eagle’s medium (DMEM; Gibco), respectively, supplemented with 10% fetal bovine serum (FBS; Sigma), 1% penicillin–streptomycin (Gibco) at 37°C with 5% CO2. H2171 cells were cultured in HITES medium, supplemented with 5% FBS at 37°C with 5% CO2. All the cell lines were regularly tested and shown to be mycoplasma free.
Plasmid construction
WT SpCas9 expression plasmid lentiCas9-Puro was created by replacing the blasticidin (Blast) resistance gene with the puromycin (Puro) resistance gene from plasmid lentiCas9-Blast (Addgene, #52962). HiFi Cas9 and LZ3 Cas9 constitutive expression plasmids lentiHiFi-Puro and lentiLZ3-Puro were constructed by the Gibson Assembly Site-Directed Mutagenesis approach (NEB, #E2621) using lentiCas9-Puro as the template. Inducible pCW-HiFi-Cas9 and pCW-LZ3-Cas9 plasmids were generated by Gibson assembly site-directed mutagenesis from pCW-Cas9 (Addgene, #50661). For endogenous gene (YAP1 and MTAP) knockout, the sgRNAs were constructed into lentiGuide-Blast vector according to the protocol from Feng Zhang's lab. The lentiGuide-Blast plasmid was generated by replacing the Puro resistance gene in lentiGuide-Puro (Addgene, #52963) with the Blast resistance gene. The sgRNA sequences of YAP1 and MTAP are 5′-TGCCCCAGACCGTGCCCATG-3′ and 5′-TCTGCCCGGGAGCTAAAACG-3′. An sgRNA (5′-GCTTACGATGGAGCCAGAG-3′) targeting the AAVS1 gene was used as the control.
Lentivirus production and titration
For virus packaging, HEK293T cells (4 × 106) were seeded into a 10 cm tissue culture dish containing 10 ml of fresh medium and kept at 37°C overnight. A 4 μg aliquot of endotoxin-free lentiviral plasmid (lentiCas9, lentiHiFi or lentiLZ3), 4 μg of psPAX2 (Addgene, #12260) and 2 μg of pMD2.G (Addgene, #12259) were mixed in 500 μl of Opti-MEM (Gibco) with 30 μl of X-tremeGene HP DNA transfection reagent (Roche, #06366236001) at room temperature for 10 min, and then dropwise added to the 10 cm dish with HEK293T cells. The supernatant containing lentivirus was filtered through a 0.45 μm syringe filter 48 h after transfection. Lentivirus was aliquoted and frozen at –80°C until use. To test endogenous gene knockout, lentivirus production was performed using the same procedure employing the endotoxin-free lentiviral plasmid lentiGuide-sgYAP1 or -sgMTAP.
For virus titration, cells were seeded into 24-well plates with 1 × 105 cells/well for 6–8 h for DLD-1 or overnight for HEK293T cells before adding various volumes of lentivirus. H2171 cells were seeded into a 24-well plate with 5 × 105 cells/well and incubated with various volumes of lentivirus just after seeding. To increase infection efficiency, 8 μg/ml polybrene (Millipore, #TR-1003-G) was supplemented. After a 48 h infection, cells were subjected to selection with Puro (2 μg/ml). Cell viability was tested using CellTiter-Glo® reagent (Promega, #G7572) 3 days after selection and the virus titer was determined based on the cell survival rate.
Stable cell line construction
The colorectal adenocarcinoma cell line DLD-1 and small cell lung cancer cell line H2171 are two model cell lines that can be easily transfected and transduced, and have been successfully utilized in high-throughput CRISPR knockout screens (25–32). We therefore selected DLD-1 and H2171 for construction of stable cell lines with constitutive expression of WT SpCas9 or Cas9 variants. Lentivirus infection was performed with the same procedure as virus titration mentioned above using a similar virus volume for WT SpCas9, HiFi and LZ3. Virus volumes were adjusted accordingly to keep Cas9 expression consistent among WT SpCas9, HiFi and LZ3. After Puro (2 μg/ml) selection, the pooled cells with individual Cas9 were subcultured and expanded for determining Cas9 expression by western blot. For the generation of HEK293T cell lines expressing doxycycline-inducible WT SpCas9 or Cas9 variants, cells were infected with pCW-Cas9, pCW-HiFi or pCW-LZ3 lentivirus followed by Puro selection. Monoclonal HEK293T cells with inducible WT SpCas9 or Cas9 variants were produced by limiting dilution. To verify the knockout efficiency of the stable cell lines, the cells were infected with lentiviral lentiGuide-sgYAP1 or -sgMTAP. The knockout of YAP1 or MTAP in the resulting cells after Blast selection was determined by western blot. An sgRNA targeting the AAVS1 gene was used as the control.
Western blot
Proteins were extracted with RIPA buffer [50 mM Tris–HCl, pH 8.0, 150 mM NaCl, 1% Triton X-100, 0.5% sodium deoxycholate, 0.1% sodium dodecyl sulfate (SDS)] supplemented with 1× proteinase inhibitor cocktail (Roche, #11836153001). Proteins were separated on 4–15% gradient gels (Bio-Rad) and transferred onto 0.45 μm polyvinylidene fluoride (PVDF) membranes, followed by blocking with 5% non-fat milk and incubation with primary antibody against Cas9 (CST, #14697, 1:2000), YAP1 (CST, #14074, 1:2000), MTAP (CST, #4158, 1:2000) or β-actin (Sigma, #A5316, 1:10000). After overnight incubation at 4°C, the membranes were rinsed with Tris-buffered saline–Tween (TBST) and incubated with secondary anti-rabbit or anti-mouse antibody at 1:10 000 for 1–2 h at room temperature. After rinsing, bands were visualized with ECL (Pierce, #32106). β-Actin was used as the loading control.
sgRNA library design
The EpiC library and T1 library were synthesized and constructed as described in our previous publications (22,31). For Tiling library design, we compiled a set of essential transcription factor genes in DLD-1 cells which encode multiple domain proteins, with particular attention on zinc finger proteins. To define the essentiality, we calculated the CRISPR score using the CRISPR knockout screen datasets in DepMap (25) and set the threshold to < –0.5. These proteins are largely not well identified. We manually added several genes encoding proteins with well-identified domains as the positive control genes. All the sgRNAs (19 nt) targeting exons (extent 20 bp up- and downstream) were included as long as there is a protospacer adjacent motif (PAM; NGG). The sgRNAs for the top 50 essential genes (four sgRNAs per gene) in our previous CRISPR knockout screen data were chosen as positive sgRNA controls (31). The 500 sgRNAs targeting intergenic regions derived from the Sabatini dataset (33) were included as the negative sgRNA controls. The redundant sgRNAs and those targeting multiple loci were removed and yielded a final Tiling library with 11762 sgRNAs. Each sgRNA was flanked with 5′ sequence: TATCTTGTGGAAAGGACGAAACACCg and 3′ sequence: GTTTTAGAGCTAGAAATAGCAAGTTAAAAT. This oligo library was synthesized as a pool by Custom Array Inc. (Bothell, WA, USA). The sequences of sgRNAs are provided in Supplementary Table S1.
Plasmid library construction and transformation
We amplified the synthesized oligo library using the following primer pair GGCTTTATATATCTTGTGGAAAGGACGAAACACCG (forward) and CTAGCCTTATTTTAACTTGCTATTTCTAGCTCTAAAAC (reverse) with NEBNext High-Fidelity Master Mix (NEB, #M0531) for 10 cycles. The pooled library cloning and transformation processes were performed according to our previous publication (31). Briefly, the amplified oligo library was purified and ligated into BsmBI-digested LentiGuide-Blast using Gibson assembly. The expression of an sgRNA and the Blast resistance gene is driven under the U6 promoter and the EF1α promoter, respectively. The transformation was performed with 2 μl of the ligation product for each tube of electrocompetent cells (Lucigen, #60242) using a GenePulser (BioRad) according to the manufacturer's protocol, and the cells were plated onto 15 cm plates with carbenicillin selection (50 μg/ml). After 14 h, all the colonies were collected as a pool for plasmid library extraction with Endotoxin-Free Plasmid Maxiprep (Qiagen, #12362).
Pooled screen
The lentiviral libraries were prepared as described previously (31). Before determining the multiplicity of infection (MOI), DLD-1 and HEK293T cells were seeded into 10 cm dishes at the cell density of 6 × 106 cells/dish in 10 ml of fresh medium for 6–8 h and overnight, respectively. H2171 cells were seeded into 6-well plates with 5 × 106 cells/well. For transduction, cells were incubated with different volumes of virus supplemented with 8 μg/ml polybrene for 36–48 h and then selected with Blast (20 μg/ml for DLD-1 and HEK293T, 10 μg/ml for H2171) for 5 days. Cell survival rate was measured by CellTiter-Glo assay and cell counting, and MOIs for different virus volumes were calculated compared with the non-Blast treated cells.
The pooled CRISPR knockout screens were performed using a similar procedure to the MOI testing mentioned in this section. DLD-1 cells with WT SpCas9, HiFi or LZ3 were infected with the EpiC library or Tiling library. H2171 cells with WT SpCas9, HiFi or LZ3 were infected with the EpiC library. Three independent replicates were set for each cell line with a minimum of 2 × 107 cells/replicate aiming for at least 500-fold coverage. Cells were infected with the lentiviral library at an MOI of ∼0.3 for 36–48 h, followed by Blast selection for 5 days. The resulting cells were collected and passaged in fresh medium at the density of 1 × 106 cells/ml for H2171 cells and 1 × 107 cells/15 cm dish for DLD-1 cells. Cells were passaged every 2–3 days and cell number at a minimum of 500-fold coverage of the library was maintained during each passaging. After 20 days of maintenance (23 days for the DLD-1 cell lines with the EpiC library), cells were collected for genomic DNA (gDNA) extraction using the Blood & Cell Culture Midi kit (Qiagen, #13343) according to the manufacturer's instructions.
For off-target evaluation, we performed high-throughput screens in HEK293T cells with doxycycline-inducible WT SpCas9, HiFi or LZ3 using a synthetic dual-target system (22). The library T1 was designed with an sgRNA expression cassette paired with its relevant off-target and on-target sequences arranged in tandem. The screening procedure was performed as the screens in the DLD-1 cells. Cell pellets were collected on day 9 after Cas9 induction for gDNA extraction.
Library amplification and deep sequencing
The sgRNA inserts were obtained by two rounds of polymerase chain reaction (PCR) amplification from gDNA. The first round was performed with primer pairs listed in Supplementary Table S2 with 16 cycles using NEBNext High Fidelity Master Mix (NEB, # M0541) for the screens with EpiC and Tiling libraries and NEBNext Ultra II Q5 Master Mix (NEB, #M0544) for the screens with the T1 library. A 40 μg aliquot of gDNA was used as the template in eight PCRs to achieve ∼500-fold coverage. The resulting products were purified with a PCR purification kit (Qiagen) and 5 μl of purified products were used as the template for the second round of PCR with 12 cycles. The PCR products were gel purified, quantified by Qubit and qPCR, and deep sequenced using Nextseq500 at MDACC-Smithville Next Generation Sequencing Core. Single-end (75 bp) sequencing was conducted for EpiC and Tiling libraries, and paired-end (75 bp) sequencing was used for the T1 library.
CRISPR screen data processing
The pooled CRISPR knockout screen data were processed using MoPAC (https://sourceforge.net/projects/mopac/). Basically, the sequencing reads were first aligned to the sgRNA library and counted. The read counts were then processed with a quality control module, a rank-weighted average algorithm for gene essentiality measurement, a normalization module for removing biases caused by different depths of selection, and the assessment of dropout effects of each sgRNA based on the normal distribution (Z-score).
For the high-throughput off-target screen data generated with a synthetic dual-target system, we adopted the same computational pipeline described in our previous study to call the read counts for five different types of indels and calculate the off–on ratios for each sgRNA–target pair (22).
Sequence feature analysis for efficiency loss with Cas9 variants
To determine the sequence features associated with efficiency loss with HiFi and LZ3, we first extracted sgRNAs that target essential genes and classified them into three different groups based on their dropout effects in WT SpCas9 and the variant screens: the ‘Inefficient’ group (Z-score > –3 in the WT SpCas9 screen), the ‘WT efficient only’ group [Z-score < –3 in the WT SpCas9 screen and > 2-fold standard deviation (SD) difference between WT SpCas9 and the variant screens] and the ‘Both efficient’ group (Z-score < –3 in the WT SpCas9 screen and < 2-fold SD difference between WT SpCas9 and the variant screens). As for HF1, the sgRNAs were also classified into three different groups based on the corresponding indel rates in WT SpCas9 and HF1 screens: the ‘Inefficient’ group (indel rate < 0.6 in WT SpCas9 screen), the ‘WT efficient only’ group (indel rate > 0.6 in WT SpCas9 and > 2-fold SD difference between WT SpCas9 and HF1 screens) and the ‘Both efficient’ group (indel rate > 0.6 in WT SpCas9 screen and < 2-fold SD difference between WT SpCas9 and HF1 screens). We then calculated the log odds ratios of different types of nucleotides at each position along the spacer sequence between the ‘Both efficient’ and ‘WT efficient only’ groups. The sequence logos were generated using the logomaker software (34).
We used the Kullback–Leibler (KL) divergence to assess the importance of different nucleotide positions. Let ,
,
and
represent the frequency of A, T, C and G at the ith position in the ‘Both efficient’ group, and
,
,
and
represent the frequency of A, T, C and G at the ith position in the ‘WT efficient only’ group. The KL divergence was calculated as follows:
![]() |
Larger KL divergence values indicate larger differences between two groups of sgRNAs at a specific nucleotide position and higher importance of the specific position for the efficiency loss with Cas9 variants.
Efficiency prediction for HiFi and LZ3
To predict the on-target efficiency for Cas9 variants using the CRISPR screen data generated in this study, we extracted sgRNAs targeting essential genes in the variant screens with both Tiling and EpiC libraries, which resulted in 9875 sgRNAs in the training dataset. For each sgRNA, we averaged the Z-scores in the HiFi and LZ3 screens as the final measurement for the knockout efficiency for Cas9 variants.
We first tried four conventional machine learning methods: a linear regression model (LR), a support vector regression model with RBF kernel (SVM), a random forest regression model (RF) and a gradient boost tree regression model (GBT). All the methods were implemented through the corresponding modules from the scikit-learn package in Python. Next, we implemented a two-step transfer learning strategy to predict the on-target efficiency of sgRNAs for Cas9 variants. In the pre-training step, we took two recurrent neural network (RNN) models as our source models, which were pre-trained on the large datasets with >50000 sgRNAs in WT SpCas9 and SpCas9-HF1 screens, respectively. The bidirectional long short-term memory (BiLSTM) layers of two RNN models were then frozen to avoid destroying any of the information they contain during future training rounds. In the fine-tuning step, the outputs of BiLSTM layers of the two RNN models together with mono- and dinucleotide sequence features were passed to downstream dense layers, the parameters of which were further fine-tuned with HiFi and LZ3 screen data described above. For the mononucleotide features, the 20 nt sgRNA sequence was binarized into a 4 × 20 two-dimensional array, with 0s and 1s indicating the absence or presence of four different nucleotides (A, T, C and G) at every single position. For the dinucleotide features, the 20 nt sgRNA sequence was binarized into a 16 × 19 two-dimensional array, with 0s and 1s indicating the absence or presence of 16 different dinucleotides (AT, AC, AG, AA, TT, TA, TG, TC, CC, CA, CG, CT, GG, GA, GT and GC) at each of the constitutive positions along the sgRNA sequence. To assess the effects of mono- and dinucleotide features on the performance, we built four different prediction models: (i) without any additional sequence features; (ii) with mononucleotide sequence features; (iii) with dinucleotide sequence features; and (iv) with both mono- and dinucleotide sequence features. We performed 5-fold cross-validation and calculated the Spearman's correlations between predicted scores and experimentally measured dropout effects as evaluation metrics. This process was randomly repeated 10 times to ensure robustness and reliability.
To evaluate the performance of different machine learning methods, we adopted a cross-validation approach by randomly splitting the dataset 10 times into two groups, among which 80% were used as the training set and 20% were used for the testing. The performance was assessed by computing the Spearman correlation between the predicted and observed efficiency scores. For the independent evaluation of our transfer learning-based model which we named GuideVar-on, we compared our model with two other deep learning-based models, DeepHF (35) and DeepSpCas9variants (17), on a dataset generated in the previous study, which includes the on-target indel rates for 59 sgRNAs targeting different genomic loci with different Cas9 variants (18). DeepHF is the combination of the RNN model and important biological features, which predicts the efficiency of sgRNAs for eSpCas9(1.1), SpCas9-HF1 and WT SpCas9. The source codes of DeepHF were downloaded from https://github.com/izhangcd/DeepHF. DeepSpCas9variants is a CNN-based deep learning model that predicts the activity of 16 different types of variants at any target sequence. The source codes of the DeepSpCas9variants were downloaded from https://github.com/NahyeKim/DeepSpCas9variants. Of note, only the models predicting the efficiency for SpCas9-HF1 were used for the comparison. The performance was evaluated as the Spearman correlation between the predicted efficiency score and experimentally measured indel frequency for each sgRNA.
Estimation of off-target effect from dual-target assay
To assess the relative off-target effect of an sgRNA at the off-target site compared with its on-target sequence, we calculated the off–on ratio using the following formula:
![]() |
Where ,
and
indicate the read counts of indels at off-target site only, at on-target site only and at both on- and off-target sites, respectively.
Because HiFi and LZ3 showed highly consistent off-target effect (Figure 3C), we averaged the off–on ratios of HiFi and LZ3 at specific target sequences for a more accurate measurement of off-target effects with the variants.
Figure 3.
Sequence-specific off-target tolerance of Cas9 variants. (A) Schematic plot of the synthetic dual-target system for the evaluation of off-target effects for a specific sgRNA–target pair. (B) Boxplot comparing the off–on ratios of sgRNA–target pairs harboring 1-, 2- or 3-mismatches (MM) among the screens with WT SpCas9, HiFi and LZ3. (C) Pairwise comparison of the off-target rates of sgRNA–target pairs among the screens with WT SpCas9, HiFi and LZ3. The consistency of off-target effects is measured using Pearson correlation. (D) Scatter plot comparing the off–on ratios of all the 1-MM sgRNA–target pairs between WT SpCas9 and the Cas9 variants (VT: average off–on ratio of HiFi and LZ3). The sgRNA–target pairs are assigned to three groups based on the off-target effects with WT SpCas9 and the degree of off-target reduction by the variants. (E) Boxplot showing the variant-mediated off-target reduction at each mismatch position measured as the ratio of off-target effects (off–on ratio) between the variants and WT SpCas9 (VT/WT ratio). The P-value is calculated using the two-tailed Manny–Whitney U-test. (F) The median VT/WT ratios corresponding to different mismatch contexts between the sgRNA and the target DNA. (G) The correlation between off-target effects (VT/WT ratio) and the expected energy barriers caused by the mismatches during R-loop formation. (H) Scatter plot showing the association between the energy barrier caused by a specific mismatch type and the median VT/WT ratio for that mismatch type. rXdY refers to the mismatch type where the sgRNA sequence is X and the target DNA sequence is Y.
As the result, we computed the off–on ratios for 815 sgRNA–target pairs with different mismatch types and positions from our training set. To further measure the degree of off-target reduction by the variants (VT) relative to WT SpCas9, we calculated the VT/WT ratio using the following formula:
![]() |
where ,
and
indicate off–on ratios for HiFi, LZ3 and WT SpCas9, respectively.
Calculation of mismatch-associated energy barrier
To quantify the energy barrier associated with a specific mismatch type between sgRNA and the targeted DNA, we computed the difference of base-stacking energy between DNA–DNA and DNA–RNA hybridization over three nucleotides centered at the mismatch position based on the nucleic acid duplex energy parameters adopted from the previous study (36). To determine the influence of the energy barrier on reducing off-target effects by the variants, we computed the ratio of off-target effects (off–on ratio) between the variants and WT SpCas9 (VT/WT ratio) for each sgRNA–target pair and the corresponding energy barriers. We also calculated the Pearson correlation between the expected energy barrier for a certain mismatch type and the median VT/WT ratio of all sgRNA–target pairs with that type of mismatch.
Off-target prediction for HiFi and LZ3
We model the sequence-specific off-target effect as the combination of four factors: (i) the individual mismatch effect (IME), which is the averaged effect for a certain type of mismatch that occurred at the specific position of the sgRNAs derived from our previous WT SpCas9 screen (22); (ii) the GMT, which measures the off-target tolerance from the intrinsic sequence of an sgRNA and is predictable with a CNN model; (iii) the mismatch type (MT), which describes the type of the mismatch between an sgRNA and the target sequence (the MT is encoded as a 12 bit binary vector representing the occurrence of 12 different mismatch types rAdA, rAdC, rAdG, rCdA, rCdC, rCdT, rGdA, rGdG, rGdT, rUdC, rUdG and rUdT, where the value for the exactly occurring mismatch type is 1 and all the others are 0s); and (iv) the mismatch position (MP), which describes the position of the mismatch between an sgRNA and the target sequence. The MP is encoded as a 20 bit binary vector representing 20 positions along the target sequence, where the value at the mismatch-occurring position is 1 and all the other positions are 0s. We tested a simple linear regression (LR) model as well as three non-linear models including a support vector regression model with RBF kernel (SVM), a random forest regression model (RF) and a gradient boost tree regression model (GBT), to integrate the four factors described above. All the methods were implemented through the corresponding modules from the scikit-learn package in Python. To evaluate the performance of different methods, we adopted a cross-validation approach by randomly splitting the dataset 20 times into two groups, of which 80% were used as the training set and the remaining 20% were used for the testing. The performance was assessed by computing the Spearman correlation between the predicted off-target effects and experimentally measured off–on ratios.
Given sgRNA–target pairs with N (N > 1) mismatches, we first predict the off-target effects for each single-mismatched pair with the machine learning models described above, denoted as ,
,
. The overall off-target effect is computed by incorporating the combinatorial effects (CE) estimated from our previous WT SpCas9 screen data. We denote
as the CE between two mismatches that occur at the ith and the jth nucleotide relative to the PAM (
. The overall combinatorial effect is computed as the geometric mean of all the possible combinations among the N mismatches:
![]() |
The final off-target effect (GuideVar-off) is then computed as the following formula:
![]() |
Considering that there are no off-target prediction methods specifically for Cas9 variants, we sought to compare GuideVar-off with the three state-of-art methods used to predict off-target effects for WT SpCas9. (i) Elevation is a two-layer regression model where the first layer learns to predict the effects of a single mismatch and the second layer learns how to combine single-mismatch effects into a final score (37). The source codes of Elevation were downloaded from https://github.com/Microsoft/Elevation. (ii) CRISPR-Net is a more recent deep learning method using a recurrent convolutional network, which shows superior performance compared with other machine learning approaches (38). The source codes to implement CRISPR-Net were downloaded from https://codeocean.com/capsule/9553651/tree/v1. (iii) MOFF is a model-based off-target predictor that includes three factors corresponding to the multiplication of IME, CE and GMT (22). The source codes to implement MOFF were downloaded from https://github.com/MDhewei/MOFF. We compared the methods on the TTISS dataset, which contains 630 detected off-target sites across 59 sgRNAs with WT SpCas9, HiFi, LZ3 and SpCas9-HF1 (18). For each sgRNA–target pair, we measured the off-target effect as the off–on ratio, which is calculated as the detected off-target reads divided by on-target reads. We adopted the Spearman correlations between the measured and predicted off–on ratios for quantitative evaluations.
sgRNA design for high-fidelity Cas9 variants using GuideVar
To facilitate the rational sgRNA design for the CRISPR applications using high-fidelity Cas9 variants, we built a unified computational framework GuideVar, which consists of the following steps. First, given the sequences of a group of sgRNA candidates, GuideVar will predict the on-target score for each sgRNA using GuideVar-on. Second, GuideVar will map the sgRNAs to the genome to search for potential off-target sites harboring up to five mismatches using CRISPRitz, a software for rapid and high-throughput in silico off-target site identification (39). A final off-target score for each sgRNA is calculated as the logarithm of the sum of the predicted off-target effects for all the potential off-target sites. Third, GuideVar will classify all the sgRNAs into three categories based on the predicted on- and off-target scores: the first class is of high on-target score (>0.5) and low off-target score (<1.0), which will be selected as priority; the second class is of high on-target score but high off-target score (>1.0), which can still be used but is less preferred; and the third class is of low on-target score (<0.5), which is highly recommended to be discarded. The source codes of GuideVar are available at https://github.com/MDhewei/GuideVar.
To test the framework, we applied GuideVar to all the sgRNAs in the EpiC library and compared the average log-fold changes (LFCs) of sgRNAs targeting essential genes and non-essential genes within each category classified by GuideVar. To quantitate the separation, we computed the strictly standardized mean difference (SSMD) between the sgRNAs targeting essential and non-essential genes. A higher SSMD indicates greater separation between the essential and non-essential genes (40).
RESULTS
Guide-specific loss of efficiency revealed by high-throughput knockout screens with Cas9 variants HiFi and LZ3
High-fidelity Cas9 variants have been developed for increased specificity of the Cas9 system. Among over a dozen Cas9 variants available to date, HiFi and LZ3 are two variants that show optimal trade-off between activity and specificity (18). To systematically compare the editing efficiency between WT SpCas9 and Cas9 variants, we performed high-throughput viability screens in cell lines with stable expression of WT SpCas9, HiFi or LZ3. A colorectal adenocarcinoma cell line DLD-1 and a small cell lung cancer cell line H2171 were individually engineered with WT SpCas9 or the two Cas9 variants by lentiviral transduction. Western blot confirmed equivalent expression of the three types of Cas9 and effective gene knockouts in all six engineered cell lines (Supplementary Figure S1). The procedures of high-throughput viability screens are demonstrated in Figure 1A and detailed in the Methods. Here, we used two independent lentiviral plasmid libraries, each containing ∼12000 sgRNAs (Supplementary Figure S2). The first library, named EpiC, contains sgRNAs targeting ∼1000 epigenetic regulators and cancer-related genes, as well as control sgRNAs targeting core essential and non-essential genes (31). The second is a Tiling sgRNA library that contains all sgRNAs targeting the exons of 26 essential transcription factors required for the proliferation and survival of DLD-1 cells. Of the six engineered cell lines, the DLD-1 cells were transduced with either the EpiC or Tiling libraries, and the H2171 cells were transduced with the EpiC library. After an ∼20 day selection, the sgRNA sequences were amplified, sequenced and counted for abundance relative to plasmid DNA (Supplementary Tables S3–S5).
Figure 1.
Performance of high-fidelity Cas9 variants in high-throughput CRISPR/Cas9 viability screens. (A) Experimental procedures of CRISPR screens using WT SpCas9, HiFi and LZ3 in combination with different sgRNA libraries in DLD-1 and H2171 cells. Schematic of constructs used for the expression of WT SpCas9 and the variants (upper), and sgRNAs (bottom) are boxed. (B) Pairwise comparison of the LFCs of sgRNAs among the screens using WT SpCas9, HiFi or LZ3 in DLD-1 cells with the EpiC library. The consistency of dropout effects was measured using Pearson correlation. (C) Scatter plot comparing the LFCs of sgRNAs targeting non-essential genes in WT SpCas9 and variant screens in DLD-1 cells with the EpiC library. The sgRNAs were classified into three categories: ‘No depletion’ (gray dots); ‘Both depletion’ (red dots) for both WT SpCas9 and the variants; and ‘WT depletion only’ (blue dots). (D) Boxplot showing the predicted off-target effects (MOFF score) of the sgRNAs targeting non-essential genes in each category in (C). (E) Scatter plot comparing the LFCs of sgRNAs targeting essential genes in WT SpCas9 and variant screens in DLD-1 cells with the EpiC library. (F) Boxplot showing the predicted off-target effects of the sgRNAs targeting essential genes in each category in (E). (G) The proportion of the sgRNAs targeting non-essential or essential genes, for each sgRNA category in (C) and (E).
We first conducted quality control analysis based on the correlation of independent replicates and the dropout effects corresponding to sgRNAs targeting essential genes and non-essential control loci (Supplementary Figuresthe S3–S5). Our results confirmed reliability and consistency of the screening data in this study. Next, we tested the consistency of dropout effects among the screens with WT SpCas9, HiFi and LZ3, based on pairwise Pearson correlations of the LFCs of sgRNA abundance (Figure 1B; Supplementary Table S6). In DLD-1 cells transduced with the EpiC library, HiFi and LZ3 showed a high correlation (r = 0.93), whereas the correlations decreased when comparing HiFi and LZ3 against WT SpCas9 (r = 0.87 and 0.83, respectively). Upon further examination, we found that a fraction of sgRNAs was associated with significantly less LFCs when HiFi and LZ3 were used, as compared with the screen with WT SpCas9 (Figure 1B). This observation was reproducible in DLD-1 cells transduced with the Tiling library and in H2171 cells transduced with the EpiC library (Supplementary Figure S6A, B; Supplementary Tables S7 and S8). Thus, the correlation analysis suggests a guide-specific variant-dependent cell dropout effect.
We reason that two factors may account for the decreased dropout effects with Cas9 variants. First, the variants reduce the off-target effect which could lead to a defect in cell viability. Second, some sgRNAs can be associated with loss of efficiency when targeting essential genes using the variants. To dissect these two factors, we examined the sgRNAs in the EpiC library that target non-essential and essential genes individually. Considering the high correlation between HiFi and LZ3, we averaged the LFCs of sgRNA abundance corresponding to the two variants in DLD-1 cells for further analysis. For the sgRNAs targeting non-essential genes, 151 (1.8%) out of 8224 showed a high dropout effect (LFC < –1) in the WT SpCas9 screen (Figure 1C). As expected, the dropped-out sgRNAs are associated with high off-targeting potential measured by the MOFF score (22), indicating that a small fraction of sgRNAs leads to a viability defect of cells independent of the function of their target genes due to the off-target effects (Figure 1D). Among them, 88 (58.3%) showed decreased dropout effects (>2-fold difference in sgRNA abundance between the variants and WT SpCas9 screens) in HiFi and LZ3 screens, suggesting that the majority of the off-target effects were reduced by the variants. For the sgRNAs targeting essential genes, 2021 (50.7%) of 3990 showed dropout effects (LFC < –1) in the WT SpCas9 screen. Among them, 379 (18.8%) sgRNAs are associated with decreased dropout effects (>2-fold difference) in the variant screens compared with the WT SpCas9 screen (Figure 1E, blue dots). Interestingly, those sgRNAs depleted only in the WT SpCas9 screen are associated with similar MOFF scores to those without a dropout effect, suggesting that the difference in dropout effects is mainly caused by the loss of efficiency of the sgRNAs in variant screens instead of the reduction of off-target effects (Figure 1F). Indeed, among all the sgRNAs in the EpiC library that were depleted only in the WT SpCas9 screen, nearly 80% target essential genes where on-target functional perturbations lead to phenotypic changes (Figure 1G). We observed similar results to the EpiC library in H2171 cells and to the Tiling library in DLD-1 cells (Supplementary Figure S6C–E). Taken together, these lines of evidence indicate that sgRNA-specific loss of efficiency is the major causal factor that accounts for the difference in dropout effects between the screens with Cas9 variants and that with WT SpCas9.
Sequence features that contribute to the loss of efficiency with Cas9 variants
Next, we asked if the nucleotide sequence context of sgRNAs contributes to the loss of efficiency in Cas9 variant screens. To answer this question, we combined the sgRNAs targeting essential genes in the EpiC and Tiling libraries and categorized them into three groups: ‘Inefficient’ (not efficient in either WT SpCas9 or Cas9 variant screens), ‘WT efficient only’ (efficient only in the WT SpCas9 screen) and ‘Both efficient’ (efficient in both WT SpCas9 and Cas9 variant screens). We then computed the log odds ratios of spacer nucleotide frequency at positions 1–19 between the ‘Both efficient’ and ‘WT efficient only’ groups. We observed highly consistent sequence features between HiFi and LZ3 (Figure 2A, B; Supplementary Figure S7). Of note, the identified sequence features are also consistent for sgRNAs with or without a mismatch at the 5′-appended ‘g’ nucleotide (Supplementary Figure S8), which has been reported to impact the activity of sgRNAs in complex with Cas9 variants (41). We further extended the analysis to an orthogonal dataset generated through direct measurement of indel frequency induced by WT SpCas9 or HF1, another high-fidelity Cas9 variant (35). Despite different types of Cas9 variants and experimental settings, many sequence features are reproducible among HiFi, LZ3 and HF1 (Figure 2C; Supplementary Figure S7), suggesting a common mechanism underlying the sequence-dependent loss of efficiency among the three variants. To determine the importance of nucleotide positions, we calculated the KL divergence that measures the difference in nucleotide distribution at each position between the ‘Both efficient’ and ‘WT efficient only’ groups. We found that sequences in three regions significantly contribute to the loss of efficiency: (i) positions 2–5 in the seed region proximal to the PAM; (ii) positions 7–11, excluding position 9, at the junction of the seed and non-seed regions; and (iii) positions 15–18 in the non-seed region (Figure 2D).
Figure 2.
Sequence features associated with efficiency loss with Cas9 variants. (A and B) Left: scatter plot showing the categorization of sgRNAs into three groups: ‘Inefficient’ (gray), ‘Both efficient’ (blue) and ‘WT efficient only’ (red), based on the screens with (A) HiFi and (B) LZ3. Right: the log odds ratios of nucleotide frequency between ‘Both efficient’ and ‘WT efficient only’ groups. The data were retrieved by combining all sgRNAs targeting essential genes in the EpiC and Tiling libraries. (C) Left: scatter plot showing the three sgRNA groups defined based on the indel rates in a published dataset using WT SpCas9 and the Cas9 variant HF1 (35). Right: the log odds ratios of nucleotide frequency between the groups of ‘Both efficient’ and ‘WT efficient only’. (D) The KL divergence representing the significance of nucleotide difference between the ‘WT efficient only’ and ‘Both efficient’ groups at each position of the spacer. The KL divergence was averaged among the variants HiFi, LZ3 and HF1. (E) Upper: the point mutations (red asterisks) introduced into HiFi, LZ3 and HF1. Bottom: the structural representation of local interactions between the RNA/DNA heteroduplex and REC3 domain of WT SpCas9 (PDB: 7S4U). Positions 9–11 interacting with residues are marked in blue, positions 15–17 interacting with residues are marked in red and the residues mutated in any of the three Cas9 variants are marked with asterisks.
Previous reports indicated that sgRNA efficiency with WT SpCas9 is highly associated with the nucleotide sequence in the seed region (3,7,42). With regard to the variants, however, our results showed that the sequence context in the non-seed region also contributes to the variant-specific editing efficiency. We hypothesized that the observed sequence features in the non-seed region are associated with the structural alteration mediated by the mutations introduced into the variants. Comparing the mutations in HiFi, LZ3 and HF1, we found that all three variants harbor mutations in REC3, a non-catalytic domain that targets complementarity and governs the HNH nuclease to regulate catalytic competence (Figure 2E, upper) (11,16,18,43). Interestingly, the structural analysis revealed that positions 9–11 and 15–17, but not 12–14, interact with the REC3 domain of WT SpCas9 protein (Figure 2E, bottom) (44). Of note, it has been reported that mutations in REC3 can disrupt the interaction between REC3 and the RNA/DNA heteroduplex, while the interaction is required for the catalytic function (12,43,45,46). Therefore, our results support a model in which specific spacer sequences at positions 15–18 and/or 9–11 mediate the activity of Cas9 variants via conformational regulation of REC3–RNA/DNA interaction.
Sequence-dependent off-target tolerance of Cas9 variants
Despite Cas9 variants being able to greatly reduce off-target effects compared with WT SpCas9, some sgRNAs targeting non-essential genes showed significant dropout effects in our viability screens when HiFi and LZ3 were used, corresponding to high off-target tolerance (Figure 1C, D). To systematically investigate the sequence-specific off-target tolerance of the Cas9 variants, we evaluated the off-target effects of 328 sgRNAs using a synthetic paired sgRNA–target system that allows high-throughput measurement of on-target and off-target indel rates in parallel (Figure 3A; Supplementary Table S9) (22). For each sgRNA, we designed seven off-target sequences, comprising three targets with one mismatch (1-MM), three with two mismatches (2-MM) and one with three mismatches (3-MM). Consistent with the viability screens (Figure 1B; Supplementary Figure S6A, B), the on-target indel rates of HiFi and LZ3 showed a high correlation, whereas the correlations between the variants and WT SpCas9 were much lower (Supplementary Figure S9). In general, the off-target effects, as measured by off–on ratios, were significantly lower for the two Cas9 variants compared with WT SpCas9 (Figure 3B; Supplementary Figure S10; Supplementary Table S10), confirming the overall reduced off-target tolerance of Cas9 variants. The off-target rates of HiFi and LZ3 were also highly correlated, in comparison with the reduced correlation between the variants and WT SpCas9 (Figure 3C).
Upon further examination, we found that 317 sgRNA–target pairs were associated with different degrees of reduction of off-target effects when the variants were used (Figure 3D). To quantify the reduction of off-target effects by the variants, we computed the ratio of off-target effects between the variants and WT SpCas9 (VT/WT ratio) for 317 1-MM sgRNA–target pairs that are associated with strong off-target effects with WT SpCas9 (off–on ratio > 0.5). We then asked if the position of the mismatch accounts for the variation of the VT/WT ratio. Interestingly, we observed significantly lower VT/WT ratios when a mismatch occurs at positions 15–18 of the RNA/DNA duplex that interacts with the REC3 domain of Cas9 (Figure 3E). This observation suggests that the variant-specific mutations of REC3 and the mismatches at positions 15–18 can be synergistic in blocking Cas9-mediated DNA cleavage. In addition to the mismatch positions, we also found that the nucleotide contexts of RNA–DNA mismatches are associated with different degrees of VT/WT ratios (Figure 3F). In previous studies, we and others have reported that an RNA–DNA mismatch leads to an energy barrier during R-loop progression, where the quantity of the energy barrier is predictable from the context of the mismatch and its adjacent nucleotides (22,36,47,48). Therefore, we computed the expected energy barrier based on the sequences of the 317 sgRNA–target pairs and found that those pairs with greater energy barriers are associated with stronger reduction of off-target effects by the variants (Figure 3G). Among the 12 single mismatch types, rUdC (i.e. uracil in RNA and cytosine in DNA), rCdC, rAdG, rAdC and rAdA are associated with greater energy barriers and smaller VT/WT ratios (Figure 3H), suggesting that the variants are more effective in overcoming the off-target effects at the genomic off-target sites harboring these mismatches. Taken together, these lines of evidence indicate that the degree of off-target reduction by the variants is dependent on the position and nucleotide context of the mismatches.
GuideVar facilitates optimal sgRNA selection for Cas9 variants
We next sought to predict the sequence-specific on-target efficiency and off-target effects for HiFi and LZ3 using the datasets generated in this study. Over the past years, machine learning models have been developed and trained to predict on-target efficiency and off-target effects of WT SpCas9 and a few Cas9 variants based on a large volume of datasets (17,22,35,49). To take advantage of these pre-trained models, we developed a transfer learning framework, named GuideVar, by which previous models on WT SpCas9 and HF1 are ‘transferred’ to new models to enhance the predictive power on HiFi and LZ3.
GuideVar consists of two modules: GuideVar-on for the prediction of on-target efficiency, and GuideVar-off for the prediction of an off-target effect. In GuideVar-on, we transferred two long short-term memory (LSTM) models in DeepHF that were trained on >50000 sgRNAs (35). As shown in Figure 4A, the flattened layers of the LSTM models were concatenated, together with the mono- and dinucleotide sequence features (see the Methods; Supplementary Figure S11), as the inputs of a multilayer neural network that was trained using the HiFi and LZ3 data in this study. Based on 10-fold cross-validation, the transfer learning model significantly improved the predictive power of on-target efficiency compared with the two source models in DeepHF and other machine learning methods (Figure 4B). We further evaluated the performance of GuideVar-on using an independent TTISS dataset in which the on-target indel rates were measured for 59 sgRNAs with the expression of HiFi or LZ3 (18). As a result, GuideVar-on outperformed two recent deep learning-based models for the prediction of sgRNA efficiency (Figure 4C; Supplementary Table S11), suggesting its robustness for applications with HiFi and LZ3.
Figure 4.
GuideVar predicts on-target efficiency and off-target effects of HiFi and LZ3 using transfer learning. (A) Schematic representation of GuideVar-on for predicting on-target efficiency. (B) Performance comparison of different machine learning models using 10-fold cross-validation. The performance was assessed based on the Spearman correlation between the predicted and observed efficiency scores. (C) Performance comparison between GuideVar-on and the other two deep learning methods for HiFi and LZ3 based on an independent TTISS dataset (18). (D) Schematic representation of GuideVar-off for predicting off-target effects. (E) Performance comparison between GuideVar-off and other off-target prediction tools for HiFi and LZ3 based on the TTISS dataset. (F) The distribution of all the sgRNAs in the EpiC library with on-target scores predicted by GuideVar-on and off-target scores by GuideVar-off. These sgRNAs are categorized into three groups: ‘Low efficiency’, ‘High efficiency high off-target’ (high on high off) and ‘High efficiency low off-target’ (high on low off). (G) Boxplot comparing the dropout effects of the sgRNAs targeting essential (blue) and non-essential (red) genes among different sgRNA categories in the screen with HiFi. SSMD scores were computed to measure the effect size of each sgRNA category defined in (F). (H) Effect size of the EpiC library before and after sgRNA selection using GuideVar, in the screens with WT SpCas9, HiFi or LZ3. The effect size is measured in SSMD.
Recently, we developed MOFF, an off-target predictive model for WT SpCas9 (22). MOFF integrates multiple sequence-dependent factors, including IME parameterized with a matrix and a GMT modeled using a CNN. We note that the degree of off-target reduction by the variants is dependent on the mismatch position (MP) and mismatch type (MT) (Figure 3E, F). Therefore, GuideVar-off incorporates MP and MT, together with the quantitative measures of IME and GMT from MOFF, as the inputs for model learning (Figure 4D). Based on cross-validation, we tested a panel of linear or non-linear machine learning models, including LR, RF, SVM and GBT. Among them, GBT achieved the best performance (Supplementary Figure S12), thus it was implemented in GuideVar-off. We observed progressive improvement in predictive power when the four types of sequence features (GMT, IME, MT and MP) were successively added to the model (Supplementary Figure S13), suggesting that all the features are associated with off-target effects of the variants. To predict off-target effects when multiple mismatches are present, GuideVar-off adopted the δ coefficients in MOFF to compute the combinatorial effect of multiple mismatches (22). We compared the performance of GuideVar-off with other state-of-the-art methods based on the TTISS dataset (18), where GuideVar-off consistently outperformed other models for both HiFi and LZ3 (Figure 4E; Supplementary Table S12).
Finally, we built a unified computational framework that integrates the GuideVar-on and GuideVar-off for the selection of optimal sgRNAs in applications with HiFi and LZ3. To test the framework, we applied GuideVar to all the sgRNAs in the EpiC library (Figure 4F), resulting in 7553 (61.8%) sgRNAs that were predicted to be of high efficiency and low off-target effects, as well as 681 (5.6%) highly off-targeting sgRNAs and 3980 (32.6%) inefficient sgRNAs. For each category of sgRNAs, we compared the dropout effects of the sgRNAs that target essential genes and non-essential genes and calculated the SSMD, a measure of the effect size of sgRNA libraries (40,50,51). As expected, the sgRNAs in the high-efficiency low-off-target group showed the highest effect size (SSMD = 1.73 and 1.64 for HiFi and LZ3, respectively) (Figure 4G; Supplementary Figure S14). The high-fidelity Cas9 variants are rarely used in high-throughput CRISPR screens due to the compromise of efficiency. Consistently, we observed lower SSMD scores for HiFi and LZ3 when the entire EpiC library was used for screening. However, with the aid of GuideVar, the selected subset of sgRNAs showed an almost equivalent effect size among WT SpCas9, HiFi and LZ3 (Figure 4H). Therefore, GuideVar will potentially facilitate the usage of high-fidelity Cas9 variants for high-throughput CRISPR screens when an off-target effect is a critical concern. For the convenience of users, we computed GuideVar scores for all the sgRNAs targeting the exomes of human, mouse and monkey genomes. This resource will facilitate broader application of high-fidelity Cas9 variants aided by sgRNA prioritization.
DISCUSSION
High-fidelity Cas9 variants have been developed to reduce the off-target effect of the CRISPR/Cas9 system, but the variants could also compromise the on-target efficiency, which limits their applications. In this study, we found that the efficiency loss of HiFi and LZ3 is guide specific and dependent on the sequence of the spacer. Importantly, the sequence context at positions 15–18 relative to the PAM is highly associated with the efficiency loss. Because nucleotides at positions 15–18 of the RNA/DNA heteroduplex interact with the REC3 domain of Cas9 where mutations are introduced into the variants, our results suggest a variant-specific sequence rule for on-target efficiency. Of note, most Cas9 variants developed to date harbor mutations in the REC3 domain (12,13,43). We observed similar sequence rules on HiFi and LZ3 to those of HF1, another REC3 mutant variant, using a dataset generated from an orthogonal study (Figure 2A–C). Therefore, the sequence rules derived from this study are likely to be applicable to other Cas9 variants. Despite the elaborate engineering of Cas9 variants, we also observed various degrees of off-target reduction by the variants over a large panel of sgRNAs and targets. The guide-specific off-target reduction is dependent on the position and nucleotide context of the mismatch. In particular, a mismatch at positions 15–18 is associated with effective off-target reduction by the variants, suggesting synergistic effects between REC3 mutations and those RNA–DNA mismatches at positions 15–18. In future study, it would be interesting to explore the structural conformations with REC3 mutations and the RNA–DNA mismatches for in-depth understanding of the mechanism underlying the synergistic effects. Together, these findings not only elucidate the guide-specific loss of efficiency and off-target reduction, but also suggest the feasibility of optimizing sgRNA design by the implementation of sequence rules for Cas9 variants.
Given the observations of guide-specific loss of efficiency and off-target reduction by the variants, it is expected that the high-fidelity variants could effectively reduce off-target effects with little sacrifice of efficiency for a subset of sgRNAs. We developed GuideVar to facilitate guide prioritization in applications with HiFi and LZ3. In GuideVar, we introduced a transfer learning framework by which previous machine learning models are transferred to the new model to enhance the predictive power. We demonstrated the utilization of GuideVar via a high-throughput screen, in which a computationally selected subset of sgRNAs exhibited improved performance with the Cas9 variants, where the effect size is equivalent to or even better than WT SpCas9 (Figure 4H). Therefore, we anticipate that GuideVar holds potential to broaden the application of the high-fidelity Cas9 variants in scientific and clinical settings. Moreover, the success of transfer learning in our study suggests its future utilization for sgRNA optimization, considering the rapidly growing body of CRISPR datasets and computational models.
Supplementary Material
ACKNOWLEDGEMENTS
We thank Dr Jianjun Shen, MDACC-Smithville Molecular Biology Core and MDACC-Smithville Next Generation Sequencing Core for help with sequencing.
Author contributions: H.X., L.Z. and W.H. conceptualized the study. L.Z. and R.F. performed the experiments. W.H. performed computational analysis. H.X., W.H. and L.Z. provided data interpretation. H.X. supervised the project. All authors participated in writing the manuscript.
Contributor Information
Liang Zhang, Department of Epigenetics and Molecular Carcinogenesis, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA.
Wei He, Department of Epigenetics and Molecular Carcinogenesis, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA.
Rongjie Fu, Department of Epigenetics and Molecular Carcinogenesis, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA.
Shuyue Wang, Department of Epigenetics and Molecular Carcinogenesis, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA.
Yiwen Chen, Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA.
Han Xu, Department of Epigenetics and Molecular Carcinogenesis, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA; Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA; The Center for Cancer Epigenetics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA.
DATA AVAILABILITY
The raw sequencing data (fastq files) generated in this study are deposited at the Sequence Read Archive (SRA) under the BioProject accession number PRJNA977711. The GuideVar scores of all the sgRNAs targeting the exomes within genomes of human, mouse and monkey are deposited at FigShare and are freely accessible at https://figshare.com/s/721551365ff2c4b92976.
CODE AVAILABILITY
The source codes of GuideVar are available at https://github.com/MDhewei/GuideVar.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
FUNDING
The Cancer Prevention and Research Institute of Texas (CPRIT) [RR160097 to H.X.]; the National Institutes of Heatlh (NIH)[R35GM137927 to H.X., and R01GM130838 and R01NS117668 to Y.C.]. H.X. is a CPRIT Scholar in Cancer Research.
Conflict of interest statement. None declared.
REFERENCES
- 1. Cho S.W., Kim S., Kim J.M., Kim J.S.. Targeted genome engineering in human cells with the Cas9 RNA-guided endonuclease. Nat. Biotechnol. 2013; 31:230–232. [DOI] [PubMed] [Google Scholar]
- 2. Cong L., Ran F.A., Cox D., Lin S.L., Barretto R., Habib N., Hsu P.D., Wu X.B., Jiang W.Y., Marraffini L.A.et al.. Multiplex genome engineering using CRISPR/Cas systems. Science. 2013; 339:819–823. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Jinek M., Chylinski K., Fonfara I., Hauer M., Doudna J.A., Charpentier E.. A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity. Science. 2012; 337:816–821. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Mali P., Yang L.H., Esvelt K.M., Aach J., Guell M., DiCarlo J.E., Norville J.E., Church G.M.. RNA-guided human genome engineering via Cas9. Science. 2013; 339:823–826. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Rafii S., Tashkandi E., Bukhari N., Al-Shamsi H.O.. Current status of CRISPR/Cas9 application in clinical cancer research: opportunities and challenges. Cancers. 2022; 14:947–960. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Fu Y.F., Foden J.A., Khayter C., Maeder M.L., Reyon D., Joung J.K., Sander J.D.. High-frequency off-target mutagenesis induced by CRISPR-Cas nucleases in human cells. Nat. Biotechnol. 2013; 31:822–826. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Hsu P.D., Scott D.A., Weinstein J.A., Ran F.A., Konermann S., Agarwala V., Li Y.Q., Fine E.J., Wu X.B., Shalem O.et al.. DNA targeting specificity of RNA-guided Cas9 nucleases. Nat. Biotechnol. 2013; 31:827–832. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Tsai S.Q., Zheng Z., Nguyen N.T., Liebers M., Topkar V.V., Thapar V., Wyvekens N., Khayter C., Iafrate A.J., Le L.P.et al.. GUIDE-seq enables genome-wide profiling of off-target cleavage by CRISPR-Cas nucleases. Nat. Biotechnol. 2015; 33:187–197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Kim D., Bae S., Park J., Kim E., Kim S., Yu H.R., Hwang J., Kim J.I., Kim J.S. Digenome-seq: genome-wide profiling of CRISPR-Cas9 off-target effects in human cells. Nat. Methods. 2015; 12:237–243. [DOI] [PubMed] [Google Scholar]
- 10. Slaymaker I.M., Gao L.Y., Zetsche B., Scott D.A., Yan W.X., Zhang F.. Rationally engineered Cas9 nucleases with improved specificity. Science. 2016; 351:84–88. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Kleinstiver B.P., Pattanayak V., Prew M.S., Tsai S.Q., Nguyen N.T., Zheng Z.L., Joung J.K.. High-fidelity CRISPR-Cas9 nucleases with no detectable genome-wide off-target effects. Nature. 2016; 529:490–495. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Chen J.S., Dagdas Y.S., Kleinstiver B.P., Welch M.M., Sousa A.A., Harrington L.B., Sternberg S.H., Joung J.K., Yildiz A., Doudna J.A.. Enhanced proofreading governs CRISPR-Cas9 targeting accuracy. Nature. 2017; 550:407–410. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Casini A., Olivieri M., Petris G., Montagna C., Reginato G., Maule G., Lorenzin F., Prandi D., Romanel A., Demichelis F.et al.. A highly specific SpCas9 variant is identified by in vivo screening in yeast. Nat. Biotechnol. 2018; 36:265–271. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Hu J.H., Miller S.M., Geurts M.H., Tang W.X., Chen L.W., Sun N., Zeina C.M., Gao X., Rees H.A., Lin Z.et al.. Evolved Cas9 variants with broad PAM compatibility and high DNA specificity. Nature. 2018; 556:57–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Lee J.K., Jeong E., Lee J., Jung M., Shin E., Kim Y.H., Lee K., Jung I., Kim D., Kim S.et al.. Directed evolution of CRISPR-Cas9 to increase its specificity. Nat. Commun. 2018; 9:3048–3057. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Vakulskas C.A., Dever D.P., Rettig G.R., Turk R., Jacobi A.M., Collingwood M.A., Bode N.M., McNeill M.S., Yan S.Q., Camarena J.et al.. A high-fidelity Cas9 mutant delivered as a ribonucleoprotein complex enables efficient gene editing in human hematopoietic stem and progenitor cells. Nat. Med. 2018; 24:1216–1224. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Kim N., Kim H.K., Lee S., Seo J.H., Choi J.W., Park J., Min S., Yoon S., Cho S.R., Kim H.H.. Prediction of the sequence-specific cleavage activity of Cas9 variants. Nat. Biotechnol. 2020; 38:1328–1336. [DOI] [PubMed] [Google Scholar]
- 18. Schmid-Burgk J.L., Gao L.Y., Li D., Gardner Z., Strecker J., Lash B., Zhang F.. Highly parallel profiling of Cas9 variant specificity. Mol. Cell. 2020; 78:794–800. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Doench J.G., Fusi N., Sullender M., Hegde M., Vaimberg E.W., Donovan K.F., Smith I., Tothova Z., Wilen C., Orchard R.et al.. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nat. Biotechnol. 2016; 34:184–191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Jiang F.G., Taylor D.W., Chen J.S., Kornfeld J.E., Zhou K.H., Thompson A.J., Nogales E., Doudna J.A.. Structures of a CRISPR-Cas9 R-loop complex primed for DNA cleavage. Science. 2016; 351:867–871. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Pacesa M., Loeff L., Querques I., Muckenfuss L.M., Sawicka M., Jinek M.. R-loop formation and conformational activation mechanisms of Cas9. Nature. 2022; 609:191–196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Fu R.J., He W., Dou J.Z., Villarreal O.D., Bedford E., Wang H.E., Hou C., Zhang L., Wang Y.L., Ma D.C.et al.. Systematic decomposition of sequence determinants governing CRISPR/Cas9 specificity. Nat. Commun. 2022; 13:474–488. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Xu H., Xiao T.F., Chen C.H., Li W., Meyer C.A., Wu Q., Wu D., Cong L., Zhang F., Liu J.S.et al.. Sequence determinants of improved CRISPR sgRNA design. Genome Res. 2015; 25:1147–1157. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Klein M., Eslami-Mossallam B., Arroyo D.G., Depken M.. Hybridization kinetics explains CRISPR-Cas off-targeting rules. Cell Rep. 2018; 22:1413–1423. [DOI] [PubMed] [Google Scholar]
- 25. Tsherniak A., Vazquez F., Montgomery P.G., Weir B.A., Kryukov G., Cowley G.S., Gill S., Harrington W.F., Pantel S., Krill-Burger J.M.et al.. Defining a cancer dependency map. Cell. 2017; 170:564–576. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Munoz D.M., Cassiani P.J., Li L., Billy E., Korn J.M., Jones M.D., Golji J., Ruddy D.A., Yu K., McAllister G.et al.. CRISPR screens provide a comprehensive assessment of cancer vulnerabilities but generate false-positive hits for highly amplified genomic regions. Cancer Discov. 2016; 6:900–913. [DOI] [PubMed] [Google Scholar]
- 27. Martin T.D., Cook D.R., Choi M.Y., Li M.Z., Haigis K.M., Elledge S.J.. A role for mitochondrial translation in promotion of viability in K-ras mutant cells. Cell Rep. 2017; 20:427–438. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Veeneman B., Gao Y., Grant J., Fruhling D., Ahn J., Bosbach B., Bienkowska J., Follettie M., Arndt K., Myers J.et al.. PINCER: improved CRISPR/Cas9 screening by efficient cleavage at conserved residues. Nucleic Acids Res. 2020; 48:9462–9477. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Yau E.H., Kummetha I.R., Lichinchi G., Tang R., Zhang Y., Rana T.M.. Genome-wide CRISPR screen for essential cell growth mediators in mutant KRAS colorectal cancers. Cancer Res. 2017; 77:6330–6339. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Zimmermann M., Murina O., Reijns M.A.M., Agathanggelou A., Challis R., Tarnauskaite Z., Muir M., Fluteau A., Aregger M., McEwan A.et al.. CRISPR screens identify genomic ribonucleotides as a source of PARP-trapping lesions. Nature. 2018; 559:285–289. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Gao G.Z., Zhang L., Villarreal O.D., He W., Su D., Bedford E., Moh P., Shen J.J., Shi X.B., Bedford M.T.et al.. PRMT1 loss sensitizes cells to PRMT5 inhibition. Nucleic Acids Res. 2019; 47:5038–5048. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Hart T., Chandrashekhar M., Aregger M., Steinhart Z., Brown K.R., MacLeod G., Mis M., Zimmermann M., Fradet-Turcotte A., Sun S.et al.. High-resolution CRISPR screens reveal fitness genes and genotype-specific cancer liabilities. Cell. 2015; 163:1515–1526. [DOI] [PubMed] [Google Scholar]
- 33. Wang T., Yu H.Y., Hughes N.W., Liu B.X., Kendirli A., Klein K., Chen W.W., Lander E.S., Sabatini D.M.. Gene essentiality profiling reveals gene networks and synthetic lethal interactions with oncogenic Ras. Cell. 2017; 168:890–903. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Tareen A., Kinney J.B.. Logomaker: beautiful sequence logos in Python. Bioinformatics. 2020; 36:2272–2274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Wang D.Q., Zhang C.D., Wang B., Li B., Wang Q., Liu D., Wang H.Y., Zhou Y., Shi L.M., Lan F.et al.. Optimized CRISPR guide RNA design for two high-fidelity Cas9 variants by deep learning. Nat. Commun. 2019; 10:4284–4297. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Alkan F., Wenzel A., Anthon C., Havgaard J.H., Gorodkin J.. CRISPR-Cas9 off-targeting assessment with nucleic acid duplex energy parameters. Genome Biol. 2018; 19:177–189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Listgarten J., Weinstein M., Kleinstiver B.P., Sousa A.A., Joung J.K., Crawford J., Gao K., Hoang L., Elibol M., Doench J.G.et al.. Prediction of off-target activities for the end-to-end design of CRISPR guide RNAs. Nat. Biomed. Eng. 2018; 2:38–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Lin J.C., Zhang Z.L., Zhang S.X., Chen J.Y., Wong K.C.. CRISPR-net: a recurrent convolutional network quantifies CRISPR off-target activities with mismatches and indels. Adv. Sci. 2020; 7:1903562. [Google Scholar]
- 39. Cancellieri S., Canver M.C., Bombieri N., Giugno R., Pinello L.. CRISPRitz: rapid, high-throughput and variant-aware in silico off-target site identification for CRISPR genome editing. Bioinformatics. 2020; 36:2001–2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. DeWeirdt P.C., McGee A.V., Zheng F.Y., Nwolah I., Hegde M., Doench J.G.. Accounting for small variations in the tracrRNA sequence improves sgRNA activity predictions for CRISPR screening. Nat. Commun. 2022; 13:5255–5265. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Kim S., Bae T., Hwang J., Kim J.S.. Rescue of high-specificity Cas9 variants using sgRNAs with matched 5' nucleotides. Genome Biol. 2017; 18:218–224. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Wu X.B., Scott D.A., Kriz A.J., Chiu A.C., Hsu P.D., Dadon D.B., Cheng A.W., Trevino A.E., Konermann S., Chen S.D.et al.. Genome-wide binding of the CRISPR endonuclease Cas9 in mammalian cells. Nat. Biotechnol. 2014; 32:670–676. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Slaymaker I.M., Gaudelli N.M.. Engineering Cas9 for human genome editing. Curr. Opin. Struct. Biol. 2021; 69:86–98. [DOI] [PubMed] [Google Scholar]
- 44. Bravo J.P.K., Liu M.S., Hibshman G.N., Dangerfield T.L., Jung K., McCool R.S., Johnson K.A., Taylor D.W.. Structural basis for mismatch surveillance by CRISPR-Cas9. Nature. 2022; 603:343–347. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Zhu X., Clarke R., Puppala A.K., Chittori S., Merk A., Merrill B.J., Simonovic M., Subramaniam S.. Cryo-EM structures reveal coordinated domain motions that govern DNA cleavage by Cas9. Nat. Struct. Mol. Biol. 2019; 26:679–685. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Kim D., Luk K., Wolfe S.A., Kim J.S. Evaluating and enhancing target specificity of gene-editing nucleases and deaminases. Annu. Rev. Biochem. 2019; 88:191–220. [DOI] [PubMed] [Google Scholar]
- 47. Eslami-Mossallam B., Klein M., Smagt C.V.D., Sanden K.V.D., Jones S.K., Hawkins J.A., Finkelstein I.J., Depken M. A kinetic model predicts SpCas9 activity, improves off-target classification, and reveals the physical basis of targeting fidelity. Nat. Commun. 2022; 13:1367–1385. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Zhang D., Hurst T., Duan D.S., Chen S.J.. Unified energetics analysis unravels SpCas9 cleavage activity for optimal gRNA design. Proc. Natl Acad. Sci. USA. 2019; 116:8693–8698. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Kim H.K., Lee S., Kim Y., Park J., Min S., Choi J.W., Huang T.P., Yoon S., Liu D.R., Kim H.H.. High-throughput analysis of the activities of xCas9, SpCas9-NG and SpCas9 at matched and mismatched target sequences in human cells. Nat. Biomed. Eng. 2020; 4:111–124. [DOI] [PubMed] [Google Scholar]
- 50. Meyers R.M., Bryan J.G., McFarland J.M., Weir B.A., Sizemore A.E., Xu H., Dharia N.V., Montgomery P.G., Cowley G.S., Pantel S.et al.. Computational correction of copy number effect improves specificity of CRISPR-Cas9 essentiality screens in cancer cells. Nat. Genet. 2017; 49:1779–1784. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. McFarland J.M., Ho Z.V., Kugener G., Dempster J.M., Montgomery P.G., Bryan J.G., Krill-Burger J.M., Green T.M., Vazquez F., Boehm J.S.et al.. Improved estimation of cancer dependencies from large-scale RNAi screens using model-based normalization and data integration. Nat. Commun. 2018; 9:4610–4622. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The raw sequencing data (fastq files) generated in this study are deposited at the Sequence Read Archive (SRA) under the BioProject accession number PRJNA977711. The GuideVar scores of all the sgRNAs targeting the exomes within genomes of human, mouse and monkey are deposited at FigShare and are freely accessible at https://figshare.com/s/721551365ff2c4b92976.