Abstract
CRISPR-based genome editing relies on guide RNA sequences to target specific regions of interest. A large number of methods have been developed to predict how efficient different guides are at inducing indels. As more experimental data becomes available, methods based on machine learning have become more prominent. Here, we explore whether quantifying the uncertainty around these predictions can be used to design better guide selection strategies. We demonstrate how using a deep ensemble approach achieves better performance than utilizing a single model. This approach can also provide uncertainty quantification. This allows to design, for the first time, strategies that consider uncertainty in guide RNA selection. These strategies achieve precision over 90% and can identify suitable guides for >93% of genes in the mouse genome.
Keywords: CRISPR-Cas9, guide design, machine learning, uncertainty quantification
Introduction
CRISPR-based methodologies have established themselves as a very important instrument for genomic manipulation [1]. Fundamentally, these technologies employ a CRISPR-associated (Cas) endonuclease alongside a sequence-specific RNA component that directs the nuclease and other functions associated with the Cas enzyme towards a predetermined genomic locus [2, 3, 4]. This guide RNA (gRNA) is engineered to target specific genomic sequences for editing. Specifically, in the context of Cas9, a potential CRISPR target site requires the presence of a protospacer adjacent motif (PAM) characterized by an NGG sequence, with the adjacent 20 nucleotide sequence upstream serving as a template for constructing the gRNA.
Over the preceding decade, the scientific community has leveraged CRISPR technology for a wide array of purposes, ranging from foundational research to practical applications. These include the development of animal models for disease research [5], the genetic study of endangered species [6], enhancement of agricultural crop resilience [7], and the pioneering of novel therapeutic approaches [3].
However, despite the broad spectrum of applications underscoring the versatility of CRISPR-based genome editing, the process of gRNA design remains a non-trivial and intricate task, demanding careful consideration and expertise.
One of the objectives when designing gRNAs is to maximize the on-target efficiency, which can be understood as the rate at which the desired edit is obtained. Liu et al. [8] highlights how a wide range of factors have been investigated to determine the effects on gRNA efficiency. To assist in the process, many tools have been developed [9, 10].
While most of the tools can identify some efficient guides, the overlap between them is often limited [9]. While this behaviour can be exploited to develop consensus approaches that outperform individual tools [11, 12], there exists considerable room for further improvements.
As more experimental data have become available, improvements have increasingly been sought through machine learning approaches, with a particular focus on deep learning [13, 14, 15, 16].
One limitation of complex machine learning methods is the risk of treating them as black boxes and putting too much trust in their output. This is particularly true for deep learning, where explainability is a challenge [17, 18].
In this paper, we explore the notion of uncertainty. Can we deploy simple and scalable strategies to estimate the uncertainty in the predicted efficiency? If this is achievable, can we develop new guide RNA design strategies that incorporate that uncertainty, and will that improve the quality of the guides being selected?
Materials and methods
CRISPRon
Xiang et al. [16] experimentally generated indel frequencies of 10 592 Cas9 gRNAs with minimal overlap with existing datasets that also report indel frequencies. Using this data, they developed a deep learning model called CRISPRon. The initial CRISPRon model was trained using one-hot encodings of 30 bp sequences (4 + spacer 20 + PAM 3 + 3) and other features such as melting point, and RNA-DNA binding energy. The initial results showed a strong correlation with the existing dataset from [14]. This resulted in the merging of the two datasets to create a new dataset of 23 902 guides. The merged dataset was used to train the final version of CRISPRon. CRISPRon was reported to have a better correlation on both an internal independent test set and an external test set, outperforming popular models such as Azimuth [13], DeepSpCas9 [14], and DeepHF [15].
The CRISPRon model is essentially a convolutional neural network taking the 30 bp sequence as input, modified to take additional features as input to its later layers, which resemble those of a regular feedforward neural network. See Figure 2.a in Ref. [16] for a diagrammatic representation of the precise neural network architecture.
Figure 2.

Absolute prediction error as a function of uncertainty (measured using IQR)
Given the strong performance of the model, we implemented it in PyTorch [19] as a base model which we modified to allow for uncertainty quantification. The updated architecture is shown in Fig. 1.
Figure 1.
Architecture diagram of the deep learning model used by each ensemble member. A key change is the output now provides the parameters for a beta distribution instead of the indel frequency
Uncertainty quantification and deep ensemble approach
We modify the CRISPRon model in two ways, to capture both the aleotoric (data-based) and epistemic (model-based) uncertainties in the predictions.
For the former, rather than outputting a single prediction, the inherent variability in the data is accounted for by modelling the response variable as coming from a Beta distribution (which takes values between zero and one, the range of possible efficiency values).
The model is trained via maximum likelihood, by performing (stochastic) gradient descent on the log-likelihood function for a collection of Beta-distributed responses. We provide a brief overview of the approach here. Our implementation uses the Beta class contained in the Distributions module of PyTorch, which has two parameters that we denote by and . We write the log probability function for a choice of parameters, evaluated at an observed response y as .
Rather than have each neural network output a single value that is used as the predicted response, our approach has each neural network output two values, which are assigned to be and . In our setting, these values will differ depending on the input to the neural network and the parameters of the neural network, . Thus, we can consider the output of the neural network for the data corresponding to the k-th observation as and .
With the above notation in hand, the loss function that is minimized with respect to the neural network parameters is then
A prediction can be obtained by outputting the expected value of the respective Beta distribution. The uncertainties for a single model can be obtained by simulating many times from the corresponding Beta distribution.
Whilst the above approach is capable of modelling the uncertainty in the response variable for a given input, it does not capture our inherent uncertainty in the model itself that should be used to make such predictions.
To overcome the latter issue, we use a simple deep ensemble approach. The essence of the approach is simple—one trains not one model but instead some large number of models, each with a different initialization for the training procedure. This results in a collection of different models, due to the existence of many local minima in the training objective function (different initialization will result in the training ending up in different minima).
Lakshminarayanan et al. [20] reported that ensembles that used five single models ( saw significantly improved uncertainty estimates across all evaluated cases. Therefore, our ensemble used 25 single models ( to strike a balance between choosing a conservatively-high number of models to ensure good uncertainty quality, Whilst simultaneously avoiding unnecessary computational effort with diminishing returns.
Despite the apparent simplicity, such an approach is effective as deep ensembling can be viewed as a crude approximation to sampling from the posterior distribution of a Bayesian model [21], which is a standard approach for accounting for model-based uncertainty (but is typically computationally intractable in deep learning settings). An additional advantage of deep ensembles is that they tend to produce better results in terms of performance (generalization to unseen data) [22].
As mentioned, our initial modelling uses a probabilistic (Beta-distribution based) extension of Xiang et al. [16]. However, instead of using a single model, we fit an ensemble of 25 models. Each ensemble member was trained for 50 epochs. The input for our deep-learning model uses the same 30 bp sequence (4 + spacer + PAM + 3) and uses the sequence melting point as a feature.
The ensemble approach is configured as an equally-weighted model average, where all of the predictions for the ensemble members are averaged to give a final prediction.
Uncertainty bands for prescribed quantiles can be produced by simulating many times from each individual ensemble members Beta distribution, aggregating the resulting samples, and computing the empirical quantiles of this simulated response.
Data
We combined the CRISPRon datasets with other datasets containing experimental results on indel frequency: [14, 15, 23]. All datasets were filtered to remove duplicated entries and NaN results.
The datasets from [15] and [16] were expanded to 30 bp sequences. The expansion was achieved by aligning the provided sequences to the reference human genome (NCBI RefSeq assembly: GCF_000001405.40) using Bowtie2 [24], and extracting the extended sequences with SAMtools [25]. Any sequence that had multiple perfect alignments from Bowtie2 was excluded.
After the above preprocessing, 13 359 guides remained from [14], 49 523 from [15], 9161 from [23] and 9886 from [16], yielding 81 195 guides in total. The processed datasets were then merged into a final dataset. This includes 80 464 guides that were unique to a dataset and 731 that appeared in more than one (728 guides were present in two datasets, three guides were present in three datasets; none were present in all datasets). For the majority of the 731 guides, we did not observe any significant difference between the datasets. In the final merged dataset, they are included only once, and represented by their average value.
The 30 bp sequences were individually one-hot encoded and their respective melting points were calculated. Finally, our processed dataset was divided into training and testing portions, of sizes 61 195 and 20 000 guides, respectively.
Metrics, thresholds, and tools
Three methods are used to evaluate the performance of approaches involving our ensemble model and its associated uncertainty ranges.
The first method compares the predicted score from the model with the observed indel frequency, i.e. the actual score. This considers the correlation between the scores, as in [16], and the absolute prediction error.
To explore guide design strategies, it is convenient to transform the scores into binary classes. Is this guide efficient or not? Is it accepted by the model or not?
The second level of evaluation is then to consider the level of performance for this binary decision problem. For any gene, there are often dozens of guides that can be used, so it is more crucial that the selected guides are efficient, rather than to select all the efficient guides. As a result, precision is the key metric of interest. For completeness, recall is also reported.
The third level of evaluation tests this assumption that a high recall is not essential. All potential CRISPR sites in the mouse genome are extracted and evaluated. Then, we count the number of selected guides for each gene.
The selection of a threshold to turn the observed indel frequencies into binary classes is somewhat arbitrary. Based on the distribution of the training and on considerations on practical constraints of genome editing experiments, we chose 0.7 as the default. To explore the impact of that choice, we also repeated all tests for 0.6 and 0.8.
To make the binary decision of selecting a guide or not, we can consider the score alone, or the score in conjunction with some notion of uncertainty. To quantify the uncertainty we used two metrics: the range between the 99th and 1st percentiles (referred to as U98), and the interquartile range (IQR), defined as the difference between the 75th and 25th percentiles. In what follows, we note the threshold on the score, the threshold on the U98, and the threshold on the IQR. For , the threshold is defined directly on the predicted score. For and , the threshold is defined based on the range of values observed with the training data. For instance, means we pick so that 20% of the guides used for training had an IQR in predicted scores below (which corresponds to a threshold of 0.13). To select guides, we only choose guides that have a predicted score above and an uncertainty threshold (if used) below (or ).
These three levels of evaluation are used to assess our ensemble approach. To understand the contribution of the ensembling and of uncertainty quantification, we also use a single prediction from the first model of the ensemble. Note that this single model should not be seen as an exact replication of CRISPRon, due to the use of the Beta distribution as described above.
To facilitate comparisons, we are therefore also adding results from CRISPon itself. CRISPRon was tested using to categorize efficient guides as defined by [16].
We also add the results from the Crackling method to our comparisons. Crackling was included because it outperformed all tools it was tested against on the Wang and Doench datasets [12], and because it performs a different type of ensembling by taking the consensus between three distinct methods when evaluating guides.
Results
Scoring performance
After training the ensemble, the performance was assessed on the testing set of 20 000 guides. The Spearman correlation and Pearson correlation were 0.842 and 0.838, respectively. To understand the performance of the ensemble, the first member was selected and tested in isolation, resulting in a Spearman correlation of 0.809 and Pearson correlation of 0.805. The ensemble model provides superior performance.
Next, we looked at the absolute error in the predicted score. For the ensemble model, we observe a mean absolute error of 0.1103 (standard deviation 0.0924). In contrast, the single model had mean of 0.1187 and standard deviation of 0.1023. Again, there is a clear advantage to the deep ensemble.
Figure 2 shows the error as a function of uncertainty (measured using IQR). The lower the uncertainty, the lower the prediction error. This further highlights the benefits of the ensemble approach. It also provides a motivation for using that uncertainty in the guide selection: if the uncertainty is low, the predicted score is more likely to be accurate, and filtering for highly-scored guides should return efficient ones.
Guide selection performance
Next, we explored different guide selection strategies to understand how the ensemble model and the uncertainty quantification could be exploited to select more efficient guides. We varied the threshold on the predicted score, and the thresholds and on the uncertainty metrics.
Table 1 shows some of the top-performing threshold combinations, sorted by precision. It is not a surprise to see a majority of configurations using , given how we binarized the actual scores into efficient/inefficient classes: in the default configuration, we used a threshold of 0.7 on the actual score. We explore this in the section ‘Impact of threshold choice’.
Table 1.
Some of the best configurations based on the 20K holdout testing set.
| Thresholds | Precision | Recall |
|---|---|---|
| Score (), IQR () | 100.00 | 0.02 |
| Score (), U98 () | 98.06 | 9.58 |
| Score (), IQR () | 97.94 | 9.58 |
| Score (), U98 () | 97.83 | 6.39 |
| Score (), U98 () | 97.17 | 18.68 |
| Score (), IQR () | 96.94 | 19.85 |
| Score (), U98 () | 96.80 | 19.69 |
| Score (), U98 () | 96.00 | 26.68 |
| Score (), IQR () | 95.39 | 29.66 |
| Score (), IQR () | 95.36 | 29.70 |
| Score (), U98 () | 94.89 | 28.80 |
| Score (), U98 () | 94.69 | 29.15 |
| Score (), U98 () | 94.66 | 29.19 |
| Score (), U98 () | 94.35 | 36.37 |
| Score (), IQR () | 93.92 | 39.28 |
Results are ordered by precision and duplicates are removed.
What is more interesting is that we obtain very high precision, with all these configurations scoring above 93%. The guides being selected have a very high chance of being efficient in practice.
It is also interesting to note that the difference between using a threshold on the U98 () or on the IQR () does not make a big difference. The threshold value is more important than its scope.
As expected, the recall values are lower. This is a direct consequence of prioritizing precision. The assumption is that these recall values (especially those above 15%) are high enough to select enough guides for most genes. We explore this in the section ‘Whole-genome performance’.
When the uncertainty threshold is too low, becomes redundant. We do not see predictions where the uncertainty is very low and the predicted score is low, so a very tight constraint on the uncertainty means that only high-score predictions are selected. Similarly, if the score threshold is extremely high, the uncertainty is always low (because the prediction is an average and is bounded by 1, so it can only be very high if all individual scores are high) and the uncertainty threshold becomes redundant. This leads to identical sets of results, so Table 1 does not show duplicates. These extreme configurations, such as or , are not considered practical.
In Table 2, we fix the score threshold at , and explore the impact of the deep ensemble and of uncertainty quantification. Just using the ensemble offers a 3% increase in precision (81.46%–84.29%). Taking the uncertainty into account adds another 6%–14% increase in precision. Of course, the cost is a lower recall, but some configurations, such as still provide a recall above 55%.
Table 2.
Impact of the uncertainty threshold.
| Thresholds | Precision | Recall |
|---|---|---|
| Score (), U98 () | 98.06 | 9.58 |
| Score (), IQR () | 97.94 | 9.58 |
| Score (), IQR () | 96.94 | 19.85 |
| Score (), U98 () | 96.80 | 19.69 |
| Score (), IQR () | 95.36 | 29.70 |
| Score (), U98 () | 94.66 | 29.19 |
| Score (), IQR () | 93.92 | 39.28 |
| Score (), U98 () | 93.42 | 38.31 |
| Score (), IQR () | 92.65 | 48.36 |
| Score (), U98 () | 92.05 | 45.51 |
| Score (), IQR () | 90.91 | 55.17 |
| Score (), U98 () | 90.50 | 51.08 |
| Ensemble () | 84.28 | 79.55 |
| Single model () | 81.46 | 79.18 |
All configurations use a score threshold of 0.7. Results are ordered by precision.
Performance comparison on previous datasets
In order to further our understanding of the model performance, we used the datasets generated by [26] and [27], referred to as the Wang dataset and the Doench data set respectively. The Wang data set contained 1669 guides that were labelled efficient (731 guides) or inefficient (438 guides). After removing the guides seen during training, there were 1146 guides (714 efficient, 432 inefficient).
The Doench data set contained 1841 guides. Efficient/Inefficient labels were applied to the top 20% of gene percent rank as outlined in [27]. This resulted in 369 efficient and 1472 inefficient guides before filtering guides seen during testing. After filtering, there were 630 (124 efficient and 506 inefficient) guides.
The Wang dataset was obtained from [28], Supplementary Table S1. The Doench dataset was obtained from [27], Supplementary Table S7.
We ran different configurations of our model on the remaining guides. The results for the Wang dataset and Doench dataset are shown in Tables 3 and 4, respectively.
Table 3.
Selected threshold configurations tested on the filtered Wang dataset, compared with Crackling.
| Thresholds | Precision | Recall |
|---|---|---|
| Score (), U98 () | 100.00 | 2.38 |
| Score (), IQR () | 100.00 | 0.98 |
| Score (), U98 () | 100.00 | 0.98 |
| CRISPRon () | 100.00 | 0.98 |
| Score (), U98 () | 96.97 | 4.48 |
| Score (), U98 () | 96.25 | 10.78 |
| Score (), IQR () | 95.83 | 9.66 |
| Score (), IQR () | 95.60 | 12.18 |
| Score (), U98 () | 95.16 | 8.26 |
| Score (), IQR () | 94.44 | 2.38 |
| Score (), IQR () | 94.12 | 4.48 |
| Score (), IQR () | 94.12 | 6.72 |
| Score (), U98 () | 93.62 | 6.16 |
| Crackling (N = 3) | 90.00 | 13.87 |
| Crackling (N = 2) | 83.95 | 44.68 |
Table 4.
Selected threshold configurations tested on the filtered Doench dataset, compared with Crackling.
| Thresholds | Precision | Recall |
|---|---|---|
| Score (), U98 () | 62.50 | 4.03 |
| Score (), IQR () | 54.55 | 4.84 |
| CRISPRon () | 50.00 | 0.81 |
| Score (), IQR () | 50.00 | 7.26 |
| Score (), U98 () | 47.06 | 6.45 |
| Score (), IQR () | 46.43 | 10.48 |
| Score (), U98 () | 42.86 | 9.68 |
| Score (), IQR () | 40.00 | 1.61 |
| Score (), IQR () | 40.00 | 12.90 |
| Score (), U98 () | 40.00 | 11.29 |
| Score (), IQR () | 40.00 | 16.13 |
| Score (), U98 () | 40.00 | 14.52 |
| Crackling (N = 3) | 35.82 | 19.35 |
| Crackling (N = 2) | 27.83 | 51.61 |
It is important to note that for this dataset the efficient/inefficient classes are based on the fold change, not indel frequency. Some indels may not lead to a change in expression, so it is a more difficult task.
For the Wang data set, we obtained high precision, ranging from 93.62% to 100%. This is higher than the results obtained by Crackling, at the cost of a lower recall. CRISPRon also achieves a precision of 100% that matches the stricter threshold configurations.
For the Doench data set, we obtain the highest precision amongst all the tools tested. We saw a precision ranging from 40% to 62.50%. CRISPRon achieved the third highest precision of 50% whilst having the lowest recall. Similar to the Wang dataset, Crackling achieved the highest recall whilst having lower precision.
Impact of threshold choice
As discussed in the section ‘Metrics, thresholds and tools’, the threshold used to transform the indel frequency into binary classes is somewhat arbitrary. To ensure that the results described in the previous Section are not an artefact of that choice, we repeated the same evaluation with a lower and a higher threshold. Overall, the results are consistent:
Precision is very high. Recall is lower, but for many threshold configurations it is sufficiently high.
Extreme threshold values for either score or uncertainty make the other threshold redundant and produce a very narrow set of guides (with high precision but a very low recall).
A threshold score close to the threshold to define efficiency generally produces good performance, but other configurations can also work.
Table 5 shows the impact of the efficiency threshold on a range of configurations. Our model reaches a high precision for all configurations if the efficiency boundary is at an indel frequency of 0.6 or 0.7. A boundary at 0.8 is more challenging, but several configurations still reach a precision above 91%. These results confirm that the model generalizes well.
Table 5.
Impact of the threshold used to define efficiency.
| Efficiency at 0.6 |
Efficiency at 0.7 |
Efficiency at 0.8 |
||||
|---|---|---|---|---|---|---|
| Thresholds | Precision | Recall | Precision | Recall | Precision | Recall |
| Score (), IQR () | 98.32 | 16.07 | 96.94 | 19.85 | 91.97 | 25.73 |
| Score (), IQR () | 97.57 | 24.25 | 95.36 | 29.70 | 89.24 | 37.97 |
| Score (), IQR () | 96.80 | 32.33 | 93.89 | 39.28 | 86.20 | 49.27 |
| Score (), IQR () | 96.19 | 40.10 | 92.59 | 48.36 | 82.85 | 59.12 |
| Score (), IQR () | 95.10 | 46.35 | 90.50 | 55.26 | 79.40 | 66.24 |
| Score (), IQR () | 98.32 | 16.07 | 96.94 | 19.85 | 91.97 | 25.73 |
| Score (), IQR () | 97.57 | 24.25 | 95.36 | 29.70 | 89.24 | 37.97 |
| Score (), IQR () | 96.83 | 32.33 | 93.92 | 39.28 | 86.22 | 49.27 |
| Score (), IQR () | 96.21 | 40.09 | 92.65 | 48.36 | 82.90 | 59.12 |
| Score (), IQR () | 95.32 | 46.17 | 90.91 | 55.17 | 79.85 | 66.20 |
| Score (), IQR () | 98.32 | 16.07 | 96.94 | 19.85 | 91.97 | 25.73 |
| Score (), IQR () | 97.56 | 24.21 | 95.39 | 29.66 | 89.27 | 37.92 |
| Score (), IQR () | 96.97 | 31.76 | 94.29 | 38.69 | 86.79 | 48.66 |
| Score (), IQR () | 96.52 | 37.21 | 93.60 | 45.21 | 85.35 | 56.32 |
| Score (), IQR () | 96.11 | 39.91 | 93.11 | 48.43 | 84.69 | 60.19 |
Whole-genome performance
Finally, we evaluated the performance of our model across the entire mouse genome. Here, we do not have a ground truth for all potential CRISPR sites. Instead, we use some of the best-performing methods from the section ‘Guide selection performance’, for which we know the precision, and assess whether their recall is sufficient to identify efficient guides across a large number of genes.
The results are shown in Fig. 3. While some settings are too extreme (as previously discussed), the configuration performs very well. It can identify at least one guide for 93.67% of the genes, and our earlier evaluation had its precision at above 90%. For 81.16% of the genes, it can identify more than 3 guides. This enables multi-targeting, which has been shown by [29] to dramatically increase knockout efficiency.
Figure 3.
Accepted guides per gene (binned by guide counts)
Overall, this confirms that the deep ensemble approach can be used to identify guides for most genes that are close to being guaranteed to be efficient.
Discussion
Uncertainty can be quantified, and it can be used to improve guide selection
We proposed and tested a deep ensemble method to evaluate the efficiency of gRNAs. Based on the findings of [16], deep learning is capable of performing this evaluation. Our approach investigated using an ensemble of deep-learning models rather than relying on a single model. It was found that the ensemble could more accurately predict gRNA efficiency while also achieving a higher recall.
Crucially, our deep ensemble also provides a method to quantify the uncertainty in the score prediction. Uncertainty is often overlooked and it is typically not incorporated into existing methods (apart from a few exceptions such as [30]). Our method provides a practical solution to including uncertainty in gRNA selection strategies.
These novel guide selection strategies rely not only on the predicted score but also on how confident we can be in that score. We showed that these strategies achieve very high precision, and we confirm our hypothesis that while the recall is lowered, it remains high enough to identify guides for most genes.
A visual representation of using CRISPR Deep Ensemble to assess efficient guides is shown in Fig. 4. The deep ensemble generalizes well, including to datasets where guide efficiency is reported using different metrics.
Figure 4.
Visual representation of using CRISPR Deep Ensemble in a pipeline. (a) Extract 30mer (4 + spacer (20) + PAM (3) + 3) guide sequences from target gene, genome, exon etc. (b) Generate one-hot encoding and calculate melting points for all candidate guides. (c) Predict beta distributions parameters (, ) for each ensemble model. (d) Sample each ensemble distribution N times. We have highlighted the samples in red. (e) Calculate the ensemble outputs from the collected samples. (f) Apply thresholds on the predicted indel frequency, U98, and IQR to select efficient guides
Off-target evaluation
While on-target efficiency is an important aspect of guide design, it is equally important to consider the potential for off-target editing. Off-target edits can occur when a CRISPR site similar to the guide is mistaken as the target site. Any site that contains 1 to 4 mismatched bases is considered an off-target site.
An entire workflow will include both on-target and off-target evaluations, typically in independent modules. These evaluations run consecutively: one first identifies ‘efficient’ guides (i.e. high on-target activity), and later filters those to only keep the ones that are also ‘safe’ (i.e. low off-target risk).
The method outlined in the paper enables uncertainty-aware on-target prediction. It can be easily integrated into an overall pipeline that will also include the off-target assessment, for instance leveraging the modular structure of Crackling [12]. Such a modular approach means the overall workflow can also incorporate the latest developments in off-target scoring [31].
Future directions
This paper represents a first attempt at leveraging uncertainty quantification to design guide RNAs. While the results are very promising, there are a number of directions to explore.
Here, we used a Beta distribution to capture aleatoric uncertainty, but the approach can accommodate other distributions. It would also be interesting to evaluate the impact of the size of the ensemble.
Methods based on the consensus between multiple approaches can produce good results [12]. Combining this consensus philosophy with the ability to estimate uncertainty could lead to new solutions with improved recall.
As more experimental data continues to become available, the ensemble can be retrained and improved, or extended to other Cas proteins.
Conclusion
In this paper, we investigated the use of deep ensembles to improve the prediction of the on-target efficiency of CRISPR guide RNAs. We showed that this approach can capture both the aleotoric and epistemic uncertainties in the predictions.
We also showed that the ensemble provides a more accurate score and that, by combining it with the uncertainty estimates, we can design guide selection strategies with a very high precision. This comes with a lower recall, but we also showed that it remains high enough to identify suitable guide RNAs for most genes.
This represents a practical and easily interpretable solution to leverage uncertainty quantification in CRISPR guide RNA design, and opens interesting directions for future research.
Our deep ensemble model is available at https://github.com/bmds-lab/CRISPR_DeepEnsemble.
Conflict of interest statement. None declared.
Supplementary Material
Contributor Information
Carl Schmitz, School of Computer Science, Queensland University of Technology, Brisbane, QLD 4000, Australia; Centre for Data Science, Queensland University of Technology, Brisbane, QLD 4000, Australia.
Jacob Bradford, School of Computer Science, Queensland University of Technology, Brisbane, QLD 4000, Australia; Centre for Data Science, Queensland University of Technology, Brisbane, QLD 4000, Australia.
Robert Salomone, School of Computer Science, Queensland University of Technology, Brisbane, QLD 4000, Australia; Centre for Data Science, Queensland University of Technology, Brisbane, QLD 4000, Australia; School of Mathematics and Statistics, UNSW Sydney, Sydney, NSW 2052, Australia.
Dimitri Perrin, School of Computer Science, Queensland University of Technology, Brisbane, QLD 4000, Australia; Centre for Data Science, Queensland University of Technology, Brisbane, QLD 4000, Australia.
Funding
C.S. was supported by an Australian Government Research Training Program Scholarship. D.P. was supported by the Australian Research Council [ARC Discovery Project DP210103401].
Data availability
Our deep ensemble model is available at https://github.com/bmds-lab/CRISPR_DeepEnsemble.
References
- 1. Wang JY, Doudna JA. CRISPR technology: a decade of genome editing is only the beginning. Science 2023;379:eadd8643. [DOI] [PubMed] [Google Scholar]
- 2. Meaker GA, Hair EJ, Gorochowski TE. Advances in engineering CRISPR-Cas9 as a molecular Swiss Army knife. Synth Biol (Oxf) 2020;5:ysaa021. 10.1093/synbio/ysaa021 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Chavez M, Chen X, Finn PB et al. Advances in CRISPR therapeutics. Nat Rev Nephrol 2023;19:9–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Villiger L, Joung J, Koblan L et al. CRISPR technologies for genome, epigenome and transcriptome editing. Nat Rev Mol Cell Biol 2024;25:464–87. [DOI] [PubMed] [Google Scholar]
- 5. Shrock E, Güell M. CRISPR in animals and animal models. Prog Mol Biol Transl Sci 2017;152:95–114. [DOI] [PubMed] [Google Scholar]
- 6. Cleves PA, Tinoco AI, Bradford J et al. Reduced thermal tolerance in a coral carrying CRISPR-induced mutations in the gene for a heat-shock transcription factor. Proc Natl Acad Sci USA 2020;117:28899–905. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Zaidi SS-e-A, Mahas A, Vanderschuren H et al. Engineering crops of the future: CRISPR approaches to develop climate-resilient and disease-resistant plants. Genome Biol 2020;21:289. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Liu G, Zhang Y, Zhang T. Computational approaches for effective CRISPR guide RNA design and evaluation. Comput Struct Biotechnol J 2020;18:35–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Bradford J, Perrin D. A benchmark of computational CRISPR-Cas9 guide design methods. PLOS Comput Biol 2019;15:e1007274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Li C, Chu W, Gill RA et al. Computational tools and resources for CRISPR/Cas genome editing. Genomics Proteomics Bioinf 2023;21:108–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Bradford J, Perrin D. Improving CRISPR guide design with consensus approaches. BMC Genomics 2019;20:931. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Bradford J, Chappell T, Perrin D. Rapid whole-genome identification of high quality CRISPR guide RNAs with the Crackling method. Crispr J 2022;5:410–21. [DOI] [PubMed] [Google Scholar]
- 13. Doench JG, Fusi N, Sullender M et al. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nat Biotechnol 2016;34:184–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Kim HK, Kim Y, Lee S et al. SpCas9 activity prediction by DeepSpCas9, a deep learning–based model with high generalization performance. Sci Adv 2019;5:eaax9249. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Wang D, Zhang C, Wang B et al. Optimized CRISPR guide RNA design for two high-fidelity Cas9 variants by deep learning. Nat Commun 2019;10:4284. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Xiang X, Corsi GI, Anthon C et al. Enhancing CRISPR-Cas9 gRNA efficiency prediction by data integration and deep learning. Nat Commun 2021;12:3238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Ali S, Abuhmed T, El-Sappagh S et al. Explainable Artificial Intelligence (XAI): what we know and what is left to attain trustworthy artificial intelligence. Inf Fus 2023;99:101805. [Google Scholar]
- 18. Zelenka NR, Di Cara N, Sharma K et al. Data hazards in synthetic biology. Synth Biol (Oxf) 2024;9:ysae010. 10.1093/synbio/ysae010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Paszke A, Gross S, Massa F et al. Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst 2019;32:8026–37. [Google Scholar]
- 20. Lakshminarayanan B, Pritzel A, Blundell C. Simple and scalable predictive uncertainty estimation using deep ensembles. In: Guyon I, Luxburg UV, Bengio S et al. (eds.), Advances in Neural Information Processing Systems, Vol. 30. Red Hook, NY, USA: Curran Associates, Inc., 2017. [Google Scholar]
- 21. Hoffmann L, Elster C. Deep ensembles from a Bayesian perspective. arXiv, arXiv:2105.13283, 2021, preprint: not peer reviewed.
- 22. Ganaie MA, Hu M, Malik AK et al. Ensemble deep learning: a review. Eng Appl Artif Intell 2022;115:105151. [Google Scholar]
- 23. Kim N, Kim HK, Lee S et al. Prediction of the sequence-specific cleavage activity of Cas9 variants. Nat Biotechnol 2020;38:1328–36. [DOI] [PubMed] [Google Scholar]
- 24. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods 2012;9:357–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Danecek P, Bonfield JK, Liddle J et al. Twelve years of SAMtools and BCFtools. GigaScience 2021;10:02. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Wang T, Wei JJ, Sabatini DM et al. Genetic screens in human cells using the CRISPR-Cas9 system. Science 2014;343:80–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Doench JG, Hartenian E, Graham DB et al. Rational design of highly active sgRNAs for CRISPR-Cas9–mediated gene inactivation. Nat Biotechnol 2014;32:1262–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Xu H, Xiao T, Chen C-H et al. Sequence determinants of improved CRISPR sgRNA design. Genome Res 2015;25:1147–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Sunagawa GA, Sumiyama K, Ukai-Tadenuma M et al. Mammalian reverse genetics without crossing reveals Nr3a as a short-sleeper gene. Cell Rep 2016;14:662–77. [DOI] [PubMed] [Google Scholar]
- 30. Kirillov B, Savitskaya E, Panov M et al. Uncertainty-aware and interpretable evaluation of Cas9–gRNA and Cas12a–gRNA specificity for fully matched and partially mismatched targets with deep kernel learning. Nucleic Acids Res 2022;50:e11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Schmitz C, Bradford J, Salomone R et al. Fast and scalable off-target assessment for CRISPR guide RNAs using partial matches. In: 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). Lisbon, Portugal: IEEE, 2024, 1649–54.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Our deep ensemble model is available at https://github.com/bmds-lab/CRISPR_DeepEnsemble.



