Fig. 3. MLDE predictions on KKH-saCas9’s on-target activity with three sgRNAs.
a Enrichment and NDCG of MLDE runs using a combination or embeddings (Bepler/Georgiev) and models (p1 and p2) are plotted against the size of training data. The best-fit line summarizes 3 replicates for each embedding and model parameter combination of MLDE runs using 5, 10, 20, 50, and 70% of training data. b Predicted versus empirical fitness of variants in the best-performing MLDE runs using 20% input training data, given the different combinations of embeddings and model parameters. The predicted fitness by MLDE is plotted against the empirical fitness of the on-target activity of KKH-SaCas9 with three sgRNAs (sg1, sg2, and sg3) in the best-performing runs (ranked according to the NDCG and enrichment score). The values of maximum fitness in the training data are indicated at the bottom-right corner of each panel. The top-5% hits in the prediction are highlighted in red, while the top-5% variants from the empirical data are outlined in black. Wild-type KKH-SaCas9 (WT-KKH) and top-performing variants N888Q, N888Q/A889S, and A889S are labeled. The source data are provided as a Source Data file.