Abstract
Transposon (IS200/IS605)-encoded TnpB proteins are predecessors of class 2 type V CRISPR effectors and have emerged as one of the most compact genome editors identified so far. Here, we optimized the design of Deinococcus radiodurans (ISDra2) TnpB for application in mammalian cells (TnpBmax), leading to an average 4.4-fold improvement in editing. In addition, we developed variants mutated at position K76 that recognize alternative target-adjacent motifs (TAMs), expanding the targeting range of ISDra2 TnpB. We further generated an extensive dataset on TnpBmax editing efficiencies at 10,211 target sites. This enabled us to delineate rules for on- and off-target editing and to devise a deep learning model, termed TEEP (TnpB Editing Efficiency Predictor), capable of predicting ISDra2 TnpB guiding RNA (ωRNA) activity with high performance (r > 0.8). Employing TEEP, we achieved editing efficiencies up to 75.3 % in the murine liver and 65.9 % in the murine brain after adeno-associated virus (AAV) vector delivery of TnpBmax. Overall, the advancements and tools presented in this study facilitate the application of TnpB as an ultracompact programmable endonuclease in research and therapeutics.
Introduction
CRISPR-Cas systems in prokaryotes provide adaptive immunity against foreign nucleic acids and can be reprogrammed for gene editing (1–3). Recent studies suggest that the class 2 CRISPR effectors Cas9 and Cas12 evolved independently from two IS200/IS605 transposon-encoded nuclease superfamily members, IscB and TnpB, respectively (4, 5). IscB and TnpB proteins are typically more compact than Cas9 and Cas12 endonucleases (4) and are guided by RNAs derived from the transposon right end (RE) element to bind and cleave substrate DNA (4, 5). TnpB of Deinococcus radiodurans (ISDra2) was the first member of the TnpB family employed for genome editing in mammalian cells (5). Analogous to CRISPR guide RNAs, the RE RNA of ISDra2 TnpB (referred to as ωRNA) is composed of a static part (scaffold) that is engaged by the protein, and a programmable stretch (guide) with complementarity to the target DNA. In addition, similar to the protospacer adjacent motif of Cas9 proteins, ISDra2 TnpB requires a 5’-TTGAT target adjacent motif (TAM) for target recognition and cleavage (5–7).
Despite recent proof-of-concept that ISDra2 TnpB can be adopted for genome editing, its activity is lower compared to CRISPR-Cas nucleases (8), and the complex TAM motif restricts its targeting range. Here, we optimized ISDra2 TnpB for mammalian genome editing (TnpBmax). We further developed a variant that targets non-canonical TAMs (TnpBmax-K76A) and a deep learning model that predicts the efficiency of ωRNAs (termed TnpB Editing Efficiency Predictor, TEEP). Employing these tools resulted in efficient genome editing in cell lines and in vivo in the murine liver and brain.
Results
Efficient DNA editing in mammalian cells with TnpBmax
ISDra2 TnpB represents an ultracompact RNA-guided endonuclease (RGEN). It can be targeted to a desired site in the genome by reprogramming the guide sequence of the ωRNA and cleaves DNA via a single RuvC domain (Fig. 1a-b). While fusing an SV40 nuclear-localization-sequence (NLS) domain to the C-terminus of ISDra2 TnpB allowed the installation of indel (insertion and deletion) mutations in mammalian cells (5), the editing efficiencies are relatively moderate (Fig. 1c)(8). Since previous studies have shown that the efficiency of genome editors can be increased by modifying nuclear localization sequences (NLS) and codon usage (9, 10), we generated constructs where the coding sequence of ISDra2 TnpB was optimized for mammalian codon usage (GenScript) and where NLS- and protein linker domains were arranged in different combinations (ARC1-13; Fig. 1d). When tested on 94 sites using a target-matched library, in which genomically integrated cassettes contain matched pairs of ωRNAs and target sites (Fig. S1a; Datafile S1), we found that codon optimization alone already led to a significant (P<0.0001) increase in editing rates (ARC-0 vs. ARC-10, 3.3-fold; Fig 1d). Benchmarking of the different designs revealed that ARC-13, which contains an additional GS linker and a bipartite NLS (BPNLS) sequence at the 3'-end, further improves editing rates by 1.3-fold (ARC-10 vs. ARC-13, P = 0.0125; Fig. 1d). Higher editing efficiencies with ARC-13 (thereafter termed as TnpBmax) compared to ARC-10 and ARC-0 were also verified at 7 endogenous sites (Fig. 1e; Extended Data Fig 1). Moreover, western blot analysis indicates that this increase in editing is primarily due to higher protein expression rather than increased nuclear shuttling (Supplementary Figure 1b-g).
Figure 1. DNA cleavage and base editing in mammalian cells with enhanced TnpBmax.
(A) Schematic representation of the Deinococcus radiodurans ISDra2 locus, and TnpB engaged with target DNA. The transposon is flanked by the left-end (LE) and right-end (RE) elements and consists of the tnpA and tnpB genes. TnpB can be used as an RNA-guided DNA endonuclease by programming the ωRNA derived from the right end of the transposon to match a sequence 3' of the transposon adjacent motif (TAM). (B) Comparison of TnpB to CRISPR Class 2 RNA-guided endonucleases (RGENs). aa, amino acids; As, Acidibacillus sulfuroxidans; Cj, Campylobacter jejuni; Nme, Neisseria meningitidis; Sp, Staphylococcus aureus. (C) Genome editing in the human embryonic kidney cell line (HEK293T) over the course of 9 days with three different ωRNAs and TnpB. Indels, insertions, and deletions; d, day; dots represent the mean ± s.d of n ≥ 3 independent biological replicates. (D) Benchmarking of different TnpB architectures (ARC1-13) in HEK293T cells on a target-matched library with N = 94 individual ωRNA-target pairs. Each data point represents the mean of n = 4 independent biological replicates. Means were compared by two-tailed t-test. Bar represents the mean of N=94 independent target sites. NLS, nuclear localization sequence; BPNLS, bipartite NLS; SRAD, Serine-Arginine-Alanine-Aspartic acid; GS, Glycine-Serine. (E) Benchmarking (fold change) of ISDra2, ISAam1, and ISYmu1 TnpB designs on endogenous loci in HEK293T cells. Indel values (%) were normalized to the ARC-0 design of the respective TnpB. Each data point represents the mean of n = 3 independent biological replicates. Bar represents the mean ± s.d. of N = 7 (ISDra2 and ISAam1) or N = 8 (ISYmu1) target sites. acodon-optimization and design from Xiang et al. (11). (F) Comparison of the indel frequencies of different RGENs in HEK293T cells. TnpBmax (mean = 27.9 %, N = 94); ISAam1max (mean = 9.1 %, N = 98); ISYmu1max (mean = 10.1 %, N = 68); AsCas12fa (CasMINI, mean = 5.6 %, N = 58); Nme2Cas9a (mean = 16.3 %, N = 82); CjCas9a (mean = 22.9 %, N = 67); SpCas9a (mean = 87.9 %, N = 91); aData of CRISPR RGENs from Schmidheini et al., 2023(8). N, number of individual target sites. Each dot represents the mean of n ≥ 3 independent biological replicates. Means were compared by two-tailed t-test. (G) Schematic representation of a nuclease-deficient TnpB(D191A) adenine base editor and adenine base editing at seven individual DNA target sites with TnpB(D191A)-TadA8e (C-ABE) or TadA8e-TnpB(D191A) (N-ABE). Adenine bases (A) in the DNA R-loop are converted to Inosine (I) by TadA8e fused to TnpB. Inosine is repaired to Guanine (G) within the cell. Substrate bases for the base editor are highlighted (bold lines).
We next explored whether the ARC-13 design could also enhance editing efficiencies of other TnpB orthologs and Fanzors, which are TnpB-related eukaryotic RNA-guided endonucleases. We first adapted the ARC-0-8-, 10- and 13- design to TnpB from Anoxybacillus amylolyticus (ISAam1) and Youngiibacter multivorans (ISYmu1), which are TnpB orthologs that have been previously employed for mammalian genome editing (11). When tested on 7 (ISAam1) or 8 (ISYmu1) endogenous loci, ARC-13 consistently outperformed all other tested designs, including the designs initially reported in Xing et al. (11)(Fig. 1e, Extended Data Fig. 1c,d). Likewise, ARC-13 also outperformed other designs when tested on Fanzor from Spizellomyces punctatus (SpuFz1-v2)(12) (Extended Data Fig. 1e).
Finally, we compared indel formation rates of TnpBmax to other commonly used RGENs in a target-matched ωRNA/guide RNA library. The library consisted of randomly chosen target sites (8), with every target site containing the optimal protospacer length and PAM sequence for each tested RGEN (20 bp and 5'-TTGAT for TnpBmax, 20 bp and 5’-TTTAA for ISAam1max, 20 bp and 5’-TTGAT ISYmu1max, 20 bp and 5’-TTTR for AsCas12f, 22 bp and 5’-N3AACAC for CjCas9, 22 bp and 5’-N4CC for Nme2Cas9 and 19 bp and 5’-NGG for SpCas9; Datafile S1). The library was genomically integrated into HEK293T cells, which were subsequently transfected with plasmids expressing the different RGENs. HTS analysis for target sites filtered for ≥ 100 reads revealed that TnpBmax outperforms all other tested smalI-sized RNA-guided nucleases, including Cas12f (CasMINI; (4.9-fold), CjCas9 (1.2-fold) and Nme2Cas9 (1.7-fold) (Fig. 1f). However, Cas9 from Streptococcus pyogenes (SpCas9) was still the most efficient endonuclease, with 3.2-fold higher editing compared to TnpBmax (Fig. 1f).
Base editing is a more recent genome editing technique that allows for the precise installation of point mutations via direct DNA deamination without requiring double-stranded DNA breaks (DSBs) (13–15). Base editors comprise a single-stranded DNA deaminase fused to a nuclease-impaired Cas9 or Cas12 enzyme. To test whether base editing can also be achieved using TnpB, we introduced a mutation in the RuvC domain of TnpB (D191A) to eliminate its nuclease activity (5) and fused the lab-evolved adenine deaminase TadA8e (16) to either its N- or C-terminus. Importantly, both architectures resulted in robust A•T to G•C conversions, with up to 16.6 % editing and an editing window ranging from position 2 to 12 of the target sites (Fig. 1g).
In summary, generating TnpBmax, a codon- and NLS-optimized version of ISDra2 TnpB, facilitates indel formation in mammalian cells and can be utilized for base editing.
Determinants of ωRNA design
To systematically assess cleavage efficiency at 11’188 of target sites in parallel, we established a HEK293T cell pool with a genomically integrated target-matched ωRNA library (Fig. 2a). Transfection of the cell pool with a plasmid encoding for TnpBmax and analysis of the target sites by deep amplicon sequencing revealed indel efficiencies up to 69.8 %, with a strong correlation between independent biological replicates (r = 0.98, R ≥ 0.95, Supplementary Fig. 2a-c) and to endogenous loci (r = 0.82, Supplementary Fig. 2d). When we first performed an unbiased analysis of sequences with below and above average editing efficiencies, we observed a preference of TnpBmax for purine-rich sequences (Fig. 2b-c). Next, we assessed the influence of single- and double mismatches between the guide- and the target sequence. We identified a critical seed region spanning the first 12 bases of the ωRNA, which did not tolerate transition-, transversion- or deletion mutations. At positions 13-15, mismatches were partially tolerated, whereas at positions 16-20, neither mismatches nor deletions negatively influenced editing efficiencies (Fig. 2d-h, Supplementary Fig. 3a-b). Additionally, trimming the last four bases of the 20-base guide sequence of the ωRNA did not lead to a significant reduction in editing efficiencies (Fig. 2i). These results align with previous studies, which also reported greater importance of the seed region for ωRNA binding (6, 7) (Supplementary Fig. 4a-b).
Figure 2. Massively parallel target-matched library screen reveals principles for ωRNA guide design.
(A) Schematic representation of the target-matched ωRNA library screen in HEK293T cells. TAM, transposon adjacent motif; MM, mismatch; HDVr, hepatitis delta virus ribozyme; txn, transfection; d, days. (B) Per-position nucleotide representation of target sites performing above (Pattern A) or below (Pattern B) the average. (C) Editing efficiencies in Neuro-2a or HEK293T cells with pattern A or B synthetically integrated and transfected with the respective ωRNA. Bar represents the mean ± s.d. of n ≥ 2 independent biological replicates. (D-G) Position-dependent impact of single (1x) or double (2x) transition or transversion mismatches (MM) on ωRNA activity. Dots represent the mean of n = 3 independent biological replicates of N = 4 individual target sites. Box plots 25th and 75th percentiles and whiskers down to the minimum and up to the maximum value and plots each individual value. The line in the box is plotted at the median. (H) Normalized TnpB-mediated Indels (FC, fold change) in the DNA target with one-nucleotide deletion throughout the target region. n = 3 independent biological replicates. (I) Influence of ωRNA length (15-25 nt) on ωRNA activity relative to a 20 nt ωRNA. N, number of individual target sites; 15-nt (mean = 1.05, N = 9); 16-nt (mean = 0.95, N = 9); 17-nt (mean = 1.05, N = 8); 18-nt (mean = 1.04, N = 9); 19-nt (mean = 1.77, N = 8); 21-nt (mean = 0.91, N = 6); 22-nt (mean = 0.87, N = 7); 23-nt (mean = 1.02, N = 7); 24-nt (mean = 0.76, N = 7); 25-nt (mean = 1.05, N = 7). (J) Schematic overview of the GUIDE-seq workflow for TnpB off-target detection. dsODN, double-stranded oligodeoxynucleotide; DSB, double-strand break (K) Sequences of off-target sites identified by GUIDE-seq. The top line presents the intended target sequence with cleaved sites below and mismatches to the on-target site highlighted in color. GUIDE-seq read counts are shown on the right.
The limited acceptance of mismatches in the 12 bp seed region combined with the five bp 5’-TTGAT TAM requirement implies high target specificity of TnpBmax. To validate this hypothesis, we conducted GUIDE-seq analysis (17) on HEK293T cells treated with TnpBmax and four different ωRNAs, respectively (Fig. 2j-k, Supplementary Fig. 5a-b). Importantly, only two of the four tested ωRNAs showed off-target activity, and consistent with our data from the target-matched library screen, these off-target sites did not display mismatches at the TAM-proximal end of the ωRNA guide (Fig. 2k).
In summary, our library screen reveals that TnpBmax favors purine-rich sequences and that ωRNAs do not tolerate mismatches in the seed region. Together with the TAM requirement, this makes TnpBmax a highly specific genome editing tool.
TnpBmax variants engineered for alternative TAM recognition
The 5'-TTGAT TAM sequence occurs on average only once in every 512 bases. While this enhances the specificity of ISDra2 TnpB, it also narrows its targeting range. To first validate the 5'-TTGAT TAM motif, we adapted the high-throughput PAM determination assay (HT-PAMDA) (18) to TnpBmax (Fig. 3a). Our results confirm that efficient targeting is limited to 5’-NTTGAT TAMs, with a slight preference for 5’-YTTGAT TAMs over 5’-RTTGAT TAMs (Fig. 3b). Further supporting these results, each possible single base-pair mutation in the TAM substantially reduced indel efficiencies in HEK293T cells (Fig. 3c).
Figure 3. Structure-guided rational engineering of TnpB to accept alternative TAMs.
(A) Molecular characterization of the 6-nucleotide TAM of TnpB via the high-throughput TAM detection assay. HTS, high-throughput-sequencing. (B) Cleavage rate (k) for two individual ωRNA on 46 TAMs each. (C) TnpB activity on mismatched TAMs at seven individual target sites in HEK293T cells normalized to the activity on the 5’-TTGAT TAM. The non-canonical base in the TAM is shown in lowercase and highlighted in red. Each datapoint represents the average of n = 3 independent biological replicates. (D) Structural details of 5’-TTGAT TAM sequence recognition (from PDB 8EXA). Residues K76, Q80, and dG-3 are highlighted. (E) Logo plots of the top 10 TAM motifs derived from HT-TAMDA of TnpBmax and rationally engineered variants thereof. (F-G) Activity of TnpBmax and variants thereof tested on 5’-TTtAT, and 5’-TTGAT TAMs in HEK293T cells. Bar represents the mean ± s.d. of n = 2 independent biological replicates. (H) TnpB-WT and TnpB-K76A activity on 11 individual target sites tested on 5’-TTGAT, 5’-TTtAT, 5’-TTcAT, and 5’-TTaAT TAMs. Values represent the mean of n = 2 independent biological replicates. (I) Sequences of off-target sites identified by GUIDE-seq. The top line presents the intended target sequence with cleaved sites below and mismatches to the on-target site highlighted in color. GUIDE-seq read counts are shown on the right.
Since recent structural characterization of ISDra2 TnpB revealed that the amino acid residues K76 and Q80 directly interact with the dG-3 position of the 5’-TTGAT TAM site (Fig. 3d) (6, 7), we next tested whether modifying these amino acid residues could alter TAM recognition of TnpBmax. We performed site-directed mutagenesis at positions 76 and 80 to encompass all possible 20 amino acids and assessed the activity of these variants on alternative TAMs with either A or T at position 3 (dA-3 or dT-3). While none of the Q80X variants showed a marked increase in target DNA cleavage at the alternative TAMs, six K76X variants showed a >5-fold enhancement in activity compared to TnpBmax at one or both alternative TAMs (K76A, K76C, K76G, K76R, K76S, K76T; Supplementary Fig. 6a-b). To further investigate the TAM preference of these six K76X variants, we performed HT-TAMDA on two target sites with 5N-TAMs (Fig. 3e, Extended Data Fig. 2). Interestingly, all variants displayed reduced affinity to the canonical dG-3 but developed a new preference for dT-3 and to a lesser extent for dC-3 and dA-3. The variants K76A, K76C, and K76G exhibited a minor increase in the acceptance of dC-2 over the canonical dT-2 (Fig. 3e, Extended Data Fig. 2).
To verify the recognition of non-canonical TAM sequences in cultured cells, we transfected the six novel K76X variants into HEK293T cells to target 5’-TTtAT or 5’-TTGAT TAMs. Three K76X variants (K76A, K76C and K76S) exhibited editing efficiencies above 17.0 % at the 5’-TTtAT TAM, with TnpBmax-K76A reaching 21.2 % (Fig. 3f). These editing efficiencies were comparable to wild type TnpBmax at the same target site paired with the canonical 5’-TTGAT TAM (Fig. 3g). Further testing of TnpBmax-K76A on a library of 11 target sites, combined with all four 5’-TTNAT TAMs, confirmed the shift in the TAM preference from 5’-TTGAT to 5’-TTtAT and to a lesser extent 5’-TTcAT (Fig. 3h).
To evaluate off-target editing activity of the three most potent novel TnpBmax variants (TnpBmax-K76A, TnpBmax-K76C, and TnpBmax-K76S), we next conducted GUIDE-seq experiments in HEK293T cells using three distinct ωRNAs (Fig. 3i, Supplementary Fig. 7). While the ωRNA targeting the DYNC1H1 locus exhibited no off-target editing with any of the tested variants, the ωRNAs targeting HPRT1 and VEGFA triggered off-target editing (Fig. 3i). The HPRT1 off-target sites, however, did not show any mismatches to the 12bp seed region of the ωRNA, and editing at these sites was therefore expected.
In summary, engineering of TnpBmax at residue K76 resulted in novel variants that collectively target 5’-TYKAT TAMs, substantially increasing the targeting range of ISDra2 TnpB.
Deep learning predicts ωRNA efficiency
Given the strong dependency of TnpBmax on the nucleotide composition of the target site (Fig. 2b-c), we next sought to develop a computational model that can predict the activity of ωRNAs. The dataset of our target-matched library screen, which contains editing efficiencies at 10,211 different sites, was split into 70 % training, 20 % testing, and 10% validation sequences to develop and validate various machine learning models. The models were trained to predict the efficiency of any given ωRNA by learning from the nucleotide sequence itself and several sequence-related features such as minimum free energy, GC content, and melting temperature (Supplementary Note, Supplemental Data S1). In total, we evaluated five different classes of machine learning models, such as tree-based models (eXtreme Gradient Boosting Trees, XGBoost), Feed-Forward Neural Networks (FNN), Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN) and Transformer-based deep learning models (Fig. 4a). Model performance was evaluated by calculating Pearson (r) and Spearman (R) correlation between predicted efficiency scores and the observed values using the test dataset. Among all tested models, TEEP-CNN (r=0.81, R=0.77) and TEEP-RNN (r=0.80, R=0.77) models demonstrated the best prediction performance (Fig. 4d-e). Further validating both models, we observed a strong correlation between predicted and actual editing efficiencies in a separate target-matched library in HEK293T cells (r > 0.84, R > 0.75) and Neuro-2a cells (r > 0.82, R > 0.78) (Fig. 4f-g). Since editing efficiencies of TnpB might not solely depend on the ωRNA sequence but also on other factors, such as the chromatin state at the target locus, we next assessed the performance of both models on endogenous loci. Analysis of 10 sites in HEK293T cells and 19 sites in N2A cells revealed that both models were still able to predict editing efficiencies with a correlation of r > 0.73 (R > 0.77) in HEK293T cells and r > 0.71 (R > 0.57) in N2A cells (Fig. 4h-i). Furthermore, both TEEP models accurately predicted editing efficiencies of an external dataset, where ISDra2 TnpB was directed to 7 endogenous loci in HEK293T cells (Fig. 4j)(6). Finally, we also tested whether the TEEP models are able to predict editing efficiencies with TnpBmax-K76A. Examination of 11 target sites with 5’-TTtAT TAMs revealed a similarly strong correlation between predicted and observed values (r > 0.72, R > 0.71; Fig. 4k).
Figure 4. Machine learning accurately predicts ωRNA activity.
(A) Comparison of 12 machine learning algorithms predicting TnpB editing efficiency. feat, Feature; seq, sequence; XGBoost, eXtreme Gradient Boosting; FNN, feedforward neural network; CNN, convolutional neural network; RNN, recurrent neural network. Values represent mean + s.d. of n = 5 runs. (B-C) Schematic representation of the two best-performing algorithms (CNN and RNN), thereafter termed TEEP. (D-E) Performance evaluation of TEEP-CNN and TEEP-RNN on sequences from the model training (test dataset). r, Pearson’s correlation coefficient; N, number of individual target sites. Datapoints represent the mean of n = 3 independent biological replicates. (F-G) Validation of TEEP-CNN and TEEP-RNN predictions on target-matched libraries integrated in HEK293T and Neuro-2a cells. (H-I) Performance evaluation of TEEP-CNN and TEEP-RNN on individual endogenous loci in HEK293T and Neuro-2a cells. (J) Correlation of TEEP-CNN and TEEP-RNN predictions on an external dataset by Nakagawa et al (6). (K) Performance evaluation of TEEP-CNN and TEEP-RNN on ωRNAs tested with TnpB-K76A and 5’-TTtAT TAMs in HEK293T cells. N, number of individual target sites; r, Pearson’s correlation coefficient. Dots represent the mean of n = 3 independent biological replicates.
In conclusion, we developed RNN and CNN models that robustly predict ωRNA activities for TnpB. To facilitate usage and accessibility of these models we made TEEP publicly available via go.tnpb.app.
In vivo genome editing with TnpBmax from single AAV vectors
The compact nature of TnpB allows combined delivery with its ωRNA on single-stranded (ss) and self-complementary (sc) AAV vectors, making it an ideal tool for in vivo genome editing. To first assess if we can further optimize the size of the TnpB genome editing system, we determined the minimal active ωRNA sequence by progressively trimming its ends (Fig. 5a). In line with a recent study (6), we found that removing bases from the 5’ end was well tolerated until position 114 (Fig. 5b; Supplementary Fig. 8a-d). Likewise, replacement of the stem 3b (6) (−70G to −49A) loop with a GAAA-tetraloop did not decrease editing activity (Supplementary Fig. 8e). To further minimize the ωRNA RNA design, we also tested a set of 11 separate ωRNAs with- and without the HDV ribozyme sequence. However, ωRNAs without the HDV ribozyme showed slightly reduced activity (Fig. 5c-d), which could not be rescued by replacement of the HDV ribozyme sequence with shorter RNA stabilizing motifs (Supplementary Fig. 8f). We therefore proceeded with a 117-nt ωRNA-scaffold (mini-ωRNA) and the HDV ribozyme attached to the guide sequence.
Figure 5. Programmable in vivo genome editing with TnpBmax.
(A) Schematic representation of the ISDra2 TnpB 3' end and the overlapping ωRNA. RuvC, nuclease domain; aa, amino acids; nt, nucleotides. (B) Identification of a minimal ωRNA by progressive trimming in HEK293T cells. Bar represents the mean ± s.d. of n = 4 independent biological replicates. (C-D) Comparison of ωRNAs with and without (w/o) hepatitis delta virus ribozyme (HDVr) on eleven individual target sites (TS) in HEK293T cells; n = 3 independent biological replicates. Box plots 25th and 75th percentiles and whiskers down to the minimum and up to the maximum value and plots each individual value. The line in the box is plotted at the median. (E) TEEP predictions (left) and experimental values (right, Neuro-2a cells) for eight Dnmt1-targeting ωRNAs and four Pcsk9-targeting ωRNAs. The arrow indicates the ωRNAs picked for in vivo validation. Bar represents the mean ± s.d. of n ≥ 3 independent biological replicates (for experimental values). (F) Schematic representation of the single-stranded (ss) AAV9 and self-complementary (sc) AAV9 designs for in vivo use. AAV9, adeno-associated virus serotype 9; EFS, EF-1a short promoter; P3, liver-specific promoter; U6, Pol III-dependent promoter for ωRNA expression; NLS, nuclear localization sequence; HDVr, hepatitis delta virus ribozyme; WPRE, Woodchuck Hepatitis virus posttranscriptional regulatory element. Schematic representation of AAV injection routes in C57BL/6J newborn or adult mice. ICV, intracerebroventricular. (G-I) TnpB mediated editing at the Dnmt1 locus determined by deep amplicon sequencing in separated brain regions of mice treated with 5.0×1012 vg/kg (ssAAV) or 5.0×1013 vg/kg (ssAAV and scAAV). BS, brain stem; CTX, cortex; Hipp, hippocampus; Hypo, hypothalamus; MB, midbrain; OB, olfactory bulb; ST, striatum; TM, thalamus; CTRL, control. Each dot represents data from one animal; bar represents the mean ± s.d. of n = 3 animals. (J) Editing in newborn mice treated with either 1.0×1013 vg/kg or 5.0×1013 vg/kg of ssAAV9-TnpB-Pcsk9. Each dot represents data from one animal; bar represents the mean ± s.d. of n = 5 animals for the 1.0×1013 vg/kg dose and n = 3 animals for the 5.0×1013 vg/kg dose and the control. (K) Editing efficiencies of TnpB delivered from dose-matched single-stranded and self-complementary AAV9 in adult mice. Each dot represents data from one animal; bar represents the mean ± s.d. of n = 3 animals. (L) Relative Pcsk9 mRNA, PCSK9 protein, and low-density lipoprotein (LDL) levels in adult mice treated with 5.0×1013 vg/kg of scAAV9-TnpB-Pcsk9. Values were normalized to untreated control mice. Each dot represents data from one animal; bar represents the mean ± s.d. of n = 3 animals (mRNA and protein levels) or only the mean (LDL) of n = 2 animals. (M) Editing efficiencies of TnpB delivered from self-complementary AAV9 in adult mice in heart, kidney, lung, and genital organs. Each dot represents data from one animal; bar represents the mean ± s.d. of n = 3 animals. (N) Schematic representation of the on- and off-target assessment via CAST-seq. CAST-seq exploits locus-specific decoy primers to improve the sensitivity in detecting off-target mediated translocations and chromosomal aberrations at the on-target site. (O) CAST-seq analysis of genomic DNA isolated from adult mice treated with scAAV9-TnpB-Pcsk9 (5.0×1013 vg/kg). Circos plot shows on-target rearrangements in green and off-target mediated translocations (OMT) in red (none present).
We next applied TEEP to design different mini-ωRNA guides targeting murine Dnmt1 and Pcsk9. After confirming their performance of the mini-ωRNAs in Neuro-2a cells (Fig. 5e), the most efficient guides were cloned into single-stranded or self-complementary AAVs (ssAAV, scAAV) for co-expression together with TnpBmax (Fig. 5f, Supplementary Fig. 9a-b). First, ssAAV9 targeting Dnmt1 was delivered at a dose of 5×1012 vg/kg into newborn C57BL/6J mice via intracerebroventricular (ICV) injection. After eight weeks, different regions of the mouse brain were analyzed by deep amplicon sequencing (Fig. 5g). Editing was observed throughout the brain, with highest efficiencies occurring in the cortex (8.9 %; Fig. 5g). While administration of a 10-fold higher ssAAV9 dose did not increase indel efficiencies (6.5% editing in the cortex; Fig. 5h), delivery of scAAV vectors at a dose of 5×1013 vg/kg led to substantially enhanced editing (14.1 % in the cortex; Fig. 5i). Furthermore, direct intracortical injection of the same vector at a dose 5×1013 vg/kg resulted in 21.9% editing in the bulk cortex and 65.9% editing when only the tissue around the injection site was analyzed (Extended Data Fig. 3a-c). Next, we systemically administered ssAAV9 vectors into neonatal and adult mice, expressing TnpBmax under the liver-specific P3 promoter and the ωRNA targeting the Pcsk9 locus under the U6 promoter. After six weeks, we isolated genomic DNA from liver necropsies and hepatocytes for deep amplicon sequencing. At a dose of 1×1013 vg/kg, we observed editing efficiencies of 12.4% in the liver (15.3% in isolated hepatocytes) (Fig. 5j). Raising the vector dose by 5-fold only resulted in slightly enhanced editing efficiencies when mice were administered either as neonates or adults (19.4% and 15.3% in the liver; 34.2% and 19.2% in isolated hepatocytes; Fig. 5j-k). However, using scAAV again increased editing efficiencies, with a vector dose of 5×1013 vg/kg leading to 56.0 % editing in the liver (75.3 % in isolated hepatocytes; Fig. 5k). These editing efficiencies also resulted in substantial PCSK9 protein reduction and an associated decrease in blood cholesterol levels (Fig. 5l). Subsequent analysis revealed that editing was predominantly localized to the liver (Fig. 5m), and that administration of ssAAV-TnpB or scAAV-TnpB vectors did not result in a significant induction of pro-inflammatory cytokines (Supplementary Fig. 10). Finally, we also assessed whether in vivo genome editing with TnpBmax led to undesired chromosomal aberrations. However, when we performed CAST-Seq on genomic DNA isolated from hepatocytes of animals treated with scAAV9-TnpB-Pcsk9, only on-target aberrations but no off-target-mediated translocations between Pcsk9 and other loci were observed (Fig. 5n-o).
Discussion
In this study, we enhanced the efficiency and extended the targeting range of ISDra2 TnpB for genome editing in mammalian cells. Our developed TnpBmax editor exhibits higher activity than other previously reported small-sized programmable nucleases, and due to the 5 nt TAM and stringent 12 nt seed region, its specificity is comparable to SpCas9. By introducing amino acid substitutions at position 76, we further expanded the targeting scope of ISDra2 TnpB from the canonical 5’-TTGAT TAM to 5’-TYKAT TAMs, which occur on average 4-times more often in the genome. Finally, we generated machine learning models for predicting ωRNA efficiencies, enabling us to design in vivo genome editing experiments where TnpBmax achieved up to 75% editing efficiency in mice. The model (TEEP) is accessible via go.tnpb.app. Overall, the comprehensive characterization of ISDra2 TnpB, coupled with the development of novel variants and machine learning models for ωRNA selection, provides a valuable resource for researchers utilizing this ultracompact genome editor.
Methods
Molecular cloning
PCRs were performed using the Q5 High-Fidelity DNA polymerase (New England Biolabs, NEB). All expression vectors were assembled using NEBuilder HiFi DNA assembly (NEB). Plasmids expressing ωRNAs were cloned according to the protocol presented in Extended Data Fig. 4. Plasmids used in mammalian tissue culture were purified using NucleoBond Xtra Midi kits (Macherey-Nagel) or GeneJET Plasmid Miniprep Kit (Thermo Fisher). Primer sequences used are listed in Supplemental Datafile S1. Oligonucleotide sequences were purchased from Microsynth or Integrated DNA Technologies (IDT). PCRs and plasmid constructs were confirmed via Sanger or Nanopore sequencing (by Microsynth). TnpB (and Fanzor) coding sequences were codon optimized using the GenSmart™ Codon Optimization tool provided by GenScript (expression host organism: human).
Target library cloning
ωRNA-target-pairs were designed in silico and ordered as single-stranded DNA oligo pools (TWIST Bioscience). The library containing the ωRNA and the corresponding target sequence was prepared using a one-step cloning process to prevent uncoupling of the ωRNA and target sequence. A schematic representation of the workflow is shown in Supplementary Fig. 11. The oligonucleotide pools were PCR-amplified in 10 cycles according to the manufacturer’s protocols using NEBNext Ultra II polymerase (NEB). The resulting amplicons were then purified using 0.8 × volumes of paramagnetic AMPure XP beads (Beckman Coulter) following the manufacturer’s instructions for PCR cleanup and assembled into a linearized (SpeI, NEB) library acceptor plasmid using NEBuilder HiFi DNA Assembly Master Mix (NEB) for 1 h at 50 °C. The product was precipitated by adding one volume of Isopropanol (99 %), 0.01 volumes of GlycoBlue coprecipitant (Invitrogen), and 0.02 volumes of 5 M NaCl solution. The mix was vortexed for 10 sec and incubated at room temperature for 15 min, followed by 15 min centrifugation (15,000 × g). The supernatant was discarded and replaced by two volumes of ice-cold ethanol (80 %). Ethanol was removed immediately, and the pellet was air-dried for 1 min. The pellet was dissolved in TE buffer (10 mM Tris, 0.1 mM EDTA) for 10 min. The pool was purified and electroporated into Endura electrocompetent cells (Lucigen) using a Gene Pulser II device (Bio-Rad). Transformed cells were recovered for 1 h and spread on Luria–Bertani agar plates (245 × 245 mm, Thermo Fisher Scientific) containing 100 μg/mL ampicillin. After incubation at 30 °C for 14 h, the colonies were scraped, and plasmids were purified.
Cell culture
Cell lines used in this study were incubated at 37°C and 5 % CO2 within cell culture incubators. We maintained HEK293T cells (ATCC CRL-3216) in Dulbecco's modified Eagle's medium (DMEM) with added GlutaMAX (Thermo Fisher Scientific). Neuro-2a (ATCC CCL-131) cells were maintained in Eagle’s Minimum Essential Medium (EMEM). Both types of media were enriched with 10% volume/volume (v/v) fetal bovine serum (FBS; Sigma-Aldrich) and 1 % penicillin/streptomycin (v/v, Thermo Fisher Scientific). Cells were maintained at confluency below 90 % and were tested negative for Mycoplasma contamination. Cells were authenticated by the supplier by short tandem repeat analysis.
Transfections
HEK293T or Neuro-2a cells (7.5 ×104/well) were seeded into 48-well flat-bottom cell culture plates (Corning), transfected 24 h after seeding with 500 ng of the editor, and 250 ng of ωRNA expression plasmid and 1.5 μL of Lipofectamine 2000 (Invitrogen) per well. One day later, the medium was removed, and cells were detached using one drop of TrypLE (Gibco) per well, resuspended in a fresh medium (containing 2.5 μg/μL puromycin for experiments with selection), and plated again into 48-well flat-bottom cell culture plates. Cells were harvested 4 days after transfection except for the ωRNAs editing time series (Fig. 1c; 1-9 days), target-matched library screens (Fig 1d-f, Fig. 2a-I, Fig. 3c, Fig. 4d-g, Fig. 5c, Supplementary Fig. 2, Supplementary Fig. 3, Supplementary Fig. 4, Supplementary Fig. 9f; 10 days) and TnpBmax-K76A editing at 5’-TTtAT and 5’-TTGAT TAMs (Fig 3f-g; 8 days). Cells transfected with plasmids expressing OMEGA effector constructs were selected with 2.5 ug/mL Puromycin, starting 1d post transfection. Cells transfected with scAAV or ssAAV constructs for TnpB expression (in vitro test, Figure S9b) were not selected with antibiotics. To obtain genomic DNA, cells were resuspended in 30 μL 1× PBS and 10 μL of lysis buffer (4× Lysis Buffer: 10 mM Tris–HCl at pH 8, 2 % Triton X, 1 mM EDTA, and 1 % freshly added Proteinase K (Qiagen)). Lysis was performed in a thermocycler (Bio-Rad) using the following program: 60 °C, 60 min; 95 °C, 10 min; 4 °C, hold. The lysate was diluted to a final volume of 100 μL using nuclease-free water for subsequent PCRs.
Target-matched library integration and effector screen
For integration of the library, the respective plasmid pool was transfected into HEK293T or Neuro-2A cells alongside plasmids encoding the Sleeping Beauty transposase (addgene #34879 and rtTA (addgene #163601) at a coverage of 2000 x cells per library member. Transfections were carried out in 150 mm cell culture dishes (Nunc EasYDish) at 9.0 × 106 cells/dish using 80 ug plasmid DNA (equal molar ratio of all plasmids) and 80 uL Lipofectamine 2000 (Invitrogen) per dish, each premixed in 1.5 mL Opti-MEM (Gibco). Transfection mixes were incubated for 20 minutes and added dropwise to the cell culture medium. Post-transfection, cells underwent doxycycline induction (500 ng/mL) and three-passages (every 3 days) of selection with Blasticidin. On day 0 of the screen, HEK293T-Library cells received the RNA-guided DNA nuclease effector (e.g. TnpB-P2A-GFP-Puro) and were subjected to 10 days of Puromycin selection (passaging cells on day 1,4, and 7). Libraries with N ≤ 100 members were screened in 6-well flat-bottom cell culture plates (Corning) with 0.6 × 106 cells/well using 4 ug plasmid DNA and 8 uL Lipofectamine 2000, each premixed in 0.25 mL Opti-MEM. Libraries with N > 10,000 members were screened in 3x 150 mm cell culture dishes (9.0 × 106 cells/dish) using 80 ug plasmid DNA and 80 uL Lipofectamine 2000 per dish, each premixed in 1.5 mL Opti-MEM.
High-throughput sequencing (deep amplicon sequencing)
Preparation of DNA for high-throughput sequencing (HTS) was performed as previously described (19). In short, the first PCR (GoTaq Green Mastermix, Promega) was performed to amplify genomic sites of interest with primers containing Illumina forward and reverse adapter sequences. For the p5/p7 barcoding PCR, NEBNext High-Fidelity 2 × PCR Master Mix (NEB) was used according to the manufacturer’s instructions. The final pool was quantified on the Qubit 3.0 (Invitrogen) instrument. Libraries were sequenced on a MiSeq or NovaSeq 6000 (Illumina, 150bp, paired-end). Amplicon sequences were analyzed using custom Python scripts (refer to ‘Data Availability’ and ‘Code Availability’ sections) or CRISPResso2 (20).
High-throughput TAM detection assay
HT-PAMDA was performed as recently described(18). In brief, TnpB was cloned in pCMV-T7-SpCas9-P2A-EGFP (Addgene #139987) and expressed in HEK293T cells for 48 hours. Whole-cell lysates were collected and normalized to a concentration corresponding to 150 nM fluorescein dye. Target-specific ωRNAs were in vitro transcribed using HiScribe T7 High Yield RNA Synthesis Kit. Substrate libraries with different target sites and TAM libraries were cloned into p11-LacY-wtxq (Addgene #69056). ωRNAs (1.1 μM) and normalized cell lysate (83 nM fluorescein) were complexed for 10 minutes at 37 °C. The RNP mixture (0.5 μM ωRNA, 37.5 nM fluorescein lysate) was added to the substrate library (2.5 nM), and the reaction was stopped after different time intervals (1, 8, and 32 minutes). For all time points, substrate libraries were individually PCR amplified using time point-specific barcodes, followed by an amplification using protein variant-specific barcodes. Samples were pooled and sequenced on a NovaSeq 6000 (Illumina). TnpB was characterized on two different target sites (GTCAGTGTGATAGGATCCGT and GTGATGGGAGCCCTTCTTCT). Two independent replicates were performed on different days.
Genome-wide, unbiased identification of DSBs (double-strand breaks) enabled by sequencing (GUIDE-Seq)
In brief, 2×105 HEK293T cells were resuspended in SF nucleofection solution (TnpBmax experiment) or homemade nucleofection solution as previously described (21) (TnpBmax-K76X experiment), combined with 666 ng of TnpB variant expression plasmid, 334 ng of ωRNA coding plasmid, and an additional 30 pmol of double-stranded oligodeoxynucleotide (dsODN), following the original GUIDE-seq protocol (17). For negative controls, only the dsODN was transfected. Nucleofections were performed in replicates using the CM-130 program on a Lonza 4-D Nucleofector instrument strip with 20 μl nucleofection solution according to the manufacturer’s protocol. Transfected cells were harvested ∼96 h post-transfection, and genomic DNA was purified according to the Puregene DNA Purification protocol (Gentra systems). GUIDE-seq relies on incorporating a short dsODN tag into DNA breaks. Consequently, following genomic DNA purification, the integration of the dsODN tag and efficient indel formation at the on-target site were confirmed through deep amplicon sequencing. Subsequently, the genomic DNA was sheared to an average size of 500 bp using Covaris E220 in accordance with the manufacturer’s protocol. Sample libraries were sequenced on the Illumina MiSeq instrument. Fastq files were analyzed using the open-source GUIDE-Seq software (version 1.1)(22). Consolidated reads were mapped to the human hg38 reference genome. Upon identification of the genomic regions integrating dsODNs in aligned data, off-target sites were retained if at most eight mismatches against the target were present and if absent in the background controls. Visualization of aligned off-target sites is provided as a color-coded sequence grid. GUIDE-seq data can be found in Supplemental Data S1, and the respective sequencing data are deposited in the NCBI Sequence Read Archive (accession ID: PRJNA1019264).
CAST-seq
CAST-seq (chromosomal aberrations analysis by single targeted linker-mediated PCR sequencing) was performed as described initially by Turchiano et al., 2021 (23) with minor modifications (24). Briefly, genomic DNA was fragmented by enzymatic digestion (NEBNext® Ultra™ II FS DNA Library Prep, NEB) to obtain an average fragment length of 600 bp. After linker ligation (NEBNext® Ultra™ II FS DNA Library Prep, NEB) and DNA purification, two rounds of PCR utilizing Hotstart Q5 polymerase (NEB) were performed with the following conditions: 20 cycles at 98 °C for 10 s, 63 °C (first reaction) or 68 °C (second reaction) for 20 s, 72 °C for 20 s. A third PCR introduced the barcoded Illumina adaptor for sequencing (NEBNext Multiplex Oligos for Illumina, NEB). Libraries were sequenced by Azenta Life Sciences using the Illumina NovaSeq platform with a 2 × 150 bp configuration.
Adeno-associated virus production
Vectors (AAV2 serotype 9) were produced by the Viral Vector Facility of the Neuroscience Center Zurich. Briefly, AAV vectors were ultracentrifuged and diafiltered. Physical titers (vg/mL) were determined using a Qubit 3.0 fluorometer (Invitrogen).
Animal studies
Mouse experiments were performed in accordance with protocols approved by the local animal welfare laws, guidelines, and policies (Kantonales Veterinäramt Zürich, ethical permission no. ZH022/2022). Mice (C57BL/6J) were housed in a pathogen-free animal facility at the Institute of Pharmacology and Toxicology at the University of Zurich and kept in a temperature- and humidity-controlled room (21°C, 50 % RH) on a 12-h light/dark cycle and fed a standard laboratory chow (Kliba Nafag no. 3437 with 18.5 % crude protein). Mice were fasted for 3–4 h before blood was collected from the inferior vena cava before liver perfusion.
Intracerebroventricular injections
Newborn mice (P1) were injected with AAV vectors via intracerebroventricular injection (ICV, 5x1012 vg/kg or 5x1013 vg/kg). Animals were anesthetized using isoflurane (5 % isoflurane with 1000 mL/min in 100 % O2) and placed into a fitted stereotaxic mouse frame on a warming surface to maintain body temperature. During injections, Anesthesia was maintained at 1.5 % isoflurane with 400 mL/min in 100 % O2. 2 μL of the AAV suspension was injected using a Hamilton syringe (Hamilton®, 10 microliters 701 RN syringe; Hamilton ® small RN, ga 31/15 mm needles) to the right and left hemisphere, respectively.
Stereotactic injection
Adult mice were administered Buprenorphine, 0.1 mg/kg body weight, subcutaneously 30 minutes before surgery, followed by anesthesia with isoflurane (induction with 5% isoflurane at 1000 mL/min in 100% O2 and maintenance with 1.5-2.5% isoflurane at 400 mL/min in 100% O2 during surgeries). For injections in the cortex, 2 x 2 μL of AAV vector (titer: 2.6x1013 vg/mL) were microinjected unilaterally into the left and right hemispheres by stereotaxic surgery at the following coordinates: 0.25 mm anteroposterior (AP); ±1.5 mm mediolateral (ML); -0.5 mm dorsoventral (DV). AAV injections were performed using a homemade glass needle at a speed of 1 μL/min. The needle was slowly removed 5 minutes after injection, and the wound was sutured using Vicryl 5-0 suture (Ethicon).
Brain Isolation
Animals were euthanized by CO2 incubation, and brains were isolated and cut into 1 mm sections using an acrylic brain matrix. Brain regions were separated using a surgical scalpel. Tissue was lysed by addition of a lysis buffer (50 mM Tris ph 8.5-9, 50 mM NaCl, 2.5 mM EDTA, 0.05 % SDS and 1 % freshly added proteinase K) and incubation at 60 °C for 2h, following inactivation of proteinase K at 95 °C for 10 min.
Primary hepatocyte isolation
Mice were euthanized using CO2 and immediately perfused with Hank’s balanced salt solution (Thermo Fisher Scientific) plus 0.5 mM EDTA via the inferior vena cava and a subsequent incision in the portal vein. During this step, one liver lobe was squeezed off via a thread to inhibit perfusion of this lobe and collect whole liver samples for whole liver lysates. After blanching of the liver, mice were perfused with digestion medium (low-glucose DMEM plus 1× penicillin-streptomycin (Thermo Fisher Scientific), 15 mM HEPES, and freshly added Liberase (Roche)) for 5 min. Livers were isolated in cold isolation medium (low-glucose DMEM supplemented with 10 % (vol/vol) FBS plus 1×penicillin–streptomycin (Thermo Fisher Scientific) and GlutaMax (Thermo Fisher Scientific)), and the liver was gently dissociated to yield a cell suspension that was passed through a 100-μm filter. The suspension was then centrifuged at 50 × g for 2 min and washed with isolation medium twice until the supernatant was clear. The primary hepatocytes were pelleted for direct lysis for deep amplicon sample preparation as previously described (19).
Clinical chemistry
Mouse PCSK9 protein levels were determined using Mouse Proprotein Convertase 9/PCSK9 Quantikine ELISA Kit (R&D Systems) according to the manufacturer’s instructions. Absorbance was measured at 450 nm and background at 540 nm; the latter was subtracted for quantification. Total cholesterol, triglyceride, and high-density lipoprotein (HDL) from all mouse samples were measured as routine parameters at the Division of Clinical Chemistry and Biochemistry at the University Children’s Hospital Zurich using Alinity ci-series. LDL (low-density lipoprotein) levels were calculated by using the Friedewald formula.
Western blotting
HEK293T cells at 96 h after transfection with TnpB-encoding plasmids. Nuclear and cytoplasmic proteins were isolated using the NE-PERTM Nuclear and Cytoplasmic Extraction Reagents Kit (Thermo Scientific) according to the manufacturer’s instructions. For the overall protein expression analysis, cells were lysed using radioimmunoprecipitation (RIPA) assay buffer (150 mM Tris pH 8.0, 150 mM NaCl, 0.1% SDS, 0.5% sodium deoxycholate, 1% NP-40; Thermo Scientific), supplemented with protease inhibitor (Sigma-Aldrich) and PhosSTOP™ (Sigma-Aldrich). Protein concentrations were determined using the Pierce Bicinchoninic Acid (BCA) Protein Assay Kit (Thermo Scientific) following the manufacturer’s protocol. Protein amounts were normalized to the fraction with the lowest total protein concentration and maximum possible loading volume (overall protein and cytoplasmic fraction: 23 ug; nuclear fraction: 8 ug). Proteins were separated by SDS-polyacrylamide gel electrophoresis using NuPage 4-12% Bis tris gradient (Thermo Scientific) and transferred to a 0.45 μm nitrocellulose membrane (Amersham). After blocking, membranes were incubated with mouse anti-FLAG (for detection of TnpB-FLAG protein) (1:500; cat. no. F3165-.2MG, Sigma-Aldrich), rabbit anti-beta-actin (1:100’000; cat. no. 81115-1-RR, cloneNo. 13E5, Cell Signaling Technology) and rabbit anti-Lamin B1 (1:1500; cat. no. ab16048, abcam) or rabbit anti histone h3k4me (1:1000; cat. no. 39915, Active Motif). Signals were detected by fluorescence using IRDye® 800CW Goat anti-rabbit IgG secondary antibody (1:15’000; cat no. 926-32211, LI-COR bio) and IRDye® 680RD Goat anti-mouse IgG secondary antibody (1:15’000; cat no. 926-68070, LI-COR bio) and a LI-COR Odyssey® DLx imaging system. Signal intensities were determined using Image Studio™ version 5.5 (LI-COR).
Multiplexed detection of inflammation-linked cytokines
Serum samples were analyzed using LEGENDPlex Mouse Inflammation Panel (13-plex; Biolegend) according to the manufacturer’s instructions. Data were collected by flow cytometry on a FACSymphony A5 (BD Biosciences) and analyzed using LEGENDPlex software (Biolegend).
Quantitative polymerase chain reaction (qPCR)
RNA extraction was performed with QIAGEN RNeasy Mini Kit, and cDNA was generated with GoScript Reverse Transcriptase kit (Promega) according to the manufacturers’ instructions. For qPCR, 2 μL of 1:10-diluted cDNA was added to 8 μl of 5x HOT FIREPol Evagreen qPCR Supermix (SolisBiodyne). Amplification and detection were performed on a LightCycler480 II (Roche). Relative gene expression was determined using the comparative CT method. Genes with a median CT value of more than 33 cycles and a difference of less than 3.3 cycles to the template control (H2O) were defined as not detectable.
Machine learning
Description of the developed machine learning models. The model is formulated to solve a supervised regression problem. Training data is assumed to be where xi represents the features vector of the ωRNA target site and yi represents the corresponding editing efficiency where yi ∈ [0,100]. We trained a model fθ: x → y that can predict the editing efficiency for a given ωRNA target site.
Input representation
To represent the ωRNA, we used target sequences of 20 bases in length along with other features encompassing hand-engineered attributes derived from the target sequence (Supplemental Data S1). In total, our representation comprises the target sequence, 256 hand-engineered features extracted from the target sequence, and 10 other hand-engineered features that are not directly associated with the target sequence. We tested all models in two main scenarios: a) using only the target sequence and b) combining the target sequence with hand-engineered features. For traditional machine learning models, we used all hand-engineered features and the target sequence. For deep learning models, we only used the target sequence and those hand-engineered features that were not directly derived from the target sequence.
Model architecture
For model development, we explored both traditional machine learning and deep learning methods tailored for sequential data. We trained XGBoost as a traditional baseline model. For deep learning, we tested a feedforward neural network, convolutional neural network (CNN), recurrent neural network (RNN), and a transformer-based network. In Fig. 4, we present the results of 12 models, encompassing 5 distinct model classes trained on various input representations. ‘target seq’ represents the 20-length ωRNA target sequence, ‘10 features’ denotes the hand-engineered features not directly linked to the target sequence, and ‘all features’ signifies all 256+10 hand-engineered features combined with the target sequence. For XGBoost, we tested the target-sequence-only input representation using both index-based encoding (e.g., xi = [1,3,0,1,2, …], length 20) and one-hot encoding (flattened length 80), denoted as XGBoost (target seq; index encoding) and XGBoost (target seq; one-hot encoding), respectively. XGBoost (all features, without target seq) refers to XGBoost trained solely on the hand-engineered features without including the target sequence. The results derived from the XGBoost models reveal that the target sequence plays a crucial role in accurate prediction while adding other features did not significantly impact performance. The CNN and RNN detailed structures are depicted in Fig. 2b-c and ‘Supplementary Note’ List 1 and List 2. The CNN architecture utilized 1D convolution with a kernel size of two and a stride size of one. It consisted of three convolutional layers with 32, 64, and 124 filters, respectively, followed by the ReLU activation function. The features learned from the final CNN layer were flattened and fed into a two-layer MLP with ReLU activation. Regarding the RNN, we employed one embedding layer and two layers of Bi-directional LSTM (Long Short-Term Memory) with a hidden dimension of 64. Analogous to the CNN, the Bi-LSTM model also passed its learned features through a two-layer Multi-Layer Perceptron (MLP) to generate the final predictions. Additional information is available on GitHub via https://github.com/uzh-dqbm-cmi/Tnpb.
Training setup and experimental results
During model training, we employed a random train-test split, repeated five times. Each dataset was divided into training and validation sets to monitor the best-performing model. We used Mean squared error (MSE) as an objective function coupled with weight decay and dropout regularization(25) (see ‘Supplementary Note’ section 1.4). Our models were trained using Adam Optimizer(26). The CNN model was trained for 300 epochs with a batch size of 100, while the Bi-directional LSTM model was trained for 500 epochs with a batch size of 1500.
Extended Data
Extended Data Figure 1. Benchmarking of TnpB and Fanzor architectures in HEK293T cells.
(A) Schematic representation of experimental workflow and designs. NLS, nuclear localization sequence; BPNLS, bipartite NLS; SRAD, Serine-Arginine-Alanine-Aspartic acid; GS, Glycine-Serine; PuroR, Puromycin resistance; d, days; HTS, high-throughput sequencing; a codon-optimization and design from Xiang et al. (11) and Saito et al. (12) (B-D) Benchmarking of different architectures of ISDra2, ISAam1 and ISYmu1 TnpBs. Number of analyzed endogenous targets: ISDra2 TnpB, N = 7; ISAam1 TnpB, N = 7; ISYmu1 TnpB, N = 8. Each dot represents the mean of n = 3 independent biological replicates; the black bar represents the mean of all target sites tested for the respective design. Means were compared by two-tailed t-test. (E) Benchmarking of SpuFz1-v2 Fanzor embedded in various designs tested at one endogenous locus (B2M). Each bar represents the mean ± s.d. of n = 3 independent biological replicates and a two-tailed t-test was used to calculate variance. Indel frequencies are shown in Datafile S1.
Extended Data Figure 2. High-throughput TAM determination assay (HT-TAMDA) of TnpBmax and variants thereof.
The log10 (rate constant) represents the mean of two replicates against two distinct target sequences.
Extended Data Figure 3. Direct intracortical injection of scAAV-TnpB-Dnmt1.
(A) Schematic representation of stereotactic scAAV injection. (B-C) TnpBmax mediated editing at the Dnmt1 locus determined by deep amplicon sequencing in separated brain regions of mice treated with 5.0×1013 vg/kg scAAV. CTX, cortex; BS, brain stem; Hipp, hippocampus; Hypo, hypothalamus; MB, midbrain; OB, olfactory bulb; ST, striatum; TM, thalamus; CTRL, control. Each dot represents data from one animal; bar represents the mean ± s.d. of n = 3 animals.
Extended Data Figure 4. Detailed protocol for ωRNA guide cloning.
Step 1: Digest and purify the ωRNA acceptor plasmid with BbsI. Step 2: Perform ligation or Golden-Gate-Assembly of phosphorylated and annealed oligonucleotides into the digested pωRNA-acceptor.
Supplementary Material
Acknowledgements
We thank the Functional Genomics Center Zurich for technical support and access to instruments at the University of Zurich and ETH Zurich, the mRNA platform at UZH/USZ and S. Pascolo, J. Frei, and C. Wyss for the production and purification of RNAs, the viral vector facility of UZH and J.-C. Paterna and M. Rauch for production of AAVs, G. Andrieux for bioinformatic analysis of CAST-seq data, O. Melkonyan for HT-TAMDA analysis, as well as J. Häberle and N. Rimann for measurements of blood LDL levels. We thank I. Querques, M. Jinek, M. Pacesa, L.-M. Koch, Lotti, and members of the Schwank lab for valuable discussions, feedback, and help throughout the study. This work was supported by the URPPs (University Research Priority Programs) ‘Human Reproduction Reloaded’ (to G.S.) and ‘ITINERARE’ (to G.S. and M.K.), the PROMEDICA Foundation (to G.S.), the Swiss National Science Foundation (SNSF) grant numbers 185293 and 214936 (to G.S.) and grant number 201184 (to M.K.), the UZH PhD fellowship (to R.T.), the ETH PhD fellowship (to L.S. and K.F.M.), and the German Research Foundation (CRC 1597-A05 to T.C.).
Footnotes
Author Contribution Statement:
K.F.M. performed numerous biological experiments throughout the study, analyzed data, and prepared figures. N.M. performed bioinformatic analysis of all target-matched library experiments. A.M. designed and developed machine learning models and implemented the web app for TEEP. S.M. prepared plasmids for TnpB/Fanzor and ωRNA expression, performed and analyzed endogenous DNA editing experiments, conducted HT-TAMDA assays, and performed western blotting experiments. L.K. and T.R. performed in vivo experiments, including intracerebroventricular and stereotactic injections and brain and hepatocyte isolation. L.S. prepared plasmids for ωRNA expression and conducted HT-TAMDA assays. P.I.K. performed and analyzed GUIDE-seq experiments. A.A. contributed to the design and development of machine learning models. M. M. K. performed CAST-seq experiments. M.M. assessed inflammation-linked cytokines. T. H. contributed to western blotting experiments. T.C., M.K., M.K., and G.S. supervised the research and provided field-specific expertise. K.F.M. and G.S. designed the study and wrote the manuscript. All authors reviewed the manuscript.
Competing interests:
K.F.M. and G.S. are co-inventors on a patent application filed by the University of Zurich relating to the work described in this paper. G.S. is an advisor to Prime Medicine Inc. The remaining authors declare no competing interests.
Data availability
All ωRNA and HTS primer sequences used for this study are provided in Supplementary Data S1. Deep amplicon sequencing data files are available from the National Center for Biotechnology Information’s Sequence Read Archive (accession ID: PRJNA1019264). Plasmid sequences are provided via https://benchling.com/marquark7/f_/FOdfdV1v-tnpb/. Additionally, key plasmids from this work are available from Addgene. All data is freely accessible to the public.
Code availability
Computer code for the analysis of the pooled libraries is available via https://github.com/Schwank-Lab/tnpb. The code for training the machine learning models is available on GitHub (https://github.com/uzh-dqbm-cmi/Tnpb). In addition, we have developed a publicly available web application (go.tnpb.app or https://www.tnpb.app) for predicting TnpB ωRNA efficiencies using our trained models. HTS data was collected and demultiplexed by Illumina NovaSeq Control software v1.7 and MiSeq Control software (v3.1 and v4.0). Pooled library analysis was performed using Python 3.9. Cutadapt (3.5) was used to trim sequencing reads. For characterization of indels and base edits at single sites (endogenous), CRISPResso2 (2.2.7) was used. For statistical analysis, SciPy (1.10.1) and Prism (9.0.0) was used.
References
- 1.Jinek M, Chylinski K, Fonfara I, Hauer MH, Doudna JA, Charpentier E. A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity. Science. 2012;337:816–821. doi: 10.1126/science.1225829. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Cong L, Ran FA, Cox D, Lin S, Barretto R, Habib N, Hsu PD, Wu X, Jiang W, Marraffini LA, Zhang F. Multiplex Genome Engineering Using CRISPR/Cas Systems. Science. 2013;339:819–823. doi: 10.1126/science.1231143. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Mali P, Yang L, Esvelt KM, Aach J, Guell M, DiCarlo JE, Norville JE, Church GM. RNA-Guided Human Genome Engineering via Cas9. Science. 2013;339:823–826. doi: 10.1126/science.1232033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Altae-Tran H, Kannan S, Demircioglu FE, Oshiro R, Nety SP, McKay LJ, Dlakić M, Inskeep WP, Makarova KS, Macrae RK, Koonin EV, et al. The widespread IS200/IS605 transposon family encodes diverse programmable RNA-guided endonucleases. Science. 2021;374:57–65. doi: 10.1126/science.abj6856. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Karvelis T, Druteika G, Bigelyte G, Budre K, Zedaveinyte R, Silanskas A, Kazlauskas D, Venclovas Č, Siksnys V. Transposon-associated TnpB is a programmable RNA-guided DNA endonuclease. Nature. 2021;599:692–696. doi: 10.1038/s41586-021-04058-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Nakagawa R, Hirano H, Omura SN, Nety S, Kannan S, Altae-Tran H, Yao X, Sakaguchi Y, Ohira T, Wu WY, Nakayama H, et al. Cryo-EM structure of the transposon-associated TnpB enzyme. Nature. 2023;616:390. doi: 10.1038/s41586-023-05933-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Sasnauskas G, Tamulaitiene G, Druteika G, Carabias A, Silanskas A, Kazlauskas D, Venclovas Č, Montoya G, Karvelis T, Siksnys V. TnpB structure reveals minimal functional core of Cas12 nuclease family. Nature. 2023;616:384. doi: 10.1038/s41586-023-05826-x. [DOI] [PubMed] [Google Scholar]
- 8.Schmidheini L, Mathis N, Marquart KF, Rothgangl T, Kissling L, Böck D, Chanez C, Wang JP, Jinek M, Schwank G. Continuous directed evolution of a compact CjCas9 variant with broad PAM compatibility. Nat Chem Biol. 2023:1–11. doi: 10.1038/s41589-023-01427-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Koblan LW, Doman JL, Wilson C, Levy JM, Tay T, Newby GA, Maianti JP, Raguram A, Liu DR. Improving cytidine and adenine base editors by expression optimization and ancestral reconstruction. Nat Biotechnol. 2018;36:843–846. doi: 10.1038/nbt.4172. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Suzuki K, Tsunekawa Y, Hernandez-Benitez R, Wu J, Zhu J, Kim EJ, Hatanaka F, Yamamoto M, Araoka T, Li Z, Kurita M, et al. In vivo genome editing via CRISPR/Cas9 mediated homology-independent targeted integration. Nature. 2016;540:144–149. doi: 10.1038/nature20565. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Xiang G, Li Y, Sun J, Huo Y, Cao S, Cao Y, Guo Y, Yang L, Cai Y, Zhang YE, Wang H. Evolutionary mining and functional characterization of TnpB nucleases identify efficient miniature genome editors. Nat Biotechnol. 2023:1–13. doi: 10.1038/s41587-023-01857-x. [DOI] [PubMed] [Google Scholar]
- 12.Saito M, Xu P, Faure G, Maguire S, Kannan S, Altae-Tran H, Vo S, Desimone A, Macrae RK, Zhang F. Fanzor is a eukaryotic programmable RNA-guided endonuclease. Nature. 2023 doi: 10.1038/s41586-023-06356-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Koblan LW, Arbab M, Shen MW, Hussmann JA, Anzalone AV, Doman JL, Newby GA, Yang D, Mok B, Replogle JM, Xu A, et al. Efficient C•G-to-G•C base editors developed using CRISPRi screens, target-library analysis, and machine learning. Nat Biotechnol. 2021;39:1414–1425. doi: 10.1038/s41587-021-00938-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Komor AC, Kim Y, Kim Y, Packer MS, Zuris JA, Liu DR. Programmable editing of a target base in genomic DNA without double-stranded DNA cleavage. Nature. 2016;533:420–424. doi: 10.1038/nature17946. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Gaudelli N, Komor AC, Rees HA, Packer MS, Badran AH, Bryson DI, Liu DR. Programmable base editing of A•T to G•C in genomic DNA without DNA cleavage. Nature. 2017;551:464–471. doi: 10.1038/nature24644. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Richter MF, Zhao KT, Eton E, Eton E, Lapinaite A, Newby GA, Thuronyi BW, Wilson Christopher J, Wilson CD, Koblan LW, Zeng J, et al. Phage-assisted evolution of an adenine base editor with improved Cas domain compatibility and activity. Nature Biotechnology. 2020;38:883–891. doi: 10.1038/s41587-020-0453-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Tsai SQ, Zheng Z, Nguyen NT, Liebers M, Topkar VV, Thapar V, Wyvekens N, Khayter C, Iafrate AJ, Le LP, Aryee MJ, et al. GUIDE-seq enables genome-wide profiling of off-target cleavage by CRISPR-Cas nucleases. Nat Biotechnol. 2015;33:187–197. doi: 10.1038/nbt.3117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Walton RT, Hsu JY, Joung JK, Kleinstiver BP. Scalable characterization of the PAM requirements of CRISPR-Cas enzymes using HT-PAMDA. Nature Protocols. 2021;16:1511–1547. doi: 10.1038/s41596-020-00465-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Marquart KF, Allam A, Janjuha S, Sintsova A, Villiger L, Frey N, Krauthammer M, Schwank G. Predicting base editing outcomes with an attention-based deep learning algorithm trained on high-throughput target library screens. Nat Commun. 2021;12:5114. doi: 10.1038/s41467-021-25375-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Clement K, Rees H, Canver MC, Gehrke JM, Farouni R, Hsu JY, Cole MA, Liu DR, Joung JK, Bauer DE, Pinello L. CRISPResso2 provides accurate and rapid genome editing sequence analysis. Nat Biotechnol. 2019;37:224–226. doi: 10.1038/s41587-019-0032-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Vriend LEM, Jasin M, Krawczyk PM. Assaying break and nick-induced homologous recombination in mammalian cells using the DR-GFP reporter and Cas9 nucleases. Methods Enzymol. 2014;546:175–191. doi: 10.1016/B978-0-12-801185-0.00009-X. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Tsai SQ, Topkar VV, Joung JK, Aryee MJ. Open-source guideseq software for analysis of GUIDE-seq data. Nat Biotechnol. 2016;34:483. doi: 10.1038/nbt.3534. [DOI] [PubMed] [Google Scholar]
- 23.Turchiano G, Andrieux G, Klermund J, Blattner G, Pennucci V, el Gaz M, Monaco G, Poddar S, Mussolino C, Cornu TI, Boerries M, et al. Quantitative evaluation of chromosomal rearrangements in gene-edited human stem cells by CAST-Seq. Cell Stem Cell. 2021;28:1136–1147.:e5. doi: 10.1016/j.stem.2021.02.002. [DOI] [PubMed] [Google Scholar]
- 24.Klermund J, Rhiel M, Kocher T, Chmielewski KO, Bischof J, Andrieux G, el Gaz M, Hainzl S, Boerries M, Cornu TI, Koller U, et al. On- and off-target effects of paired CRISPR-Cas nickase in primary human cells. Molecular Therapy. 2024 doi: 10.1016/j.ymthe.2024.03.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research. 2014;15:1929–1958. [Google Scholar]
- 26.Kingma DP, Ba J. Adam: A Method for Stochastic Optimization. arXiv. 2017:arXiv:1412.6980. doi: 10.48550/arXiv.1412.6980. [Preprint] [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All ωRNA and HTS primer sequences used for this study are provided in Supplementary Data S1. Deep amplicon sequencing data files are available from the National Center for Biotechnology Information’s Sequence Read Archive (accession ID: PRJNA1019264). Plasmid sequences are provided via https://benchling.com/marquark7/f_/FOdfdV1v-tnpb/. Additionally, key plasmids from this work are available from Addgene. All data is freely accessible to the public.
Computer code for the analysis of the pooled libraries is available via https://github.com/Schwank-Lab/tnpb. The code for training the machine learning models is available on GitHub (https://github.com/uzh-dqbm-cmi/Tnpb). In addition, we have developed a publicly available web application (go.tnpb.app or https://www.tnpb.app) for predicting TnpB ωRNA efficiencies using our trained models. HTS data was collected and demultiplexed by Illumina NovaSeq Control software v1.7 and MiSeq Control software (v3.1 and v4.0). Pooled library analysis was performed using Python 3.9. Cutadapt (3.5) was used to trim sequencing reads. For characterization of indels and base edits at single sites (endogenous), CRISPResso2 (2.2.7) was used. For statistical analysis, SciPy (1.10.1) and Prism (9.0.0) was used.