Abstract
Biological signals occur over time in living cells. Yet most current approaches to interrogate biology, particularly gene expression, use destructive techniques that quantify signals only at a single point in time. A recent technological advance, termed the Retro-Cascorder, overcomes this limitation by molecularly logging a record of gene expression events in a temporally organized genomic ledger. The Retro-Cascorder works by converting a transcriptional event into a DNA barcode using a retron reverse transcriptase and then storing that event in a unidirectionally-expanding CRISPR array via acquisition by CRISPR-Cas integrases. This CRISPR array-based ledger of gene expression can be retrieved at a later point in time by sequencing. Here we describe an implementation of the Retro-Cascorder in which the relative timing of transcriptional events from multiple promoters of interest is recorded chronologically in Escherichia coli populations over multiple days. We detail the molecular components required for this technology, provide a step-by-step guide to generate the recording and retrieve the data by Illumina sequencing, and give instructions for how to use custom software to infer the relative transcriptional timing from the sequencing data. The example recording is generated in two days, preparation of sequencing libraries and sequencing can be accomplished in two to three days, and analysis of data takes up to several hours. This protocol can be implemented by someone familiar with basic bacterial culture, molecular biology, and bioinformatics. Analysis can be minimally run on a personal computer.
Editor Summary
This protocol describes Retro-Cascorder to make temporally resolved transcriptional recordings in E. coli DNA, using a retron reverse transcriptase to store transcriptional events in a unidirectionally-expanding CRISPR array via acquisition by CRISPR-Cas integrases.
Introduction
Cells often react to internal and external stimuli through a change in gene expression. These changes can range from simple and isolated like a single gene response to a chemical or metabolic stimulus, or complex like a multi-gene transcriptional cascade during cell differentiation. Biologists have long sought insight into cells and their environments by analyzing gene expression. This analysis requires measuring the abundance of specific RNA transcripts, usually achieved by disrupting cell membranes to physically collect RNA that can be quantified. But destroying cells for analysis has an unfortunate ramification: the same cell or cell lineage cannot be analyzed at multiple points over time, and the cell must be harvested while the event is ongoing and the RNA remains. Therefore, experimenters cannot easily collect temporal data on a gene expression cascade or reconstruct a stimulus-driven event that occurred in the cell’s past.
Molecular recorders are an alternative to destructive analyses of gene expression. These molecular technologies continuously record biological activity over time and store that data in DNA. By linking a biological event, like the expression of a gene, to a permanent genomic modification within the same cell, molecular recorders enable data collection throughout an entire biological process. Different molecular recorders vary in the mechanism of data recording, using recombinases1–8, nucleases9–11, prime editors12–14, or integrases15–18 (for more information on DNA-based molecular recorders, refer to refs. 19,20). Here we focus on the Retro-Cascorder21, which uses E. coli Cas1-Cas2 integrases to write events, leveraging the natural directionality of these integrases to encode events in the order that they occur.
Mechanism of Cas1-Cas2 integration of DNA sequences
Cas1-Cas2 integrases are an essential component of the CRISPR bacterial immune system. Under phage invasion, these integrases capture small fragments of phage DNA and integrate them into a genomic repository, called the CRISPR array, where they serve as an immunological memory of the phage. The CRISPR array consists of a leader sequence followed by a series of repetitive elements, or repeats, which are separated from each other by the fragments of phage DNA, called spacers. The Cas1-Cas2 integrases are both repetitive, inserting new phage spacers without deleting old spacers, and directional, always integrating new spacers directly adjacent to the leader sequence in the CRISPR array22,23. Since the insertion of the spacer always occurs adjacent to the leader sequence, the chronology of spacer integration within an array can be inferred because the spacers progress from oldest to newest as they approach the leader. This property has previously enabled molecular recorders using type I-E Cas1-Cas2 from E. coli to decipher the ordering of multiple rounds of exogenously delivered pre-spacers in bacteria15,24.
Recording RNA-derived spacers using Cas1-Cas2
One limitation of early Cas1-Cas2-based molecular recorders is that Cas1-Cas2 have only been found to integrate DNA, not the RNA that would be required to record transcriptional events. This limitation can be addressed by adding additional components to the recording system. For instance, a reverse transcriptase (RT) can reverse-transcribe RNA into DNA that Cas1-Cas2 can integrate. A natural RT-Cas1 fusion was found in Marinomonas mediterranea (MMB-1) that mediates the integration of RNA-derived spacers via a reverse transcribed intermediate25. Another RT-Cas1 fusion was found in Fusicatenibacter saccharivorans and used as the basis of a recording technology in E. coli, enabling a global transcriptome to be recorded in a cell population and retrieved at a later point in time by sequencing the CRISPR array17,26,27. In both cases, the chosen RT was promiscuous and enabled the acquisition of a diverse set of RNA-derived spacers. However, this distributed acquisition makes tracking the order of events nearly impossible as the likelihood of two biologically relevant spacers being acquired into the same array – which is required to recover temporal information – is exceptionally low. To track the timing of specific transcriptional signals using the Retro-Cascorder, we instead used RTs from a separate bacterial immune system, the retron system, whose RTs specifically reverse-transcribe from RNA sequences that contain retron recognition elements. This specificity focuses the recording on a subset of transcripts, which have a higher likelihood of being recorded into the same array.
Development of Retro-Cascorder
The retron RT acts specifically on a structured noncoding RNA (ncRNA), which the RT recognizes and partially reverse-transcribes into a short fragment of single-stranded DNA. We modified the retron ncRNA to generate a fragment of DNA that can be captured and integrated into the CRISPR array found within the bacterial genome by Cas1-Cas2 integrases. If this modified retron ncRNA is driven by a promoter of interest, retron-derived spacers accumulate in CRISPR arrays only when that promoter is active. These retron ncRNAs were further modified to create diversity of sequences by changing internal bases that are not involved in reverse transcription or integration. These changes create distinct barcoded retron ncRNAs that can operate within a single cell. When the expression of distinctly barcoded retron ncRNAs are driven by different promoters of interest, barcoded retron-derived spacers accumulate in the CRISPR array according to the order of the activity of different promoters. The relative order of transcriptional events can be reconstructed by sequencing the CRISPR array (ledger) at a later point in time. Thus, by marrying the unique features of Cas1-Cas2 integrases and the retron RT, Retro-Cascorder captures order information of specific biological events within living cells.
Applications of Retro-Cascorder
We previously demonstrated that Retro-Cascorder can successfully reconstruct the temporal relationship of induced transcriptional events in bacteria over multiple days21. While we decoded the expression order of multiple inducers, including anhydrotetracycline, choline chloride, and sodium salicylate, these promoters can, in principle, be replaced with other promoters of interest. Thus, this technology could enable the construction of biosensors to monitor the occurrence and order of different stimuli, such as pollutants or pathogens, in the environment. With increased engineering to improve acquisition efficiency, Retro-Cascorder may also have the resolution to track endogenous gene expression within bacteria.
Limitations of the protocol
Retro-Cascorder uses a common set of molecular components for molecular recording and relies only upon variable nucleotide sequences to encode different transcriptional signals. This design theoretically allows for substantially more biological events to be simultaneously recorded as compared to recombinases, whose scaling relies on a limited set of orthogonal proteins28. However, a reliance upon multiple plasmids to express all components from Retro-Cascorder currently limits the number of transcriptional signals that can be tested. One of these plasmids must always include high copy number expression plasmid pSBK.079 to overexpress retron RT, Cas1, and Cas2. Another necessary plasmid is a signal plasmid, whose architecture currently enables two different transcriptional events to be recorded. Although the introduction of another signal plasmid to include more transcriptional events is possible, we have found that the host burden from propagating more than two plasmids inhibits bacteria growth and prevents successful recording on relevant time scales.
The low acquisition efficiency of Cas1-Cas2, even when expressed from a high copy number expression plasmid, may also limit temporal or signal resolution because of the paucity of CRISPR arrays that will contain barcodes. While the order of transcriptional events inferred from recording analyses is typically correct when the promoters of interest express strongly and at a time scale of at least 24 hours, ordering confidence will decrease with weaker promoters and shorter experimental time scales. This issue can be mitigated by increasing the sequencing depth or the number of biological replicates. These additional measurements ensure enough CRISPR arrays containing barcodes of interest are sequenced, as long as the number of reads does not exceed the number of initial bacterial genomes harvested (see Figure 1 and Box 1, which discusses how to pick a starting sequencing depth).
Fig 1.
Plots summarizing the effect of the number of simulated informative arrays on ordering score accuracy. The X axis (number of informative arrays) is binned in groups of 8 and each label represents the center number of each bin. For clarity and to reflect both informative array sparsity and abundance, we simulated N=18 biological replicates for panels (a)-(c), with a range from 3,000 to 6 million arrays total. (a) Ordering scores calculated from simulated arrays which demonstrates how the number of informative arrays acquired over a simulated sequencing run impact the A/N and B/N ordering scores. As the number of informative arrays increase, the exact magnitude and direction of the calculated ordering score converge towards its “true” value, suggesting more accurate ordering scores occur when there are higher numbers of informative arrays. (b) Effect of the increase in informative arrays on the standard deviation of A/N and B/N ordering scores. With increasing informative arrays, the standard deviation decreases, thus increasing the confidence in the ordering score. (c) Effect of the increase in informative arrays on the percentage of calculated A/N, B/N and A/B ordering scores that are “dropouts.” As the number of informative arrays increase, the percentage of “dropout” scores decreases, suggesting ordering information can be more reliably found when there are higher numbers of informative arrays. Error bars represent a 95% confidence interval.
BOX 1: Estimating initial sequencing depth.
How accurately Retro-Cascorder’s recording mirrors the true transcriptional activity of the bacterial population is mainly determined by the acquisition efficiency of Cas1-Cas2 and the number of cells used to reconstruct the recording. Each cell harbors one CRISPR array. Depending on the spacer sequence and promoter strength, around 0.5–5% of CRISPR arrays are expanded with a retron-derived spacer over the course of 24 hours. The majority of acquired spacers will instead either be from the bacterial genome or the plasmids harbored in the cell. These background spacers serve as a pseudo-internal timer and can even help to deduce the ordering of transcriptional activity of interest (see Box 3), assuming that background spacer integration is constant and independent from signal transcriptional activity. However, transcriptional order inference relies on informative CRISPR arrays, which we define as arrays that (a) contain at least two spacers, (b) contain at least one retron-derived spacer sequence, and (c) contain at least two spacer sequences are different from each other. Given both the rarity of retron-derived spacer acquisition and the exponentially diminishing probability that a given CRISPR array will contain more than one new spacer, many reads are often necessary to sequence enough informative arrays to calculate ordering scores. Furthermore, the greater the number of spacers, the higher the confidence will be that the scores reflect the true biological signal rather than estimation error from undersampling.
As a starting point, we suggest a sequencing depth of 1 million reads per biological sample, which is what we aimed for in our prior work for reconstructing the order of two discrete transcriptional events21. However, using a weaker promoter or shortening the duration of the recording would ultimately decrease the percentage of informative CRISPR arrays, which means a greater sequencing depth would be needed to find enough arrays to be used for ordering inference. Fig. 1a shows how the accuracy of ordering scores calculated from simulated CRISPR arrays changes depending upon the number of informative arrays available.
The increased confidence in calling the order of transcriptional signals at deeper sequencing depths is primarily dependent on the number of informative arrays. Accordingly, we found that the standard deviation of the ordering scores decreased with higher numbers of informative arrays sequenced (Fig. 1b). Further, we found that, at low numbers of informative arrays, a majority of the ordering scores were “dropouts”, or incalculable, due to the paucity of observed A and B spacers used to calculate the scores (i.e., a “dropout” A/B ordering score would imply that there are no arrays with both A and B spacers sequenced) (Fig 1c). Based on these findings, we suggest that users will find that their recordings are most accurate when at least 40 informative arrays can be used for ordering calculations per biological replicate. In prior work21 where we recorded the ordering of two strong promoters over 48 hours recording using a sequencing depth of 1 million reads, we regularly acquired over 40 informative arrays, and often closer to 100 informative arrays, per sample (10 of 12 biological replicates). However, since changes in the underlying transcriptional program, promoter strength, and/or leakiness may affect the sequencing depths necessary to observe enough informative arrays to enable robust calculation of ordering scores, we suggest that users empirically determine how many sequencing reads are necessary to reliably average around 40 informative arrays over several biological replicates. Generally, we consider increasing sequencing depth to be quickest and most affordable method to optimize the performance of Retro-Cascorder, before attempting to tweak other experimental variables, such as promoter strength.
Otherwise, if the user finds that the scores from their biological replicates are inconsistent or consistently result in dropouts after increasing sequencing depth, they can alternatively increase the number of biological replicates to account for the additional noise and variance. Another possibility is to use an approach like SENECA, described in another Nature Protocols paper26, which enriches for expanded CRISPR arrays and thus would decrease the sequencing depth necessary to find informative arrays.
Although increased sequencing depth usually increases recording fidelity, some care has to be taken to ensure that sequencing depth is not higher than the number of original CRISPR sequences in the original sample. For this reason, we recommend that the number of genomes harvested be in large excess compared to the number of reads. Assuming perfect lysis and experimental conditions, we estimate that there should be no more than 400,000 unique genomes per sample, so we suggest users sequence using no more than 250,000 reads per sample, to be conservative. For example, to sequence at a depth of 1 million reads per biological replicate, we typically collect 4 separate samples from the same culture to prepare for sequencing, index them separately, then sequence each indexed sample at a depth of 250,000 reads before pooling all the reads together. By collecting multiple samples from the same culture, the ratio of the number of starting CRISPR arrays to the eventual number of sequencing reads is increased to minimize any risk of re-sequencing the same CRISPR array more than once.
Finally, Retro-Cascorder is currently constrained to bacteria. This limitation is because type I-E Cas1-Cas2 integrase functionality has been restricted to prokaryotes. One potential explanation for its host-specific activity is its reliance on other bacterial host factors like IHF29,30. Further screening for additional host factors may therefore be necessary before Cas1-Cas2 acquisition is used to record transcriptional signals within eukaryotic cells. Fortunately, the other required component, the retron RT, has already been successfully used for genomic editing in multiple eukaryotic species, including yeast and mammalian cells31–34. Nonetheless, even if Retro-Cascorder is constrained to bacteria, this technology may still be used within sentinel cells for translational advances, including within the mammalian gut27.
We expect that users may be interested in porting Retro-Cascorder to other bacterial species. Since we have not yet attempted to port this technology into other strains, we cannot guarantee its use in other organisms is possible. However, for those interested, the essential components that we currently know about for Retro-Cascorder include the components to create retron-derived spacers, namely a retron and RNase H, which is necessary for the retron to produce correctly sized RT-DNA35. Additionally, for successful acquisition of retron-derived spacers into a CRISPR array, Cas1 and Cas2, additional host factor IHF29,30, and a CRISPR array are needed. While our protocol contains the CRISPR array in the bacterial genome, spacers can be acquired into CRISPR arrays contained on plasmids instead. However, plasmid-based acquisition efficiency is typically lower than genome-based acquisition efficiency15, so a greater sequencing depth may be necessary to find arrays that contain information about the order of different biological events.
Comparison with other technologies
Adjacent technologies TRACE16 and Record-seq17,26 both utilize CRISPR-Cas acquisition and have also been used as biosensors to track cellular activity36,27. TRACE differs from Retro-Cascorder in that it uses plasmid DNA as the source of its spacers, thus making this technology most useful as a way to track or identify DNA, such as horizontal gene transfer in gut microbiota36. In contrast, Record-seq, similarly to Retro-Cascorder, uses an RT to convert a transcriptional signal into spacers that can be acquired by Cas1-Cas2. As discussed above, Record-seq captures a global transcriptomic profile sensitive to transient, transcriptional changes that occurred earlier in the cell’s lifetime17,27. However, Record-seq cannot provide information about the ordering of specific transcripts of interest. Another adjacent approach called a DNA Typewriter14 uses a prime editing strategy to modify a pre-constructed genomic locus that in principle resembles a CRISPR array. This technology generates time-ordered DNA data similar to Retro-Cascorder, but it is implemented in mammalian rather than bacterial cells. However, DNA Typewriter has thus far only been shown to resolve the relative order of transfection events, not biological signals. Given that Retro-Cascorder logs specific, pre-defined transcripts of interest, it is most useful when the desired application aims to record the transcriptional order of specifically tagged promoters of interest in bacteria.
Experimental design
The Retro-Cascorder protocol (Fig. 2) is divided into three main parts: (a) growth of bacteria containing the necessary plasmids to perform transcriptional recording, (b) preparation and deep sequencing of CRISPR arrays containing the transcriptional record, and (c) analysis of sequencing results using logical rules to infer the order of transcriptional events.
Fig. 2.
Retro-Cascorder experimental and computational workflow. (a) Experimental procedure for Retro-Cascorder-based temporal recording. Barcoded retron-derived spacers are integrated by Cas1-Cas2 integrases into a CRISPR array only when a signal (inducer) is in the medium. The position of the spacers in the array enables reconstruction of the relative order of transcriptional events. (b) Preparation of CRISPR arrays for multiplexed sequencing. After acquiring genomic DNA from the samples, CRISPR arrays are selectively amplified using a forward primer binding the leader region and a reverse primer (SPCR_MiSeq3_rev) binding an endogenous, invariant spacer always contained within the genomic CRISPR array. Amplicons are indexed using a qPCR reaction, and indexed samples are cleaned-up using SPRI beads. (c) Multiplexed sequencing of CRISPR arrays. Eluted samples are diluted (1:40,000) and a second qPCR reaction is used to quantify molarity of each sample using the KAPA library Quantification Kit. After preparing a sample sheet with the specifications of each sample, Retro-Cascorder libraries are deep sequenced using Illumina-MiSeq. (d) Computational pipeline for processing Retro-Cascorder data, beginning with installing JupyterLab Notebook, downloading Shipman’s lab Github repository, and importing all python packages and dependencies required. FASTQ files from Illumina-MiSeq are processed following a python-based workflow to obtain ordering scores that may be used to generate plots to visualize both real and simulated Retro-Cascorder data. Open circles correspond to biological replicates.
Bacterial growth
In the first part of the protocol, BL21-AI E. coli containing two plasmids expressing Retro-Cascorder are grown over some specified time scale to acquire transcriptional records. Although our original publication uses a modified BL21-AI E. coli strain called bSLS.114 in which BL21-AI E. coli’s endogenous retron was removed, we find that retron-based recordings still occur in the parental line. As a result, we have chosen to use the commercially available BL21-AI in the protocol due to its wider accessibility. In the case the user would like to use bSLS.114 instead, we have made it available on Addgene (catalog #191530).
The recording plasmids consist of: (a) an expression plasmid pSBK.079 constitutively expressing Eco1 RT and expressing Cas1-Cas2 under a T7 promoter, which can be induced using IPTG and l-arabinose in BL21-AI E. coli, and (b) a signal plasmid (e.g. pSBK.134) containing up to two promoters of interest (see Box 2, which discusses how to design signal plasmids). Each promoter expresses a modified Eco1 noncoding RNA, which can be reverse-transcribed by Eco1 RT into a barcoded DNA that Cas1-Cas2 may integrate into the BL21-AI endogenous genomic CRISPR array. Biological replicates are cultures that originated from separate, single bacterial colonies containing both plasmids.
BOX 2: Designing the signal plasmid.
The signal plasmid, which can link up to two promoters of interest with barcoded retron noncoding RNA, is one of two essential plasmids for Retro-Cascorder. For users to clone a signal plasmid which contains their own promoters of interest, we recommend using our signal plasmid pSBK.134 as the backbone. This plasmid contains two inducible promoters which face in opposite directions to prevent expression leakage or crossover. Each promoter expresses one of two barcoded Eco1 noncoding RNAs.
The two barcodes on pSBK.134 were found to have similar acquisition efficiencies to each other. However, we have also previously tested other barcodes, which result in similar acquisition rates to those used in pSBK.134 (see Fig. 2b in our original publication21 for the observed acquisition efficiency for each barcode) and can replace the barcodes used in pSBK.134 in case the user needs a different sequence. All barcode sequences whose acquisition efficiency have been previously validated are listed in the table below:
Number of barcode in Fig 2b of ref. 21 | Barcode |
---|---|
1 | CCTAGG |
2 | GCTAGC |
3 | CTGCAG |
4 | GTGCAC |
5 | ACGCGT |
6 | CAGTAG |
7 | GAGCTC |
8 | GCATGC |
In cases where users would prefer a different plasmid backbone or architecture, they can also generate different plasmid designs in which a barcoded Eco1 noncoding RNA is under the control of a chosen promoter. However, we find that the choice of plasmid and backbone architecture alters acquisition rates. Other architectures may result in acquisition rates similar to the pSBK.134 architecture, but they must be tested first. Ideally, the expression from a retron noncoding RNA over 24 hours should result in CRISPR arrays expanded with new, retron-derived spacers at a rate of between 0.5–5%. If not, a different plasmid architecture should be picked or the promoter of interest may be too weak for Retro-Cascorder to accurately resolve temporal recordings that use it without significantly increasing sequencing depth or the number of biological replicates.
The experimental protocol describes how to perform a transcriptional recording over 48 hours (i.e., two inducible promoters are each expressed for 24 hours). However, the protocol can be adjusted to instead record over a shorter or longer amount of time, depending on need. For longer recordings, we recommend diluting the bacterial sample into fresh LB after no more than 16 hours of growth so bacteria are kept in exponential growth conditions. Additionally, given that the strength of a promoter’s signal, time scale, and sequencing depth all impact the fidelity of recordings generated using Retro-Cascorder, we recommend including a positive control in which the correct order of multiple transcriptional events is known. Our lab has previously used the signal plasmid pSBK.134 that includes two inducible promoters pTet* and pBetI, which can be turned on in either order. We recommend that users perform a 48-hour recording where they induce pTet* then pBetI for 24 hours each, or vice versa, and ensure the protocol works in their hands before moving on to perform their own experiments of interest. Although we find variability in the exact acquisition efficiency of Retro-Cascorder between different runs, this variability does not typically alter the overall trends inferred from our analysis method. However, the expression of both the signaling and expression plasmid in our strains over multiple days is burdensome on the cells and can occasionally lead to loss or corruption of the components (see Fig. 4g in our original publication21). As such, we recommend users always perform multiple biological replicates and to be wary when interpreting results from replicates in which there is no acquisition activity or very few informative arrays (see Box 1).
Following a recording experiment, the genomic records are extracted by harvesting, diluting, and lysing the bacteria. Bacterial DNA can then be stored for up to six months at −20°C until the user is ready to perform multiplexed sequencing.
Multiplexed sequencing
Following the recording experiment, BL21-AI CRISPR arrays are selectively amplified using PCR. Although most CRISPR arrays only contain old spacers already present within the array before the recording experiment occurred, a fraction of these arrays should also contain between one to three new spacers acquired during the recording. Some of these new spacers may be retron-derived, thus allowing event ordering to be inferred from a whole bacterial population. This first round of PCR adds Illumina adapters to the amplicons using a pool of primers with varied nucleotide lengths to diversify the samples that are eventually sequenced by the Illumina MiSeq instrument (see Table 1), making them compatible with downstream indexing reactions for deep sequencing. After this first round of PCR, the array-containing amplicons for each experimental condition are separately indexed, cleaned up, quantified, diluted and pooled, then finally sequenced on an Illumina MiSeq instrument. Indexing occurs using a mixture of two different primer sets; the P5 & P7 primers (see Table 1) are added to prevent the other, longer indexing primer sets from creating too many unwanted byproducts. The MiSeq was specifically chosen to enable long enough read lengths to sequence multiple spacers in a single CRISPR array.
Table 1.
Primer sequences required for deep sequencing
Primer name | Nucleotide sequence (5’ to 3’) | Purpose | Procedure step |
---|---|---|---|
SPCR_MiSeq3_fow1 | CTTTCCCTACACGACGCTCTTCCGATCTNCATTAATTAATAATAGGTTATGTTTAGAGTGTTCC | First round PCR | 26, 31 |
SPCR_MiSeq3_fow2 | CTTTCCCTACACGACGCTCTTCCGATCTNNCATTAATTAATAATAGGTTATGTTTAGAGTGTTCC | First round PCR | 26, 31 |
SPCR_MiSeq3_fow3 | CTTTCCCTACACGACGCTCTTCCGATCTNNNCATTAATTAATAATAGGTTATGTTTAGAGTGTTCC | First round PCR | 26, 31 |
SPCR_MiSeq3_fow4 | CTTTCCCTACACGACGCTCTTCCGATCTNNNNCATTAATTAATAATAGGTTATGTTTAGAGTGTTCC | First round PCR | 26, 31 |
SPCR_MiSeq3_fow5 | TTTCCCTACACGACGCTCTTCCGATCTNNNNNCATTAATTAATAATAGGTTATGTTTAGAGTGTTCC | First round PCR | 26, 31 |
SPCR_MiSeq3_rev | GGAGTTCAGACGTGTGCTCTTCCGATCTGTGTCAACAATCGTTCCCTGATTGTC | First round PCR | 26, 31 |
P5 | AATGATACGGCGACCACCGA | Indexing PCR | 37 |
P7 | CAAGCAGAAGACGGCATACGAGAT | Indexing PCR | 37 |
Given our experimental parameters (promoter strength, time resolution, and retron-derived spacer acquisition rate), we have found that sequencing each biological sample at a depth of 1 million reads allows us to reliably infer the order of transcriptional events from expanded arrays. However, in cases where the promoters are weaker or experiments are run at a short time scale, more sequencing reads may be necessary to find enough spacers to run ordering analyses (see Box 1).
Analysis
All analyses are performed using scripts written in Python 3. After quality-based read trimming, the first part of the analysis consists of extracting new spacers found in the sequenced CRISPR arrays, and storing both the new spacer sequences and the sequence of the read containing them. These reads and spacers are binned according to their characteristics, including the number of new spacers per read. Next, the order of the spacers in each newly-expanded CRISPR array is determined, as well as whether these new spacers are derived from either of the barcoded retron noncoding RNAs (referred to as “A” and “B” spacers) or not (referred to as “N” spacers, likely genome- or plasmid-derived).
For a CRISPR array to be informative for transcriptional order, it must meet three criteria: (a) the array should contain at least two new spacers, (b) at least one of the spacers should contain a barcoded retron-derived spacer, and (c) at least two spacers must have different identities. Explicitly, the number of A → B → Leader, A → N → Leader, ..., etc. CRISPR arrays are counted and used for the calculation of ordering scores, described below.
Following the count of each spacer ordering possibility, we calculate three ordering scores, which describe and help us infer the ordering of transcriptional events. These scores make an assumption about the biology: that “N” spacers are acquired at a constant rate during the course of the recording experiment. Moreover, this analysis is designed to reconstruct a transcriptional history into two epochs, one early and one late. Further subdivision of the temporal signal would require a substantially more complicated analysis than is provided here.
In cases such as ours where two promoters are under study, the CRISPR arrays are analyzed for order based on three scores, each that vary between −1 to +1: (1) the A/B score, (2) the A/N score, and (3) the B/N score.
(1). A/B score.
The A/B score determines both the order of and magnitude of temporal separation between the “A” and “B” transcriptional events. Positive scores suggest that transcriptional event “A” occurred before “B” and thus more “B” spacers are found in the Leader-proximal position relative to the number of “A” spacers; on the other hand, negative scores suggest the opposite, namely that event “B” occurred before “A”. The magnitude of the score represents the temporal separation between A or B, or how much the transcriptional activity between A or B overlaps in time. The more their activity overlaps in time, the closer to zero this score will be.
(2). A/N score.
The A/N score determines how the timing of “A” is expressed in comparison to the constant signal “N”, for the duration of the recording experiment. It takes into account the relative frequencies of Leader-distal vs. Leader-adjacent “A”-expanded arrays: a positive score suggests that “A” was strongly expressed in the first epoch rather than the second, and conversely for a negative score.
(3). B/N score.
The same interpretations of the A/N score applies here, except in relation to “B” rather than “A”. However, by arbitrary convention, the B/N score is reversed relative to the A/N score, calculated from the relative frequencies of Leader-adjacent vs. Leader-proximal “B”-expanded arrays: positive scores suggest that “B” occurred in the second epoch rather than the first, and conversely for a negative score.
Fig. 3 (see Anticipated Results) gives hypothetical examples of different transcriptional activity for two promoters across two epochs and what scores might be expected in such cases. See Box 3 for more information about the exact mathematical calculations and interpretation of a given composite score.
Fig. 3.
Simulated ordering score results from different transcriptional programs. Colors correspond to transcriptional events: event “A” in blue; event “B” in red; events A and B together in purple. (a) Key graphically illustrating how to interpret the magnitude and sign of different ordering scores. Left, ordering score A/N: a positive score suggests that event A happened, on average, before event N. The inverse is true for negative scores of A/N. Middle, ordering score B/N: a positive score suggests that event B happened, on average, after event N. Right, ordering score A/B: a positive score suggests that event A happened, on average, before event B. Panels (b)-(e) illustrate four transcriptional programs (left) and simulated ordering score plots (right). The transcriptional programs shown on the left of each panel show a series of “real”, non-constant transcriptional signals (top), illustrated by wave-shaped curves. Below the “real” transcriptional program are a “reduced” version of these transcriptional programs, to make their inference compatible with our analysis’ assumptions. On the right of panels (b)-(e) are ordering scores generated by simulating the transcriptional program described in each panel. The simulated spacer acquisition rates for “A” and “N” are equivalent to those determined experimentally in our experiments, although we have chosen to make A and B signals matched in terms of strength and leakiness. Open circles correspond to N=6 simulated biological replicates. (b) Transcriptional program A→B. Signal A occurs during the early epoch; signal B occurs during the late epoch (left). (c) Transcriptional program A→AB. Signal A is present during both the early and the late epoch; signal B occurs only during the late epoch (left). (d) Transcriptional program None→AB. Signals A and B occur only during the late epoch (left). (e) Transcriptional program AB→AB. Signals A and B occur during the early and the late epoch (left).
Box 3: Calculating and interpreting ordering scores.
The exact formulas used in each of the three scores are shown below:
Here, is the count of arrays that have the spacers ordered as .
Although the A/N score and B/N score can mathematically range from −1 to +1, under the assumption that N spacer acquisition is constant, the scores should not exceed |0.5|. More precisely, the average value of the A/N and B/N scores over n replicates should be |0.5|, with the actual scores for each replicate being normally distributed around it. The spread of the distribution around this average score, or how much the A/N and B/N scores deviates from |0.5|, is a reflection of biological noise and variability in the recording system, as well as any deviation from the assumption that “N” is a constant signal independent of “A” and “B”. If this assumption does not hold, or the CRISPR arrays sequenced are sparse (i.e., few new retron-derived spacers observed and usable to calculate the ordering scores; see Box 1), the ordering scores are more likely to fall outside this range. Although individual replicates may fall beyond the expected values, the average of a large enough sample size should nonetheless fall within −0.5 to +0.5. Thus, if a user observes that the average score falls outside of this range, we suggest increasing the sequencing depth, and if the result remains, re-evaluating the assumption regarding uniform and continuous “N” spacer acquisition.
The command line code used in our original publication21 is available on GitHub (https://github.com/Shipman-Lab/Spacer-Seq). We have also compiled the necessary functions with additional comments on how to perform the analysis into a Jupyter notebook, available here: https://github.com/Shipman-Lab/Spacer-Seq_Nat-Protocols/tree/main. We have made minimal changes to the original code, with the intention of simplifying user experience and making it more widely deployable. These changes are:
Using sickle-trim (https://github.com/najoshi/sickle), a quality-based trimming package, due to its wide availability through all Python dependency managers (pip and Anaconda channels, including Bioconda);
Adapting the spacer extraction function to be parallelizable, which substantially reduces the computing time;
Reducing the number of intermediate files generated, and placing emphasis on data visualization;
Implementing an in-notebook calculation of the ordering scores.
Implementing a series of simulations meant to illustrate how scores would vary under different transcriptional programs, and giving users a starting point for interpreting their data and generating their own hypotheses.
Materials
Reagents
General
UltraPure™ Distilled Water (Thermo Fisher Scientific, cat. no. 10977015)
Plasmid backbones
Expression plasmid pSBK.079 (Addgene, cat. no. 187218)
Signal plasmid pSBK.134 (Addgene, cat. no. 187219)
Biological Materials
BL21-AI™ One Shot™ Chemically Competent E. coli (Thermo Fisher Scientific, cat. no. C607003) CRITICAL BL21-AI E. coli is widely available but contains an endogenous retron. We do not find that this endogenous retron interferes with Retro-Cascorder, but users can instead use a modified BL21-AI E.coli strain bSLS.114 that lacks this endogenous retron (Addgene, cat. no. 191530).
Recording experiment
LB broth (Miller; 10g/L tryptone, 5g/L yeast extract, 10 g/L NaCl, UltraPure™ Distilled Water, pH 7)
L-arabinose (200 mg/mL in UltraPure™ Distilled Water, sterile-filtered; GoldBio, cat. no. A-300)
IPTG (100 mM in UltraPure™ Distilled Water, sterile-filtered; GoldBio, cat. no. I2481C)
Kanamycin (35 mg/mL in UltraPure™ Distilled Water, sterile-filtered; GoldBio, cat. no. K-120)
Carbenicillin (100 mg/mL in 50% vol. UltraPure™ Distilled Water/50% vol absolute ethanol; GoldBio, cat. no. C-103) !CAUTION Carbenicillin is a respiratory and skin sensitizer; avoid breathing or skin contact. When handling carbenicillin, wear gloves and eye protection.
Anhydrotetracycline (100 ng/mL in 50% vol. UltraPure™ Distilled Water/50% vol absolute ethanol, sterile-filtered; Cayman Chemical, cat. no. 10009542) !CAUTION Anhydrotetracycline is harmful if swallowed and causes skin and eye irritation. When handling anhydrotetracycline, wear gloves and eye protection.
Choline chloride (100 μM in UltraPure™ Distilled Water, sterile-filtered; Sigma-Aldrich, cat. no. C7017)
Sample preparation for deep sequencing
AMPure XP Reagent (Beckman Coulter, cat. no. A63881) CRITICAL To decrease cost, this reagent can also be substituted for Sera-Mag™ beads following a series of short washes and resuspension in homemade nucleic acid binding buffer. Refer to Supplementary Methods for details.
Q5 High-Fidelity DNA Polymerase (NEB, cat. no. M0318L) CRITICAL Use of a high-fidelity polymerase minimizes errors in amplification of high-diversity libraries for multiplexed sequencing.
Q5 Reaction Buffer 5x (NEB, cat. no. B9027S)
Deoxynucleotides (dNTPs) Solution Mix, Nucleotide 10 mM, 40 μmol each nucleotide (New England Biolabs, cat. no. N0447L)
SYBR Green I Nucleic Acid Gel Stain, 10,000X concentrate in DMSO (Thermo Fisher Scientific, cat. no. S7585)
ROX Low Reference Dye (Kapa Biosystems, cat. no. KD 4601) CRITICAL This reference dye is used with the StepOnePlus Real-Time PCR machine mentioned in Equipment below. If using a different machine, use the appropriate reference dye as provided by the machine’s instructions.
1 Kb Plus DNA Ladder (Thermo Fisher Scientific, cat. no. 10787018)
Deep sequencing
KAPA Library Quantification Kit – Complete Kit (Universal) (KK4824, Roche cat. no. 07960140001)
KAPA Library Quantification DNA Control Standard, Illumina (KK4906, Roche cat. no. 7960417001)
PhiX Control Kit v3 (Illumina, cat. no. FC-110–3001)
Miseq Reagent Kit v2 (300 cycle, Illumina, cat. no. MS-102–2002)
Primers for deep sequencing (Table 1) CRITICAL All listed DNA primers can be purchased from DNA synthesis companies, such as IDT.
96-well PCR plate containing DNA indexing primer pair sets for MiSeq CRITICAL 96-well plates containing synthesized DNA primers can be purchased from DNA synthesis companies, such as IDT. Many different sequences can be used; for greater accessibility, we provide a table of indexing primer sequences for an entire 96-well plate in Supplementary Table 1.
Equipment
Plate Centrifuge (SouthwestScience, cat. no. SC20-PLATE)
Adhesive PCR Plate Foils (Thermo Fisher Scientific, cat. no. AB0626)
Bacterial culture tubes (VWR, cat. no. 60818–725)
Water bath (VWR, cat. no. 1202)
Bacterial shaker Innova S44i (Eppendorf, cat. no. S44I3100001)
Disposable Pasteur Pipets, Flint Glass, 9” (VWR, cat. no. 14672–380)
1.5 and 2.0 mL Microcentrifuge tubes (Axygen, cat. no. MCT-150-C-S)
Easy Reader Conical Polypropylene Centrifuge Tubes 15 and 50 ml (Thermo Fisher Scientific, cat. no. 07–200-886 and 05–539-8, respectively)
1-mm-gap cuvette (Bio-Rad, cat. no. 1652089)
Benchtop microcentrifuge (Eppendorf, cat. no. 5425)
Gene Pulser Xcell Electroporation System (Bio-Rad, cat. no. 1652666)
PCR strip tubes, 0.2 ml (USA Scientific, cat. no. 102–4700)
MiniAmp Plus Thermal Cycler (Thermo Fisher Scientific, cat. no. A37835)
MicroAmp Fast Optical 96-Well Reaction Plate, 0.1 mL (Thermo Fisher Scientific, cat. no. 4346907)
StepOnePlus Real-Time PCR System (Applied Biosystems, cat. no. 4376600)
E-Gel EX agarose gels, 2% (Thermo Fisher Scientific, cat. no. G402002)
E-Gel Power Snap Electrophoresis Device (Thermo Fisher Scientific, cat. no. G8100) CRITICAL In place of E-Gel EX agarose gels and E-Gel Power Snap Electrophoresis Device, a 2% agarose gel can be poured in the lab and run on a standard gel electrophoresis system. For more information on specific equipment and protocols to both make and run agarose gels, see https://www.jove.com/t/3923/agarose-gel-electrophoresis-for-the-separation-of-dna-fragments37
Magnetic separator for 96-well PCR plate (DynaMag-96 Side Skirted Magnet; Thermo Fisher Scientific, cat. no. 12027)
MiSeq system (Illumina, cat. no. SY-410–1003)
Software
The computational methods described in this protocol have been implemented for a Unix-like operating system with a bash shell. This Jupyter-notebook (written in Python3) serves as a self-contained, interactive walkthrough of the deep sequencing data generated during our experiments, and requires Jupyter-notebook to be installed; the rest of the dependencies are handled internally. Note that the analysis pipeline is meant to be run on a Unix-like operating system; nonetheless, it can be adapted to run on Windows-based OSs with minimal changes to the notebook.
Python dependencies are listed below. For new python users, we strongly recommend beginning with the Anaconda Distribution (https://www.anaconda.com/distribution) - it includes Python and many other commonly used packages for scientific computing and data science. Anaconda also enables easy installation of dependencies. See Troubleshooting for more information.
Python ≥ v3.0, available here (https://www.python.org/downloads/)
Jupyterlab, to run the notebook, available here (https://jupyter.org/install)38
Biopython, a set of freely available tools for biological computation, available here (https://biopython.org/wiki/Download)
Fuzzysearch, a python package for string matching, available here (https://pypi.org/project/fuzzysearch/)
sickle-trim, a python package for read trimming, available here (https://github.com/najoshi/sickle)39
seaborn, a data visualization library based on matplotlib, available here (https://seaborn.pydata.org/installing.html)
numpy, a package for scientific computing, available here (https://numpy.org/install/)
pandas, a powerful data analysis and manipulation tool, available here (https://pandas.pydata.org/getting_started.html)
matplotlib, a comprehensive library for creating static, animated, and interactive visualizations, available here (https://matplotlib.org/stable/users/getting_started/)
multiprocess, a library that enables multiprocessing and multithreading in python, available here (https://pypi.org/project/multiprocess/#files)
We recommend installing these packages through pip, or Anaconda’s own package handler, with the following command prompts:
pip install jupyterlab biopython fuzzysearch seaborn numpy multiprocess pandas matplotlib; conda install -c bioconda sickle-trim
Reagent Setup
Preparing primers for deep sequencing (Table 1)
Synthesize primers as single-stranded oligonucleotides. Dissolve each oligo in UltraPure™ Distilled Water to a final concentration of 100 μM. To store, keep at −20°C for up to 1 year.
Stock 96-well plate of indexing primer (100 μM each)
Starting from a 96-well PCR plate containing a pair of dried forward and reverse indexing primers per well (see Supplementary Table 1 for potential primer sequences), spin plate in a plate spinner to collect dried primers at the bottom of each well. Add UltraPure™ Distilled Water to each well to a final concentration of 100 μM per each primer in a well. To store, seal the 96-well PCR plate with an adhesive plate foil and keep at −20°C for up to 1 year.
Procedure
Temporal recording procedure; transformation of signal and expression plasmids into expression strain
● Timing 4 d, 1.5 h hands-on
-
1
Place one aliquot of E. coli BL21-AI cells on ice to thaw. When aliquot is fully thawed (5–10 min), transfer between 15–50 μL of cells per transformation into a clean, pre-chilled bacterial culture tube.
CRITICAL STEP Commercially available BL21-AI E. coli contains its own endogenous retron, whose RT-DNA appears on a PAGE gel. However, we find that its presence will not prevent Retro-Cascorder from recording the transcriptional recordings described in this protocol. If the presence of the endogenous retron impacts the user’s experiments, an alternate BL21-AI E.coli strain bSLS.114 that lacks this endogenous retron is available through Addgene (catalog #191530).
-
2
On ice, add 1 pg-100 ng signal plasmid (i.e. pSBK.134) DNA in 1–3 μL total volume to the tube containing BL21-AI and gently swirl solution with pipette tip. Keep on ice for 15 minutes.
CRITICAL STEP The transfection order of plasmids does matter. If the expression plasmid is added first, Cas1-Cas2 may acquire spacers before the transcriptional recording experiment begins. To guard against this risk, we recommend always adding the expression plasmid last (see Step 10).
-
3
Heat shock cells by placing the tube in a water bath heated to 42°C for 30 s. Place the tube back on ice for 1 minute.
-
4
Add 250 μL SOC media to the tube and place in the bacterial shaker at 37°C at 250 r.p.m. for 1 hour to allow cells to recover and express antibiotic resistance gene.
-
5
Plate entire transformation on a pre-warmed LB agar plate (10 cm, 35 μg ml–1 kanamycin). Spread bacteria with flamed Pasteur pipet in a dilution series to promote formation of individual colonies (if unfamiliar with how to perform a dilution series, see https://www.jove.com/v/10507/serial-dilutions-and-plating-microbial-enumeration40).
-
6
Incubate LB agar plate overnight at 37°C for 16 hours.
-
7
The following morning, check the LB agar plate for the presence of bacterial colonies. Take LB agar plates from 37°C and leave them at 4°C or room temperature (20°C) until the evening.
PAUSE POINT LB agar plates containing colonies of BL21-AI containing signal plasmid can be kept 4°C for up to 1 week before transforming with the expression plasmid.
-
8
Add 3 mL of LB media containing kanamycin (35 μg/mL) into a bacterial culture tube. Inoculate one tube with one bacterial colony from the LB agar plate. Transfer tube to bacterial shaker set at 37°C at 250 r.p.m. overnight for 16 hours.
-
9
The following morning, dilute 60 μL of culture into a bacterial culture tube with 3 mL LB containing the antibiotic kanamycin. Place tube in bacterial shaker set at 37°C at 250 r.p.m. and let incubate for 2 h.
-
10
During the 2 h incubation, prepare a 10 pg/μL solution of expression plasmid by diluting the expression plasmid (i.e. pSBK.079) in water to a total volume of 50 μL in a microcentrifuge tube. Place the expression plasmid solution, a 1-mm-gap cuvette, a microcentrifuge tube, and a 15 mL conical tube filled with water on ice to pre-chill.
-
11
After 2 h, transfer 1 mL of culture from Step 9 to the pre-chilled microcentrifuge tube from Step 10. Centrifuge the microcentrifuge tube in a microcentrifuge at 4°C for 30 s at 10,000 × g to pellet the culture.
CRITICAL STEP After 2 h, culture should barely be cloudy. Electroporation efficiency will be lower for a denser culture.
CRITICAL STEP For Steps 11–14, the bacteria should be kept cold at 4°C and all steps performed quickly to increase acquisition efficiency and avoid as much cell death as possible.
-
12
Remove supernatant by pipetting and resuspend cells in 1 mL chilled water from Step 10. Centrifuge the microcentrifuge tube in a microcentrifuge at 4°C for 30 s at 10,000 × g to pellet the culture.
-
13
Repeat Step 12 two more times for a total of three washes. If the bacterial pellet becomes loose once cells are in water, the microcentrifugation duration can be increased to 1 minute.
-
14
Remove supernatant and resuspend cells in 50 μL of expression plasmid solution from Step 10. Transfer 50 μL of the mixture to a pre-chilled cuvette from Step 10.
-
15
Dry the cuvette using a paper towel and place in the electroporation system. Electroporate using the following parameters: 1.8 kV, 25 μF, and 200 Ω.
CRITICAL STEP The time constant (τ), indicating the time it takes for the voltage to decay to 1/3 the initial set voltage in milliseconds, following electroporation ideally should be between 4.7–5.2 and can be found on the display screen of the electroporation system after each electroporation. If any time constant is below 4, we recommend discarding the electroporation and trying again.
? TROUBLESHOOTING
-
16
Quickly recover the electroporated cells into LB by pipetting 250 μL of SOC media into the cuvette and mixing with the cells. After, transfer the mixed solution from the cuvette into a bacterial culture tube. Place the tube in the bacterial shaker at 37°C at 250 r.p.m. for 1 hour to allow cells to recover and express antibiotic resistance gene.
-
17
Plate entire transformation on a pre-warmed LB agar plate (10 cm, 100 μg ml–1 carbenicillin and 35 μg ml–1 kanamycin). Spread bacteria with flamed Pasteur pipet in a dilution series to promote formation of individual colonies.
-
18
Repeat Steps 6–7. There should be bacterial colonies containing both the signal and expression plasmid on the LB agar plate.
PAUSE POINT LB agar plates containing colonies of BL21-AI containing expression and signal plasmid can be kept at room temperature for up to 24 hours or at 4°C for up to 3 weeks before starting the recording experiment.
Temporal recording procedure; recording transcriptional activity for 48 hours
● Timing 2 d, 40 min hands-on
-
19
For each biological replicate, add 3 mL of LB media containing carbenicillin (100 μg/mL) and kanamycin into a bacterial culture tube. Inoculate each tube with one bacterial colony from the LB agar plate from Step 18. Transfer tubes to bacterial shaker set at 37°C at 250 r.p.m. overnight for 16 hours.
-
20
The following morning, dilute 150 μL of culture into a bacterial culture tube with 3 mL LB containing the antibiotics carbenicillin and kanamycin and the inducers IPTG (1 mM) and l-arabinose (2 mg/mL) to induce expression of Cas1-Cas2. If appropriate, add a relevant compound (i.e. 3 μL of 100 ng/mL anhydrotetracycline to turn on pTet* promoter in pSBK.134) to induce expression of the first transcriptional event on the signal plasmid. Place tube in bacterial shaker set at 37°C at 250 r.p.m. and incubate for 8 h.
-
21
After 8 h, dilute 60 μL of culture into a new bacterial culture tube with 3 mL LB containing the same antibiotics and inducers as step 20. Transfer tubes to bacterial shaker set at 37°C at 250 r.p.m. and incubate overnight for 16 hours. CRITICAL STEP Bacteria are diluted after 8 h of growth to prevent bacteria from reaching stationary phase and to allow cells to continue log-based growth. Additionally, always add new, fresh inducers to ensure continual expression of Cas1 and Cas2.
-
22
Repeat Steps 20 and 21, except—if appropriate—add a relevant compound (i.e. 30 μL of 100 μM choline chloride to turn on pBetI promoter in pSBK.134) to induce expression of the second transcriptional event on the signal plasmid.
? TROUBLESHOOTING
-
23
After 48 h of bacterial growth and recording, collect 25 μL sample and mix with 25 μL water in a PCR tube. Boil at 95°C in a thermocycler for 5 min to lyse cells then allow to cool on benchtop (~5 min) before freezing at −20°C for later analysis.
PAUSE POINT Boiled bacterial samples can be stored at −20°C for at least 6 months.
? TROUBLESHOOTING
Preparation of CRISPR arrays for deep sequencing: Determine appropriate cycle number for first round PCR amplification
● Timing 2 h, 2h hands-on
CRITICAL: During first round PCR amplification, CRISPR arrays are amplified from the bacterial genome. Ideally, the PCR should stop during the end of log-based amplification before reaching the plateau. Minimizing excess numbers of cycles is important to reduce the potential for crossover events during the later cycles of a PCR. For each experimental paradigm or different signal plasmid, we recommend performing at least one qPCR amplification to determine the ideal cycling number to stop the PCR. After performing this step once, it should not have to be repeated for a given signal plasmid unless the user is attempting to troubleshoot potential issues downstream.
-
24
Thaw frozen bacterial samples from Step 23.
-
25
Dilute SYBR Green I in water to a final concentration of 5X for qPCR amplification. Given the size of the dilution, we recommend performing a serial dilution, i.e. add 1 μL SYBR Green I to 99 μL water and mix well. Then, add 15 μL of the mixture to 285 μL of water and mix well.
-
26Prepare first round PCR primer mix by combining primers (see Table 1) as follows:
Component Amount (μL) Final concentration (μM) H2O 90 - SPCR_MiSeq3_fow1, 100 μM 2 2 SPCR_MiSeq3_fow2, 100 μM 2 2 SPCR_MiSeq3_fow3, 100 μM 2 2 SPCR_MiSeq3_fow4, 100 μM 2 2 SPCR_MiSeq3_fow5, 100 μM 2 2 Total 100 - PAUSE POINT: Aliquots of first round PCR primer mix can be stored at −20°C for at least one year.
CRITICAL STEP Since the Illumina MiSeq system calibrates on a diverse set of nucleotides, we recommend using a mixture of at least 5 forward primers with varied lengths and nucleotides to ensure accuracy of the subsequent reads, assuming that the majority of the run will include CRISPR arrays. In the case that a diverse set of amplicons will be run alongside the arrays, there is no need to use varied forward primers.
-
27Prepare the qPCR reaction as follows on ice. We recommend creating a master mix with all reagents except DNA template. Dispense 24.2 μL of master mix per well then add 0.8 μL DNA template per well.
Component Amount per reaction (μL) Final concentration Q5 Reaction Buffer (5X) 5 1X dNTPs (10 mM) 0.5 200 μM Forward Primer mix from Step 26 (10 μM) 1.25 0.5 μM SPCR_MiSeq3_rev (Table 1) (10 μM) 1.25 0.5 μM Template DNA from Step 24 0.8 - Q5 High-Fidelity DNA Polymerase 0.25 - SYBR Green I from Step 25 (5X) 5 1X H2O 10.95 - Total 25 - -
28Begin qPCR reaction by implementing the following qPCR protocol:
Cycle number Denature Anneal Extend 1 98°C, 3 min 2–46 98°C, 20 s 60°C, 15 s 72°C, 20 s -
29
Note when during cycles 2–46 sample traces begin to plateau. This cycle number should be the number of cycles used during the subsequent first round PCR amplifications for all biological replicates with the same signal plasmid and experimental paradigm.
? TROUBLESHOOTING
Preparation of CRISPR arrays for deep sequencing: amplifying and indexing samples
● Timing 6 h, 4 h hands-on
CRITICAL: This part of the Procedure amplifies the BL21-AI CRISPR array within the genome and attaches a primer extension that will be used for indexing.
-
30
Thaw frozen bacterial samples from Step 23.
-
31First round PCR amplification. Prepare the PCR reaction as follows on ice. We recommend first preparing a master mix without the template DNA. Dispense 24.2 μL of master mix per well then add 0.8 μL boiled bacterial sample per well.
Component Amount per reaction (μL) Final concentration Q5 Reaction Buffer (5X) 5 1X dNTPs (10 mM) 0.5 200 μM Forward Primer mix from Step 26 (10 μM) 1.25 0.5 μM SPCR_MiSeq3_rev (Table 1) (10 μM) 1.25 0.5 μM Template DNA from Step 30 0.8 - Q5 High-Fidelity DNA Polymerase 0.25 - H2O 15.95 - Total 25 - -
32Perform PCR reaction using the following thermocycling protocol:
Cycle number Denature Anneal Extend 1 98°C, 30 s 2-[cycle number determined in Step 29] 98°C, 10 s 72°C, 30 s 72°C, 30 s [cycle number determined in Step 29] + 1 72°C, 2 min Following PCR amplification, freeze each PCR reaction at −20°C or continue immediately to indexing.
CRITICAL STEP We typically find that there is no need to purify or clean-up the resulting products from the first round of PCR before moving forward to second round PCR amplification. However, if the user finds that they do not acquire the expected product during the indexing reaction, a DNA clean-up may help during troubleshooting. PAUSE POINT First round PCR reactions can be stored at −20°C for at least 6 months.
-
33
Run samples out on a 2% agarose E-gel EX by loading 3.5 μL first round PCR product and 16.5 μL water into each well. Use the 1 kB+ ladder as a reference by adding 2 μL undiluted ladder and 18 μL water to the marker lane. Run gel for 10 minutes. Validate that the brightest PCR band is 265 nt. This band corresponds to the unexpanded CRISPR array, although higher-level bands may also be visible. These bands should be some multiple of 61 nucleotides larger than the unexpanded array, and correspond to expanded CRISPR arrays containing 1 or more additional spacers.
CRITICAL STEP As mentioned in the Materials section, an alternative option to the 2% agarose E-gel EX is to create and run an agarose gel made in lab, as described in https://www.jove.com/t/3923/agarose-gel-electrophoresis-for-the-separation-of-dna-fragments37. For a homemade 2% gel, we recommend running the gel for around 30 min at 100V.
? TROUBLESHOOTING
-
34
Create a 10 μM stock indexing plate from the 100 μM stock indexing plate (see Reagent Setup). We recommend first spinning the 100 μM stock indexing plate in a plate spinner to collect liquid at the bottom of each well before mixing 20 μL of each original 100 μM primer solution with 180 μL water in a new 96-well PCR plate. Afterwards, make a working plate at 100 nM by adding 2 μL of primer solution from the 10 μM stock plate with 198 μL water in another 96-well PCR plate. Both stock and working plates can be stored at −20°C for at least 1 year.
CRITICAL STEP Be careful not to cross-contaminate primers between wells by always spinning the plate to collect liquid before opening and never reusing pipette tips. In case of unexpected results downstream, making a new working plate is always safest.
-
35
Dilute DNA template for indexing by adding 5 μL first round PCR reaction from Step 32 into water for a total volume of 75 μL. Pipette up and down vigorously to mix the reaction.
-
36
Dilute SYBR Green I in water to a final concentration of 5X for qPCR amplification. Given the size of the dilution, we recommend performing a serial dilution, i.e. add 1 μL SYBR Green I to 99 μL water and mix well. Then, add 15 μL of the mixture to 285 μL of water and mix well.
-
37Indexing. This step indexes each amplicon with a unique, sample-specific barcode necessary for subsequent deep sequencing. Prepare the qPCR reaction as follows on ice. We recommend creating a master mix with all reagents except DNA template and the forward and reverse P5 & P7 indexing primers (Table 1). Dispense 23 μL of master mix per well then add 1 μL of each indexing primer and 5 μL DNA template per well.
Component Amount per reaction (μL) Final concentration H2O 12.18 - Q5 Reaction Buffer (5X) 6 1X SYBR Green I from Step 36 (5X) 3 0.5X dNTPs (10 mM) 0.9 300 μM P5 primer (25uM) 0.36 0.3 uM P7 primer (25uM) 0.36 0.3 uM Q5 Hotstart Polymerase 0.6 - Rox (50X) 0.6 1X Indexing primers (F & R) from Step 34 (100 nM each) 1 3.33 nM each DNA template from Step 32 5 - Total 30 - CRITICAL STEP Do not substitute a high-fidelity DNA polymerase like Q5 with a qPCR polymerase. Use of a qPCR polymerases may result in higher error rates which decrease the fidelity and accuracy of reads in the deep sequencing library.
CRITICAL STEP This reaction is modified from a normal PCR to include two sets of distinct primers rather than a single primer set. The indexing primers are very long and have a propensity to create unwanted products such as primer dimers, so they are added at a low concentration. Meanwhile, the P5 and P7 primers are added at a higher concentration. The intention is to add indices based on the indexing primers to the amplicons in the initial cycles and then to avoid later unwanted products by allowing the P5 and P7 primers to amplify the indexed molecules throughout the rest of the cycles.
-
38Begin qPCR reaction by implementing the following qPCR protocol:
Cycle number Denature Anneal Extend 1 98°C, 3 min 2–46 98°C, 20 s 60°C, 15 s 72°C, 20 s -
39
If sample amplification traces begin to approach plateau during cycles 2–46, stop machine and move 25 μL of these samples to a new, clean 96-well PCR plate. Preferably, samples should be moved before their traces reach plateau phase, to avoid overamplification and subsequent artifacts. We recommend removing the samples as the amplification curves begin to flatten, which ideally corresponds to 2–3 cycles before the plateau.
Make a new qPCR run with the same protocol as Step 38 and restart the original plate with the remaining samples.
? TROUBLESHOOTING
-
40
Repeat Step 39 until all indexed samples have plateaued and been collected into the same 96-well plate. Continue immediately to DNA clean-up or seal the 96-well PCR plate with an adhesive plate foil before freezing at −20°C.
PAUSE POINT Indexed amplicons can be stored at −20°C for at least one month.
CRITICAL STEP Collecting indexing product before plateauing minimizes the chance for chimeric amplicons and index swapping. However, to save time, we do not recommend stopping, removing samples, and restarting a plate more than two times. To avoid too many restarts, we recommend pulling the whole plate after multiple samples have plateaued and choosing to collect samples in batches, rather than waiting for and collecting each individual sample as it leaves its exponential phase.
-
41
PCR clean-up using beads. This step purifies indexed amplicons without biasing the amplicons based on their size using beads. To save on reagent costs, we typically prepare and wash our own SPRI beads. Refer to Supplementary Methods Steps 1–14 for details. However, to save time, commercially available XP AMPure beads may also be used interchangeably without any additional preparations or wash steps. If using XP AMPure beads, add a volume of beads at a 1.8X bead-to-DNA dilution to each well in the 96-well plate containing indexed product. For example, for a 1.8X bead-to-DNA dilution, add 45 μL beads to 25 μL indexed product per well. Otherwise, if using homemade SPRI beads, determining the optimal ratio of beads to add is explained in Supplementary Methods Steps 15–24.
-
42
Mix reaction thoroughly by pipetting between 10–15 times. Incubate reaction for 5 minutes at room temperature. Place 96-well plate on magnet until beads migrate near the magnet and the solution is clear.
-
43
Remove the supernatant by pipetting and discard.
-
44
Wash the DNA by adding 200 μL fresh 70% ethanol and allow to incubate on the magnet for one minute until the solution is clear. Remove supernatant by pipetting and discard.
-
45
Repeat Step 44 to wash DNA one more time.
-
46
Let DNA dry for <3 minutes.
CRITICAL STEP DNA needs to dry to remove contaminant ethanol but allowing the beads to dry for too long will result in low recovery. The presence of cracks appearing in the beads is a sign that the DNA has been allowed to dry for too long.
-
47
Remove the plate from the magnet and resuspend the DNA with 25 μL water. Pipette up and down between 10–15 times to mix. Incubate at 5 minutes at room temperature.
-
48
Place the plate back on the magnet and wait until the solution is clear. Collect supernatant and move into a new, clean 96-well PCR plate. Continue immediately to deep sequencing or seal the 96-well PCR plate with an adhesive plate foil before freezing at −20°C.
PAUSE POINT Purified, indexed samples can be stored at −20°C for at least a month.
-
49
Run samples out on a 2% agarose E-gel EX by loading 3.5 μL indexed product and 16.5 μL water into each well. Use the 1 kB+ ladder as a reference by adding 2 μL undiluted ladder and 18 μL water to the marker lane. Run gel for 10 minutes. Validate that the brightest PCR band is 402 nt. This band corresponds to the unexpanded CRISPR array, although higher-level bands may also be visible. These bands should be some multiple of 61 nucleotides larger than the unexpanded array, and correspond to expanded CRISPR arrays containing 1 or more additional spacers.
? TROUBLESHOOTING
Multiplexed sequencing of CRISPR arrays
● Timing 3.5 h, 2 h hands-on
-
50
Dilute cleaned-up and indexed samples from Step 48 1:40,000 in water for quantification. We recommend performing this step through serial dilution, i.e. add 1 μL sample from Step 48 to 499 μL of water and mix well. Then, add 10 μL of the mixture to 790 μL of water and mix well.
-
51Set up qPCR to quantify amount of each diluted sample using the KAPA Library Quantification Kit along with their DNA Control Standard. Each standard should be run in duplicate. In a 96-well plate, prepare reactions as follows on ice:
Component Amount per reaction (μL) KAPA Mastermix (with primers and ROX added previously, according to manufacturer’s instructions) 6.2 1:40,000 diluted sample from Step 50 OR undiluted standard 4 -
52
Run qPCR according to the protocol included with the KAPA Library Quantification kit.
-
53
Using qPCR results, calculate molar concentrations of cleaned-up, indexed samples using the KAPA Library Quantification Data Analysis Template provided by the KAPA Library Quantification Kit (Supplementary Table 2).
-
54
After determining the molarity of each indexed sample, normalize the samples by adding the appropriate volume of each sample, along with water, to a single microcentrifuge tube to produce a multiplexed library that yields the desired number of reads for each sample. We recommend using the software tool “Pipette-Guide-96” (https://github.com/tamilieberman/Pipette-Guide-96) when pipetting samples to help save time and keep track of the work.
-
55
Dilute and denature multiplexed library according to the MiSeq System Denature and Dilute Libraries Guide (https://support.illumina.com/sequencing/sequencing_instruments/miseq/documentation.html) from Illumina.
CRITICAL STEP Due to the low diversity of templates present in the library, use a PhiX control spike-in of 10% as directed on page 10 of the MiSeq System Denature and Dilute Libraries Guide. Libraries with low diversity often have unbalanced nucleotide composition, or the relative proportion of each of the four nucleotide bases. In such cases, the MiSeq instrument may fail to accurately sequence the samples. To compensate, the PhiX control spike-in provides a more balanced base composition to improve the sequencing run quality.
-
56
Prepare the sample sheet for the MiSeq run with the appropriate information, such as chemistry, number of cycles, and indices.
-
57
Load the MiSeq instrument and run. For additional information on deep sequencing, we recommend referencing the MiSeq System Guide from Illumina.
Data analysis
● Timing 1–3 h (depending on number of CPU cores available), 30 min hands-on
CRITICAL We have adapted the scripts pertaining to our original publication, available on our GitHub (https://github.com/Shipman-Lab/Spacer-Seq), into a Jupyter notebook38 (https://docs.jupyter.org/en/latest/), written in Python. This notebook serves as a self-contained, interactive walkthrough of the deep sequencing data generated during our experiments and can also be used by users analyzing their own code by running each notebook cell in order when using their FASTQ files from the deep sequencing run in Step 57. The notebook requires JupyterLab or a similar ipython-notebook handler to be installed; the rest of the dependencies are handled internally within the notebook. Note that the analysis pipeline is meant to be run on a Unix-like operating system; nonetheless, it can be adapted to run on Windows-based OSs with minimal changes to the notebook, which are pointed out in the notebook where relevant. This notebook focuses on recreating figure 4L from our original publication21, which shows the ordering analysis of recording experiments with signal plasmid pSBK.134 (as detailed in Steps 58–73). Hence, the data downloaded will be that pertaining to figure 4L from ref. 21.
-
58If necessary, using terminal, install JupyterLab with pip:
pip install jupyterlab
-
59
Download the GitHub repository (https://github.com/Shipman-Lab/Spacer-Seq_Nat-Protocols). The simplest way is to download it as a .zip file and uncompress it.
-
60Using terminal, go to the GitHub repository directory:
cd Spacer-Seq_Nat-Protocols-main
-
61
The notebook uses the following dependencies:
fuzzysearch
Biopython
seaborn
numpy
sickle-trim
These can be installed by running a cell in the notebook which verifies that the required dependencies are installed, or installs them if need be.
-
62
Import the necessary Python packages and dependencies.
-
63
Run the relevant “Step 63” cell in the notebook to load a dataframe with metadata relevant to the FASTQ files that the user wants to analyze.
If users would like to use example FASTQ files we’ve provided to recreate Fig. 4L from ref. 21, our Jupyter notebook is set up to load the relevant Sequence Read Archives (SRA) run table by running the “Step 63(a)” cell, which contains metadata describing the sequencing files. This file is provided in the GitHub repository with the notebook, but can be accessed through the NCBI Sequence Read Archive (PRJNA838025).
-
Alternatively, if users would like to analyze their own FASTQ files from the sequencing run in Step 57, they can perform the following steps:
Download the FASTQ files from the sequencing run in Step 57, and save the FASTQ files into a directory called “fastqs” located in the same directory as the Jupyter notebook.
Users should create a spreadsheet containing necessary metadata about their samples. This file should be a tab-delimited, spreadsheet-style table with columns that include “Library Name”, “Condition”, “Replicate”, “PCR”, and “Order”. Column “Library Name” should contain the name of the FASTQ file to be analyzed (e.g., “msSBK-2–35_S35_L001_R1_001.fastq.gz”); “Condition” is a description of the experiment run (e.g., “BA_PCR2”); “Replicate” is the biological replicate (e.g., 1); “PCR” is the technical PCR replicate (e.g., 3); and “Order” is the order of the experiment run (e.g., “AB” for an experiment where signal “A” is expected to have been present before signal “B”). We recommend creating the file in Microsoft Excel with the aforementioned columns and saving the output as a .txt file.
Save the metadata spreadsheet as “SraRunTable.txt” in the same directory as the Jupyter notebook.
Run the “Step 63(b)” cell in the notebook to load the metadata dataframe.
CRITICAL STEP There are a number of ways that the user can retrieve FASTQs from previous sequencing runs available through the NCBI SRA. This step can be performed manually. However, we recommend using ‘SRA-tools’, a collection of tools and libraries, developed by NCBI for the purpose of interacting with the SRA. This collection allows reasonably quick querying and downloading of the FASTQs. Of note, the most recent release of ‘SRA-tools’ is not available through ‘pip’ or Python’s usual dependency managers. Instead, it should be installed manually and interactively. To circumvent this, we have written a snippet of code that allows users to download the most recent release of ‘SRA-tools’ and use its packages locally. With this, users can specifically query and download the FASTQ files relevant to the analysis to be performed (i.e., the data pertaining to figure 4L of ref. 21).
CRITICAL STEP The snippet of bash code used to download and run ‘SRA-tools’ is written for a Unix-like OS – in the notebook, we have illustrated how to adapt it to run on MacOSX, and have suggested how users can adapt this to work on other OSs.
-
64
Run the “Step 64” cell in the notebook to trim the FASTQs using ‘sickle-trim’, a Python package that uses sliding windows along with quality and length thresholds to determine when quality is sufficiently low to trim the 3’-end of reads and also determines when the quality is sufficiently high enough to trim the 5’-end of reads39.
-
65
Run the “Step 65” cell in the notebook to set global variables, such as the “Repeat” sequence and the “old” spacers’ sequences. These are the spacers found in the CRISPR array of BL21AI E. coli. Additionally, we define how stringently the query sequences have to match the references (i.e., how closely a putative repeat has to match the actual repeat sequence). This allows some tolerance for sequencing errors. By default, we set the repeat fuzziness to 4 (i.e., allowing 4 mismatches between query and reference) and the old spacers fuzziness to 5.
-
66
Run the “Step 66” cell to define the following functions that will be used for the analysis. These functions perform most of the analysis, and work as follows:
‘get_spcrs(sequence)’: takes as input a sequence (typically a single read), and returns a list of spacers extracted from said read.
ǹot_existing(spacer)’: takes as input a sequence (typically a putative spacer), and determines whether this sequence resembles (≥83% similar) an old spacer or a repeat. Returns ‘Falsè if so; if not, returns ‘Truè -- this is how new spacers (i.e, the results of new CRISPR array expansions) are identified.
‘get_spcrs_11BC(sequence)’: takes as input a sequence (typically a single read), and returns a list of spacers extracted from said read. This function works analogously to ‘get_spcrs’, with one important difference, as described in our original publication21. ‘get_spcrs_11BC is an implementation of the “lenient analysis” used in Figs. 4 and 5 of ref. 21, where a retron-derived spacer was defined to be a spacer that contained an 11-base region of the hypothetical prespacer consisting of the 7-base barcode region and 2 bases on either side (with one mismatch or indel allowed).
‘matchesTarget(target, seq)’: takes as input a target and reference sequence, and returns ‘Truè if the sequences are the same (with an allowance of 1 mismatch or change); returns ‘Falsè otherwise.
‘double_order(double)’: takes as input two spacers from a double expansion, and returns a tuple of coded spacers, e.g. (‘A’, ‘B’) or (‘B’, ‘N’).
‘triple_order(triplet)’: takes as input three spacers from a triple expansion, and returns a tuple of coded spacers, e.g. (‘A’, ‘B’, ‘N’).
‘ multiprocess_spr(file)’: this function will:
setup a temporary dictionary, ‘ddd’, to store the new spacer data;
generate a counter of the reads in the input FASTQ, for the sake of expediting the analysis;
-
iterate through each read in the counter, extract and and determine the characteristics of the read and its spacer(s), such as:
does the read contain one or more spacers;
are the spacers “old” (one of the spacers found in the endogenous CRISPR array) or “new”;
store the read and spacer information in the temp dictionary ‘ddd’ as a dictionary ~‘{“FASTQ_i”: ddd}’, where ‘ddd’ is the dictionary with the information collected on all of the FASTQ reads;
-
return the dictionary for downstream analysis.
Note that the function called to extract the spacers is ‘get_spcrs’, which takes as input a read, and outputs a list of spacers. This list of spacers is then processed by the rest of the ‘multiprocess_spr’ function and the features detailed above are extracted and used to bin the spacers and reads, which are finally added to the temporary dictionary ‘ddd’, as discussed above.
-
67
For each read in each FASTQ, run the “Step 67” cell to extract new spacers and store them according to their characteristics and the characteristics of the CRISPR arrays from which they were extracted. The idea is to execute the function defined above as ‘multiprocess_spr’, which relies on two functions defined in Step 66 as follows:
Uses ‘get_spcrs’ to extract spacers from each read
Uses ‘not_existing’ to check whether an extracted spacer is an old, prexisting spacer in the array (to qualify, the spacer has to be ≥83% similar to an old spacer) or a repeat. If a spacer meets neither criterion, it is instead considered a new spacer that was acquired over the course of the experiment.
To speed things up, this analysis uses multiprocessing to offload tasks to worker processes, and enables the analysis of multiple FASTQs in parallel. The number of processes run will be ‘cpu_count - 1’, where ‘cpu_count’ is the number of CPUs in the system (i.e., on your laptop or cluster).
-
68
Store the data collected (information about of FASTQs, their reads, and spacers) in a dictionary by running the “Step 68” cell. If users are using the example FASTQ files provided in Step 63(a), this dictionary, ‘dict_datà, will contain a lot of useful information, most of which will not be used to re-create Fig. 4L from ref. 21, but can be explored by users.
-
69
Determine the order of spacers in each sequenced CRISPR array by running the “Step 69” cell. This cell works by running the following functions:
‘get_spcrs_11BC’ to extract potential retron-derived sequences from spacers in each, similarly to the ‘get_spcrs’ function. However, this function defines a retron-derived spacer as one that contains an 11-base region of the hypothetical prespacer, consisting of the 7-base barcode region, and 2 bases on either side (with one mismatch or indel allowed). For instance, an “A” retron-derived spacer would have an 11bp core region consisting of the following sequence: “GTTGCAGCAAC”. Similarly, a “B” retron-derived spacer would have an 11bp core region consisting of the following sequence: “GTCAGACTGAC”.
‘matchesTarget’ to determine if the potential retron-derived spacer sequences are “A”, “B”, or “N” spacers, as specified in the ‘Target_dict.’
-
‘double_order’ and ‘triple_order’, which iterates through every FASTQ, generating a dictionary of the counts of every possible permutation of “ABN” spacers, both for double expansions and triple expansions. For instance, in the case of double expansions, the possibilities are:
A, A
A, B
A, N
B, B
B, A
B, N
N, N
N, A
N, B
These counts are stored in the dictionaries ‘double_dict’ and ‘triple_dict’. Note that the function called is ‘get_spcrs_11BC’, because it involves a more ‘relaxed’ search for retron-derived spacers, as mentioned above.
-
70
Run the “Step 70” cell to generate a dataframe with the data collected in Step 69. Specifically, the code in this cell generates a dataframe òrdering_df’ by merging the dictionaries of double and triple spacer expansion ordering counts created in Step 69. Then, the code merges the ‘ordering_df’ dataframe with the metadata dataframe generated in Step 63. This cell also adds two columns to this new dataframe called Òrder’, or what the experimental order of signals were (A → B or B → A), and ‘PCR’, which will allow us to average scores within biological replicates.
-
71
Run the “Step 71” cell to sum the number of informative arrays (i.e., (A, N), (A, B) ...) for each biological replicate, which is stored in the ‘summed_counts’ dataframe.
-
72
Calculate the “Ordering Scores” by running the “Step 72” cell. The A/N score is calculated by subtracting the total number of (A, N) arrays from the total number of (N, A) arrays, then dividing that value by the sum of the total number (A, N) and (N, A) arrays. The B/N score is calculated by subtracting the total number of (N, B) arrays from the total number of (B, N) arrays, then dividing that value by the sum of the total number of (N, B) and (B, N) arrays. The A/B score is calculated by subtracting the total number of (A, B) arrays from the total number of (B, A) arrays, then dividing that value by the sum of the total number of (A, B) and (B, A) arrays. As discussed in the Experimental Design section and Box 3, these logical rules should govern the ordering of spacers in the CRISPR arrays and assist with inferring the order of transcription of tagged genes (in this case, of distinct ncRNAs).
CRITICAL STEP Because spacers are acquired unidirectionally, with newer spacers closer to the leader sequence, we propose that, if transcript “A” is expressed before transcript “B”, A → B → Leader arrays should be more numerous than B → A → Leader arrays. Conversely, if “B” is expressed before “A”, the number of B → A → Leader arrays should be greater than the number of A → B → Leader arrays. For a more extensive discussion of the scores, refer to “Analysis” section.
-
73
To visualize the data, the ordering scores for a given experiment can be plotted as a strip or swarm plot. We also recommend users add a horizontal line at 0 to separate scores corresponding to “A happened before B” (positive ordering score values) from scores corresponding to “B happened before A” (negative ordering score values). An example of such a plot is provided in Anticipated Results (see Fig. 4) and how to generate such a plot from the calculated ordering scores is shown in the two cells that follow the “Step 73” heading. Users first generate a smaller dataframe, ‘summarized_df’, that contains information about the filename, the order, the biological replicate, the PCR, the score, and type of score (i.e. A/N, B/N, or A/B). Afterwards, the ‘seaborn’ package is used to generate two overlaid plots:
A swarmplot, showing the mean value of each score per biological replicate;
A violinplot, to give a sense of the distribution of the scores.
CRITICAL: This is the end of the pipeline to calculate ordering scores for two transcriptional events and create plots similar to Fig. 4L of ref. 21 (see also Fig. 4). To infer what the ordering scores suggest about the underlying order and transcriptional profile of two different transcriptional events, users can also run a series of simulations, such as those used to generate the plots for section “Anticipated Results” (Fig. 3b–e), by following Steps 74–75:
-
74
Define the functions that will be used for the simulation by running the “Step 74” cell. These functions work as follows:
‘gen_arrays(p_A_on, p_A_off, p_B_on, p_B_off, p_N, n_arrays, p, f)’: takes as input the A-derived “on” and “off” rates, the B-derived “on” and “off” rates, the N spacer rate, the number of arrays to in silico expand, the PCR number, and the correction factor. The rates are in numbers of expansions per epoch (~24h); the PCR number and the “biological replicate” number (not a parameter of the function) allows users to generate simulations that correspond to experimental conditions. A correction factor of 1 is also implemented to allow users to run simulations faster by acting as a scaling factor that increases the promoter “on” and “off” rates by the same factor as it decreases the number of simulated arrays, allowing for faster iteration times.
To simulate array expansion, N unexpanded arrays are generated, where N is user-specified. Given user-specified “on” and “off” rates for promoters A and B, as well as the (assumed constant) rate of N spacer acquisitions, each array samples three different Poisson distributions (one each for signals A, B and N) to determine the number of spacers of each type that are added to its array during the epoch. The order of these spacers is then randomized and appended to the array.
‘run_sim(double_options, order, bioreps, pcrs, count_dict, double_dict, p_A_on, p_A_off, p_B_on, p_B_off, p_N, n_arrays, corr_factor)’: it is a wrapper function that runs the simulations by calling ‘gen_arrays’ on a series of global variables. As such, it takes as input the same parameters as ‘gen_arrays’, as well as a couple of other global variables: dictionaries to store the collected data (‘count_dict’ and ‘double_dict’), a list of double expansion ordering possibilities (‘double_options’), and the user-specified transcriptional program (òrder’). For instance, òrder’ can be “AB”, which specifies a transcriptional where promoter A is on during epoch 1, then promoter B is on during epoch 2.
-
75
Run simulations by running the cells below the “Step 75” heading after inserting the relevant parameters as defined by the user’s specific experiment. Since users will likely not know what the on and off rates and leakiness for their promoters are, they can iteratively guess and check different rates and compare the simulation results to their own plots made in Step 73 to make inferences about what the transcriptional profiles, including rates, of their promoters are. The provided cells work by running the ‘gen_arrays’ and ‘run_sim’ functions defined in Step 74. To demonstrate how this simulation can be done, we’ve provided code for how to run four different transcriptional program scenarios, described below, in the Jupyter notebook (see Fig. 3b–e), which can also be easily adapted by users for their own specific experiments and inputs. The Jupyter notebook also provides detailed explanations to the rationales for why certain parameters were chosen for each of the four simulations.
Scenario 1: similar transcriptional recording experiment as in Figure 4L (pSBK134 A → B // B → A)
-
Order of events: A turned on during epoch 1; B turned on during epoch 2
Scenario 2: A; A+B
-
Order of events: A turned on during epoch 1; A and B turned on during epoch 2
Scenario 3: none; A+B
-
Order of events: neither is turned on during epoch 1; A and B turned on during epoch 2
Scenario 4: A+B; A+B
Order of events: A and B turned on during epoch 1; A and B turned on during epoch 2
-
Fig 4.
Illustrative ordering analysis of a recording experiment. Ordering scores for 48-hour transcriptional recording using sequential 24-hour expression of anhydrotetracycline (“A”) and chlorine chloride (“B”)-induced promoters. When A occurs before B, the calculation of ordering scores result in positive values, suggesting that the analysis pipeline appropriately identified the order of expression where A occurs before B. Likewise, when B occurs before A, the calculation of ordering scores result in negative values, suggesting again that the analysis pipeline appropriately identified the opposite order of expression where B occurs before A. This figure is a re-analysis using our updated computational pipeline on the same raw sequencing reads previously obtained from our previous publication to create Fig. 4L21. Open circles correspond to N=6 biological replicates.
Troubleshooting
Troubleshooting advice can be found in Table 2.
Table 2.
Troubleshooting steps
Step | Problem | Possible reason | Solution |
---|---|---|---|
15 | Low time constants following electroporation; arcing during electroporation | Solution is too conductive, potentially due to too much salt or DNA. | Ensure solution is salt-free by performing extra washes and removing all supernatant during wash steps; decrease the concentration of DNA added to the cuvette |
22, 23 | Bacteria not growing; low culture density or OD | Inducible compounds or signal plasmid inhibits growth | Optimize dilution amount while passaging or length of transcriptional recording to allow bacteria to near stationary phase before passaging or collecting for harvest |
29, 39 | Trace shows no clear log-based amplification; slope is very shallow throughout the entire qPCR | Too much template | Dilute template between 10–1000X and redo qPCR to determine template amount that results in a normal qPCR trace |
29, 39 | Trace shows humps before log-based amplification | qPCR traces will not always look flat at the beginning. Wait, and if traces eventually show normal log-based amplification, no modifications/changes are needed | |
29, 39 | Template only amplifies >40 cycles | Indexing primers degraded | Purchase or use new indexing primers; make a new working indexing primer plate from the stock |
33 | No PCR bands | PCR requires more cycles | Redo qPCR in Steps 24–29 to determine cycle number |
33 | No PCR bands | PCR reaction may have been inhibited due to too much template or salt content | Decrease the amount of template added; perform a genomic DNA extraction to get rid of excess salt |
49, Supp. Method Step 24 | Bands do not run straight or are slightly curved | Too much ethanol | Increase the amount of time beads are allowed to dry in Step 46 or Supp. Method Step 20. |
49, Supp. Method Step 24 | No bands; beads did not bind to DNA | Nucleic acid buffer was not made correctly; not enough beads | Ensure all reagents are fresh and pH is correct; always make incomplete binding buffer in Supp. Method Step 8 right before use and ensure proportions of reagents are correct; resuspend or mix beads well in Supp. Method Step 1; check that there is no aspiration of beads during washes |
58–62 | Issues with installing packages/loading dependencies | Conflicting Python dependencies/conflicts with Python package managers | Most of the dependencies can be installed through pip, the package installer for Python. For instance, Jupyter-lab, the web-based interactive development environment for Python notebooks used to run our notebook, can be installed through pip. However, we strongly recommend users, especially those less familiar with wrangling Python environments and package installation, to use the Anaconda Distribution (https://www.anaconda.com/distribution) - it includes Python and many other commonly used packages for scientific computing and data science. Anaconda also enables easy installation and handling of dependencies. In particular, we recommend users start with Miniconda (https://docs.conda.io/en/latest/miniconda.html), a minimal installer for conda. It is a small version of Anaconda that includes only conda, Python, the packages they depend on, and a small number of other useful packages. However, it allows users to use the conda install command to install effectively any package from the Anaconda repository and its various channels, including the Bioconda channel. |
Timing
Temporal recording procedure: 6 d, 2 h 10 min hands-on
Transformation of plasmids into expression strain (steps 1–18): 4 d, 1.5 h hands-on.
Recording transcriptional activity for 48 hours (steps 19–23): 2 d, 40 min hands-on.
Preparation of CRISPR arrays for deep sequencing and deep sequencing: 13.5 h, 9.5 h hands-on, including optional step
Determine cycle number for first round PCR amplification (steps 24–29): 2 h, 2h hands-on. Note: should only have to be performed once for a given experimental paradigm
Optional step: Preparation and cleaning of Sera-Mag beads for DNA clean-up (Supplementary Methods): 2 h, 1.5 h hands-on.
Amplifying and indexing samples (steps 30–49): 6 h, 4 h hands-on.
Deep sequencing of CRISPR array (steps 50–57): 3.5 h, 2 h hands-on.
Data analysis: 1–3 h (depending on number of CPU cores available), 30 min hands-on
Installing dependencies (steps 58–62): 10 min hands-on.
Loading experiment sheet; downloading FASTQs from SRA and trimming them (steps 63–64): 30 min, 5 hands-on.
Extracting new spacers and storing them according to their characteristics and the characteristics of the CRISPR arrays from which they were extracted; storing the data in a data-frame (steps 65–68): 1–3h, 2 min hands-on.
Determining the order of spacers in each sequenced CRISPR array; storing the data in a data-frame (steps 69–70): 5 min, 1 min hands-on.
Calculating the ordering scores; storing the data in a data-frame (steps 71–72): 1 min, 1 min hands-on.
Plotting the data (step 73): 1 min, 1 min hands-on.
Run simulations for different transcriptional programs (steps 74–75): 1h, 1 min hands-on.
Anticipated results
Following this protocol, we expect users to be able to generate plots depicting the ordering scores from different transcriptional programs. As an example, we provide the outcome plot using our updated computational pipeline on raw sequencing reads previously obtained21 from a 48-hour transcriptional recording experiment, where an anhydrotetracycline-induced promoter (“A”) was turned on for 24 hours then a choline chloride-induced promoter (“B”) was turned on another 24 hours, or vice versa (Fig. 4).
Although Fig. 4 depicts a simple transcriptional program, the three ordering scores we describe also enable the representation of more complex transcriptional programs. The only requirement is that each program must be separated into two distinct epochs where the acquisition rate of a given transcriptional signal in each epoch assumed to be constant. A key to the meaning of different ordering score plots is provided (Fig. 3a), followed by four hypothetical transcriptional programs (Fig. 3b–e). For each transcriptional program, we simulated the expected ordering score results from six replicates using 2.1 million reads per replicate to give users an intuition of the types of results they can expect for different transcriptional programs. Although we chose programs where transcriptional signal A and B each have the same acquisition rate, the exact magnitude of ordering scores can also reflect differences between the strengths and resulting varied acquisition rates of different signals as well.
We anticipate that users could perform an experiment akin to our 48 h recordings, run the ordering score analysis, and plot them. By comparing the distribution of their scores with the key in Fig. 3a, as well as the different possibilities illustrated in Fig 3b–e, and together with the interactive simulations provided in the accompanying notebook, we believe that users will be able to make inferences regarding the underlying transcriptional programs that took place during the recording experiment.
Supplementary Material
Supplementary Method: Preparation and cleaning of Sera-Mag beads for DNA clean-up
Supplementary Figure 1. Representative example of gel used to test bead-to-DNA ratios.
Supplementary Table 1: Indexing primer sequences for multiplexing samples during library preparation
Supplementary Table 2: KAPA Library Quantification Data Analysis Template as described for the KAPA Library Quantification Kit. Worksheet provides a Readme page, an analysis page for data input, and a summary page for data output.
Acknowledgements
This work was supported by funding from the National Science Foundation (2137692), the NIH/NIGMS (1DP2GM140917–01), and the Pew Biomedical Scholars Program. S.L.S. is a Chan Zuckerberg Biohub investigator and acknowledges additional funding support from the L.K. Whittier Foundation. S.K.L. was supported by an NSF Graduate Research Fellowship (2034836). S.C.L. was supported by a Berkeley Fellowship for Graduate Study.
Footnotes
Competing interests
S.L.S. is a named inventor on a patent application assigned to Harvard College, “Method of recording multiplexed biological information into a CRISPR array using a retron” (US20200115706A1). All other authors have no competing interests.
Code Availability
The latest version of the analysis code can be accessed through our lab GitHub (https://github.com/Shipman-Lab/Spacer-Seq_Nat-Protocols). The release at time of manuscript publishing is available at DOI: https://doi.org/10.5281/zenodo.7549326.
Data Availability
Sequencing data associated with this study are available in the NCBI Sequence Read Archive (PRJNA838025).
References
- 1.Siuti P, Yazbek J & Lu TK Synthetic circuits integrating logic and memory in living cells. Nat. Biotechnol. 31, 448–452 (2013). [DOI] [PubMed] [Google Scholar]
- 2.Bonnet J, Yin P, Ortiz ME, Subsoontorn P & Endy D Amplifying Genetic Logic Gates. Science 340, 599–603 (2013). [DOI] [PubMed] [Google Scholar]
- 3.Yang L et al. Permanent genetic memory with >1-byte capacity. Nat. Methods 11, 1261–1266 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Farzadfard F & Lu TK Genomically encoded analog memory with precise in vivo DNA writing in living cell populations. Science 346, 1256272 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Courbet A, Endy D, Renard E, Molina F & Bonnet J Detection of pathological biomarkers in human clinical samples via amplifying genetic switches and logic gates. Sci. Transl. Med. 7, 289ra83–289ra83 (2015). [DOI] [PubMed] [Google Scholar]
- 6.Roquet N, Soleimany AP, Ferris AC, Aaronson S & Lu TK Synthetic recombinase-based state machines in living cells. Science 353, aad8559 (2016). [DOI] [PubMed] [Google Scholar]
- 7.Hsiao V, Hori Y, Rothermund PW & Murray MM A population-based temporal logic gate for timing and recording chemical events. Mol. Syst. Biol. 12, 869 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Weinberg BH et al. Large-scale design of robust genetic circuits with multiple inputs and outputs for mammalian cells. Nat. Biotechnol. 35, 453–462 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Perli SD, Cui CH & Lu TK Continuous genetic recording with self-targeting CRISPR-Cas in human cells. Science 353, aag0511 (2016). [DOI] [PubMed] [Google Scholar]
- 10.Tang W & Liu DR Rewritable multi-event analog recording in bacterial and mammalian cells. Science 360, eaap8992 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Kempton HR, Love KS, Guo LY & Qi LS Scalable biological signal recording in mammalian cells using Cas12a base editors. Nat. Chem. Biol. 1–9 (2022) doi: 10.1038/s41589-022-01034-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Chen W et al. Multiplex genomic recording of enhancer and signal transduction activity in mammalian cells. Preprint at bioRxiv 10.1101/2021.11.05.467434 (2021). [DOI] [Google Scholar]
- 13.Loveless TB et al. Molecular recording of sequential cellular events into DNA. Preprint at bioRxiv 10.1101/2021.11.05.467507 (2021). [DOI] [Google Scholar]
- 14.Choi J et al. A time-resolved, multi-symbol molecular recorder via sequential genome editing. Nature 608, 98–107 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Shipman SL, Nivala J, Macklis JD & Church GM Molecular recordings by directed CRISPR spacer acquisition. Science 353, aaf1175 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Sheth RU, Yim SS, Wu FL & Wang HH Multiplex recording of cellular events over time on CRISPR biological tape. Science 358, 1457–1461 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Schmidt F, Cherepkova MY & Platt RJ Transcriptional recording by CRISPR spacer acquisition from RNA. Nature 562, 380–385 (2018). [DOI] [PubMed] [Google Scholar]
- 18.Yim SS et al. Robust direct digital-to-biological data storage in living cells. Nat. Chem. Biol. 17, 246–253 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Sheth RU & Wang HH DNA-based memory devices for recording cellular events. Nat. Rev. Genet. 19, 718–732 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Lear SK & Shipman SL Molecular recording: transcriptional data collection into the genome. Curr. Opin. Biotechnol. 79, 102855 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Bhattarai-Kline S et al. Recording gene expression order in DNA by CRISPR addition of retron barcodes. Nature (2022) doi: 10.1038/s41586-022-04994-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Yosef I, Goren MG & Qimron U Proteins and DNA elements essential for the CRISPR adaptation process in Escherichia coli. Nucleic Acids Res. 40, 5569–5576 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Nuñez JK et al. Cas1–Cas2 complex formation mediates spacer acquisition during CRISPR–Cas adaptive immunity. Nat. Struct. Mol. Biol. 21, 528–534 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Shipman SL, Nivala J, Macklis JD & Church GM CRISPR–Cas encoding of a digital movie into the genomes of a population of living bacteria. Nature 547, 345–349 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Silas S et al. Direct CRISPR spacer acquisition from RNA by a natural reverse transcriptase–Cas1 fusion protein. Science 351, aad4234 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Tanna T, Schmidt F, Cherepkova MY, Okoniewski M & Platt RJ Recording transcriptional histories using Record-seq. Nat. Protoc. 1–27 (2020) doi: 10.1038/s41596-019-0253-4. [DOI] [PubMed] [Google Scholar]
- 27.Schmidt F et al. Noninvasive assessment of gut function using transcriptional recording sentinel cells. Science 376, eabm6038 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Yehl K & Lu T Scaling computation and memory in living cells. Curr. Opin. Biomed. Eng. 4, 143–151 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Nuñez JK, Bai L, Harrington LB, Hinder TL & Doudna JA CRISPR Immunological Memory Requires a Host Factor for Specificity. Mol. Cell 62, 824–833 (2016). [DOI] [PubMed] [Google Scholar]
- 30.Yoganand KNR, Sivathanu R, Nimkar S & Anand B Asymmetric positioning of Cas1–2 complex and Integration Host Factor induced DNA bending guide the unidirectional homing of protospacer in CRISPR-Cas type I-E system. Nucleic Acids Res. 45, 367–381 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Sharon E et al. Functional Genetic Variants Revealed by Massively Parallel Precise Genome Editing. Cell 175, 544–557.e16 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Kong X et al. Precise genome editing without exogenous donor DNA via retron editing system in human cells. Protein Cell 12, 899–902 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Lopez SC, Crawford KD, Lear SK, Bhattarai-Kline S & Shipman SL Precise genome editing across kingdoms of life using retron-derived DNA. Nat. Chem. Biol. 18, 199–206 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Zhao B, Chen S-AA, Lee J & Fraser HB Bacterial Retrons Enable Precise Gene Editing in Human Cells. CRISPR J. 5, 31–39 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Palka C, Fishman CB, Bhattarai-Kline S, Myers SA & Shipman SL Retron reverse transcriptase termination and phage defense are dependent on host RNase H1. Nucleic Acids Res. 50, 3490–3504 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Munck C, Sheth RU, Freedberg DE & Wang HH Recording mobile DNA in the gut microbiota using an Escherichia coli CRISPR-Cas spacer acquisition platform. Nat. Commun. 11, 95 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Lee PY, Costumbrado J, Hsu CY, Kim YH Agarose Gel Electrophoresis for the Separation of DNA Fragments. J. Vis. Exp. (62), e3923, doi: 10.3791/3923 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Kluyver T, et al. Jupyter Notebooks—a publishing format for reproducible computational workflows. In: Loizides F. & Schmidt B. (eds.) Positioning and Power in Academic Publishing: Players, Agents and Agendas, 87–90 (IOS Press, 2016). [Google Scholar]
- 39.Joshi NA & Fass JN Sickle: A sliding-window, adaptive, quality-based trimming tool for FastQ files (Version 1.33) Available at: https://github.com/najoshi/sickle. (2011). [Google Scholar]
- 40.JoVE Science Education Database. Microbiology. Serial Dilutions and Plating: Microbial Enumeration. JoVE, Cambridge, MA, (2022). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary Method: Preparation and cleaning of Sera-Mag beads for DNA clean-up
Supplementary Figure 1. Representative example of gel used to test bead-to-DNA ratios.
Supplementary Table 1: Indexing primer sequences for multiplexing samples during library preparation
Supplementary Table 2: KAPA Library Quantification Data Analysis Template as described for the KAPA Library Quantification Kit. Worksheet provides a Readme page, an analysis page for data input, and a summary page for data output.
Data Availability Statement
Sequencing data associated with this study are available in the NCBI Sequence Read Archive (PRJNA838025).