Abstract
Background:
Gibson assembly and assembly-in-yeast are strategies to create long synthetic DNAs from diverse fragments, for example, when engineering bacteriophage genomes. Design for these methods requires terminal sequence overlaps in the fragments, determining the order of assembly. Design to rebuild a genomic fragment that is too long for a single PCR presents a puzzle since some candidate joint regions cannot yield satisfactory primers for the overlap. No existing overlap assembly design software is open-source, and none explicitly supports rebuilding.
Methods:
We describe here bigDNA software that solves the rebuilding puzzle by recursive backtracking, with options to remove or introduce genes; it also tests for mispriming on the template DNA. BigDNA was tested with 3082 prophages and other genomic islands (GIs), from 20 to 100 kb, and the synthetic Mycoplasma genitalium genome.
Results:
Rebuilding assembly design succeeded for all but ∼1% of GIs.
Conclusion:
BigDNA will speed and standardize assembly design.
Keywords: DNA assembly, primer design, bacteriophage, synthetic biology
Introduction
A common approach to preparing large engineered DNAs, longer than a single long polymerase chain reaction (PCR), is to assemble them from a set of overlapping DNA fragments, perhaps with circularization. One such method is in vivo assembly in yeast.1–3 Another popular method that can build DNAs as long as 900 kb is the isothermal Gibson assembly.4
These methods avoid restriction digestion and ligation; assembly order is specified by effective recombination between homologous overlaps at the ends of the fragments. Early work assembled cloned DNA fragments of arbitrary size. More recent assembly projects have avoided cloning and used only PCR products to rebuild moderately long genomic regions (∼40–70 kb) that exceed the size limit (∼30 kb) of a single long PCR.5–9 Overlap assembly can yield novel sequences at splicing joints or reproduce sequences that already exist in the source genomes at rebuilding joints (Fig. 1A).
FIG. 1.
Qualitatively different assembly joint types, one solved by recursive backtracking. (A) Splicing versus rebuilding. Segments from different sources can be effectively spliced together, generating overlaps by cross-elongating the PCR primers. A final circularization, if desired, would also be a splicing joint. Alternatively, a single-source segment too long to amplify can be rebuilt from a series of PCR products prepared with appropriate simple overlaps. (B) The rebuilding puzzle. In our algorithm, large windows are allowed for PCR right ends (yellow boxes). However, the left end windows (orange) are small, set by the desired range of fragment overlap length, and can prevent simple serial design of a rebuild. Backtracking to new windows, set by different alternative right ends for the preceding PCR, increases completion. For PCR1 (recursion 1), the alternatives (1a, 1b) are tested sequentially for PCR2. Testing of 1a (Recursion 2) starts by setting the appropriate left end window; if 1a yields no PCR2 alternatives, the process backtracks to PCR1, and tries again (Recursion 3) with 1b. If 1b succeeds, recursion continues to PCR3 and so forth. PCR, polymerase chain reaction.
However, rebuilding presents a puzzle in that not all potential joint overlap sequences yield suitable primer sequences. Fortunately, alternative overlap sequences can often be found nearby to allow successful joining.
Tools aimed at Gibson assembly primer design are the online-only resources of New England Biolabs and Codex and the proprietary software SnapGene. These tools tackle the splicing problem encountered in plasmid generation or insertion/deletion engineering, but not the more challenging rebuilding problem. None are open-source, which may be acceptable for one-off splicing applications; however, based on our own work in engineering bacteriophages for therapy,6 there can be requirements to quickly design large numbers of assemblies, a use case better served by open-source software. Nor do the above tools cross-check primers against the entire source DNA, which may be as complex as a whole genomic DNA preparation, to avoid primers that may cause mispriming events.
Such a check against potential mispriming is provided by the single-PCR primer design tool Primer-BLAST,10 which tests each suggested primer pair by BLASTN against the full genome sequence. Primer-BLAST is not specifically aimed at assembly design, but we have employed it semi-manually for this purpose. With the appreciable failure rate (measured here as 15%) for rebuilding joints, this semi-manual sequential approach to rebuilding design was time-consuming and often required backtracking to earlier primer pairs in a growing assembly. That is, we were manually applying an algorithm of recursive backtracking for rebuilding design.
We have prepared new open-source software, bigDNA, that automatically designs primers for overlapping PCR products, in both splicing and rebuilding assembly applications. By explicitly addressing the rebuilding problem, bigDNA goes further than existing tools; rebuilding is accomplished with recursive backtracking, a common approach to speeding puzzle solution (Fig. 1B). The recursing subroutine calls the classic PCR primer design software Primer311 before checking each primer pair for potential mispriming on the source DNA using ThermonucleotideBLAST (tntBLAST).12
BLASTN uses a simple substitution matrix to evaluate mismatches, whereas tntBLAST uses PCR-pertinent thermodynamic calculations to identify potential mispriming events by a PCR primer pair. The combination of Primer3 and tntBLAST in the recursive subroutine enables bigDNA to automate the primer design process for large DNAs and return custom primers optimized to prevent off-target amplification.
Materials and Methods
DNA sequences for studying assembly
From our recently prepared list of 6415 genomic islands (GIs) from 2168 prokaryotic genomes,13 we took the subset of 3082 GIs that were between 20 and 100 kb, and contained no blocks of two or more ambiguous bases (Supplementary Data S1). By source, 58 were from Archaea, and 3024 from Bacteria. By type, 1698 were prophages, 295 integrative and conjugative elements, and 1098 of unknown biology. Breaking the set down by number of rebuilding fragments required, for fragment counts of 3, 4, 5, 6, 7, 8, 9, and 10, the numbers of GIs were 666, 811, 892, 360, 144, 81, 89, and 39, respectively.
For each GI, three mock GIs were created by taking a randomly located region of the same size from the same chromosome. Mock GIs were not allowed to overlap each other, known GIs, nor rRNA genes, the latter to avoid tntBLAST problems with large, highly repetitive sequences. These GI and mock GI sets were specified in GFF3 format that pointed to their source among a set of 1211 RefSeq genome FastA files. Abbreviated gene annotation files (covering only the GI regions) were prepared for the 1211 genomes.
To speed testing of exhaustive rebuilding, a subset of 200 of the GIs was taken (Supplementary Data S1), all known to complete even when applying tntBLAST; for fragment counts of 3, 4, and 5, the numbers of these GIs were 85, 65, and 50, respectively. The eight “eighth-genome” segments used to build the synthetic Mycoplasma genitalium genome (CP000925.1) were tested for rebuilding, adding 100-bp tolerance windows to the termini, since acceptable primers could not always be found when forced to the original termini.
Use of third-party software
BigDNA makes essential use of Primer311 for finding PCR primers, overriding the defaults for the following Primer3 settings: PRIMER_FIRST_BASE_INDEX = 1, PRIMER_PAIR_MAX_DIFF_TM = 3, PRIMER_NUM_RETURN = 20, PRIMER_TASK = pick_pcr_primers (see Primer3 manual). The user can override defaults or add other relevant settings through the bigDNA configuration file.
BigDNA makes optional use of tntBLAST12 for ruling out PCR primer pairs that may misprime. Four tntBLAST flags (see tntBLAST manual) are exposed as bigDNA configuration tags, and they are set at the following defaults: minimum Tm TNT_MIN_TM = 45, primer-clamp TNT_CLAMP = 4, maximum gaps allowed TNT_MAX_GAP = 6, and maximum mismatch count TNT_MAX_MISMATCH = 6. The maximum PCR size tntBLAST parameter is set at twice the size of the intended PCR product. tntBLAST operation is controlled by the TNT_USE tag, with four acceptable values, “off,” “per-recursion,” “per-solution,” and “post-exhaustive.”
BigDNA package and synopsis
The bigDNA package is available at https://github.com/sandialabs/bigdna. Some simple test systems are provided as well as the whole validation system described next.
BigDNA is controlled by a defaults file, containing the above third-party software tags and these additional tags:
maximum size for any PCR in the assembly, PCR_MAX_SIZE = 10,000;
size of window for a rebuild PCR right end, REBUILD_WINDOW = 3000
maximum size for a rebuild PCR overlap, OVERLAP_MAX_SIZE = 80
minimum size for a rebuild PCR overlap, OVERLAP_MIN_SIZE = 40
choice for rebuild optimization (penalty or uniform), OPTIMIZE = penalty
“first” or “exhaustive” solution mode, SOLUTION = first
tntBLAST control (off, per-recursion, per-solution or post-exhaustive), TNT_USE = per-solution
a safety feature, FRAGMENT_MAX_NUM = 15.
A required configuration file allows override of the defaults. Moreover, it provides instructions for assembly, specifying the order, orientation, and end-handling of each segment to be spliced together. All segments longer than PCR_MAX_SIZE are rebuilt. Circularization of the assembled DNA (a splicing operation) is default behavior that can be overridden if it is desired to leave the assembly linear. A bigdna.log file summarizes progress and statistics. A primers.txt output file lists sequences of primers to be synthesized for the assembly, along with unique identifiers and genomic coordinates of the upper-case portions of primer sequences.
Validation
Helper scripts, files, and instructions provided with the package allow the complete reproduction of all data presented here, and they can be adapted for managing other projects. One script prepares a list of jobs in jumbled order (to avoid per-job timing artifacts) and a directory structure populated with configuration files. The job list was run in a batch on 134 cpu of a reserved node. For the primary set of 99,824 jobs, total memory usage for the node remained in the range of 40–110 Gbytes on all 134 cpu over 62 min, then dropped off as the last jobs completed by 63 min. Summary lines from each job log file were collected, and a final script further summarized these in relevant sets with additional analyses.
Production of assembly fragments for an engineered prophage
TIGER13 software was used to predict prophages in Burkholderia cepacia MSMB648 (GCA_001532955.1). A resulting prophage call, Bce38.59.V (LPKL01000068.1:221593-280717), was selected for genome engineering design, namely deletion of the integrase gene at one terminus of the prophage, applying bigDNA to design both a four- and five-fragment assembly. Long PCR was performed using primers listed in Supplementary Table S1 with 1x Platinum SuperFi PCR master mix (Invitrogen), 0.5 μM forward and reverse primer, and 100 ng of B. cepacia MSMB648 genomic DNA.
The PCR conditions were as follows: (1) 95°C for 2 min; (2) 35 cycles of 95°C for 10 s, 55°C for 10 s, 68°C for 6 min (five-part), or 7.5 min (four-part); and (3) 68°C for 5 min. A 0.8% agarose gel stained with SYBR Safe (Invitrogen) was used to resolve products.
Results
Rebuilding prophages and other GIs
Our motivating use case was to engineer prophages that lie as GIs within bacterial chromosomes. We and others have done this by a rebuilding Gibson assembly from overlapping long PCRs, prepared using whole bacterial genomic DNA as a template.5,6 We began software development by focusing on the more challenging rebuilding process, studying a set of 3082 GIs13 ranging in size from 20 to 100 kb (Supplementary Data S1), and with wide-ranging G + C content (Supplementary Fig. S1).
Two metrics were used to evaluate each rebuild solution: penalty (the average Primer3 penalty score for all PCRs of the rebuild) and nonuniformity (the average absolute difference between each PCR's actual size and that preferred for uniform size of all PCR products). Favoring lower penalty should improve each individual PCR, whereas favoring uniformity may improve overall PCR production efficiency since large size differences can require multiple long PCR runs.
BigDNA performs rebuilding by recursion and backtracking (Fig. 1B and Supplementary Fig. S2), speeded by memoization, that is, recording PCR right ends that fail to complete rebuilding, to avoid repeating Primer3 runs. We began by testing a mode that stops recursion when it finds the first solution (as opposed to finding all possible solutions). When configured to optimize based on the Primer3 penalty scores, the testing order of alternative primer pairs for a fragment is determined by this penalty.
For initial analysis, we simply aimed at rebuilding each entire GI. To minimize challenges at segment termini, and thereby focus on the internal PCRs that are key to rebuilding, we allowed a 200-bp tolerance window at each extreme GI end in which to locate primer 5′ ends (i.e., up to 199 bp were allowed to be deleted at each end). Using an overlap window size of 21 bp, the completion rate of rebuilding for the 3082 GIs with standard thermodynamic parameters (Table 1) was 98.3%, with 1.2% failing at terminal PCRs (despite allowing 200-bp terminal tolerance windows) and 0.45% failing because of an inability to find an internal PCR after multiple attempts.
Table 1.
Overlap and Thermodynamic Settings
| Overlap window (bp) | Altered thermodynamic Primer3 parameter | Default thdyn value | Completion | Final term fail | Final int fail | Per term fail | Per int fail | Btrack reqd | Avg penalty | Max penalty | Avg nonuniformity | Max nonuniformity | Time (s) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 21 | None | — | 98.31 | 1.23 | 0.45 | 10.67 | 12.27 | 26.15 | 1.631 | 3.906 | 624.32 | 1191.64 | 2.97 |
| 41 | None | — | 98.70 | 1.23 | 0.06 | 3.34 | 3.67 | 9.53 | 1.190 | 2.957 | 623.78 | 1191.95 | 2.83 |
| 61 | None | — | 98.77 | 1.23 | 0.00 | 0.03 | 0.11 | 0.33 | 0.313 | 0.946 | 623.41 | 1192.80 | 2.75 |
| 41 | MAX_SIZE = 33 | 27 | 99.12 | 0.88 | 0.00 | 1.20 | 1.76 | 5.04 | 1.298 | 3.294 | 623.85 | 1192.41 | 4.45 |
| 41 | MIN_TM = 55 | 57 | 99.12 | 0.88 | 0.00 | 1.45 | 2.21 | 6.06 | 1.256 | 3.121 | 624.09 | 1192.95 | 2.80 |
| 41 | MAX_SIZE = 33, MIN_TM = 55 | 27, 57 | 99.12 | 0.88 | 0.00 | 1.10 | 1.65 | 4.62 | 1.281 | 3.214 | 624.10 | 1193.33 | 4.45 |
| 41 | MIN_GC = 5 | 20 | 98.70 | 1.23 | 0.06 | 3.34 | 3.67 | 9.53 | 1.190 | 2.957 | 623.78 | 1191.95 | 2.84 |
| 41 | MAX_TM = 75 | 63 | 98.70 | 1.23 | 0.06 | 2.84 | 3.13 | 8.02 | 1.214 | 3.055 | 624.01 | 1192.43 | 2.88 |
| 41 | MAX_HAIRPIN_TH = 80 | 47 | 98.70 | 1.14 | 0.16 | 1.59 | 3.66 | 6.04 | 1.038 | 2.666 | 622.95 | 1190.12 | 2.76 |
| 41 | MAX_POLY_X = 8 | 5 | 98.70 | 1.23 | 0.06 | 2.97 | 3.46 | 8.94 | 1.187 | 2.958 | 623.80 | 1191.26 | 2.82 |
The 3082 GIs (Supplementary Data S1) were treated with default parameters, varying only the overlap window size or the indicated Primer3 thermodynamic settings, taking the first solution. Penalty, Nonuniformity, and Time are averaged for all non-failing GIs. Avg penalty is the average Primer3 penalty score (see Primer3 manual) for the PCR set of the rebuild; max penalty is the maximum for the PCR set. Avg nonuniformity is the average distance (bp) from perfect size uniformity for the PCR set; max nonuniformity is the maximum for the PCR set. Other values are given as percentages of GIs: Completion is the design of whole assembly succeeded; Final Term Fail is the design failure occurred at a terminal segment; Final int fail is the design failure occurred at an internal segment; Btrack Reqd is the backtracking required. Remaining values are percentages of success per PCR attempt: Per Term Fail is the rate of failed Primer3 calls per-attempt for the final segment; Per Int Fail is the rate of failed Primer3 calls per-attempt for any internal segment. Overlap window = 41 and MAX_SIZE = 33 were selected for use in the tests of the following tables.
GI, genomic islands; PCR, polymerase chain reaction.
This low latter failure rate was due to backtracking; the failure rate of all internal PCR attempts was 12.3%, such that 26% of successful rebuilds required at least one backtrack. Increasing the overlap window size further improved the completion rate; 41 bp was chosen for remaining tests. Altering Primer3 thermodynamic parameters showed that raising maximum primer length improved design completion, as did lowering Tm, and these effects were not additive (Table 1). We therefore raised the maximum primer length default to 33, as this was expected to have little effect on PCR success (lowering Tm can instead reduce long PCR quality).
Rebuilding statistics were similar for sets of mock GIs of the same sizes and from the same genomes as the GIs (Table 2), showing that GIs are not particularly special to rebuild, although for unknown reasons they required less backtracking than the mock GIs.
Table 2.
Optimizers and Three Independent Sets of Mock Genomic Islands
| Optimizer | Dataset | Completion | Final term fail | Final int fail | Per term fail | Per int fail | Btrack reqd | Avg penalty | Max penalty | Avg nonuniformity | Max nonuniformity | Time (s) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Penalty | 3082 GIs | 99.12 | 0.88 | 0.00 | 1.20 | 1.76 | 5.04 | 1.298 | 3.294 | 623.85 | 1192.41 | 4.45 |
| Uniform | 3082 GIs | 99.12 | 0.88 | 0.00 | 1.42 | 2.18 | 5.37 | 1.449 | 3.578 | 128.77 | 297.10 | 4.46 |
| Penalty | mock1 | 99.38 | 0.62 | 0.00 | 2.39 | 3.27 | 7.48 | 1.121 | 2.897 | 616.71 | 1191.47 | 4.47 |
| Penalty | mock2 | 99.42 | 0.58 | 0.00 | 2.85 | 2.70 | 7.25 | 1.168 | 2.993 | 620.78 | 1195.90 | 4.47 |
| Penalty | mock3 | 99.32 | 0.68 | 0.00 | 2.52 | 2.98 | 7.64 | 1.160 | 2.989 | 620.29 | 1195.00 | 4.48 |
In what has been stated earlier, the alternative PCRs from each Primer3 run were ranked for inclusion according to penalty score (see Primer3 manual), which optimizes each individual component PCR of the rebuild. Alternatively, users can optimize for size uniformity of the rebuild PCRs (Table 2), a metric uncorrelated to penalty (Supplementary Fig. S3), which may improve overall PCR production efficiency since large size differences might otherwise require multiple long PCR runs. Another setting controls whether bigDNA stops after producing the first acceptable solution or exhaustively produces all acceptable solutions (Table 3). The exhaustive mode is greatly speeded by allowing fewer alternative PCRs.
Table 3.
Exhaustive Search and Alternative Polymerase Chain Reaction Set Size
| Solution mode | Alt. PCRs | Completion | Per term fail | Per int fail | Btrack reqd | Avg penalty | Max penalty | Avg nonuniformity | Max nonuniformity | Time (s) | Solutions |
|---|---|---|---|---|---|---|---|---|---|---|---|
| First | 20 | 100.00 | 0.99 | 2.41 | 4.50 | 0.989 | 2.454 | 613.84 | 1142.78 | 2.92 | 1.0 |
| Exhaustive | 20 | 100.00 | 0.54 | 1.21 | 23.50 | 0.226 | 0.501 | 100.10 | 213.41 | 81.03 | 823371.4 |
| First | 7 | 100.00 | 0.99 | 2.41 | 4.50 | 0.989 | 2.454 | 613.84 | 1142.78 | 2.70 | 1.0 |
| Exhaustive | 7 | 100.00 | 0.35 | 1.24 | 11.50 | 0.279 | 0.660 | 218.12 | 479.67 | 11.77 | 4941.4 |
The set of 200 GIs (Supplementary Data S1) was used. For exhaustive search, the lowest among all solutions for each of avg penalty, max penalty, avg nonuniformity, and max nonuniformity; these values are averaged for all GIs.
Application of tntBLAST, to avoid mispriming on a complex template,12 was turned off in the earlier work. Three options were tested for its application: per-recursion, per-solution, or (for exhaustive mode) post-exhaustive (Table 4). We generally recommend use of the exhaustive mode, with post-exhaustive application of tntBLAST; the penalty- or uniformity-optimal solution can then be selected from the output. As for Primer3 results, tntBLAST results were memorized for efficiency.
Table 4.
Filtering Out Mispriming
| Solution | tntBLAST use | Completion | Avg penalty | Max penalty | Avg nonuniformity | Max nonuniformity | Time (s) | Solutions | Solutions rejected (%) | Tnt fail (%) |
|---|---|---|---|---|---|---|---|---|---|---|
| First | Off | 100.00 | 0.989 | 2.454 | 613.84 | 1142.78 | 2.70 | 1.0 | 0.00 | NA |
| First | Per-recursion | 100.00 | 0.969 | 2.408 | 617.78 | 1143.64 | 8.81 | 1.0 | 0.00 | 4.17 |
| First | Per-solution | 100.00 | 0.967 | 2.397 | 614.93 | 1142.23 | 4.54 | 1.5 | 31.03 | 11.43 |
| Exhaustive | Off | 100.00 | 0.279 | 0.660 | 218.12 | 479.67 | 11.77 | 4941.4 | 0.00 | NA |
| Exhaustive | Per-recursion | 100.00 | 0.286 | 0.680 | 229.45 | 499.01 | 35.55 | 2389.2 | 0.00 | 4.03 |
| Exhaustive | Per-solution | 100.00 | 0.285 | 0.679 | 227.80 | 496.39 | 65.99 | 2461.9 | 0.41 | 4.21 |
| Exhaustive | Post-exhaustive | 100.00 | 0.279 | 0.660 | 218.12 | 479.67 | 13.05 | 4941.4 | 0.25 | 8.72 |
The set of 200 GIs (Supplementary Data S1) was used, limiting alternative PCR set sizes to 7.
NA, not applicable; tntBLAST, ThermonucleotideBLAST.
All testing described earlier was performed, (1) allowing a tolerance at each end for a specific window size (200 bp) in which to locate the terminus of the rebuild. Other options for end-handling are (2) forcing to the extreme ends of the template sequence and (3) reading a gene/feature annotation file for source DNAs and setting a custom tolerance window that will include in the PCR the closest feature to each terminus. A final specialized end-handling procedure is aimed at our original use case, (4) deleting the integrase gene and attachment site from the GI termini.
Completion rates dropped with these more stringent end-handling treatments, to as low as 45.8% for the forcing option (Table 5). For unknown reasons, GIs were more amenable to terminus-forcing than mock GIs.
Table 5.
Terminus Treatments
| Terminus handling | Dataset | Completion | Avg penalty | Max penalty | Avg nonuniformity | Max nonuniformity | Time (s) | Tnt fail |
|---|---|---|---|---|---|---|---|---|
| 200 bp tolerance | GIs | 93.15 | 1.295 | 3.273 | 623.44 | 1192.95 | 12.14 | 8.01 |
| Gene | GIs | 92.31 | 1.386 | 3.574 | 633.82 | 1197.29 | 12.21 | 7.98 |
| delta_int | GIs | 84.85 | 1.306 | 3.281 | 862.09 | 1481.17 | 11.83 | 8.48 |
| Force | GIs | 45.78 | 3.231 | 7.979 | 597.65 | 1192.76 | 11.72 | 8.02 |
| force | mock1 | 37.87 | 2.908 | 6.988 | 588.22 | 1184.75 | 11.44 | 6.56 |
| Force | mock2 | 38.45 | 3.055 | 7.266 | 603.82 | 1191.85 | 11.11 | 6.69 |
| Force | mock3 | 38.74 | 3.013 | 7.188 | 605.59 | 1199.29 | 11.19 | 6.17 |
200 bp tolerance, standard window allowed to omit from termini; gene, tolerance windows set to include first annotation feature at each terminus; delta_int, delete the integrase gene and attP of the GI; force, extreme termini of GI must be included. TntBLAST use was per-recursion, stopping after reaching the first solution.
The rebuilding subroutine is run on each segment to be spliced in the final assembly. Remaining splicing junctions are trivial to prepare once termini have been determined for each segment to be spliced; they are chimeras where each junction primer is prepended by the reverse complement of the other. One type of splicing, circularization, was already tested in the cases cited earlier. We explicitly tested deletion and insertion at internal sites for the 1411 circularizable GIs, selecting the centermost gene for deletion.
A window between a terminus of the deleted gene and the next neighboring gene was used to locate PCR termini. With occasional shortness of the windows separating the target gene from its neighbor, 87.0% of the deletions succeeded. We then inserted a green fluorescent protein promoter/gene in place of the deleted gene, with identical success.
Rebuilding a bacterial genome
The first synthetic cellular genome, M. genitalium JCVI-1.0, was built by overlap assembly.1,14 At one stage in the assembly, there were eight “eighth-genome” subassemblies ranging in size from 64 to 93 kb. Here, we design the rebuild of each eighth-genome segment from long PCRs. Challenges for primer-finding and mispriming may be expected from the relatively low complexity of the genome at 31.67% GC; however, mispriming may be mitigated by the small total genome size (583 kb).
This exercise would hypothetically use the existing engineered genome as a template; additional design work would be needed to introduce the “watermarks” and other engineered sequences in JCVI-1.0 if a native Mycoplasma genome were the template. Quicker first-solution runs established that each assembly was feasible. The final runs used exhaustive mode so that the penalty-optimized and the uniformity-optimized solution could be identified for each segment; tntBLAST was applied post-solution, and runs were speeded by returning no more than seven alternative PCRs in each Primer3 call.
This generated a total of 2.51E8 solutions, from which the penalty-optimized and the uniformity-optimized solution could be selected for each segment (Table 6). This included the biggest assembly tested here (10 PCRs for segment F), with concomitantly long runtime (5.5 h) and large numbers of solutions collected; this number of PCRs exceeds the limit typically recommended for Gibson assembly; in practice, we would likely increase the maximum PCR size for this segment to reduce the number of PCRs.
Table 6.
Rebuilding the Eighths of the Synthetic Mycoplasma Genome
| Eighth | Cassettes | Length | GC % | PCRs | Lowest avg penalty | Lowest max penalty | Lowest avg nonuni-formity | Lowest max nonuni-formity | Time (s) | Solutions |
|---|---|---|---|---|---|---|---|---|---|---|
| A | 1..12 | 73,514 | 30.03 | 8 | 4.097 | 6.035 | 259.9 | 713 | 389.56 | 5,764,801 |
| B | 13..24 | 70,017 | 32.52 | 8 | 2.197 | 4.463 | 367.5 | 1285 | 341.06 | 5,764,801 |
| C | 25..36 | 71,402 | 34.61 | 8 | 2.205 | 5.446 | 344.9 | 1376 | 157.61 | 2,235,331 |
| D | 37..49 | 64,299 | 35.28 | 7 | 1.159 | 4.970 | 274.4 | 840 | 62.98 | 823,543 |
| E | 50..61 | 73,553 | 32.46 | 8 | 1.870 | 5.590 | 369.9 | 563 | 372.98 | 5,764,801 |
| F | 62..77 | 92,684 | 31.00 | 10 | 1.444 | 4.317 | 489.8 | 1003 | 11435.17 | 172,944,030 |
| G | 78..89 | 73,244 | 28.91 | 8 | 3.585 | 6.963 | 501.9 | 957 | 338.37 | 5,764,801 |
| H | 90..101 | 65,166 | 28.93 | 7 | 3.723 | 12.797 | 235.1 | 565 | 64.28 | 823,543 |
Eighths and cassettes as specified in Ref.14 Run in exhaustive mode, limiting alternative PCR sets to 7, with post-exhaustive tntBLAST.
bigDNA for prophage engineering
The work cited earlier was all performed in silico. To show practical value in preparing a set of long PCRs for prophage engineering, we used bigDNA to design PCR primer pairs for a prophage, Bce328.59.V, predicted by TIGER13 in the B. cepacia MSMB648 (GCA_001532955.1) genome, engineering the prophage genome to delete its integrase gene found at one terminus. Both a four-fragment and a five-fragment Gibson assembly were designed. All the bigDNA primer pairs yielded long PCR fragments of the expected size (Fig. 2). Additional smaller fragments are commonly generated by long PCR; from here, gel extraction could be used as needed to purify the desired fragment before Gibson assembly.
FIG. 2.
Long PCRs for assembly of circularized prophage Bce328.59.V, engineered for integrase gene deletion. The PCR primer pairs were designed using bigDNA, for either a five-part (expected sizes ranging from 11,196 to 11,550 bp) or a four-part (expected sizes ranging from 14,064 to 14,590 bp) assembly. The PCR products were resolved in a 0.8% agarose gel; relevant marker (lanes M) bands from a 1-kb ladder are indicated.
Discussion
Preparing long engineered DNAs by overlap assembly of long PCRs, including both splicing and rebuilding operations, has proven valuable in diverse applications.5–9 We present an open-source software that designs PCR primers for such overlap assemblies, with a unique focus on the rebuilding problem. BigDNA is open-source, quick, allows design with circularization or gene insertion or deletion, and scores the final assembly design.
We validate the software with several assembly design problems for a large database of GIs, including many prophages, and show the effectiveness of the resulting primer pairs in generating long PCR products for prophage engineering, our original use case for phage therapy.6 BigDNA was also validated by designing pieces of a synthetic M. genitalium genome, thus showing utility beyond phage applications. Indeed, this software could be used to rebuild any large DNA, including integrative conjugative elements, prophages, viruses, plasmids, whole bacterial genomes, or synthetic cassettes.
Supplementary Material
Authors' Contributions
I.V. performed software development and data analysis. C.M.M. performed experiments, data analysis, and results interpretation. K.P.W. performed software development, data analysis, results interpretation, and acquired funding. All authors contributed to writing and approved the manuscript.
Author Disclosure Statement
No competing financial interests exist.
Funding Information
This work was supported by the U.S. Department of Energy Summer Undergraduate Laboratory Internship program; by the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research under the Secure Biosystems Design Initiative; and by the Laboratory Directed Research and Development program at Sandia National Laboratories (project 222466), which is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-NA0003525.
Supplementary Material
References
- 1. Gibson DG, Benders GA, Axelrod KC, et al. One-step assembly in yeast of 25 overlapping DNA fragments to form a complete synthetic Mycoplasma genitalium genome. Proc Natl Acad Sci U S A 2008;105:20404–20409. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Ma H, Kunes S, Schatz PJ, et al. Plasmid construction by homologous recombination in yeast. Gene 1987;58:201–216. [DOI] [PubMed] [Google Scholar]
- 3. Ando H, Lemire S, Pires DP, et al. Engineering modular viral scaffolds for targeted bacterial population editing. Cell Syst 2015;1:187–196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Gibson DG, Young L, Chuang R-Y, et al. Enzymatic assembly of DNA molecules up to several hundred kilobases. Nat Methods 2009;6:343–345. [DOI] [PubMed] [Google Scholar]
- 5. Kilcher S, Studer P, Muessner C, et al. Cross-genus rebooting of custom-made, synthetic bacteriophage genomes in L-form bacteria. Proc Natl Acad Sci U S A 2018;115:567–572. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Mageeney CM, Sinha A, Mosesso RA, et al. Computational basis for on-demand production of diversified therapeutic phage cocktails. mSystems 2020;5:e00659-20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Bordat A, Houvenaghel MC, German-Retana S. Gibson assembly: An easy way to clone potyviral full-length infectious cDNA clones expressing an ectopic VPg. Virol J 2015;12:89. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. De Munter S, Van Parys A, Bral L, et al. Rapid and effective generation of nanobody based CARs using PCR and gibson assembly. Int J Mol Sci 2020;21:883. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Silayeva O, Barnes AC. Gibson assembly facilitates bacterial allelic exchange mutagenesis. J Microbiol Methods 2018;144:157–163. [DOI] [PubMed] [Google Scholar]
- 10. Ye J, Coulouris G, Zaretskaya I, et al. Primer-BLAST: A tool to design target-specific primers for polymerase chain reaction. BMC Bioinformatics 2012;13:134. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Koressaar T, Remm M. Enhancements and modifications of primer design program Primer3. Bioinformatics 2007;23:1289–1291. [DOI] [PubMed] [Google Scholar]
- 12. Gans JD, Wolinsky M. Improved assay-dependent searching of nucleic acid sequence databases. Nucleic Acids Res 2008;36:e74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Mageeney CM, Lau BY, Wagner JM, et al. New candidates for regulated gene integrity revealed through precise mapping of integrative genetic elements. Nucleic Acids Res 2020;48:4052–4065. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Gibson DG, Benders GA, Andrews-Pfannkock C, et al. Complete chemical synthesis, assembly, and cloning of a Mycoplasma genitalium genome. Science 2008;319:1215–1220. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


