Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Dec 20.
Published in final edited form as: ACS Synth Biol. 2024 Nov 7;13(12):4099–4109. doi: 10.1021/acssynbio.4c00539

Sequencing Strategy to Ensure Accurate Plasmid Assembly

Sarah I Hernandez 1, Casey-Tyler Berezin 1, Katie M Miller 1, Samuel J Peccoud 1, Jean Peccoud 1,*
PMCID: PMC11706207  NIHMSID: NIHMS2045286  PMID: 39508818

Abstract

Despite the wide use of plasmids in research and clinical production, the need to verify plasmid sequences is a bottleneck that is too often underestimated in the manufacturing process. Although sequencing platforms continue to improve, the method and assembly pipeline chosen still influence the final plasmid assembly sequence. Furthermore, few dedicated tools exist for plasmid assembly, especially for de novo assembly. Here, we evaluated short-read, long-read, and hybrid (both short and long reads) de novo assembly pipelines across three replicates of a 24-plasmid library. Consistent with previous characterizations of each sequencing technology, short-read assemblies had issues resolving GC-rich regions, and long-read assemblies commonly had small insertions and deletions, especially in repetitive regions. The hybrid approach facilitated the most accurate, consistent assembly generation and identified mutations relative to the reference sequence. Although Sanger sequencing can be used to verify specific regions, some GC-rich and repetitive regions were difficult to resolve using any method, suggesting that easily sequenced genetic parts should be prioritized in the design of new genetic constructs.

Keywords: Whole-plasmid sequencing, NGS, Nanopore, Assembly, Workflows, Reproducibility

Graphical Abstract

graphic file with name nihms-2045286-f0001.jpg

INTRODUCTION

Plasmids are critical tools used in research, industrial, and clinical settings for applications such as recombinant gene expression, designing genetic circuits, and the generation of clinical products like vaccines 16. Sequence verification of these increasingly large plasmid libraries is critical to ensuring the expected product is made and to evaluate the biological effects of spontaneous and intentional mutations, including single nucleotide polymorphisms (SNPs) 13, 7, 8. Although DNA sequences are generally designed and documented digitally, and the necessity of openly providing DNA sequences and thorough gene annotations has been discussed many times, it is not uncommon for researchers to have only a vague plasmid map and/or no reference sequence 912. Many plasmids are generated by inserting a gene of interest into a plasmid backbone, and sequence verification is often overlooked in favor of simpler “confirmation” methods, such as PCR amplification or restriction digests. Yet, without confirming a plasmid’s sequence, unrecognized deviations from an expected sequence threaten the accuracy of biological insights gained using such a plasmid.

We have proposed cryptographic algorithms to secure the exchange of plasmids in the life science community 1315. This cyberbiosecurity solution 16, 17 makes it possible to retrieve the plasmid’s and its developer’s identities, retrieve the plasmid documentation, and detect the possible presence of mutations in the plasmid sequence from sequencing data without prior knowledge of the plasmid sequenced. The value of this technology depends on the availability of a streamlined and robust sequencing workflow. First, we developed a de novo assembly pipeline to produce the plasmid’s physical sequence from short sequencing reads 18. To streamline the execution of this bioinformatics pipeline, we deployed it on Amazon Web Services. We also developed a user interface allowing laboratory personnel to analyze plasmid sequencing data without installing software or requiring a bioinformatician’s assistance. This tool, called PlasCAT 19, is available at sequencing.genofab.com. Its modular architecture makes it possible to improve the underlying bioinformatics pipeline without significant modifications to the user interface. Here, we present an improved de novo assembly pipeline, analyze its reproducibility, and compare its performance to Epi2ME, a tool from Oxford Nanopore Technologies (ONT) for long-read assembly 20.

The lack of dedicated plasmid assembly tools can explain people’s reluctance to perform sequence verification of their plasmids. Amidst the many tools for genome and metagenome assembly, some are designed to identify plasmids in the assembly of these larger datasets 2129. However, many of these methods can struggle to accurately reconstruct small (<25 kbp) plasmids or miss them altogether 30, 31. To accelerate plasmid verification in high-throughput settings, assembly pipelines should avoid the need for a reference sequence or manual intervention, as required by some methods 1, 32. De novo sequence assembly is preferred to reference-based assembly as it can also help overcome reference bias and identify unexpected mutations 7, 10, 3336.

Although Sanger sequencing has long been the gold standard for sequencing, it requires a reference sequence to design primers and is limited to short (~800 bp) sequences, necessitating many reactions to verify a whole plasmid 3842. The need to sequence unknown DNA templates led to the introduction of next-generation sequencing (NGS) technology, namely the Illumina fragmentation-based approach 3, 3840, 43. While the short-read sequencing fragments (~250 bp) generally provide good template coverage and high sequence accuracy, biases introduced in PCR steps result in the underrepresentation of GC-rich, GC-poor and repetitive regions 38, 39, 4447. Short-read sequencing can also underperform with low diversity libraries 44, 46. Thus, third-generation sequencing methods that allow the sequencing of reads that are thousands of nucleotides long – as long as a plasmid itself – have been developed 1, 40. These long-read sequencing methods can be faster and cheaper than fragmentation-based approaches, and better resolve long, complex sequences, but has historically had lower accuracy than its predecessors and can struggle with smaller templates 7, 31, 36, 48, 49. Recent advancements in genome assembly tools have indicated that a hybrid approach, using both short and long reads, can produce improved assemblies 27, 31, 4951. Nevertheless, the ability to sequence growing libraries of DNA sequences, including genomes, with base-level precision is a continued pursuit 52. To our knowledge, a hybrid approach to de novo plasmid assembly using PlasCAT or other tools has not yet been systematically interrogated.

Here, we evaluated the ability of the short- and long- read sequencing methods available from Illumina and ONT, respectively, to generate accurate de novo assemblies of plasmid sequences. We found that a hybrid assembly approach, using both short and long reads, produced the best assemblies. The short-read assemblies were limited by the quality and quantity of DNA used and struggled to assemble GC-rich regions, whereas the long-read assemblies had a higher incidence of insertions and deletions (collectively, indels) and mutations, as has been previously suggested 30, 49, 53. We used Sanger sequencing to confirm several discrepancies between the assembly sequences and the plasmid reference sequences and found several cases where the assemblies consistently differed from the expected reference sequence. Importantly, de novo assembly outperformed reference-based assembly, which frequently showed reference bias and often did not match Sanger data. Thus, de novo hybrid assembly is the preferred method for high-throughput plasmid sequencing.

MATERIALS AND METHODS

Reagents

The Zyppy-96 Plasmid MagBead Miniprep Kit was purchased from Zymo Research (Irving, CA, USA, #D4102). The long-read library preparation kit and R10.4 flow cell were purchased from Oxford Nanopore (UK, #SQK-RBK114.96 and #FLO-MIN114). For the short-reads, the ILMN DNA LP (M) Tagmentation 96 library preparation kit, MiSeq cartridges, and iSeq cartridges were purchased from Illumina (San Diego, CA, USA. #20060059, #MS-103-1003, and #20031374 respectively). Sanger sequencing primers were designed and ordered from IDT (Coralville, IA). The BigDye Terminator V3.1 and BigDyeXTerminator purification kits were both purchased from ThermoFisher (Waltham, MA. USA, #4337454 and #4376486 respectively).

Biological Resources

The plasmids used for library preparation, sequencing, and analysis were obtained from three vendors. Twelve plasmids were synthesized and sequence-verified by Twist Biosciences (San Francisco, CA), and eleven were procured from Addgene (Watertown, MA, USA). One plasmid solution was taken from a transfection kit where the mixture contains two plasmids with the same vector backbone, roughly 6000 bp, and two different inserts, around 1350 bp and 650 bp (Gibco #A14635) 54, 55. The array of plasmids was given unique identifiers (e.g., Plasmid 1234) to anonymize the data set. This allowed for true de novo assembly, where the different methods could be compared on overall accuracy among the generated datasets.

Plasmid Isolation and Sequencing

Plasmid Isolation:

The plasmid DNA was extracted from each of the 24 isolates on an epMotion 5075 TC liquid handler (Eppendorf, Hamburg, DE) using the Zyppy-96 Plasmid MagBead Miniprep Kit (Zymo Research, Irving CA, USA), according to manufacturer’s instructions with the modification of pipet mixing during the lysis and neutralization steps and an extended elution time of 10 minutes. Samples were all quantified on a Synergy LX plate reader to determine the quality and quantity of samples after extraction via miniprep. All samples were required to have at least 35 ng/uL and an A260/280 purity reading of >1.8.

Oxford Nanopore Sequencing:

Post isolation, 50 ng of each isolate was used for sequencing with the MinION Sequencer (Oxford Nanopore, UK). These sequencing libraries were prepared using the Rapid Barcoding Kit (#SQK-RBK114.96) with the Flow Cell (#FLO-MIN114) according to the manufacturer’s instructions. Samples were run on the MinION with a maximum read length kept of 25 kbp. FASTQ files were generated from the super-high accuracy method of the Dorado basecaller within the MinKNOW software and were used for sequence validation, comparison, and evaluation.

Illumina Sequencing:

After isolation, 200 ng of each isolate was used for sequencing on the MiSeq and iSeq (Illumina, CA). Both sequencing libraries were prepared on an epMotion 5075 TC liquid handler (Eppendorf, Hamburg, DE) using the ILMN DNA LP (M) Tagmentation 96 IPB kit protocol as described by the manufacturer. The pooled libraries were spiked with 1% v/v PhiX Control V3 (Illumina, San Diego, CA) and were diluted to a final loading concentration of 10 pM and 100 pM for the MiSeq and iSeq, respectively. The diluted libraries (600 µL and 20 µL) were loaded onto a MiSeq Reagent Nano Kit v2 (500-cycles) and iSeq 100 i1 Reagent v2 (300-cycle). FASTQ files generated were used for sequence validation, comparison, and evaluation.

Sanger Sequencing:

Sequence validation was performed as needed for templates with generated sequence discrepancies. Primers were designed and ordered through IDT between 18 to 25 bps long to satisfy the following requirements: a GC content of 50% or higher, a melting temperature around 70 C, and no secondary structure (Supplementary Table S1). Fragments were prepared following the BigDye Terminator V3.1 kit as described, and the 10 µL reactions were diluted to 0.5X using the BigDyeXTerminator purification kit where described. Samples were sequenced using the LongFrag_BDX protocol. Generated results were immediately uploaded to SnapGene and compared to generated data.

De Novo Sequence Assembly:

De novo sequence assembly was primarily performed using PlasCAT, an open-source plasmid assembly pipeline that was recently adapted to a web application (sequencing.genofab.com) 37. Short-read assembly through this pipeline has been previously validated 7, while the experimental data used in developing the long-read and hybrid pipelines has not been published. As such, this work represents the first comparative assessment of the accuracy of each pipeline implemented in PlasCAT. In brief, the pipeline generates assemblies from short-reads, from long-reads, or from a hybrid approach (i.e. both short and long reads) via the gold-standard genome assembly tool, Unicycler 27. The pipeline also performs some pre-processing of the data, either through Trimmomatic for short-reads 56 or Filtlong for long reads (https://github.com/rrwick/Filtlong), and subsets the long reads to a particular coverage using Rasusa 57, 58. The short reads were filtered to a minimum length of 50 bp and a minimum quality score of 35. Filtlong was used to keep the best 80% of long reads (based on quality and length) and remove reads above the maximum read length of 20,000 bp 59. Subsetting with rasusa was done using the default estimated size of 5,000 bp and 500X coverage, which gave better long-read assemblies than the default 1000X coverage. Long-read and hybrid assemblies are polished with racon 60. Long-read assemblies were also generated using Oxford Nanopore’s EpiPI2ME platform to serve as a method comparison (https://labs.epi2me.io/). This pipeline uses Flye for long-read assembly 61, the Medaka polisher (https://github.com/nanoporetech/medaka) 20, 62, and Trycycler to generate a consensus assembly 32. For Epi2ME, we used the default estimated size of 7000 bp, 60X coverage, and end trimming of 150 bp.

RESULTS

Plasmid Sequencing and Assembly Pipelines

We sequenced a set of 24 plasmids and performed de novo assembly using short-reads from the Illumina MiSeq and long-reads from the ONT MinION. Three technical replicates were sequenced from each plasmid, all pulling from the same initial purified plasmid solution. The open-source tool PlasCAT was used to generate assemblies in three ways: short-read only, long-read only, and a hybrid approach using short and long reads (Figure 1). This design allowed us to compare both the reproducibility of each sequencing technology across repeated library preparations and evaluate different plasmid assembly approaches.

Figure 1: Overview of sequencing workflow.

Figure 1:

Multiple sequencing runs were performed on 24 plasmid samples. For short-read sequencing, plasmids were fragmented and chemically indexed before being loaded onto the Illumina MiSeq. Forward and reverse reads were generated and used with PlasCAT to generate both short-read and hybrid assemblies. For long-read sequencing, the plasmids were linearized, chemically indexed, and loaded onto the Oxford Nanopore MinION. FASTQ files were generated and used for both long-read and hybrid assemblies. Sanger sequencing was used to confirm regions with discrepancies between assemblies.

Of the 72 total samples prepared, all generated data on the Nanopore sequencer, while only 69 samples generated FASTQ files with data on the MiSeq. Of the three failures, two were replicates of the same sample. Issues with sample dropouts prohibited us from generating assemblies for these three samples and were attributed to the library preparation procedure and not the assembly process. The short reads were trimmed to maintain a per-base quality score of at least 35 and reads shorter than 50 bp were removed (Supplementary Figure S1). Before filtering, there were roughly 50,000 reads (combined forward and reverse) for each sample, representing at least 500 million bases per run. After filtering, only about 5,000 reads were retained per sample (Supplementary Figure S1). The long reads were filtered with Filtlong. There were initially about 50,000 reads per sample, representing over 3 billion bases per run. Filtlong retains the best 80% of the data, based on both quality and length. This is evidenced by an increase in average read length (from ~4,000 to ~6,000) and minimum read length (from 500 to ~1,000 bp) in post-filtered samples, despite the decrease in maximum read length to 20,000 bp. After filtering, there were about 25,000 reads per sample (Supplementary Figure S1).

Hybrid De Novo Plasmid Assembly Outperforms Short-Read or Long-Read Assembly

To quantify the robustness of each assembly method, we devised two scoring methods: an assembly score to represent the success of a particular assembly, and a sequence agreement score to assess the reproducibility of assemblies across multiple runs (Figure 2). An assembly score of 1 indicates a single contig was returned (a success), while a 0 was given if no assembly or if multiple contigs were returned (a failure). An overall assembly score was obtained by summing the scores of each of the three runs. A sequence agreement score of 1 indicates that the sequences of all successful (non-fragmented) assemblies were the same, or a 0 if not. Plasmids that only had one successful assembly were excluded from this scoring. The overall assembly and sequence agreement scores were converted into percentages based on the number of included runs and samples, respectively. We included one sample that was a mixture of two plasmids (Plasmid 3589) to see whether the plasmid assembly pipelines would generally return one contig, or if it could resolve mixtures of similar plasmids, but it was excluded from our formal data analysis.

Figure 2: Hybrid assemblies outperform short- and long-read assemblies.

Figure 2:

De novo assemblies were generated from short-reads (A), long-reads (B), or from both (hybrid, C). All assembly pipelines produced some fragmented assemblies (>1 contig) which were considered failures (red). Only the length of the longest contig is reported in these cases. Some short-read sequencing preparations did not produce sufficient data for assembly (n/a, red). An assembly score of 1 indicates a successful assembly (non-fragmented), and these are summed across the three replicates to generate an overall assembly score (maximum of 3). A sequence agreement score of 1 indicates that all successful assemblies were exact matches for one another, and the corresponding assembly lengths are bolded. Samples with only one successful assembly were not given a sequence agreement score and were excluded from the percentage calculation. The sample containing a mixture of two plasmids (Plasmid 3589) was also excluded from this analysis, as it was not expected to return only one contig. The hybrid assemblies had the highest sequence agreement scores, followed by short-read and then long-read assemblies. The long-read assemblies had the highest overall assembly score but failed to produce high sequence agreement scores, indicating lower reproducibility of assembly results.

We generated short-read assemblies for all 69 of the samples that produced sufficient data. Most of the assemblies were single contigs, however the pipeline returned 13 assemblies with multiple contigs (Figure 2A). Given that we expected all samples to result in a single contig representing the plasmid, these 13 assemblies were considered failures and given an assembly score of 0. There was only one sample that received an overall assembly score of 0, for which all three assemblies were fragmented (Plasmid 3135). Nevertheless, nearly 77% of runs resulted in a successful assembly. Furthermore, 76% of samples had good sequence agreement scores, indicating that the assembly of most plasmids by short-read sequencing is reproducible across repeated library preparations. The sequence agreement score is ultimately more important than the assembly score, since the consistency provides researchers with a higher level of confidence that their assemblies are correct. Notably, there were 5 samples, including most of the plasmids larger than 10 kb, which received a sequence agreement score of 0 for their short-read assemblies and for which no consensus could be reached (Figure 2A).

We compared the assemblies generated from one sequencing run using the iSeq, which produces 151 bp reads, to the MiSeq, which produces 251 bp reads. Overall, the assemblies generated by the iSeq were similar to the assemblies from the MiSeq (Supplementary Table S2). Four of the iSeq assemblies contained multiple fragments, which was comparable to the failure rate of the MiSeq assemblies. Although the assemblies generated by the iSeq reads were not considerably better or worse, the shorter length of the reads may result in worse resolution of repetitive regions. Thus, we continued only with MiSeq for short-read data for further analysis.

Compared to the short-read assemblies, the long-read assemblies had a higher overall assembly score but a lower sequence agreement score (Figure 2). Nearly 99% of long-read assemblies contained a single contig, likely due, at least in part, to the length of the reads generated and the absence of any fragmentation steps. However, the long-read assemblies were not consistent across runs, with only 30% of samples having a sequence agreement score of 1. Most of the samples with sequence agreement scores of 0 appeared to vary only by small (≤5 bp) insertions and deletions (collectively, indels), but in a few cases, there appeared to be a multiplicity issue where the length of one assembly was 2–3x longer than the assemblies from other runs (Plasmids 3121 and 3132). If a researcher’s goal is only a high-level structural confirmation of the plasmid (i.e., was my gene of interest inserted?), then such assemblies may be sufficient. However, the inconsistency of the runs is a matter of concern both in terms of overall trust in the accuracy of long-read assemblies and when detailed sequence verification is required.

The hybrid assembly pipeline produced the best assemblies overall. The overall assembly score (81%) was slightly lower than the long-read assemblies, due to the three failed short-read library preparations and the presence of more fragmented assemblies. This similarity to the short-read assemblies is consistent with the hybrid approach relying on the short-reads to establish an initial scaffold onto which the long-reads are assembled. Nevertheless, nearly 87% of samples had a good sequence agreement score, significantly higher than both short-read and long-read assemblies, highlighting the ability of the hybrid approach to reproducibly generate high quality assemblies that leverage the strengths of both technologies. There were only 3 samples that received a sequence agreement score of 0, and these had not been resolved by short-read or long-read methods either (Plasmids 3127, 3130, and 3132).

Of note, short-read assembly of the two-plasmid mixture (Plasmid 3589) resulted in either a singular plasmid of about 6600 bp or the assembly contained three contigs, seemingly representing a backbone and two inserts (Figure 2A). Several of the long-read and hybrid assemblies returned plasmids around 7400 bp and 6700 bp. Each method seemed to struggle with resolving two highly similar structures.

Given that the long-read assemblies were less robust than short-read and hybrid assemblies, we compared the results we obtained from PlasCAT to assemblies generated from Epi2ME, a long-read de novo assembly tool recommended by Oxford Nanopore. Epi2ME failed to produce an assembly in two cases (Supplementary Table S3), while PlasCAT always returned an assembly, albeit sometimes fragmented. Aside from these runs, 97% assemblies were successful; however, it appears that Epi2ME is restricted to returning only a single contig, which could have inflated this score. With almost 55% of samples having sequence assembly scores of 1, Epi2ME performed slightly better than PlasCAT for long-read assemblies but did not perform as well as the short-read or hybrid assemblies.

Increasing sequencing depth improves assembly quality until it begins to plateau around 50X depth 63. None of our samples had such low long-read sequencing depth that increasing the depth would have improved the results; almost all samples had initial coverage depth over 20,000X (assuming plasmid size of 5000 bp) with the lowest at roughly 1500X. Yet, depth that is too high also impedes assembly quality, thus both PlasCAT and Epi2ME subset the long-read data to a particular coverage level to improve assembly results 29, 49, 58, 6466. Indeed, running the PlasCAT pipeline on the first MinION run without subsampling the data did not produce any successful assemblies; they either failed or were highly fragmented, consisting of anywhere from 3 to 80 different contigs (Supplementary Table S4). The one exception was the two-plasmid mixture (Plasmid 3589) which produced two reasonable contigs sized 7447 bp and 6532 bp. To subset the data, an estimated size for the plasmid must be provided to establish the number of reads needed to achieve a particular coverage level, presenting a potential barrier to de novo assembly. However, the default size parameters of PlasCAT and Epi2ME seemed to work well for samples of all sizes. While PlasCAT takes a single subset of the data to produce an assembly, Epi2ME uses Trycycler to generate a consensus assembly from 3 assemblies generated from 3 separate subsample sets, which likely improved its sequence agreement scores. Interestingly, re-analysis of the same data with the same parameters on Epi2ME occasionally returned different assemblies, suggesting a truly random seed used for subset generation. Although Trycycler was likely meant to accommodate this randomness, it was not intended to be a fully automated platform and will fail to produce a consensus if any assemblies are too different from one another 32. On the other hand, PlasCAT’s long-read analysis returned the same assembly each time, suggesting a systematic randomness in how data is subset, and an increased ability to reliably reproduce assemblies.

Subsetting the data also led to more practical return times, cutting down the average time per PlasCAT assembly from about 30 minutes to 2 minutes. All samples from the same PlasCAT run are executed in parallel, resulting in a fast turnaround time for processing large datasets. On the other hand, each Epi2ME assembly took a few minutes per sample, and they are run in succession, so each set of 24 assemblies took two to three hours to complete. Within PlasCAT, long-read assembly was the fastest at only about 2 minutes per assembly, while each short-read assembly took, on average, 11 minutes to complete, ranging from 90 seconds to nearly 25 minutes. The hybrid assemblies took significantly longer: at least five minutes and up to 48 minutes.

De Novo Assembly Reveals Deviations from Reference Sequence

Some plasmids were easily assembled by any method and matched exactly to the reference sequence (Figure 3A). However, short-read sequencers are known to have biases associated with highly repetitive and/or GC-rich regions 38, 44, thus we expected that some short-read assemblies may not be representative of the sample. Indeed, aligning the assemblies generated from the short-reads to the reference sequence revealed significant gaps in some assembly sequences (Figure 3B). While long-read sequencers are better able to resolve repetitive and GC-rich regions, they have historically been marred by high error rates, can introduce indels, and appear to depend greatly on how the data is processed (i.e., subsetting, choice of tool) 53, 67. Thus, hybrid assemblies are expected to resolve the gaps seen in short-read assemblies by using long-read data, while also leveraging the high accuracy of short reads to prevent indels and mismatches.

Figure 3: Alignment of assembly sequences to a reference sequence.

Figure 3:

(A) Representative example of a plasmid that is correctly assembled by any method (Plasmid 3125). (B) Short-read assemblies consistently miss GC-rich regions that are resolved by long-read or hybrid assemblies (Plasmid 3131). (C) Some deviations between assembly sequences and the reference sequence are consistent across methods (Plasmid 3133), indicating true mutations (red asterisks). Only successful assemblies were aligned to the reference sequence (1–3 per method). Each red line represents 1 assembly. Insertions, relative to the reference, appear as lines above the red line, while deletions and mismatches appear as gaps in the red line.

We found that all assembly methods resulted in some assemblies that had indels or mismatches compared to the reference sequence (Supplementary Table S5, Table 1). The short-read assemblies, and therefore the hybrid assemblies, were more likely to fail or be fragmented than the long-read assemblies. The short-read assemblies were also the most likely to have large indels, typically entire fragments missing (Table 1). However, long-read assemblies had more small indels (≤ 5 bp) and mismatches than the other methods. There were 10 plasmids where at least one long-read assembly had a deviation from the reference sequence, although the short-read and hybrid assemblies matched the reference perfectly. In all cases except one, these assemblies contained indels, mostly small (Supplementary Table S5). There were also two samples that had assemblies that matched the reference length exactly but had several mismatches, which we did not encounter in the short-read or hybrid assemblies.

Table 1.

Summary of errors in plasmid assemblies compared to the reference sequence.

Assembly Type # Successful Assemblies # Assemblies with Indels ≤ 5 bp # Assemblies with Indels > 5 bp # Assemblies with Mismatches
Short-Read 53 18 13 17
Long-Read 68 35 8 29
Hybrid 56 19 5 19

There were 9 plasmids that had deviations from the reference sequence in all three types of assemblies (Supplementary Table S5). Of these, 5 plasmids consistently deviated from the reference sequence (Figure 3C). For example, the reference for Plasmid 3133 was 14194 bp, but all 9 assemblies had 6 identical mismatches and 2 small deletions resulting in a 14191 bp plasmid.In addition, 8 of the 9 assemblies for Plasmid 3121 showed a 19 bp deletion corresponding to the T7 promoter. For all 5 samples, any successful assemblies that did not exactly match the others were long-read assemblies that supported the deviations but had additional indels.

In the 4 other cases, there were deviations from the reference sequence, but it was not clear what the true sequence was. For example, Plasmid 3131 usually returned a 6077 or 6078 bp from long-read or hybrid assemblies; some errors were consistent, resulting in a plasmid roughly 13 bp larger than the reference, but a consensus could not be reached (Supplementary Table S5). All 4 plasmids showed large (>750 bp) gaps in the short-read assemblies corresponding to the chicken β-actin (CAG) promoter 68 and adjacent chimeric intron (Figure 3B, Supplementary Table S5). This region is highly repetitive and has a GC content of 73%. This region also contained the discrepancies that could not be resolved between the long-read and hybrid assemblies.

Sanger sequencing primers were designed to target a few regions containing discrepancies (Supplementary Table S1). The Sanger reactions allowed us to confirm several deviations from the reference sequence (Figure 4). For example, it confirmed a 1 bp deletion near the chicken β-actin promoter in Plasmid 3132 (Figure 4A). It also confirmed a 2 bp insertion in a section of repeated Gs in the same promoter in Plasmid 3127, which was found in one hybrid assembly; the other assemblies had a variable number of Gs and the short-read assemblies missed the region completely (Figure 4B). Given the high level of support across methods for certain discrepancies along with our Sanger data, we excluded what seemed to be true deviations, recalculated the number of assemblies with errors, and found that hybrid assemblies had the fewest remaining errors of all types (Table 2).

Figure 4: Sanger confirmation of discrepancies compared to reference sequence.

Figure 4:

All successful assemblies were aligned to the reference sequence. (A) Sanger sequencing confirmed a 1 bp deletion in the sequence of Plasmid 3132 that was found in all assemblies. (B) Sanger sequencing confirmed a 2 bp insertion in the sequence of Plasmid 3127, which was found in only one hybrid assembly. The 2 short-read assemblies missed this region entirely.

Table 2.

Summary of errors in plasmid assemblies compared to the reference sequence, excluding true deviations.

Assembly Type # Successful Assemblies # Assemblies with Indels ≤ 5 bp # Assemblies with Indels > 5 bp # Assemblies with Mismatches
Short-Read 53 9 10 8
Long-Read 68 25 6 19
Hybrid 56 8 2 8

One may assume that if a reference sequence is available, performing a reference-based assembly would lead to improved results over a de novo assembly. We used MIRA 42, 6971 to perform reference-based assembly of short reads for the 9 plasmids where the de novo assemblies deviated from the reference. A detailed discussion of these results is provided in the supplementary data (Supplementary Discussion). Briefly, the de novo assemblies were much closer to the expected size for all samples (Supplementary Table S6). Some of the deviations found in de novo assemblies were supported by the reference-based assemblies, however, there were several cases where the reference-based assembly showed evidence of reference bias: the assembly matched the reference sequence even when Sanger data supported the deviations found in the de novo assemblies (Supplementary Figure S2). In addition, reference-based assemblies frequently contained non-standard nucleotides, even when Sanger data showed clean peaks. These findings suggest that de novo assemblies are more accurate than reference-based assemblies, which is especially powerful since accurate reference sequences may not always be available.

DISCUSSION

Calls for consistent, accurate sequence verification have been left unanswered for too long 10, 12. Even when researchers may not expect single base pair accuracy to be important, it is important to remember that even single base pair changes can have unintended biological effects, exemplified by SNPs, various diseases, and the sequence similarity between certain fluorescent proteins 72, 73. We evaluated several approaches for plasmid assembly in terms of their ability to produce a successful (single contig) assembly as well as to reproducibly assemble a plasmid across three technical replicates of a 24-plasmid library. In several cases, the de novo assemblies for a plasmid consistently differed from the reference sequence, regardless of sequencing method, and several of these deviations were confirmed by Sanger sequencing. Importantly, we found de novo assembly to be more accurate than traditional reference-based assembly. Providing a reference sequence sometimes led to reference bias wherein the reference-based assembly preferentially matched the reference sequence, even when Sanger sequencing confirmed a true deviation. Amongst de novo assemblies, the hybrid approach, leveraging the high accuracy of short reads with the ability of long reads to resolve complex regions, led to the best, most reproducible assemblies.

This workflow was developed with the large-scale production and verification of high-throughput plasmid libraries in mind. With the increasing ease and availability of DNA synthesis and sequencing technologies 74, there is increasing demand for high-throughput methods to produce hundreds to thousands of plasmids. The major bottleneck in the production of these libraries is the sequence verification step 7, 9, 12. While one may wish to argue that short- or long-read sequencing on their own can provide sufficient sequence verification, the hybrid approach not only leverages the high accuracy of short-reads, but the ability of long-reads to resolve complex regions. The cost associated with this choice will be the highest, but in return, reproducibility and confidence will skyrocket.When deviations from the expected sequence arise during hybrid assembly, confidence can be gained by sequencing additional independent replicates of your plasmid. Given the propensity of DNA to mutate, errors could arise in one bacterial colony and not another, and frequent sequencing can help detect which mutations arose and when. Alternatively, sequencing multiple replicates from the outset as we did here, assuming there will be some failure, may save time in the long run despite incurring higher costs. Sanger sequencing can provide additional confirmation of potential errors.

This work aimed to combine different sequencing technologies and bioinformatics tools to develop a high-performance plasmid sequencing pipeline that minimizes the risk of sequencing errors. This risk should always be considered when reviewing plasmid sequencing data. The decision to collect additional data using different technologies and replicate the sequencing experiment should be motivated by an economic analysis considering two factors: the consequences of sequencing errors and the cost of sequencing. The consequences of sequencing errors can be evaluated as the potential economic loss resulting from working with an incorrect plasmid. For example, a company using a plasmid in a regulated biomanufacturing process can justify higher quality control costs than a graduate student using plasmids in relatively cheap phenotyping assays. The cost structure of sequencing data can hinder additional data collection. It is easier to order additional data when paying on a per-sample basis, as when working with specialized service providers. Users operating their own instruments may find it more difficult to justify the cost of a sequencing run if they only have a few plasmids to sequence.

The analysis of the risks of sequencing error and the cost structure of NGS may even lead an investigator to use Sanger sequencing. An increasing number of Sanger reactions is required to cover large sequences, which becomes costly and time-consuming for whole-plasmid sequencing (Supplementary Table S7). However, Sanger sequencing may still make sense for users who need to accurately verify the inserts of only a limited number of plasmids at a time. While Sanger sequencing is the gold standard for accurately resolving known short sequences 75, it requires a reference sequence to generate primers, making it impractical for de novo assembly. Thus, a Sanger-based approach may be compatible with using long-read sequencing to obtain a high-level overview of the plasmid’s structure in order to design primers.

Similarly, long-read plasmid assembly can be sufficient for quickly and cheaply eliminating constructs or colonies that will not produce the desired genotype or phenotype from high-throughput screening experiments. For example, when we perform ligation reactions, we expect that some plasmids will not take up the insert(s), thus we perform long-read sequencing to identify samples where ligation failed and continue onto hybrid sequencing with only the promising candidates. These initial long-read assemblies can have issues like indels that may make them less trustworthy but provide a reasonable starting point for further analysis. On the other hand, short-read plasmid assembly may be sufficient for plasmids that are not highly repetitive and with an overall and parts-level GC-content around 50%, otherwise large fragments may be missing or the assemblies may be fragmented. Although short-read assemblies can outperform long-read assemblies, it is worth noting that short-read sequencing on the Illumina MiSeq is more expensive per sample than long-read sequencing on the ONT MinION (Supplementary Table S7).

It should be emphasized that certain highly repetitive, GC-rich sequences remained difficult to resolve to single base pair accuracy by any of the bioinformatics pipelines, with the short-read data performing the worst in these areas. Specifically, there were 4 plasmids containing a GC-rich region encompassing a CAG promoter and adjacent chimeric intron, which could not be resolved by any assembly method. Although targeted Sanger sequencing may resolve these sequences with good confidence (Figure 4), regions with GC-content outside the typical (40–60%) range as well as repetitive regions that can form hairpins can present challenges for Sanger, short-read, and long-read sequencing alike 7, 47, 75. Optimization of sequencing library preparation kits that can overcome the GC-bias would make short-read sequencing a more desirable approach. For example, the ExpressPlex kit from SeqWell is more resilient to a range of GC values and to lower inputs of DNA which mitigates some of the issues we found with short-read sequencing 76. If some genetic parts are unreliable to sequence regardless of the method, especially to single base pair accuracy, these parts must be cataloged and flagged as difficult sequences. Ideally, these parts would be replaced with others that can perform the same function but are more readily sequenced. This is especially important for developing new plasmid backbones that can be used to produce a wide variety of plasmids that are easily sequenced.

The pipelines described in this manuscript address a critical need for better plasmid library validation, generating reliable data faster, and they make it easier for non-technical users to carry out complex bioinformatics analyses. Tools such as Trycycler that require manual intervention for users to reliably generate good assemblies may be useful for difficult-to-assemble plasmids but become impractical to scale up 1, 77. Both PlasCAT and Epi2ME are into easy-to-use full-service workflows suitable for high-throughput de novo plasmid sequence verification. Given that a hybrid approach is superior for de novo plasmid assembly, it is critical that assembly tools can accommodate data from multiple sources, making a vendor-independent solution like PlasCAT appealing. As sequencing technologies and bioinformatics tools continue to improve, further work is needed to optimize analysis pipelines for de novo plasmid assembly as well as improve verification and documentation practices surrounding plasmids. Recent work has led to the development of DNA signatures that can embed identifying information directly into plasmid sequences to facilitate simpler plasmid verification using only de novo assembly 36, 78. By incorporating a compressed version of a sequence into a signature and inserting this into a plasmid, plasmids can be instantly verified against the original reference sequence even with no prior knowledge of the sequence, emphasizing the need for and power of accurate de novo assembly methods 36, 78.

Supplementary Material

Supplemental Data and Figures

ACKNOWLEDGEMENTS

We thank Dr. Stephen Coleman for access to the Illumina iSeq sequencing machine. Figure 1 was created with BioRender.com.

FUNDING

This work was supported by the National Institutes of Health R01GM147816 and R21AI168482, the National Science Foundation 2123367, and the Suzanne and Walter Scott Foundation. Funding for open access charge: National Institutes of Health.

Footnotes

Supplementary Information and Data are available at NAR online.

CONFLICT OF INTEREST

J.P., K.M., and S.P. have financial interests in GenoFAB, Inc., a company that may benefit or be perceived as benefiting from this publication.

SUPPORTING INFORMATION

All supporting information, including additional methods, data, and references on reference-based assembly; Sanger primer sequences; iSeq short-read assemblies; Epi2ME assemblies; long-read assemblies without subsampling; summary of errors found in assemblies; cost analysis; summary of read statistics before and after filtering (PDF)

DATA AVAILABILITY

All data generated for this manuscript are publicly available in a repository and can be accessed at https://figshare.com/s/cb61b237859049e68e52.

REFERENCESUncategorized References

  • 1.Brown SD; Dreolini L; Wilson JF; Balasundaram M; Holt RA Complete sequence verifcation of plasmid DNA using the Oxford Nanopore Technologies’ MinION device. BMC Bioinformatics 2023, 24 (116). DOI: 10.1186/s12859-023-05226-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Rozwandowicz M; Brouwer MSM; Fischer J; Wagenaar JA; Gonzalez-Zorn B; Guerra B; Mevius DJ; Hordijk J Plasmids carrying antimicrobial resistance genes in Enterobacteriaceae. Journal of Antimicrobial Chemotherapy 2018, 73, 1121–1137. DOI: 10.1093/jac/dkx488. [DOI] [PubMed] [Google Scholar]
  • 3.Cameron DE; Bashor CJ; Collins JJ A brief history of synthetic biology. Nature Reviews Microbiology 2014, 12 (5), 381–390. [DOI] [PubMed] [Google Scholar]
  • 4.Munnelly K Engineering for the 21st Century: Synthetic Biology. ACS Synth. Biol 2013,(2), 213–215. DOI: 10.1021/sb400039g. [DOI] [PubMed] [Google Scholar]
  • 5.Peccoud J Synthetic Biology: fostering the cyber-biological revolution. Synthetic Biology 2016, 1 (1). DOI: 10.1093/synbio/ysw001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Ghaffarifar F Plasmid DNA vaccines: where are we now. Drugs Today 2018, 54 (5), 315–333. [DOI] [PubMed] [Google Scholar]
  • 7.Gallegos JE; Rogers MF; Cialek CA; Peccoud J Rapid, robust plasmid verification by de novo assembly of short sequencing reads. Nucleic Acids Research 2020, 48 (18). DOI: 10.1093/nar/gkaa727. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Shapland EB; Holmes V; Reeves CD Low-Cost, High-Throughput Sequencing of DNA Assemblies Using a Highly Multiplexed Nextera Process. ACS Synthetic Biology 2015, 4 (7), 860–866. [DOI] [PubMed] [Google Scholar]
  • 9.Peccoud J Data sharing policies: share well and you shall be rewarded. Synthetic Biology 2021, 6 (1), ysab028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Peccoud J; Anderson JC; Chandran D; Densmore D; Galdzicki M; Lux MW; Rodriguez CA; Stan G-B; Sauro HM Essential information for synthetic DNA sequences. Nature Biotechnology 2011, 29 (1), 22–22. DOI: 10.1038/nbt.1753. [DOI] [PubMed] [Google Scholar]
  • 11.Peccoud J; Gallegos JE; Murch R; Buchholz WG; Raman S Cyberbiosecurity: from naive trust to risk awareness. Trends in biotechnology 2018, 36 (1), 4–7. [DOI] [PubMed] [Google Scholar]
  • 12.Thuronyi BW; DeBenedictis EA; Barrick JE No assembly required: Time for stronger, simpler publishing standards for DNA sequences. Plos Biology 2023, 21 (11), e3002376. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Kar DM; Ray I; Gallegos J; Peccoud J; Assoc Comp, M. Digital Signatures to Ensure the Authenticity and Integrity of Synthetic DNA Molecules. Nspw ‘18: Proceedings of the New Security Paradigms Workshop 2018, 110–122. DOI: 10.1145/3285002.3285007. [DOI] [Google Scholar]
  • 14.Kar DM; Ray I; Gallegos J; Peccoud J; Ray I Synthesizing DNA molecules with identity-based digital signatures to prevent malicious tampering and enabling source attribution. Journal of Computer Security 2020, 28 (4), 437–467. DOI: 10.3233/JCS-191383. [DOI] [Google Scholar]
  • 15.Gallegos JE; Kar DM; Ray I; Ray I; Peccoud J Securing the Exchange of Synthetic Genetic Constructs Using Digital Signatures. Acs Synth Biol 2020, 9 (10), 2656–2664. DOI: 10.1021/acssynbio.0c00401. [DOI] [PubMed] [Google Scholar]
  • 16.Peccoud J; Gallegos JE; Murch R; Buchholz WG; Raman S Cyberbiosecurity: From Naive Trust to Risk Awareness. Trends Biotechnol 2018, 36 (1), 4–7. DOI: 10.1016/j.tibtech.2017.10.012. [DOI] [PubMed] [Google Scholar]
  • 17.Murch RS; So WK; Buchholz WG; Raman S; Peccoud J Cyberbiosecurity: An Emerging New Discipline to Help Safeguard the Bioeconomy. Front Bioeng Biotechnol 2018, 6, 39. DOI: 10.3389/fbioe.2018.00039. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Gallegos JE; Rogers MF; Cialek CA; Peccoud J Rapid, robust plasmid verification by de novo assembly of short sequencing reads. Nucleic Acids Res 2020, 48 (18), e106. DOI: 10.1093/nar/gkaa727. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Peccoud S; Berezin CT; Hernandez SI; Peccoud J PlasCAT: Plasmid Cloud Assembly Tool. Bioinformatics 2024, 40 (5). DOI: 10.1093/bioinformatics/btae299 From NLM. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Epi2ME Labs. Epi2ME Labs Blog. Epi2ME, 2023. (accessed 3/11/24). [Google Scholar]
  • 21.Antipov D; Hartwick N; Shen M; Raiko M; Lapidus A; Pevzner PA plasmidSPAdes: assembling plasmids from whole genome sequencing data. Bioinformatics 2016, 32 (22), 3380–3387. DOI: 10.1093/bioinformatics/btw493 (acccessed 2/9/2024). [DOI] [PubMed] [Google Scholar]
  • 22.Rozov R; Brown Kav A; Bogumil D; Shterzer N; Halperin E; Mizrahi I; Shamir R Recycler: an algorithm for detecting plasmids from de novo assembly graphs. Bioinformatics 2017, 33 (4), 475–482. DOI: 10.1093/bioinformatics/btw651 From NLM. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Antipov D; Raiko M; Lapidus A; Pevzner PA Plasmid detection and assembly in genomic and metagenomic data sets. Genome Res 2019, 29 (6), 961–968. DOI: 10.1101/gr.241299.118 From NLM. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Gomi R; Wyres KL; Holt KE Detection of plasmid contigs in draft genome assemblies using customized Kraken databases. Microb Genom 2021, 7 (4). DOI: 10.1099/mgen.0.000550 From NLM. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Pellow D; Zorea A; Probst M; Furman O; Segal A; Mizrahi I; Shamir R SCAPP: an algorithm for improved plasmid assembly in metagenomes. Microbiome 2021, 9 (1), 144. DOI: 10.1186/s40168-021-01068-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Gupta SK; Raza S; Unno T Comparison of de-novo assembly tools for plasmid metagenome analysis. Genes & Genomics 2019, 41 (9), 1077–1083. DOI: 10.1007/s13258-019-00839-1. [DOI] [PubMed] [Google Scholar]
  • 27.Wick RR; Judd LM; Gorrie CL; Holt KE Unicycler: resolving bacterial genome assemblies from short and long sequencing reads. PLoS computational biology 2017, 13 (6), e1005595. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Tang X; Shang J; Ji Y; Sun Y PLASMe: a tool to identify PLASMid contigs from short-read assemblies using transformer. Nucleic Acids Research 2023, 51 (15), e83–e83. DOI: 10.1093/nar/gkad578 (acccessed 3/5/2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Bouras G; Sheppard AE; Mallawaarachchi V; Vreugde S Plassembler: an automated bacterial plasmid assembly tool. Bioinformatics 2023, 39 (7). DOI: 10.1093/bioinformatics/btad409 (acccessed 2/14/2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Berbers B; Saltykova A; Garcia-Graells C; Philipp P; Arella F; Marchal K; Winard R; Vanneste K; Roosens NHC; De Keersmaecker SC J. Combining short and long read sequencing to characterize antimicrobial resistance genes on plasmids applied to an unauthorized genetically modifed Bacillus. Nautre Research 2020, 10. DOI: 10.1038/s41598-020-61158-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Johnson J; Soehnlen M; Blankenship HM Long read genome assemblers struggle with small plasmids. Microbial Genomics 2023, 9 (5). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Wick RR; Judd LM; Cerdeira LT; Hawkey J; Méric G; Vezina B; Wyres KL; Holt KE Trycycler: consensus long-read assemblies for bacterial genomes. Genome Biology 2021, 22. DOI: 10.1186/s13059-021-02483-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Chen N-C; Solomon B; Mun T; Iyer S; Langmead B Reference flow: reducing reference bias using multiple population genomes. Genome Biology 2021, 22 (1). DOI: 10.1186/s13059-020-02229-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Valiente-Mullor C; Beamud B; Ansari I; Francés-Cuesta C; García-González N; Mejía L; Ruiz-Hueso P; González-Candelas F One is not enough: On the effects of reference genome for the mapping and subsequent analyses of short-reads. PLOS Computational Biology 2021, 17 (1), e1008678. DOI: 10.1371/journal.pcbi.1008678. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Lau J Reference bias: Challenges and solutions. In SevenBridges Blog, 2017. [Google Scholar]
  • 36.Gallegos JE; Kar DM; Ray I; Peccoud J Securing the Exchange of Synthetic Genetic Constructs Using Digital Signatures. ACS Synth. Biol 2020, 9 (10), 2656–2664. DOI: 10.1021/acssynbio.0c00401. [DOI] [PubMed] [Google Scholar]
  • 37.Peccoud S; Berezin C-T; Hernandez SI; Peccoud J PlasCAT: Plasmid Cloud Assembly Tool. Bioinformatics 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Mardis ER Next-Generation Sequencing Platforms. Annu. Rev. Anal. Chem 2013, 6, 287–303. DOI: 10.1146/annurev-anchem-062012-092628. [DOI] [PubMed] [Google Scholar]
  • 39.Hu T; Chitnis N; Monos D; Dinh A Next-generation sequencing technologies: An overview. Human Immunology 2021, 82 (11), 801–811. DOI: 10.1016/j.humimm.2021.02.012. [DOI] [PubMed] [Google Scholar]
  • 40.Heather JM; Chain B The sequence of sequencers: The history of sequencing DNA. Genomics 2016, 107 (1), 1–8. DOI: 10.1016/j.ygeno.2015.11.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Peccoud J; Blauvelt MF; Cai Y; Cooper KL; Crasta O; DeLalla EC; Evans C; Folkerts O; Lyons BM; Mane SP Targeted development of registries of biological parts. Plos one 2008, 3 (7), e2671. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Wilson ML; Cai Y; Hanlon R; Taylor S; Chevreux B; Setubal JC; Tyler BM; Peccoud J Sequence verification of synthetic DNA by assembly of sequencing reads. Nucleic Acids Research 2012, 41 (1), e25–e25. DOI: 10.1093/nar/gks908 (acccessed 3/5/2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Buermans HPJ; Dunnen J. T. d. Next generation sequencing technology: Advances and applications. Biochimica et Biophysica Acta 2014, 1842 (10), 1932–1941. DOI: 10.1016/j.bbadis.2014.06.015. [DOI] [PubMed] [Google Scholar]
  • 44.Tilak M-K; Botero-Castro F; Galtier N; Nabholz B Illumina Library Preparation for Sequencing the GC-Rich Fraction of Heterogeneous Genomic DNA. Genome Biology and Evolution 2018, 10 (2), 616–622. DOI: 10.1093/gbe/evy022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Liao X; Li M; Zou Y; Wu F-X; Wang J Current challenges and solutions of de novo assembly. Quantitative Biology 2019, 7, 90–109. [Google Scholar]
  • 46.Aird D; Ross MG; Chen W-S; Danielsson M; Fennell T; Russ C; Jaffe DB; Nusbaum C; Gnirke A Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biology 2011, 12. DOI: 10.1186/gb-2011-12-2-r18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Browne PD; Nielsen TK; Kot W; Aggerholm A; Gilbert MTP; Puetz L; Rasmussen M; Zervas A; Hansen LH GC bias affects genomic and metagenomic reconstructions, underrepresenting GC-poor organisms. GigaScience 2020, 9 (2). DOI: 10.1093/gigascience/giaa008 (acccessed 5/23/2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Zhao W; Zeng W; Pang B; Luo M; Peng Y; Xu J; Kan B; Li Z; Lu X Oxford nanopore long-read sequencing enables the generation of complete bacterial and plasmid genomes without short-read sequencing. Front. Microbiol. Sec. Evolutionary and Genomic Microbiology 2023, 14. DOI: 10.3389/fmicb.2023.1179966. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.De Maio N; Shaw LP; Hubbard A; George S; Sanderson ND; Swann J; Wick R; AbuOun M; Stubberfield E; Hoosdally SJ Comparison of long-read sequencing technologies in the hybrid assembly of complex bacterial genomes. Microbial genomics 2019, 5 (9), e000294. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Xia Y; Li X; Wu Z; Nie C; Cheng Z; Sun Y; Liu L; Zhang T Strategies and tools in illumina and nanopore-integrated metagenomic analysis of microbiome data. iMeta 2023, 2 (1), e72. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Khrenova MG; Panova TV; Rodin VA; Kryakvin MA; Lukyanov DA; Osterman IA; Zvereva MI Nanopore sequencing for de novo bacterial genome assembly and search for single-nucleotide polymorphism. International Journal of Molecular Sciences 2022, 23 (15), 8569. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Gallegos JE; Hayrynen S; Adames NR; Peccoud J Challenges and opportunities for strain verification by whole-genome sequencing. Scientific Reports 2020, 10 (1), 5873. DOI: 10.1038/s41598-020-62364-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Amarasinghe SL; Su S; Dong X; Zappia L; Ritchie ME Opportunities and challenges in long-read sequencing data analysis. Genome Biology 2020, 21 (30). DOI: 10.1186/s13059-020-1935-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.ThermoFisher Scientific. Expi293 Expression System USER GUIDE. ThermoFisher Scientific: 2020; pp 1–32. [Google Scholar]
  • 55.Janeway CA TP Jr, Walport M, et al. The structure of a typical antibody molecule. In Immunobiology: The Immune System in Health and Disease, New York: Garland Science, 2001. [Google Scholar]
  • 56.Bolger AM; Lohse M; Usadel B Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 2014, 30 (15), 2114–2120. DOI: 10.1093/bioinformatics/btu170 (acccessed 11/3/2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Hall MB Rasusa: Randomly subsample sequencing reads to a specified coverage. Journal of Open Source Software 2022, 7 (69), 3941. [Google Scholar]
  • 58.Lonardi S; Mirebrahim H; Wanamaker S; Alpert M; Ciardo G; Duma D; Close TJ When less is more: ‘slicing’ sequencing data improves read decoding accuracy and de novo assembly quality. Bioinformatics 2015, 31 (18), 2972–2980. DOI: 10.1093/bioinformatics/btv311. [DOI] [PubMed] [Google Scholar]
  • 59.Wick RR; Menzel P Filtlong. 2021. (accessed 3/11/24). [Google Scholar]
  • 60.Vaser R; Sovic I; Nagarajan N; Šikic M Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res 2017, 27 (5), 737–746. DOI: 10.1101/gr.214270.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Kolmogorov M; Yuan J; Lin Y; Pevzner PA Assembly of long, error-prone reads using repeat graphs. Nature Biotechnology 2019, 37, 540–546. DOI: 10.1038/s41587-019-0072-8. [DOI] [PubMed] [Google Scholar]
  • 62.Wright C; Griffiths S; Nicholls S; Parker M; Horner N epi2me-labs/wf-denovo-assembly. 2022. (accessed 3/11/24). [Google Scholar]
  • 63.Zhang T; Xing W; Wang A; Zhang N; Jia L; Ma S; Xia Q Comparison of long-read methods for sequencing and assembly of lepidopteran pest genomes. International Journal of Molecular Sciences 2022, 24 (1), 649. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Murigneux V; Rai SK; Furtado A; Bruxner TJC; Tian W; Harliwong I; Wei H; Yang B; Ye Q; Anderson E; et al. Comparison of long-read methods for sequencing and assembly of a plant genome. GigaScience 2020, 9 (12). DOI: 10.1093/gigascience/giaa146 (acccessed 1/24/2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Wick RR; Holt KE Benchmarking of long-read assemblers for prokaryote whole genome sequencing. F1000Research 2021, 8, 2138. DOI: 10.12688/f1000research.21782.4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Hall MB Rasusa: Randomly subsample sequencing reads to a specified coverage. Journal of Open Source Software 2022, 7 (69). DOI: 10.21105/joss.03941. [DOI] [Google Scholar]
  • 67.Boostrom I; Portal EA; Spiller OB; Walsh TR; Sands K Comparing long-read assemblers to explore the potential of a sustainable low-cost, low-infrastructure approach to sequence antimicrobial resistant bacteria with oxford nanopore sequencing. Frontiers in Microbiology 2022, 13, 796465. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Alexopoulou AN; Couchman JR; Whiteford JR The CMV early enhancer/chicken β actin (CAG) promoter can be used to drive transgene expression during the differentiation of murine embryonic stem cells into vascular progenitors. BMC Cell Biology 2008, 9 (1), 2. DOI: 10.1186/1471-2121-9-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Chevreux B; Wetter T; Suhai S Genome Sequence Assembly Using Trace Signals and Additional Sequence Information. In German Conference on Bioinformatics, 1999; Vol. 99, pp 45–56. [Google Scholar]
  • 70.Cock PJ; Grüning BA; Paszkiewicz K; Pritchard L Galaxy tools and workflows for sequence analysis with applications in molecular plant pathology. PeerJ 2013, 1, e167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.The Galaxy Community. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update. Nucleic Acids Research 2022, 50 (W1), W345–W351. DOI: 10.1093/nar/gkac247 (acccessed 5/20/2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Morise H; Shimomura O; Johnson FH; Winant J Intermolecular Energy Transfer in the Bioluminescent System of Aequorea. Biochemistry 1974, 13 (12), 2656–2662. DOI: 10.1021/bi00709a028. [DOI] [PubMed] [Google Scholar]
  • 73.Weiner MP; Hudson TJ Introduction to SNPs: discovery of markers for disease. Biotechniques 2002, 32 (Sup), S4–S13. [PubMed] [Google Scholar]
  • 74.Hughes RA; Ellington AD Synthetic DNA Synthesis and Assembly: Putting the Synthetic in Synthetic Biology. Cold Spring Harbor Perspectives in Biology 2017, 9 (1). DOI: 10.1101/cshperspect.a023812. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Crossley BM; Bai J; Glaser A; Maes R; Porter E; Killian ML; Clement T; Toohey-Kurth K Guidelines for Sanger sequencing and molecular assay monitoring. 2020, 32 (6), 767–775. DOI: 10.1177/1040638720905833. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.SeqWell. ExpressPlex Library Prep Kit. 2024. https://seqwell.com/expressplex-library-prep-kit/ (accessed 9/24/24). [Google Scholar]
  • 77.Wick RR; Judd LM; Holt KE Assembling the perfect bacterial genome using Oxford Nanopore and Illumina sequencing. PLOS Computational Biology 2023, 19 (3), e1010905. DOI: 10.1371/journal.pcbi.1010905. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Berezin C-T; Peccoud S; Kar DM; Peccoud J Cryptographic approaches to authenticating synthetic DNA sequences. Trends in Biotechnology 2024. DOI: 10.1016/j.tibtech.2024.02.002. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Data and Figures

Data Availability Statement

All data generated for this manuscript are publicly available in a repository and can be accessed at https://figshare.com/s/cb61b237859049e68e52.

RESOURCES