A fast machine-learning-guided primer design pipeline for selective whole genome amplification

Jane A Dwivedi-Yu; Zachary J Oppler; Matthew W Mitchell; Yun S Song; Dustin Brisson

doi:10.1371/journal.pcbi.1010137

. 2023 Apr 17;19(4):e1010137. doi: 10.1371/journal.pcbi.1010137

A fast machine-learning-guided primer design pipeline for selective whole genome amplification

Jane A Dwivedi-Yu ^1,^2,^#, Zachary J Oppler ^3,^#, Matthew W Mitchell ^3,⁴, Yun S Song ^1,^5,^*, Dustin Brisson ^3,^*

Editor: Manja Marz⁶

PMCID: PMC10138271 PMID: 37068103

Abstract

Addressing many of the major outstanding questions in the fields of microbial evolution and pathogenesis will require analyses of populations of microbial genomes. Although population genomic studies provide the analytical resolution to investigate evolutionary and mechanistic processes at fine spatial and temporal scales—precisely the scales at which these processes occur—microbial population genomic research is currently hindered by the practicalities of obtaining sufficient quantities of the relatively pure microbial genomic DNA necessary for next-generation sequencing. Here we present swga2.0, an optimized and parallelized pipeline to design selective whole genome amplification (SWGA) primer sets. Unlike previous methods, swga2.0 incorporates active and machine learning methods to evaluate the amplification efficacy of individual primers and primer sets. Additionally, swga2.0 optimizes primer set search and evaluation strategies, including parallelization at each stage of the pipeline, to dramatically decrease program runtime. Here we describe the swga2.0 pipeline, including the empirical data used to identify primer and primer set characteristics, that improve amplification performance. Additionally, we evaluate the novel swga2.0 pipeline by designing primer sets that successfully amplify Prevotella melaninogenica, an important component of the lung microbiome in cystic fibrosis patients, from samples dominated by human DNA.

Author summary

Population genomics enables the inference of evolutionary and ecological processes that are critical to understanding and eventually controlling many infectious diseases. The promise of microbial population genomics is tempered, however, by difficulties in isolating and preparing pathogens for next-generation sequencing. Here we present swga2.0, an optimized pipeline that designs the selective whole genome amplification (SWGA) primer sets needed to amplify and sequence microbial genomic DNA from complex biological specimens.

Introduction

The rapidly expanding field of population genomics is transforming our understanding of the evolutionary forces shaping genomic diversity within and among species [1]. In microbial systems in particular, population genomic studies are increasingly feasible due to the minimal cost of sequencing small genomes [2–5]. These studies can identify the origins of adaptive traits, map range expansions and migration patterns, and clarify epidemiological processes. A principal obstacle to sequencing specific microbial genomes from natural samples is isolating the target microbial DNA from the DNA of contaminating organisms [6]. Although laboratory culture is the standard practice, the overwhelming majority of microbes cannot be cultured, and direct sequencing is problematic as the microbial genome constitutes only a minuscule fraction of the total DNA [7–9]. Thus, a primary hindrance to collecting populations of microbial genomes is the lack of innovative, cost-effective, and practical methods to collect sufficient amounts of target microbial genomic DNA with limited contaminating DNA.

Several technologies have been developed and utilized to overcome this obstacle including genome capture, single-cell sequencing, and selective whole genome amplification (SWGA) [10–12]. Of these, SWGA is the most inexpensive, flexible, and shareable culture-free technology [13]. SWGA takes advantage of the inherent differences in the frequencies of sequence motifs (k-mers) among species in order to build primer sets that bind often in the target genome but rarely in the contaminating genomes. These selective primer sets are used to selectively amplify the target microbial genomes using Φ29 multi-displacement amplification technology [12, 14]. The Φ29 DNA polymerase amplifies DNA from primers with high processivity (up to 70-kbp fragments) and is 100 times less error-prone than Taq, making it the standard for genome amplification prior to sequencing [14–16]. By coupling Φ29 amplification with selective priming, researchers can selectively amplify a target microbial genome, thus separating the metaphorical baby (target microbial genomes) from the bathwater (off-target DNA from vectors, hosts, or other microbes). SWGA is a powerful and cost-effective tool for researchers looking to generate genomic data for microbial systems. Effective SWGA protocols have resulted in next-generation sequencing (NGS)-ready samples that are enriched for specific target microbial genomes and have been used to address biologically important questions in several microorganisms, including Mycobacterium tuberculosis, Wolbachia spp, Plasmodium spp, Neisseria meningitidis, Coxiella burnetii, Wuchereria bancrofti, and Treponema pallidum [12, 17–30].

The most recent SWGA development pipeline (swga1.0) improved on the concept and existing tools available for SWGA primer selection [17]. Whereas the first SWGA tool used only differential binding ratios of k-mers and melting temperature to build primer sets [12], swga1.0 incorporated a larger a priori set of optimality criteria when selecting both individual primers (i.e., primer binding frequency, improved melting temperature, evenness) and potential primer sets (i.e., evenness, primer binding site density on the target genome) [17]. Evaluation of the amplification and sequencing data from this study revealed that primer sets that prioritized binding site density on the target genome, along with binding site evenness as a secondary factor, yielded the most consistent amplification success.

The swga1.0 pipeline substantially improved the available SWGA development tools and identified several potential enhancements for future studies [17]. First, swga1.0 uses only marginally-effective optimality criteria to evaluate individual primers and primer sets due to a lack of empirical data of the characteristics that result in effective SWGA. While primer binding site density and evenness appear broadly important, the majority of primer sets chosen using these criteria resulted in limited amplification success. Thus, additional primer characteristics correlated with efficient selective amplification were not included in the primer selection process. Second, swga1.0 uses a computationally-expensive algorithm to search for primer sets. swga1.0 evaluated 1–5 million primer sets, which is only a very limited proportion of all potential primer sets, yet still required more time than is available for research projects. Evaluating primer sets could be vastly improved by an informed objective function and by pruning unpromising search paths.

The computational time and experimental cost and effort needed to develop an effective protocol to amplify a target microbial genome has hindered the broad adoption of SWGA for microbial population genomic studies. Here we present the next generation pipeline for SWGA protocol development, swga2.0, that improves the state-of-the-art methods in three areas. First, active learning and machine learning are incorporated into the pipeline to predict the effectiveness of primers and primer sets. Second, novel features including thermodynamically-principled binding affinities are included in primer and primer set evaluation models. Lastly, the computational efficiency of the primer set search algorithm is improved by multiprocessing and caching computationally expensive information. swga2.0 is a fast SWGA optimization software that allows researchers to rapidly identify primer sets that are likely to amplify a specific microbial genome from a complex, heterogeneous sample. We test the novel pipeline by designing and evaluating primer sets to selectively amplify Prevotella melaninogenica, an important pathogen in cystic fibrosis patients.

Methods

`swga2.0` pipeline

The swga2.0 pipeline incorporates metrics on the efficacy of individual primers and a computationally efficient, multiprocessing algorithm to identify sets of primers to selectively amplify a target microbial genome from samples dominated by background DNA (https://anaconda.org/janedwivedi/swga2). The swga2.0 pipeline consists of four major stages, illustrated in Fig 1: (1) cataloging DNA sequence motifs in the target genome and identifying the locations of each motif in both the target genome and background DNA, (2) removing DNA sequence motifs that are either too rare or too unevenly distributed in the target genome, are too common in the background DNA, or have calculated melting temperatures that are outside the acceptable range, (3) predicting the amplification efficacy of the remaining primers (see Amplification efficacy section below), and (4) searching and evaluating aggregations of primers as candidate primer sets. A summary of similarities and differences between the swga2.0 pipeline and prior published methods is presented in Table 1.

Fig 1 — The process is broken into four stages: 1) preprocessing of locations in the target and off-target genomes, 2) filtering motifs in the target genome based on individual primer properties and frequencies in the genomes, 3) scoring the remaining primers for amplification efficacy using a machine learning model, and 4) searching and evaluating aggregations of primers as candidate primer sets.

Table 1. Differences between `swga1.0` and `swga2.0`.

	Feature	`swga1.0`	`swga2.0`
All stages	Time to complete (minutes)^*	1176	93
	Multiprocessing		✓
	Incorporates forward strand	✓	✓
	Incorporates reverse strand		✓
Stage 1 (k-mer preprocessing)	Time to complete (minutes)^*	100	20
	Process k-mers that exact-match the target	✓	✓
	Process all k-mers in the target		✓
Stage 2 (Candidate primer filtering)	Time to complete (minutes)^*	0.1	37.5
	Filter by melting temp	✓	✓
	Filter by self-dimerization	✓	✓
	Filter by target binding count	✓	✓
	Filter by background binding count	✓	✓
	Filter by target Gini index	✓	✓
	Filter by target to background freq. binding ratio	✓	✓
Stage 3 (Primer efficacy filtering)	Time to complete (minutes)^*	NA	5
Stage 3 (Primer efficacy filtering)	Filter by predicted amplification using a trained model		✓
Stage 4 (Primer set search and eval.)	Time to complete (minutes)^*	1076	31
	Filters based on heterodimer formation	✓	✓
	Score sets using heuristic function	✓
	Score sets using fitted regression model		✓
	Uses min distance between binding sites in the background	✓
	Uses max distance between binding sites on the target	✓
	Uses min ratio (target:off) of mean binding site distances		✓
	Uses max ratio (target:off) of binding site frequency		✓
	Uses min ratio (target:off) of coverage approximation		✓
	Uses min mean Gini index of target binding		✓
	Uses max mean Gini index of background binding		✓
	Branch-and-bound search based on heterodimer cliques	✓	✓
	Branch-and-bound based on predict set score		✓

Open in a new tab

*Primer sets designed to selectively amplify P. melaninogenica (H. sapiens background) were built on a MacBookPro14,1 (using all 4 threads: 2.5GHz Dual-Core Intel Core i7, 2 threads per core; Memory 16GB 2133MHz LPDDR3).

Stage 1 (k-mer preprocessing)

swga2.0 first identifies all 6 bp to 12 bp k-mers in the target genome that serve as candidate SWGA primers. The number and location of each k-mer in the target genome and background DNA are computed using jellyfish [31]—a fast, parallel k-mer counter—and stored in h5py files. This stage of the pipeline is parallelized and does not need to be re-run when modifying parameters or generating new primer sets (Stage 4).

Stage 2 (Candidate primer filtering)

The k-mer motifs in the h5py files are sorted according to their binding frequency in the target genome and background DNA, the evenness of their distribution in the target genome, their calculated melting temperatures, GC content, homodimerization probability, and number of single and di-nucleotide repeats. These motif characteristics were used previously to identify candidate SWGA primers [17]. Briefly, binding frequency is the number of exact matches of a primer in the genome normalized by the total genome length; motifs that occur too rarely in the target genome (< min_fg_freq) or too frequently in the background DNA (> max_bg_freq) are removed. Primers that bind unevenly across the target genome (> max_gini), calculated using the Gini index of the distances between motifs are removed. The melting temperature of potential primers [32] must be within min_tm (default 15°C) and max_tm (default 45°C) as established in [12]. Motifs with GC content less than min_GC or greater than max_GC (default 0.375 and 0.625, respectively), as well as motifs with three or more G/C in the last five base pairs of the 3′-end, are eliminated. Candidate primers that could self-dimerize, estimated as subsequences that have a reverse complement (≥ default_max_self_dimer_bp, default = 4), are eliminated. Finally, motifs with runs of single and di-nucleotide repeats are eliminated. The ratio of the binding frequency in the target genome to that in the background DNA is computed for the candidate primers that remain after the filtering steps above. The primers are sorted by the target-to-background ratio and the primers with the largest ratios are retained for downstream evaluation (max_primers; default 500 primers). Multi-processed tasks use the user-specified number of CPUs (default = all).

Stage 3 (Primer efficacy filter)

Decreasing the number of candidate primers reduces the computational effort necessary to search and evaluate primer sets in Stage 4. Thus, the motifs retained from Stage 2 are individually evaluated for their potential to bind and amplify the target genome, using the random forest regressor model trained on the experimental data described in the Amplification efficacy section below. Briefly, this non-linear regression model predicts amplification efficacy from experimentally-identified primer properties including thermodynamically-principled features that correlate with binding affinity. Candidate primers are selected according to the minimum predicted on-target amplification threshold parameter (min_amp_pred, default = 5) in this regression model.

Stage 4 (Primer set search and evaluation)

swga2.0 searches for and evaluates primer sets using a machine-learning guided scoring function that incorporates a breadth first, greedy approach. The number of desired primer sets (max_sets) are built in parallel, primer by primer, by adding primers that cause the greatest increase in evaluation scores (see S1 Appendix). Briefly, the first primer in each of the max_sets sets is chosen at random from the candidate primer list allowing for broad exploration of the search space (Fig 2). Primers that are not expected to dimerize with any primer already in the set are added, one at a time, to each set and the primer set is evaluated as: score = β₀ + β₁ freq_ratio + β₂ mean_gap_ratio + β₃ coverage_ratio + β₄ on_gap_gini + β₅ off_gap_gini where the score predicts the percent of target genome coverage at 1×. The features for this regression are described in Table 2 and explained further in the section Primer set search and scoring function. New candidate primers are added sequentially to primer sets and retained if the addition improves the computed score. This process iterates until primer sets of a desired size are generated or the maximum number of iterations is reached. The primer set search process also utilizes a drop-out step where a set of size n reduces to the subset of size n − 1 with the highest computed score. “Dropping” the weakest primer also provides the possibility of adding a primer that would otherwise be excluded due to the risk of dimerizing with the dropped primer. The drop-out step can be re-run multiple times and includes the option of temporarily withholding frequently-used primers until after the drop-out layer.

Fig 2 — Stage 4 begins with one randomly selected primer for each primer set. Each primer set is built in parallel until the improvements in evaluation score no longer exceed a user-defined parameter (ϵ) or until the maximum number of iterations is reached. A drop-out iteration forces each of the highest-scoring primer sets of size n to reduce to the subset of size n − 1 with the highest computed score.

Table 2. Ridge regression variable descriptions and coefficient values for primer set evaluation.

Variable name	Variable description	Coef.	Coef. value
intercept	Intercept	${\hat{β}}_{0}$	−3.14 × 10⁻¹⁵
`freq_ratio`	Ratio of the binding site rate in the on-target to off-target genome.	${\hat{β}}_{1}$	0.321
`mean_gap_ratio`	Ratio of the mean distance between binding sites of the on-target to off-target genome, aggregating across strands.	${\hat{β}}_{2}$	−0.0368
`coverage_ratio`	Ratio of the coverage approximation of the on-target to that of the off-target.	${\hat{β}}_{3}$	−0.0318
`on_gap_gini`	Mean gini index of on-target binding site gap sizes, averaging across strands.	${\hat{β}}_{4}$	−0.0131
`off_gap_gini`	Mean gini index of off-target binding site gap sizes, averaging across strands.	${\hat{β}}_{5}$	0.281

Open in a new tab

Amplification efficacy

The amplification efficacy of individual primers (Stage 3) is predicted using a regression model built from a series of rolling circle amplification (RCA) [14, 33] experiments in which plasmids were amplified with individual primers. SWGA typically utilizes multiple primers, each with multiple priming sites, making it challenging to isolate the impact of individual primers. Single-primer amplification reactions assess amplification efficacy of individual primers binding from plasmids with or without an exact-match binding site.

The model of primer amplification efficacy included the features delineated in Table 3 as parameters. Some of these attributes include properties such as the number of base pair repeats, melting temperature and G/C proportion, all of which are thought to impact the efficacy of accurate primer binding and amplification in PCR and Φ29 reactions [34]. swga2.0 includes additional features that estimate the likelihood of the primer binding to the target using a unified thermodynamic nearest-neighbor DNA model [35, 36]. Empirical thermodynamic parameters ( $Δ G_{T}^{\circ}$ ) are available for most primer exact-match and single-mismatch scenarios. Empirical thermodynamic data for terminal mismatches are not publicly available [35, 36] and were not captured in our predictive model. This thermodynamic nearest-neighbor model was incorporated by computing $Δ G_{T}^{\circ}$ values for each primer at each genome position—a smoother metric for primer binding propensity than the number of exact match binding sites. The $Δ G_{T}^{\circ}$ values were binned within the range of −20 and 3, and the resulting histogram normalized by genome length. The normalized $Δ G_{T}^{\circ}$ frequency values are used as features in the primer amplification efficacy regression model.

Table 3. Feature importances based on the random forest regressor model.

Subset Description	Feature Description	Feature Importance (%)	Subset Feature Importance (%)
G/C content features	number/proportion of G’s	11.8/2.68	27.9
	number/proportion of C’s	2.76/6.54
	GC content	4.08
repeat features	GG repeat number	7.30	19.2
	longest G repeat	4.10
	CC repeat number	2.37
	longest C repeat	3.51
	TT repeat number	0.458
	longest T repeat	0.613
	AA repeat number	0.333
	longest A repeat	0.431
Binding affinity features	3	1.42	18.1
	2.5	1.70
	2	0.942
	1.5	1.84
	1	2.07
	0.5	1.21
	0	1.27
	−0.5	0.778
	−1	0.616
	−1.5	0.532
	−2	0.368
	−2.5	0.413
	−3	1.25
	−3.5	0.633
	−4	0.611
	−4.5	0.105
	−5	0.0120
	−5.5	0.0143
	−6	0.134
	−7	0.0768
	−8	0.259
	−9	0.0799
	−10	0.215
	−12	0.531
	−14	0.667
	−16	0.199
	−18	0.0802
molarity	molarity	10.2	10.2
last 5 bases near 3′-end	GC-clamp	1.92	10.2
	first base from 3’ end	1.08
	second base from 3’ end	2.48
	third base from 3’ end	1.32
	fourth base from 3’ end	1.41
	fifth base from 3’ end	1.96
A/T content features	number/proportion of A’s	1.16/1.25	6.42
A/T content features	number/proportion of T’s	1.03/2.98	6.42
melting temperature	melting temperature	6.33	6.33
sequence length	sequence length	2.56	2.56

Open in a new tab

An active learning approach

Active learning, a type of iterative supervised machine learning, was used to maximize information gain in three rounds of single primer amplification experimentation. The previously published SWGA Perl script [12] was used to generate a list of all primers with one exact-match binding site on one of the plasmids and no exact-match binding site on the other plasmid (pcDNA3-EGFP and pLTR-RD114A from Addgene). The first round of experimental amplification used 204 primers from this list that maximized the variability in the 22 primer attributes in Table 3, excluding the thermodynamic binding affinity features. A random forest regressor model was built using the target and off-target amplification data from the first round of experimental amplifications as it had the best performance of the tested models (linear, logistic, random forest, gradient boosting, and support vector machine). The optimal parameters according to a hyperparameter search were n_estimators=1500, min_samples_split=10, min_samples_leaf=4, max_depth=50, bootstrap=False.

The random forest regression model was used to predict the amplification efficacy of all primers in the original list. The 96 primers predicted to have the greatest amplification efficacy were chosen for the second round of experimental assessment. The experimental amplification data from round 2 were used to update the random forest regression model. An additional 96 primers predicted to have the greatest amplification efficacy were chosen for a third round of experimental evaluation. The final random forest regression model, built on the experimental results from three rounds of single-primer amplification experiments, is included in the swga2.0 primer design pipeline (Stage 3).

Primer set scoring function

Multiple analytical frameworks were explored to construct a primer set scoring function that would accurately predict amplification efficacy and evenness from an individual primer set, determined by the proportion of the target genome with at least 1× sequencing coverage. Multiple primer set scoring functions were trained on data from 46 sets of published SWGA and sequencing data from Mycobacterium tuberculosis and Homo sapiens [17]. Model features were selected from a set of variables thought to influence amplification of the target genome (Gini index, nucleotide distance between target binding sites, entropy and generalized entropy of the binding site distribution), amplification of background DNA (kurtosis, skewness, bimodality, and variance among binding sites), and combinations of target to background amplification (the frequency of binding sites, average distance between binding sites on the same strand, average distance between binding sites on the opposite strands). The best model was selected using 10-fold cross-validation error among ridge regression candidate models. Ridge regression, which uses regularization to reduce model variance, was favored in order to reduce the risk of overfitting given the limited data.

Empirical evaluation of primer sets

Primer sets designed to amplify Prevotella melaninogenica from samples dominated by human DNA were empirically tested to evaluate the efficacy of swga2.0. Six primer sets were created using P. melaninogenica strain ATCC 25845 (GCF_000144405.1) as the target genome and the human genome (GRCh38.p13) as the background DNA. The six primer sets were evaluated in duplicate on purified P. melaninogenica DNA (strain ATCC 25845), diluted to 1% in purified human genomic DNA (Promega, female, catalog No. G1521). Briefly, the 1:99 target:background sample was digested with FspEI (New England Biolabs) according to the manufacturers protocol (incubation at 37°C for 90 minutes, 20 minute heat inactivation at 80°C). Although digestion has reduced mitochondrial DNA amplification or increased target amplification in some prior studies [12, 20, 37], experiments amplifying non-digested DNA using two of the Prevotella primer sets resulted in comparable amplification success with the digested samples. The digested sample was purified using AmpureXP beads (Beckman Coulter) prior to performing selective whole-genome amplification as previously described [18] with slight modifications. Reactions were performed in a volume of 50 uL using 50 ng of digested DNA, SWGA primers (total concentration of all primers together = 3.5mM), 1× Φ29 buffer (New England Biolabs), 1 mM dNTPs, and 30 units Φ29 polymerase (New England Biolabs). Amplification conditions included a ramp-down from 35 to 31°C (5 min at 35°C, 10 min at 34°C, 15 min at 33°C, 20 min at 32°C, 25 min at 31°C), followed by a 16h amplification step at 30°C. The polymerase was then denatured for 15 min at 65°C. Amplified samples were purified using AmpureXP beads, prepared for Illumina sequencing [38], and sequenced on an Illumina MiSeq (150 bp, paired end). The unamplified sample was also sequenced to assess changes in sequencing coverage due to SWGA. Illumina-specific adapter and primer sequences were removed from the reads using Cutadapt (Martin, 2011) and reads aligned to the target genome using BWA mem (v0.7.1). Analysis of sequence coverage of the target genome was performed using samtools (v1.9) [39].

Results

Primer amplification efficacy

Individual primers were evaluated in amplification experiments using plasmids that either did or did not have exact-match binding sites. The active learning approach, in which data were collected to train a random forest regression model that was then used to choose additional primers, maximized the information gained from each of the three rounds of amplification experiments. The goal of the first round of experiments was to maximize the exploration of the feature space across 22 primer characteristics (Table 3, excluding binding affinity features). Only a few of the 204 tested primers resulted in strong amplification of the target plasmid and weak amplification of the off-target plasmid (Fig 3; S1 Table).

Fig 3 — Amplification of the target plasmid was weak in the majority of the randomly selected primers experimentally investigated in Round 1 (blue points represent the 204 primers evaluated). Target plasmid amplification was equally poor for the primers selected for Round 2 experimentation (96 orange points) by the random forest regressor model trained on the data from Round 1. The majority of primers selected for Round 3 experimentation (96 green points) by the updated random forest regressor model trained on the data from Rounds 1 and 2 resulted in moderate and high amplification of the target plasmid. Points are adjusted along the x-axis so that they do not overlap.

The data collected from the first round of amplification experiments were used to build a random forest regression model and predict the amplification potential of an additional 96 primers (S1 Table). Although a greater proportion of the primers utilized in the second experimental round resulted in strong amplification of the target plasmid, the majority still performed poorly (Fig 3), likely due to training the initial model with few high-performing primers from the initial round which limited extrapolation to the high-amplification regime. Training the random forest regression model with data from rounds 1 and 2 resulted in a model that predicted primarily high-amplification primers.

The high variance in the amplification efficacy of primers selected by the final random forest model iteration makes accurate prediction of high-amplification primers difficult. Fortunately, however, this model accurately predicts poor amplification (amplification scores < 10; Fig 4). As the utility of this model is to exclude low-amplification primers from SWGA primer sets, the swga2.0 pipeline uses the random forest regressor to filter low-amplification primers in Stage 3 (Primer efficacy filter). The random forest regressor excludes a high proportion of low-performing primers and rarely excludes higher-performing primers as determined by testing the model predictions on out-of-sample data (Table 4). The minimum predicted amplification parameter (min_amp_pred, default threshold value = 5) can be modified to exclude a greater proportion of poor-performing primers. Excluding greater numbers of poor-performing primers will reduce the computational complexity of finding primer sets and reduce the probability of experimentally evaluating poor primer sets.

Fig 4 — The amplification predicted for Round 2 amplification was accurate only for low-performing primers (orange). Updating the model with Round 2 amplification data resulted in a model that predicted highly effective primers (green). The amplification efficacy of the primers selected for Round 3 were highly variable despite similar predictions. Nevertheless, the final random forest regression model did not select any poor performing primers. The model was not used to predict the primers used in Round 1.

Table 4. Poor performing primers are filtered out in Stage 3 of `swga2.0`.

Higher threshold values in the random forest regression model filter greater proportions of lower-amplification primers but few moderate or efficient primers.

Threshold parameter	Total primers filtered	High-amplification primers filtered
2	6.5%	0.04%
5	26.5%	1.6%
10	47.1%	3.8%
15	58.9%	4.4%
20	66.6%	5.4%

Open in a new tab

Feature importance in the random forest regression model

Feature importance, computed from the variance reduction at each split in each tree, identifies the impact of each variable on primer performance predictions (Table 3). Features involving GC content are the most important subset of features (27.9%) followed by binding affinities calculated from the thermodynamic binding model (18.1%). The features involving GC-content and the thermodynamic binding affinity features are in agreement, as expected, as these sets of features are correlated. For example, primers with greater GC tend to have more negative $Δ G_{T}^{\circ}$ values calculated in the thermodynamic model [35].

Primer set search and scoring function

The optimal primer set scoring function selected uses five variables (freq_ratio, mean_gap_ratio, coverage_ratio, on_gap_gini and off_gap_gini) within a ridge regression framework (Table 2). The variables freq_ratio and mean_gap_ratio are summary statistics previously found to correlate with SWGA success [17]. The freq_ratio variable is a simple measure of binding site frequency between the on-target and off-target genomes. The mean_gap_ratio variable measures the average distance in base pairs between primer positions on both the forward and reverse strands where the computation is indifferent to the strand on which a binding position lies. coverage_ratio is computed by first identifying the number of binding sites on the opposite strand that are within 70 kbp of an exact primer binding site (70 kbp is the maximum length of the synthesized DNA by Φ29 [40]). Exponential amplification using Φ29 requires priming positions in the opposite direction within 70 kbp of each other. The number of binding sites within 70 kbp of each binding site on the opposite strand is then normalized by the total genome length as a proxy for primer ‘coverage’. The coverage_ratio is the ratio of the coverage in the target genome to the coverage in the background DNA, which is minimized in the ridge regression model (negative regression coefficient value). Primer evenness is represented in the model for each strand of the target and off-target genomes by on_gap_gini and off_gap_gini, respectively, and is computed using the Gini index of distances between binding sites.

Computational costs are reduced more than ten-fold

The computational time needed to build primer sets using swga2.0 is significantly lower than for the original swga1.0 program (Table 1). Both programs use the same base framework to create files to store the catalog of 6–12-mers found in the target and background genomes. In swga2.0, however, these files are stored efficiently for future use and need not be re-created if additional primer sets are desired. Using a 2013 MacBook Pro (2 GHz Intel Core i7 with 16 GB of memory; max_sets = 5; max_primers = 200), the swga2.0 pipeline completes its primer set design for P. melaninogenica and H. sapiens in 93 minutes, a process that requires more than ten times as long using swga1.0 (Table 1). The increase in computational efficiency is the result of multi-processing in many aspects of the pipeline and utilization of an efficient rather than exhaustive search algorithm. Additionally, file formats like h5py allow for O(1) read access to primer binding positions and data structures that cache primer set scores which eliminates unnecessary recalculations for future iterations. Lastly, swga2.0 uses an efficient expression to calculate the exact Gini score and takes advantage of efficiencies in array computations in the Python library numpy.

Evaluation of primer sets to selectively amplify Prevotella melaninogenica

Prevotella melaninogenica is an important pathogen in cystic fibrosis patients that is difficult to culture from human-derived samples [41, 42]. P. melaninogenica was chosen to evaluate the improvements in the swga2.0 primer design pipeline as it was expected to be more challenging to selectively amplify for several reasons. For example, both the human genome and P. melaninogenica have a GC-content of 41% while the GC-content in the M. tuberculosis genome is 65.6% [43, 44]. Due to the similarities between P. melaninogenica and the human genome, only 14 candidate primers passed the most restrictive filters in Stage 2 (min_fg_freq = 1/33,333; max_bg_freq = 1/500,000). By contrast, 114 primers passed the same set of filters for M. tuberculosis (Table 5). The number and quality of primers retained in Stages 1–3 of the swga2.0 pipeline is critical to building effective primer sets to selectively amplify a target genome in Stage 4. We built three primer sets for P. melaninogenica from the 19 primers retained using a less restrictive background DNA filter (Primer sets 1–3; min_fg_freq = 1/33,333 bp; max_bg_freq = 1/333,333 bp) and three primer sets from the 48 primers retained using a less restrictive target genome filter (Primer sets 4–6; min_fg_freq = 1/40,000 bp; max_bg_freq = 1/500,000 bp; Table 5). The primer sets and their associated statistics are presented in S2 Table.

Table 5. P. melaninogenica is a more difficult genome to design primers for than M. tuberculosis.

Searching the two genomes with the same parameters during Stages 2 and 3 produces much larger lists of candidate primers for M. tuberculosis than for P. melaninogenica. Values are the number of candidate primers remaining after Stage 3 (parenthetical numbers are the candidate primers remaining after Stage 2).

Mean Target Frequency	Mean Background Frequency	Prevotella melaninogenica	Mycobacterium tuberculosis
< 1/33.3 kbp	> 1/500.0 kbp	14 (84)	114 (676)
< 1/33.3 kbp	> 1/333.3 kbp	19 (143)	162 (884)
< 1/33.3 kbp	> 1/300.0 kbp	22 (174)	169 (939)
< 1/40.0 kbp	> 1/500.0 kbp	48 (217)	168 (920)
< 1/40.0 kbp	> 1/333.3 kbp	93 (396)	250 (1238)
< 1/40.0 kbp	> 1/300.0 kbp	103 (448)	260 (1308)

Open in a new tab

Although the sequence motif similarities between P. melaninogenica and the human genome made primer set design more difficult, three of the six primer sets built by swga2.0 were highly effective at selectively amplifying P. melaninogenica from a sample dominated by human DNA (99%) (S3 Table; https://doi.org/10.5061/dryad.3n5tb2rm2). By comparison, 40% of the primer sets designed to amplify M. tuberculosis using the prior swga1.0 pipeline performed much better than the unamplified controls [17]. High-throughput sequencing of the amplification products from the three effective P. melaninogenica primer sets reached 1× coverage across 25–64% the target genome with 50 Mbp sequencing effort (Fig 5). The most effective primer set designed by swga1.0 in a prior study reached 1× coverage across only 27% of the M. tuberculosis with 50 Mbp sequencing effort [17]. Similarly, deeper sequencing of the amplification products of the effective primer sets reached 10× coverage at 33–82% of the the P. melaninogenica genome after 700 Mbp of sequencing effort, compared with just 0.2% for the unamplified control (Fig 6). The most effective M. tuberculosis primers reached 10× coverage at less than 30% of the target genome at similar sequencing effort [17].

Fig 5 — Red and yellow lines indicate the percent of the target genome covered at 1× depth from sequencing each of the two replicate amplification experiments. The black dashed lines represent sequencing coverage of the unamplified samples. While five of the six sets resulted in greater sequencing coverage of the target genome compared to unamplified controls, two were only marginally better. By contrast, sequencing coverage of the amplicons from three of the sets was substantially better than the coverage of the unamplified samples.

Fig 6 — The solid colored lines indicate individual replicates and the green dashed line represents the pooled total. Each of the three primer sets yield dramatic increases in sequencing depth compared to the unamplified samples (black dashed line). Each primer set reached 10× coverage across 23–74% of the target genome, while the unamplified samples reached 10× coverage at <1% of the target genome, with 500 Mbp of sequencing effort.

Discussion

Analyses of populations of microbial genomes has the power to address many major outstanding questions in evolution and pathogenesis. Obtaining populations of genome sequence data has been aided by practical and cost-effective method advancements like Selective Whole Genome Amplification (SWGA). However, developing and verifying an effective SWGA primer set to selectively amplify a target genome from a heterogeneous sample has been both computationally and experimentally challenging. Prior applications of SWGA required considerable computational investment followed by experimental assessments of numerous primer sets, of which only a few worked sufficiently well. The swga2.0 pipeline efficiently identifies primer sets, of which half are highly effective. These improvements were achieved by experimentally identifying characteristics of individual primers that result in effective amplification and limit mispriming and by using efficient data structures and search algorithms to identify primer set characteristics that are correlated with strong and even amplification. The resulting swga2.0 program reduces both the computational and experimental investment necessary to design and subsequently utilize protocols to effectively transform a complex sample containing mostly DNA from nontarget species to a sample dominated by the genomic DNA of the target species.

Computational identification of SWGA primer sets is a highly complex optimization problem involving scalability challenges and limited, noisy prior data. The previously published swga1.0 program identified over five million candidate primers that needed to be filtered into a reasonable working catalog prior to designing primer sets to selectively amplify M. tuberculosis. The computational effort needed to assess each primer set combination requires data structures that efficiently store the information needed for evaluation without repeated searches across multiple genomes for potential binding locations. Further, identifying search and evaluation strategies to identify primer sets demands combinatorial optimization techniques that could be aided through analyses of prior SWGA data. However, learning from prior data has its own challenges due to both limited data availability and experimental noise associated with short-read sequencing data. Despite these challenges, the swga2.0 program provides the necessary pipeline to rapidly identify primer sets that have a high probability of selectively amplifying any target genome from any background DNA sample [45].

The swga2.0 program offers a number of improvements including dramatically reduced computation time due to the implementation of parallelization, caching, branch-and-bound searching, and efficient data storage formats (Table 1). Each of the four stages of the computational pipeline improves speed and accuracy. For example, the jellyfish program [31] used in k-mer preprocessing (Stage 1) is much faster than the DSK program [46] used in swga1.0; novel filters improve primer identification and prevent self-dimerization in filtering candidate primers (Stage 2); a novel machine-learning model for scoring individual primers based on amplification efficacy, which includes thermodynamically-principled binding affinities, improves primer efficacy estimation and reduces the candidate primer list (Stage 3); and the implementation of branch-and-bound techniques with randomized starting locations, drop-out techniques, and additional primer set evaluation functions learned from prior data improve both speed and accuracy of primer set searches and evaluation (Stage 4). Most importantly, optimization is no longer done by hand or via exhaustive search [12, 17].

Amplification from 6 to 12 bp primers is prone to mispriming. The random forest model presented here, built on iteratively collected experimental data, accurately identifies low-amplification primers with very few false positive identifications. Applying this filter to the candidate primers from swga1.0 identifies more than 85% as having low-amplification potential. Further, all of ten top-rated primer sets designed by swga1.0 are made exclusively of primers with low-amplification potential. The random forest model is generalizable to all SWGA projects as it identifies amplification efficacy from primer sequence properties such as GC content and thermodynamically-principled binding affinities [35, 36]. Thus, we expect that this model can be used to predict effective primers when using alternative polymerases such as EquiPhi29 or Bst. However, empirical validation of the primer efficacy model using alternative polymerases would be prudent. Regardless, this novel model allows researchers to remove low-amplification primers from the primer set search and evaluation (Stage 4), resulting in a considerable reduction in computational time complexity and reducing the number of experimentally-tested primer sets that perform poorly.

The primer sets designed by swga2.0 to selectively amplify P. melaninogenica from a sample dominated by human DNA were highly successful. Out of the six primer sets tested, three sets amplified the target genome significantly more than the negative controls. Similar to the conclusion from prior primer set design algorithms [12, 17], the primer sets with lower mean binding distance in the target genome (Prev03—1/4.1 kbp, Prev04—1/2.0 kbp, Prev06—1/4.9 kbp) generally outperformed the other sets (Prev01—1/7.4 kbp, Prev02—1/5.0 kbp, Prev05—1/2.8 kbp). However, all of the sets have similar mean binding distances suggesting that other primer or set attributes that have not been identified likely account for much of the observed variation in amplification success (S2 Table). Nevertheless, it is expected that approximately half of all designed sets are likely to give strong and even selective amplification results such that few sets need to be experimentally evaluated by researchers. Also similar to prior results [20], protocols developed with swga2.0 are expected to retain the ability to investigate within-host microbial diversity using SWGA enrichment [45], without introducing errors, as the SWGA biochemistry remains identical (S4 Table). These advances reduce the up-front cost of developing the SWGA primer sets necessary for population genomic studies.

The data and analyses from this and prior SWGA projects suggest the following as a general primer identification and primer set design workflow for future projects:

Stage 2 (Candidate primer filtering): Set the min_fg_freq parameter as high as possible, and the max_bg_freq parameter as low as possible but not below 3 × 10⁶, while retaining approximately 100 primers.
Stage 3 (Primer efficacy filter): Set the min_amp_pred to 10 or higher to eliminate low-efficiency primers. If too few candidate primers are retained (< 20), increase the number of primers that pass Stage 2 filtration by reducing min_fg_freq as opposed to reducing the min_amp_pred parameter value.
Stage 4 (Primer set search and evaluation): Select 5–10 primer sets with the highest scores for experimental evaluation. However, it is not recommended to choose the top 5–10 sets as they often differ by only one or two primers. It is prudent to choose three of the top-scoring sets and several others with high scores that share few primers with the top-scoring sets. Experimentally amplify the target genome from a mixed sample (≈ 1% target DNA), barcode amplicons from each primer set separately, then pool and sequence at low depth to assess performance. Sequence amplicons from high-performing sets to ensure quality and evenness.

Best practices are likely to evolve as SWGA is used more frequently. To facilitate this, the project source repository web page contains on a tutorial on the program’s operation and more extensive documentation on each parameter and module as well as a link to a user mailing list (https://github.com/songlab-cal/swga2).

Dryad DOI

https://doi.org/10.5061/dryad.3n5tb2rm2.

Supporting information

S1 Appendix. Primer set search algorithm.

(PDF)

Click here for additional data file.^{(159.4KB, pdf)}

S1 Table. Primer amplification data.

(PDF)

Click here for additional data file.^{(167.7KB, pdf)}

S2 Table. Primer set statistics and sequences.

(PDF)

Click here for additional data file.^{(48.7KB, pdf)}

S3 Table. Percent of reads mapping to Prevotella and Human (background) sequences (15 Mbp sequenced).

(PDF)

Click here for additional data file.^{(46.4KB, pdf)}

S4 Table. Proportion of called bases matching the Prevotella reference genome demonstrates that SWGA does not introduce sequencing errors.

(PDF)

Click here for additional data file.^{(46.5KB, pdf)}

Acknowledgments

We are grateful to Paul Planet and Prioty Sarwar for providing genomic DNA.

Data Availability

The swga2.0 program and all dependencies can be downloaded into a conda environment from https://anaconda.org/janedwivedi/swga2. The source repository, tutorials, and documentation can be found at https://github.com/songlab-cal/swga2. All amplification and sequencing data presented can be found in the supplemental materials or in the Dryad data repository https://doi.org/10.5061/dryad.3n5tb2rm2.

Funding Statement

This research is supported in part by the National Institutes of Health (R21-AI137433 (DB,YSS), R35-GM134922 (YSS), and R01- AI142572 (DB)), and the Burroughs Wellcome Fund (1012376 (DB)). The funding agencies had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1. Nosil P, Buerkle A. Population Genomics. Nature Education Knowledge. 2010;3(10):8. [Google Scholar]
2. Lasken RS, McLean JS. Recent advances in genomic DNA sequencing of microbial species from single cells. Nature Reviews Genetics. 2014;15:577–584. doi: 10.1038/nrg3785 [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Seth-Smith HMB, Harris SR, Skilton RJ, Radebe FM, Golparian D, Shipitsyna E, et al. Whole-genome sequences of Chlamydia trachomatis directly from clinical samples without culture. Genome Research. 2013;23(5):855–866. doi: 10.1101/gr.150037.112 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Richardson MF, Weinert LA, Welch JJ, Linheiro RS, Magwire MM, Jiggins FM, et al. Population genomics of the Wolbachia endosymbiont in Drosophila melanogaster. PLOS Genetics. 2012;8(12):e1003129. doi: 10.1371/journal.pgen.1003129 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Pain A, Böhme U, Berry AE, Mungall K, Finn RD, Jackson AP, et al. The genome of the simian and human malaria parasite Plasmodium knowlesi. Nature. 2008;455:799–803. doi: 10.1038/nature07306 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Mardis ER. Next-Generation DNA Sequencing Methods. Annual Review of Genomics and Human Genetics. 2008;9(1):387–402. doi: 10.1146/annurev.genom.9.081307.164359 [DOI] [PubMed] [Google Scholar]
7. Schmeisser C, Steele H, Streit WR. Metagenomics, biotechnology with non-culturable microbes. Applied Microbiology and Biotechnology. 2007;75(5):955–962. doi: 10.1007/s00253-007-0945-5 [DOI] [PubMed] [Google Scholar]
8. Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P. A Bioinformatician’s Guide to Metagenomics. Microbiology and Molecular Biology Reviews. 2008;72(4):557. doi: 10.1128/MMBR.00009-08 [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Eisen JA. Environmental Shotgun Sequencing: Its Potential and Challenges for Studying the Hidden World of Microbes. PLOS Biology. 2007;5(3):e82. doi: 10.1371/journal.pbio.0050082 [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Mamanova L, Coffey AJ, Scott CE, Kozarewa I, Turner EH, Kumar A, et al. Target-enrichment strategies for next-generation sequencing. Nature Methods. 2010;7:111–118. doi: 10.1038/nmeth0610-479c [DOI] [PubMed] [Google Scholar]
11. Blainey PC. The future is now: single-cell genomics of bacteria and archaea. FEMS Microbiology Reviews. 2013;37(3):407–427. doi: 10.1111/1574-6976.12015 [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Leichty AR, Brisson D. Selective Whole Genome Amplification for Resequencing Target Microbial Species from Complex Natural Samples. Genetics. 2014;198(2):473–481. doi: 10.1534/genetics.114.165498 [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Rutledge GG, Ariani CV. Finding the needle in the haystack. Nature Reviews Microbiology. 2017;15:136. doi: 10.1038/nrmicro.2017.7 [DOI] [PubMed] [Google Scholar]
14. Dean FB, Nelson JR, Giesler TL, Lasken RS. Rapid Amplification of Plasmid and Phage DNA Using Phi29 DNA Polymerase and Multiply-Primed Rolling Circle Amplification. 2001;11(6):1095–1099. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Pinard R, de Winter A, Sarkis GJ, Gerstein MB, Tartaro KR, Plant RN, et al. Assessment of whole genome amplification-induced bias through high-throughput, massively parallel whole genome sequencing. BMC Genomics. 2006;7(1):216. doi: 10.1186/1471-2164-7-216 [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Banér J, Mendel-Hartvig M, Nilsson M, Landegren U. Signal amplification of padlock probes by rolling circle replication. Nucleic Acids Research. 1998;26(22):5073–5078. doi: 10.1093/nar/26.22.5073 [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Clarke EL, Sundararaman SA, Seifert SN, Bushman FD, Hahn BH, Brisson D. swga: a primer design toolkit for selective whole genome amplification. Bioinformatics. 2017;33(14):2071–2077. doi: 10.1093/bioinformatics/btx118 [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Sundararaman SA, Plenderleith LJ, Liu W, Loy DE, Learn GH, Li Y, et al. Genomes of cryptic chimpanzee Plasmodium species reveal key evolutionary events leading to human malaria. Nature Communications. 2016;7(7):11078. doi: 10.1038/ncomms11078 [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Guggisberg AM, Sundararaman SA, Lanaspa M, Moraleda C, González R, Mayor A, et al. Whole-Genome Sequencing to Evaluate the Resistance Landscape Following Antimalarial Treatment Failure With Fosmidomycin-Clindamycin. The Journal of Infectious Diseases. 2016;214(7):1085–1091. doi: 10.1093/infdis/jiw304 [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Oyola SO, Ariani CV, Hamilton WL, Kekre M, Amenga-Etego LN, Ghansah A, et al. Whole genome sequencing of Plasmodium falciparum from dried blood spots using selective whole genome amplification. Malaria Journal. 2016;15(1):597. doi: 10.1186/s12936-016-1641-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Cowell AN, Loy DE, Sundararaman SA, Valdivia H, Fisch K, Lescano AG, et al. Selective Whole-Genome Amplification Is a Robust Method That Enables Scalable Whole-Genome Sequencing of Plasmodium vivax from Unprocessed Clinical Samples. mBio. 2017;8(1):e02257–16. doi: 10.1128/mBio.02257-16 [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Loy DE, Plenderleith LJ, Sundararaman SA, Liu W, Gruszczyk J, Chen YJ, et al. Evolutionary history of human Plasmodium vivax revealed by genome-wide analyses of related ape parasites. Proceedings of the National Academy of Sciences. 2018;115(36):8450–8459. doi: 10.1073/pnas.1810053115 [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Small ST, Labbé F, Coulibaly YI, Nutman TB, King CL, Serre D, et al. Human Migration and the Spread of the Nematode Parasite Wuchereria bancrofti. Molecular Biology and Evolution. 2013;36(9):1931–1941. doi: 10.1093/molbev/msz116 [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Morgan AP, Brazeau NF, Ngasala B, Mhamilawa LE, Denton M, Msellem M, et al. Falciparum malaria from coastal Tanzania and Zanzibar remains highly connected despite effective control efforts on the archipelago. Malaria Journal. 2020;19(1):47. doi: 10.1186/s12936-020-3137-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Cocking JH, Deberg M, Schupp J, Sahl J, Wiggins K, Porty A, et al. Selective whole genome amplification and sequencing of Coxiella burnetii directly from environmental samples. Genomics. 2020;112(2):1872–1878. doi: 10.1016/j.ygeno.2019.10.022 [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Itsko M, Retchless AC, Joseph SJ, Turner AN, Bazan JA, Sadji AY, et al. Full molecular typing of Neisseria meningitidis directly from clinical specimens for outbreak investigation. Journal of Clinical Microbiology. 2020;58(12). doi: 10.1128/JCM.01780-20 [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Osborne A, Manko E, Takeda M, Kaneko A, Kagaya W, Chan C, et al. Characterizing the genomic variation and population dynamics of Plasmodium falciparum malaria parasites in and around Lake Victoria, Kenya. Scientific Reports. 2021;11(1):19809. doi: 10.1038/s41598-021-99192-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
28. Ibrahim A, Benavente ED, Nolder D, Proux S, Higgins M, Muwanguzi J, et al. Selective whole genome amplification of Plasmodium malariae DNA from clinical samples reveals insights into population structure. Scientific Reports. 2020;10(1). doi: 10.1038/s41598-020-67568-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Benavente ED, Gomes AR, Silva JRD, Grigg M, Walker H, Barber BE, et al. Whole genome sequencing of amplified Plasmodium knowlesi DNA from unprocessed blood reveals genetic exchange events between Malaysian Peninsular and Borneo subpopulations. Scientific Reports. 2019;9(1):9873. doi: 10.1038/s41598-019-46398-z [DOI] [PMC free article] [PubMed] [Google Scholar]
30. Thurlow CM, Joseph SJ, Ganova-Raeva L, Katz SS, Pereira L, Chen C, et al. Selective Whole-Genome Amplification as a Tool to Enrich Specimens with Low Treponema pallidum Genomic DNA Copies for Whole-Genome Sequencing. mSphere. 2022;7(3):e0000922. doi: 10.1128/msphere.00009-22 [DOI] [PMC free article] [PubMed] [Google Scholar]
31. Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27(6):764–770. doi: 10.1093/bioinformatics/btr011 [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Allawi HT, SantaLucia J. Thermodynamics and NMR of internal G.T mismatches in DNA. Biochemistry. 1997;36(34):10581–10594. doi: 10.1021/bi962590c [DOI] [PubMed] [Google Scholar]
33. Ali MM, Li F, Zhang Z, Zhang K, Kang DK, Ankrum JA, et al. Rolling circle amplification: a versatile tool for chemical biology, materials science and medicine. Chemical Society Reviews. 2014;43(10):3324–3341. doi: 10.1039/c3cs60439j [DOI] [PubMed] [Google Scholar]
34. Dieffenbach CW, Lowe TM, Dveksler GS. General concepts for PCR primer design. Genome Research. 1993;3(3):S30–S37. doi: 10.1101/gr.3.3.S30 [DOI] [PubMed] [Google Scholar]
35. SantaLucia J Jr, Hicks D. The Thermodynamics of DNA Structural Motifs. Annual Review of Biophysics and Biomolecular Structure. 2004;33(1):415–440. doi: 10.1146/annurev.biophys.32.110601.141800 [DOI] [PubMed] [Google Scholar]
36. SantaLucia J Jr. Physical Principles and Visual-OMP Software for Optimal PCR Design. In: Yuryev A, editor. PCR Primer Design. Totowa, NJ: Humana Press; 2007. p. 3–33. [DOI] [PubMed] [Google Scholar]
37. Teyssier NB, Chen A, Duarte EM, Sit R, Greenhouse B, Tessema SK. Optimization of whole-genome sequencing of Plasmodium falciparum from low-density dried blood spot samples. Malaria journal. 2021;20:116. doi: 10.1186/s12936-021-03630-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
38. Kryazhimskiy S, Rice DP, Jerison ER, Desai MM. Global epistasis makes adaptation predictable despite sequence-level stochasticity. Science. 2014;344(6191):1519–1522. doi: 10.1126/science.1250939 [DOI] [PMC free article] [PubMed] [Google Scholar]
39. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25(16):2078–2079. doi: 10.1093/bioinformatics/btp352 [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Blanco L, Bernad A, Salas M. Phi29 DNA Polymerase; 1993.
41. Field TR, Sibley CD, Parkins MD, Rabin HR, Surette MG. The genus Prevotella in cystic fibrosis airways. Anaerobe. 2010;16(4):337–344. doi: 10.1016/j.anaerobe.2010.04.002 [DOI] [PubMed] [Google Scholar]
42. Rogers GB, Carroll MP, Serisier DJ, Hockey PM, Jones G, Bruce KD. Characterization of bacterial community diversity in cystic fibrosis lung infections by use of 16S ribosomal DNA terminal restriction fragment length polymorphism profiling. Journal of Clinical Microbiology. 2004;42(11):5176–5183. doi: 10.1128/JCM.42.11.5176-5183.2004 [DOI] [PMC free article] [PubMed] [Google Scholar]
43. Ibrahim M, Subramanian A, Anishetty S. Comparative pan genome analysis of oral Prevotella species implicated in periodontitis. Functional and Integrative Genomics. 2017;17(5). doi: 10.1007/s10142-017-0550-3 [DOI] [PubMed] [Google Scholar]
44. Fleischmann RD, Alland D, Eisen JA, Carpenter L, White O, Peterson J, et al. Whole-genome comparison of Mycobacterium tuberculosis clinical and laboratory strains. Journal of Bacteriology. 2002;184(19). doi: 10.1128/JB.184.19.5479-5490.2002 [DOI] [PMC free article] [PubMed] [Google Scholar]
45. Pilling OA, Grace CA, Reis-Cunha JL, Berry AS, Mitchell MW, Yu JA, et al. Selective whole-genome amplification reveals population genetics of Leishmania braziliensis directly from patient skin biopsies. medRxiv. 2022. doi: 10.1101/2022.09.06.22279552 [DOI] [PMC free article] [PubMed] [Google Scholar]
46. Rizk G, Lavenier D, Chikhi R. DSK: k-mer counting with very low memory usage. Bioinformatics. 2013;29(5):652–653. doi: 10.1093/bioinformatics/btt020 [DOI] [PubMed] [Google Scholar]

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010137.r001

Decision Letter 0

Manja Marz

18 Jul 2022

Dear Dr. Brisson,

Thank you very much for submitting your manuscript "A fast machine-learning-guided primer design pipeline for selective whole genome amplification" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Manja Marz

Software Editor

PLOS Computational Biology

Manja Marz

Software Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: In this manuscript the authors introduce version 2 of the SWGA pipeline which optimizes primer design for selective whole genome amplification. This is a very valuable tool for enriching target DNA for sequencing in cases where samples would typically be dominated by background or host DNA. In the new version, the authors have added several speed ups to the original algorithm and used a machine learning approach to identify efficient primers features. The new additions to the pipeline are substantial and warrant a second paper to introduce these enhancements. The authors demonstrate the performance of swga2.0 by designing and testing 6 primer sets for amplifying Prevotella melaninogenica within a human background. The manuscript is well organized and clearly written in a way that is understandable to a non-computational audience and the best practices section at the end is very helpful.

The authors provide publicly available, version-controlled, and well documented code with appropriate documentation on installation and workflow. A local copy of the codebase can easily be installed using pip however it would be assessable to a wider audience if it could be installed using Conda (or similar) with all dependencies. The name of the software is different than what is reported in the paper which may be confusing. There does not appear to be any unit tests.

The sequencing and raw data was not provided with the manuscript and would be required to reproduce the results. The primer sequences were also not provided as far as I can tell.

There are not many other tools to directly compare to, so the authors mostly refer to previous results of swga1.0 but they did not do a direct comparison. Because the focus of this paper is on improvements from version 1.0, it would be helpful if the authors provide a summary table that compares the features between the old and new versions. To drive home the benefit of the newly added primer efficacy filter, it would be good to build sets using swga1.0 and evaluate how often the “best” sets contain primers that are predicted to be low efficiency and at what proportion. This would lend support the claims in the paper and would be fairly quick to run without requiring any additional sequencing.

Furthermore, given that the multiprocessing speedups are another focus of the paper, I recommend that a more rigorous comparison is performed between the two versions. A time was provided for a single run of v2.0 and the time for v1.0 was given as over two weeks. The number of processors and the memory requirements were not provided. It would be good to know how many processors were used and if there are diminishing returns. Was swga1.0 able to find a result after running for 2 weeks or did it get stuck? I believe that I have been able to design SWGA primers in less than two weeks using swga1.0, so it is unclear why it took so long. Perhaps it would be good to try a less complicated design to compare the efficiency between the two versions so that we can better evaluate the improvement of each of the speedups. It would be great to see a breakdown of the estimated time saved due to each of the improvements listed in the section titled: “Computational costs are reduced from weeks to minutes”

About half of the primer sets failed using swga2.0. There was some mention in the discussion about a relationship between the successful sets and lower mean binding distance in the target genome, but it would be nice to have a table of all the observed stats for the sets in addition to the parameter settings.

The discussion mentions that this approach would work on metagenomic samples (last paragraph). Has this been done and how would swga2.0 be able to design primer sets for such complex backgrounds? How many background genomes could reasonably be added? How would you handle all the potential plasmids in such a sample?

The last paragraph of the discussion abruptly introduces a number of new ideas that are difficult to follow. These should be expanded on to provide clarity if they are important or eliminated.

The EquiPhi29 enzyme is more thermally stable and produces higher yield than Phi29. Using this enzyme may allow longer primers with higher Tm. Could the authors provide some discussion about whether swga2.0 can design appropriate primer sets for use with this enzyme and some recommended parameter adjustments.

Mispriming in the rolling circle plasmid amplification experiment was mentioned but there were no results, and it was unclear if that was incorporated into the model.

It would be good to have the number of reads that aligned to the background genome.

The methods do not describe exactly how the target amplification efficiency was calculated.

Minor

Line 10 Abstract: Change “Evaluate” to “evaluates”

Page 3 line 5 and 6: Change “ie” to “i.e.,”

Page 3 second paragraph: change “as well as identifying” to “and identified”

Page 3, 4th line of last paragraph in introduction beginning with “First, active learning” this sentence should be reworded. Something like “First, active machine learning was used to identify features that predict primer and primer set efficiency”

Figure 3 caption, last line: change “axis” to “x-axis”

Page 8 Section “Empirical evaluation of primer sets”: Can you clarify that the FspE1 digestion was to suppress mitochondrial amplification and note whether or not swga2.0 has the option to omit mitochondrial sequences?

Page 9 first line: Missing reference and version for samtools.

Page 9, second paragraph, last line: Change “data from the rounds” to “data from rounds”

The text references Figure 3A but there is only a single for Figure 3.

The variables described in Stage 4 of the methods are not described until much later in the paper. I would move them up to where they are first mentioned.

Page 13, line 8: Change “eliminate” to “eliminating”

Page 14: Sentence “Highthroughput sequencing of the amplification products from the three effective P. melaninogenica primer sets reached 1× coverage across 25–64% the target genome with 50Mbp sequencing effort while only 27% of the M. tuberculosis was covered at 1× at similar sequencing effort after amplification with the most effective primer set (Figure 5)” is confusing. Figure 5 does not contain data about M. tuberculosis so I would move the figure reference closer to the actual result and clarify again that you are referring to M. tuberculosis results from the previous version of swga.

Pg 17 line 9: Change “utilizing using” with “using”

Reviewer #2: Yu et al. describe swga2.0, a pipeline that improves upon the original swga program described by Clarke et al. to identify, screen, and choose primers for use in selective whole-genome amplification. The authors apply swga2.0 to identify candidate sWGA primer sets for enrichment of P. melaninogenica and then test the best performing primer sets experimentally.

The manuscript is well-written and easy to follow. Major advances of the new pipeline are greatly reduced computational time (the prior version required weeks on a high-powered computing cluster, the current version purports to requires hours on a personal laptop) and use of machine-learning methods to select primers that are likely to perform well. The model used to predict amplification efficiency (or more accurately, probability of poor amplification) could be very useful for a variety of applications, though experimental data for enzymes other than phi29 would be required. Several companies offer services for highly multiplexed primer design but do not make their code publicly available. The approaches described here could be useful not only for sWGA primer design but for other applications as well, including design of highly multiplexed amplicon panels. The manuscript could be improved in several ways as outlined below – namely application to more than one pathogen and explicit comparison of swga2.0 to the original swga program.

MAJOR

1. Table 2: It is important to emphasize clearly that these features are specific for phi29. Some groups have begun using Equiphi29 to achieve higher reaction temperatures (enabling longer/more specific primers). Others may want to adapt these methods for other enzymes such as Bst. If the features employed by the program are phi29-specific, then the model will probably not perform well for these applications.

2. Page 4: It would be useful to delineate more clearly how swga2.0 differs from the original swga program published by Clarke et al., both in terms of the approach and empirical differences in output. As for approach, in the figure on page 4, there appear to be new steps in boxes 1 (jellyfish), 2 (homodimerization probability, runs of repeats), and most importantly in boxes 3 (random forest model) and 4 (coverage ratio). Some of these differences are covered on page 17 in the Discussion, but a table highlighting differences/improvements would be helpful. As for empirical differences in output, the authors include some discussion of differences on page 18 but are not comparing apples-to-apples. Differences between P. melaninogenica swga2.0 output and M. tuberculosis swga output are interesting, but direct comparison of swga vs swga2.0 for the same organism is needed to delineate their differences clearly. Did the authors run the original sWGA program on P. melaninogenica? If so, how did the output compare? If not, can they run it using similar parameters and comment on differences?

3. How did variant calls in the P. melaninogenica sequencing data compare to reference and to the non-sWGA sample? Phi29 is reported to have a very low error rate, but it is important to confirm with their own data if possible. Another issue of relevance in some cases is the ability to investigate within-host diversity after sWGA – specifically whether allele frequencies can still be estimated with confidence after sWGA enrichment (ex: Oyola et al. PMID 27998271). Though it is probably out of the scope of this manuscript to test experimentally, commenting on these effects in the discussion is probably worthwhile.

4. Confidence in the program’s ability to produce effective sWGA primer sets would be increased if more than one pathogen were tested experimentally. In the original swga publication, Clarke et al. reported differences in sWGA performance across several different organisms. I wonder if the new swga2.0 algorithm might overcome some of these differences, or if they will persist?

MINOR

Page 2: sWGA has now been used successfully for Treponema pallidum subs pallidum (causative agent of syphilis) – see Thurlow CM et al. mSphere 2022, PMID 35491834.

Page 3: Unclear why ‘SWG amplification’ is used instead of ‘sWGA’.

Page 5: Please define each of the ratios included in the score equation listed at the bottom of the page.

Pages 6-7 and Table 2: The included explicit discussion of thermodynamically-principled included in the model is helpful.

Page 8: Why was the sample digested? I assume this is to linearize circular bacterial DNA. Has this been shown to improve the efficiency of MDA/phi29 reactions? In this paper, DNA was digested with FspEI while the previous study using NarI (Clarke et al., 2017). Methylation digestion with different enzymes and experimental process can vary the swga results, such as the proportion of reads mapping to the target organism and genome coverage (see Teyssier NB et al. PMID 33637093).

Page 8: For clarity, please specify exactly what 3.5mM total of sWGA primers means so that there is no ambiguity. Ex: “…(e.g. in a set of 10 primers, each primer was added at 0.35mM so that the total molarity of primers was 3.5mM)” if that is the case, or other wording depending on the actual conditions. There has been variability in published sWGA methods on this topic, due in part to ambiguity in the description of the exact conditions used during sWGA in some past publications.

Page 18: The general suggestions provided are very helpful for scientists looking to use this methodology to design and test their own custom sets.

Page 19: Mention of CRISPR and gRNA design in the concluding sentence doesn’t follow from the manuscript and was surprising. Consider removing, or moving up into the discussion.

Methods: In swga2.0, is it possible to visualize/export exact locations of primer binding sites?

Methods: Is it possible to set the number of mismatched bases permitted in the primers themselves? Can the allowable length of mononucleotide repeats be modified?

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No: The code is available and well documented but I do not see the sequencing reads, the raw data, or the primer sequences.

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Jonathan B. Parr

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. 2023 Apr 17;19(4):e1010137. doi: 10.1371/journal.pcbi.1010137.r002

Author response to Decision Letter 0

13 Sep 2022

Attachment

Submitted filename: ReviewerComments.pdf

Click here for additional data file.^{(274.3KB, pdf)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010137.r003

Decision Letter 1

Manja Marz

9 Feb 2023

Dear Dr. Brisson,

Thank you very much for submitting your manuscript "A fast machine-learning-guided primer design pipeline for selective whole genome amplification" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Manja Marz

Software Editor

PLOS Computational Biology

Manja Marz

Software Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: This resubmission is much improved, and the authors have addressed my major concerns. I have a few minor comments.

The authors added Table 1 which helpfully summarizes the improvements in swga2.0 compared to swga1.0 and breaks down the performance improvements for all stages. However, the benchmarking details could be a bit more explicit in a computational biology article. For example, I had to go to the Github page to find swga2.0’s default setting for number of CPUs (which I assume is what was used). After finding out that it defaults to “all” I then needed to look up what that might be on the specific model of MacBook Pro that was used (4 maybe)? I think that this is important information to have in the article to evaluate the 10-fold improvement in speed. Another minor point on Table 1, the first column could include a short name/description of each of the stages in addition to just the stage number so that the reader does not have to refer to the text to understand the table.

My previous comment about the method for calculating amplification efficiency was referring to the empirical values, not the predictions using the model. Some detail about the method used to measured primer amplification of the plasmids would be helpful, even if it is only included in the caption for Supplemental Table 1.

All other issues that I have raised have been addressed by the authors.

Reviewer #2: I appreciate the authors’ responses to my comments and the associated manuscript changes. One minor comment as below:

Discussion: The authors include the following statement: “Also similar to prior results (Oyola et al., 2016), protocols developed with swga2.0 retain the ability to investigate within-host microbial diversity using SWGA enrichment (Pilling et al., 2022), without introducing errors, as the SWGA biochemistry remains identical (Table S4).”

We have found that reliable estimation of allele frequencies after sWGA is not straightforward, due in part to exceptionally high depth of coverage in some regions (“jackpotting”) but not others. If the authors do not have empirical data to support the statement above, I would suggest softening the language to indicate that this is their expectation but not an observation (i.e. replace “retain” and with “are expected” or similar).

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Jonathan Parr

Figure Files:

Data Requirements:

Reproducibility:

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLoS Comput Biol. 2023 Apr 17;19(4):e1010137. doi: 10.1371/journal.pcbi.1010137.r004

Author response to Decision Letter 1

16 Feb 2023

Attachment

Submitted filename: Response_To_Reviewers.pdf

Click here for additional data file.^{(191.1KB, pdf)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010137.r005

Decision Letter 2

Manja Marz

23 Mar 2023

Dear Dr. Brisson,

We are pleased to inform you that your manuscript 'A fast machine-learning-guided primer design pipeline for selective whole genome amplification' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.

Best regards,

Manja Marz

Software Editor

PLOS Computational Biology

Manja Marz

Software Editor

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1010137.r006

Acceptance letter

Manja Marz

3 Apr 2023

PCOMPBIOL-D-22-00648R2

A fast machine-learning-guided primer design pipeline for selective whole genome amplification

Dear Dr Brisson,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Timea Kemeri-Szekernyes

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Appendix. Primer set search algorithm.

(PDF)

Click here for additional data file.^{(159.4KB, pdf)}

S1 Table. Primer amplification data.

(PDF)

Click here for additional data file.^{(167.7KB, pdf)}

S2 Table. Primer set statistics and sequences.

(PDF)

Click here for additional data file.^{(48.7KB, pdf)}

S3 Table. Percent of reads mapping to Prevotella and Human (background) sequences (15 Mbp sequenced).

(PDF)

Click here for additional data file.^{(46.4KB, pdf)}

S4 Table. Proportion of called bases matching the Prevotella reference genome demonstrates that SWGA does not introduce sequencing errors.

(PDF)

Click here for additional data file.^{(46.5KB, pdf)}

Attachment

Submitted filename: ReviewerComments.pdf

Click here for additional data file.^{(274.3KB, pdf)}

Attachment

Submitted filename: Response_To_Reviewers.pdf

Click here for additional data file.^{(191.1KB, pdf)}

Data Availability Statement

[pcbi.1010137.ref001] 1. Nosil P, Buerkle A. Population Genomics. Nature Education Knowledge. 2010;3(10):8. [Google Scholar]

[pcbi.1010137.ref002] 2. Lasken RS, McLean JS. Recent advances in genomic DNA sequencing of microbial species from single cells. Nature Reviews Genetics. 2014;15:577–584. doi: 10.1038/nrg3785 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref003] 3. Seth-Smith HMB, Harris SR, Skilton RJ, Radebe FM, Golparian D, Shipitsyna E, et al. Whole-genome sequences of Chlamydia trachomatis directly from clinical samples without culture. Genome Research. 2013;23(5):855–866. doi: 10.1101/gr.150037.112 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref004] 4. Richardson MF, Weinert LA, Welch JJ, Linheiro RS, Magwire MM, Jiggins FM, et al. Population genomics of the Wolbachia endosymbiont in Drosophila melanogaster. PLOS Genetics. 2012;8(12):e1003129. doi: 10.1371/journal.pgen.1003129 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref005] 5. Pain A, Böhme U, Berry AE, Mungall K, Finn RD, Jackson AP, et al. The genome of the simian and human malaria parasite Plasmodium knowlesi. Nature. 2008;455:799–803. doi: 10.1038/nature07306 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref006] 6. Mardis ER. Next-Generation DNA Sequencing Methods. Annual Review of Genomics and Human Genetics. 2008;9(1):387–402. doi: 10.1146/annurev.genom.9.081307.164359 [DOI] [PubMed] [Google Scholar]

[pcbi.1010137.ref007] 7. Schmeisser C, Steele H, Streit WR. Metagenomics, biotechnology with non-culturable microbes. Applied Microbiology and Biotechnology. 2007;75(5):955–962. doi: 10.1007/s00253-007-0945-5 [DOI] [PubMed] [Google Scholar]

[pcbi.1010137.ref008] 8. Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P. A Bioinformatician’s Guide to Metagenomics. Microbiology and Molecular Biology Reviews. 2008;72(4):557. doi: 10.1128/MMBR.00009-08 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref009] 9. Eisen JA. Environmental Shotgun Sequencing: Its Potential and Challenges for Studying the Hidden World of Microbes. PLOS Biology. 2007;5(3):e82. doi: 10.1371/journal.pbio.0050082 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref010] 10. Mamanova L, Coffey AJ, Scott CE, Kozarewa I, Turner EH, Kumar A, et al. Target-enrichment strategies for next-generation sequencing. Nature Methods. 2010;7:111–118. doi: 10.1038/nmeth0610-479c [DOI] [PubMed] [Google Scholar]

[pcbi.1010137.ref011] 11. Blainey PC. The future is now: single-cell genomics of bacteria and archaea. FEMS Microbiology Reviews. 2013;37(3):407–427. doi: 10.1111/1574-6976.12015 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref012] 12. Leichty AR, Brisson D. Selective Whole Genome Amplification for Resequencing Target Microbial Species from Complex Natural Samples. Genetics. 2014;198(2):473–481. doi: 10.1534/genetics.114.165498 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref013] 13. Rutledge GG, Ariani CV. Finding the needle in the haystack. Nature Reviews Microbiology. 2017;15:136. doi: 10.1038/nrmicro.2017.7 [DOI] [PubMed] [Google Scholar]

[pcbi.1010137.ref014] 14. Dean FB, Nelson JR, Giesler TL, Lasken RS. Rapid Amplification of Plasmid and Phage DNA Using Phi29 DNA Polymerase and Multiply-Primed Rolling Circle Amplification. 2001;11(6):1095–1099. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref015] 15. Pinard R, de Winter A, Sarkis GJ, Gerstein MB, Tartaro KR, Plant RN, et al. Assessment of whole genome amplification-induced bias through high-throughput, massively parallel whole genome sequencing. BMC Genomics. 2006;7(1):216. doi: 10.1186/1471-2164-7-216 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref016] 16. Banér J, Mendel-Hartvig M, Nilsson M, Landegren U. Signal amplification of padlock probes by rolling circle replication. Nucleic Acids Research. 1998;26(22):5073–5078. doi: 10.1093/nar/26.22.5073 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref017] 17. Clarke EL, Sundararaman SA, Seifert SN, Bushman FD, Hahn BH, Brisson D. swga: a primer design toolkit for selective whole genome amplification. Bioinformatics. 2017;33(14):2071–2077. doi: 10.1093/bioinformatics/btx118 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref018] 18. Sundararaman SA, Plenderleith LJ, Liu W, Loy DE, Learn GH, Li Y, et al. Genomes of cryptic chimpanzee Plasmodium species reveal key evolutionary events leading to human malaria. Nature Communications. 2016;7(7):11078. doi: 10.1038/ncomms11078 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref019] 19. Guggisberg AM, Sundararaman SA, Lanaspa M, Moraleda C, González R, Mayor A, et al. Whole-Genome Sequencing to Evaluate the Resistance Landscape Following Antimalarial Treatment Failure With Fosmidomycin-Clindamycin. The Journal of Infectious Diseases. 2016;214(7):1085–1091. doi: 10.1093/infdis/jiw304 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref020] 20. Oyola SO, Ariani CV, Hamilton WL, Kekre M, Amenga-Etego LN, Ghansah A, et al. Whole genome sequencing of Plasmodium falciparum from dried blood spots using selective whole genome amplification. Malaria Journal. 2016;15(1):597. doi: 10.1186/s12936-016-1641-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref021] 21. Cowell AN, Loy DE, Sundararaman SA, Valdivia H, Fisch K, Lescano AG, et al. Selective Whole-Genome Amplification Is a Robust Method That Enables Scalable Whole-Genome Sequencing of Plasmodium vivax from Unprocessed Clinical Samples. mBio. 2017;8(1):e02257–16. doi: 10.1128/mBio.02257-16 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref022] 22. Loy DE, Plenderleith LJ, Sundararaman SA, Liu W, Gruszczyk J, Chen YJ, et al. Evolutionary history of human Plasmodium vivax revealed by genome-wide analyses of related ape parasites. Proceedings of the National Academy of Sciences. 2018;115(36):8450–8459. doi: 10.1073/pnas.1810053115 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref023] 23. Small ST, Labbé F, Coulibaly YI, Nutman TB, King CL, Serre D, et al. Human Migration and the Spread of the Nematode Parasite Wuchereria bancrofti. Molecular Biology and Evolution. 2013;36(9):1931–1941. doi: 10.1093/molbev/msz116 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref024] 24. Morgan AP, Brazeau NF, Ngasala B, Mhamilawa LE, Denton M, Msellem M, et al. Falciparum malaria from coastal Tanzania and Zanzibar remains highly connected despite effective control efforts on the archipelago. Malaria Journal. 2020;19(1):47. doi: 10.1186/s12936-020-3137-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref025] 25. Cocking JH, Deberg M, Schupp J, Sahl J, Wiggins K, Porty A, et al. Selective whole genome amplification and sequencing of Coxiella burnetii directly from environmental samples. Genomics. 2020;112(2):1872–1878. doi: 10.1016/j.ygeno.2019.10.022 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref026] 26. Itsko M, Retchless AC, Joseph SJ, Turner AN, Bazan JA, Sadji AY, et al. Full molecular typing of Neisseria meningitidis directly from clinical specimens for outbreak investigation. Journal of Clinical Microbiology. 2020;58(12). doi: 10.1128/JCM.01780-20 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref027] 27. Osborne A, Manko E, Takeda M, Kaneko A, Kagaya W, Chan C, et al. Characterizing the genomic variation and population dynamics of Plasmodium falciparum malaria parasites in and around Lake Victoria, Kenya. Scientific Reports. 2021;11(1):19809. doi: 10.1038/s41598-021-99192-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref028] 28. Ibrahim A, Benavente ED, Nolder D, Proux S, Higgins M, Muwanguzi J, et al. Selective whole genome amplification of Plasmodium malariae DNA from clinical samples reveals insights into population structure. Scientific Reports. 2020;10(1). doi: 10.1038/s41598-020-67568-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref029] 29. Benavente ED, Gomes AR, Silva JRD, Grigg M, Walker H, Barber BE, et al. Whole genome sequencing of amplified Plasmodium knowlesi DNA from unprocessed blood reveals genetic exchange events between Malaysian Peninsular and Borneo subpopulations. Scientific Reports. 2019;9(1):9873. doi: 10.1038/s41598-019-46398-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref030] 30. Thurlow CM, Joseph SJ, Ganova-Raeva L, Katz SS, Pereira L, Chen C, et al. Selective Whole-Genome Amplification as a Tool to Enrich Specimens with Low Treponema pallidum Genomic DNA Copies for Whole-Genome Sequencing. mSphere. 2022;7(3):e0000922. doi: 10.1128/msphere.00009-22 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref031] 31. Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27(6):764–770. doi: 10.1093/bioinformatics/btr011 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref032] 32. Allawi HT, SantaLucia J. Thermodynamics and NMR of internal G.T mismatches in DNA. Biochemistry. 1997;36(34):10581–10594. doi: 10.1021/bi962590c [DOI] [PubMed] [Google Scholar]

[pcbi.1010137.ref033] 33. Ali MM, Li F, Zhang Z, Zhang K, Kang DK, Ankrum JA, et al. Rolling circle amplification: a versatile tool for chemical biology, materials science and medicine. Chemical Society Reviews. 2014;43(10):3324–3341. doi: 10.1039/c3cs60439j [DOI] [PubMed] [Google Scholar]

[pcbi.1010137.ref034] 34. Dieffenbach CW, Lowe TM, Dveksler GS. General concepts for PCR primer design. Genome Research. 1993;3(3):S30–S37. doi: 10.1101/gr.3.3.S30 [DOI] [PubMed] [Google Scholar]

[pcbi.1010137.ref035] 35. SantaLucia J Jr, Hicks D. The Thermodynamics of DNA Structural Motifs. Annual Review of Biophysics and Biomolecular Structure. 2004;33(1):415–440. doi: 10.1146/annurev.biophys.32.110601.141800 [DOI] [PubMed] [Google Scholar]

[pcbi.1010137.ref036] 36. SantaLucia J Jr. Physical Principles and Visual-OMP Software for Optimal PCR Design. In: Yuryev A, editor. PCR Primer Design. Totowa, NJ: Humana Press; 2007. p. 3–33. [DOI] [PubMed] [Google Scholar]

[pcbi.1010137.ref037] 37. Teyssier NB, Chen A, Duarte EM, Sit R, Greenhouse B, Tessema SK. Optimization of whole-genome sequencing of Plasmodium falciparum from low-density dried blood spot samples. Malaria journal. 2021;20:116. doi: 10.1186/s12936-021-03630-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref038] 38. Kryazhimskiy S, Rice DP, Jerison ER, Desai MM. Global epistasis makes adaptation predictable despite sequence-level stochasticity. Science. 2014;344(6191):1519–1522. doi: 10.1126/science.1250939 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref039] 39. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25(16):2078–2079. doi: 10.1093/bioinformatics/btp352 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref040] 40.Blanco L, Bernad A, Salas M. Phi29 DNA Polymerase; 1993.

[pcbi.1010137.ref041] 41. Field TR, Sibley CD, Parkins MD, Rabin HR, Surette MG. The genus Prevotella in cystic fibrosis airways. Anaerobe. 2010;16(4):337–344. doi: 10.1016/j.anaerobe.2010.04.002 [DOI] [PubMed] [Google Scholar]

[pcbi.1010137.ref042] 42. Rogers GB, Carroll MP, Serisier DJ, Hockey PM, Jones G, Bruce KD. Characterization of bacterial community diversity in cystic fibrosis lung infections by use of 16S ribosomal DNA terminal restriction fragment length polymorphism profiling. Journal of Clinical Microbiology. 2004;42(11):5176–5183. doi: 10.1128/JCM.42.11.5176-5183.2004 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref043] 43. Ibrahim M, Subramanian A, Anishetty S. Comparative pan genome analysis of oral Prevotella species implicated in periodontitis. Functional and Integrative Genomics. 2017;17(5). doi: 10.1007/s10142-017-0550-3 [DOI] [PubMed] [Google Scholar]

[pcbi.1010137.ref044] 44. Fleischmann RD, Alland D, Eisen JA, Carpenter L, White O, Peterson J, et al. Whole-genome comparison of Mycobacterium tuberculosis clinical and laboratory strains. Journal of Bacteriology. 2002;184(19). doi: 10.1128/JB.184.19.5479-5490.2002 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref045] 45. Pilling OA, Grace CA, Reis-Cunha JL, Berry AS, Mitchell MW, Yu JA, et al. Selective whole-genome amplification reveals population genetics of Leishmania braziliensis directly from patient skin biopsies. medRxiv. 2022. doi: 10.1101/2022.09.06.22279552 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1010137.ref046] 46. Rizk G, Lavenier D, Chikhi R. DSK: k-mer counting with very low memory usage. Bioinformatics. 2013;29(5):652–653. doi: 10.1093/bioinformatics/btt020 [DOI] [PubMed] [Google Scholar]

PERMALINK

A fast machine-learning-guided primer design pipeline for selective whole genome amplification

Jane A Dwivedi-Yu

Zachary J Oppler

Matthew W Mitchell

Yun S Song

Dustin Brisson

Roles

Abstract

Author summary

Introduction

Methods

swga2.0 pipeline

Fig 1. Overview of the swga2.0 pipeline.

Table 1. Differences between swga1.0 and swga2.0.

Stage 1 (k-mer preprocessing)

Stage 2 (Candidate primer filtering)

Stage 3 (Primer efficacy filter)

Stage 4 (Primer set search and evaluation)

Fig 2. Summary schematic of Stage 4 (Primer set search and evaluation) of the swga2.0 pipeline.

Table 2. Ridge regression variable descriptions and coefficient values for primer set evaluation.

Amplification efficacy

Table 3. Feature importances based on the random forest regressor model.

An active learning approach

Primer set scoring function

Empirical evaluation of primer sets

Results

Primer amplification efficacy

Fig 3. The accuracy of amplification efficacy predictions increased after three iterations of the active learning approach designed to identify primer characteristics associated with effective priming.

Fig 4. The final random forest regression model reliably predicts the amplification efficacy of individual primers.

Table 4. Poor performing primers are filtered out in Stage 3 of swga2.0.

Feature importance in the random forest regression model

Primer set search and scoring function

Computational costs are reduced more than ten-fold

Evaluation of primer sets to selectively amplify Prevotella melaninogenica

Table 5. P. melaninogenica is a more difficult genome to design primers for than M. tuberculosis.

Fig 5. Selective amplification of P. melanogenica using six swga2.0-designed primer sets (Prev01-Prev06) identified three primer sets (Prev03, Prev06, and Prev04) that can amplify the target genome but not the background DNA.

Fig 6. Deeper sequencing of the three successful primer sets—Prev03, Prev06, and Prev04—confirms the efficient and even selective amplification of P. melaninogenica.

Discussion

Dryad DOI

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Manja Marz

Roles

Author response to Decision Letter 0

Decision Letter 1

Manja Marz

Roles

Author response to Decision Letter 1

Decision Letter 2

Manja Marz

Roles

Acceptance letter

Manja Marz

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

`swga2.0` pipeline

Fig 1. Overview of the `swga2.0` pipeline.

Table 1. Differences between `swga1.0` and `swga2.0`.

Fig 2. Summary schematic of Stage 4 (Primer set search and evaluation) of the `swga2.0` pipeline.

Table 4. Poor performing primers are filtered out in Stage 3 of `swga2.0`.

Fig 5. Selective amplification of P. melanogenica using six `swga2.0`-designed primer sets (Prev01-Prev06) identified three primer sets (Prev03, Prev06, and Prev04) that can amplify the target genome but not the background DNA.