Abstract
Summary
A number of methods have been devised to address the need for targeted genomic resequencing. One of these methods, region-specific extraction (RSE) is characterized by the capture of long DNA fragments (15–20 kb) by magnetic beads, after enzymatic extension of oligonucleotides hybridized to selected genomic regions. Facilitating the selection of the most appropriate capture oligos for targeting a region of interest, satisfying the properties of temperature (Tm) and entropy (ΔG), while minimizing the formation of primer-dimers in a pooled experiment, is therefore necessary. Manual design and selection of oligos becomes very challenging, complicated by factors such as length of the target region and number of targeted regions. Here we describe, AnthOligo, a web-based application developed to optimally automate the process of generation of oligo sequences used to target and capture the continuum of large and complex genomic regions. Apart from generating oligos for RSE, this program may have wider applications in the design of customizable internal oligos to be used as baits for gene panel analysis or even probes for large-scale comparative genomic hybridization array processes. AnthOligo was tested by capturing the Major Histocompatibility Complex (MHC) of a random sample.
The application provides users with a simple interface to upload an input file in BED format and customize parameters for each task. The task of probe design in AnthOligo commences when a user uploads an input file and concludes with the generation of a result-set containing an optimal set of region-specific oligos. AnthOligo is currently available as a public web application with URL: http://antholigo.chop.edu.
Supplementary information
Supplementary data are available at Bioinformatics online.
1 Introduction
Massively parallel sequencing, in particular, short-read technologies such as Exome Sequencing have become important milestones in genomic diagnosis. Newer technologies (Gnirke et al., 2009; Okou et al., 2007; Tewhey et al., 2009) have emerged such as long-read sequencing using linked-read strategy from 10× genomics (Wenger et al., 2019) and single-molecule real-time sequencing approach from PacBio (Zheng et al., 2016). They primarily focus on improving coverage over complex genomic regions to achieve finer resolution over sequence and structural rearrangements. Combining an approach that provides long DNA fragments for sequencing with a low-cost targeted enrichment method would result in an optimal capture methodology.
Region-specific extraction (RSE) of DNA is a solution-based technique for enrichment of defined genomic regions of interest. The method’s cost-effective target-enrichment approach allows for the capture of longer sequence templates up to 20 kb and a uniform depth of coverage across a region of interest (ROI).
Probe design for targeted enrichment is a requirement for any NGS test development. Although there exist many stand-alone tools and web-applications to help address requirements for varied target enrichment approaches, none can be implemented directly for the RSE method (Ben Zakour, 2004; Francis et al., 2017; Ilie et al., 2013; Jabado et al., 2006; Nordberg, 2005; Rouillard et al., 2003; Rychlik, 2007; Shen et al., 2010; Vallone and Butler, 2004; Wingo et al., 2017). While tools such as Prober (Navin et al., 2006) design paired oligonucleotide probes that may also account for spacing between the probes, their design criteria are limited to probe spacing up to 2000 bp and target regions requiring a 10 kb unique subsequence. This might not be helpful when oligonucleotide probes are needed to capture extremely large complex regions up to millions of bases using target capture methods that may require the probes to be spaced at distances larger than 2 kb. The advantage of the RSE oligonucleotide design method is the ability to ‘space’ the oligos evenly at a pre-selected distance that can be thousands of bases. Thus, the tool achieves equivalent target specificity for large complex regions with fewer probes. Additionally, tools such as OligoArray (Rouillard et al., 2003) and its upgraded counterpart OligoMiner (Beliveau et al., 2018) seem to satisfy most of the criteria required to design oligos compatible with the RSE but are not shown to be applicable for use outside of FISH method. There is only a limited set of available probes for each chromosome and any customizations seem to require considerable knowledge and understanding of programing to be able to generate a new set of probes, if needed. iFISH (Gelali et al., 2019) is another valuable resource as a publicly available repository of genome-wide available FISH oligo probes. However, AnthOligo’s ability to analyze oligo interactions for multiple target regions for pooled, multiplexed reactions and customization of parameters for each task, makes this application rather unique. Prior to automation of oligonucleotide design for capture/enrichment, an analyst would have to painstakingly filter the oligonucleotides to create sets of oligos manually by scanning a large matrix of primer-dimer interactions. The task could exponentially increase in complexity and time when factors such as target region, size or number of regions increased. By streamlining the process of oligo design via an automated, statistically motivated downstream processing algorithm (Ben Zakour, 2004; Francis et al., 2017; Shen et al., 2010), our analysis showed the tool saves hands on time by at least 10-fold using the time analysis from the validation dataset (Supplementary Table S1). In summary, we present AnthOligo, an automated application to design evenly spaced capture oligos when provided with coordinates for genome-specific regions of interest. We have successfully implemented AnthOligo to design viable capture oligos for the Zebrafish genomes (Gupta et al., 2010) and additionally targeted and captured 4 MB section of the highly complex, MHC region in the human genome (Dapprich et al., 2016) in a solution-based capture. Most recently, additional sets of oligos have been designed, enriching the MHC by including publicly available MHC reference sequences from other cell lines that were either, partially known or fully completed (Horton et al., 2008). The newest set of oligos has been successfully used in our new study (unpublished results).
2 Implementation
2.1. Step 1
A region in the input file can range from a single exon to multiple megabases. A sliding window approach spanning 2 kb overlapping every 100 bp ensures thorough coverage of the region (Fig. 1). Primer3 (Untergasser et al., 2012) is used to generate internal oligos within each window using a repeat-masked reference sequence (Chen, 2004). UCSC BLAT (Kent, 2002) is used to inspect sequence specificity across the oligos at a percentage identity threshold customized at 95%. The ‘susceptibility’ to form hairpins and duplexes is estimated by measuring their Tm and ΔG predictions by Mfold (Zuker et al., 1999) and UNAFold (Markham and Zuker, 2008) for dimer stability based on the parameters of SantaLucia et al. (Owczarzy et al., 2008; Rouillard et al., 2003; SantaLucia, 1998; Vallone and Butler, 2004).
Fig. 1.
AnthOligo workflow. Step1: For each ROI, this step displays the process of design and filtration of oligos that pass established thresholds. Step2: A graph-based approach is used to create candidate oligo sets based on free energy thresholds from primer–dimer interaction. Step3: The resulting candidate sets of oligos from each region are then tested for primer–dimer interactions across other regions to generate the final set of oligo sets
2.2. Step 2
For each ROI, oligos that pass applicable thresholds from Step1 are considered ‘candidates’. The algorithm models the storage of oligos and specific properties like ‘primer–dimer interactions’ and ‘association by distance’ in a directed acyclic graph (Francis et al., 2017) (Fig. 1). For RSE method to be able to capture the entire ROI, the first few ‘seed’ oligos must lie within a short window across the start of the region. The graph object consists of seed oligos or ‘root nodes’ and associated oligos become ‘child nodes’. Each ‘edge’ represents the user-defined distance between the root and child nodes. A depth-first-search is then carried out to walk through ‘completed paths’ in each graph. A path is ‘complete’ when the ‘leaf’ oligo is found within the end of the target region. Each completed path forms a ‘set’ of oligos for the given region.
2.3. Step 3
Design of optimal collection of oligos for target capture using multiplex PCR requires combinatorial optimization solutions (Nordberg, 2005; Wingo et al., 2017) (Fig. 1). The number of heterodimer combinations C for n oligos for each input region can be calculated as:
To get a resultant ‘combination of set of oligos’ across all of the user-provided input regions, region-specific oligo sets are cross-compared across the input regions to ensure that oligos across regions do not dimerize with each other in solution. Every m, k, p number of oligos across M, K, P additional input regions increases this number of combinations somewhat exponentially:
With increasing region size and number of regions, this becomes computationally intensive akin to the Np-complete ‘knapsack problem’. Heuristic optimization allows for scalability without sacrificing quality of the capture design by returning the first available combination of oligo sets that satisfies our thresholds.
3 Results
Besides the published work (Dapprich et al., 2016; Gupta et al., 2010), oligos have been designed for capturing several genomic regions associated with Noonan Syndrome (8 genes), Type 1 Diabetes (9 genomic regions), Crohn’s Disease (10 genomic regions) and retinitis pigmentosa (37 genomic regions) (available upon request). In each case, the oligonucleotides performed well as observed by uniformity, sensitivity and average depth of coverage (Dapprich et al., 2016; Gupta et al., 2010). To additionally validate the tool, the MHC of a random sample was captured and sequenced on the Illumina MiSeq. Alignment was performed using COX as reference, since the sample showed a closer match to COX than PGF reference. The average depth of coverage was estimated at 100× with 98.4% of positions >20× (Supplementary Fig. S1). The reason we attempted another capture of the MHC region, besides the one published earlier (Dapprich et al., 2016), is because we needed to assess the success of the design using a random sample with unknown MHC sequence. The previously published capture (Dapprich et al., 2016) involved the PGF cell line, which has a known MHC sequence and the oligos were designed based on this known reference sequence. This time the AnthOligo using a number of different reference MHC sequences (Horton et al., 2008) was used to generate a new set of oligos that presumably can target the MHC of any random DNA sample. To help get a better sense of the expected outcomes of using AnthOligo, we have calculated some statistics from the obtained outputs and documented them in the Supplementary Files (Supplementary Figs S2, S3, S4 and Supplementary Table S2).
To capture sequence with acceptable range of accuracy and uniform representation across all the regions in multiplexed reactions, oligonucleotides must meet certain specifications in terms of sequence specificity, efficient oligo design with minimal interaction between the probes and optimized process time (Francis et al., 2017; Mulle et al., 2010; Shen et al., 2010; Wingo et al., 2017). AnthOligo is implemented to satisfy these requirements with the RSE method. It is well-understood that target capture design for multiplexed reactions is an NP-complete problem (Nicodeme and Steyaert, 1997; Shen et al., 2010). Heuristic optimization is necessary to process large regions, upwards of 1 Mb while identifying sets of evenly spaced capture oligonucleotides throughout the target region with target specificity (Hysom et al., 2012). Combinatorial approaches along with MapReduce framework help such multi-threaded memory-intensive and data-intensive tasks to run within an optimized time frame.
Sequence specificity is governed by multiple factors, the majority of which are repeats in the genome and the presence of pseudogenes (Claes and Leeneer, 2014; Koressaar et al., 2018; Mertes et al., 2011; Nordberg, 2005; Treangen and Salzberg, 2012; Vallone and Butler, 2004). AnthOligo’s use of hard-masked reference file for generating oligos resolves this by avoiding possible repeat regions in the sequence. BLAT results are filtered by focusing on the specificity of the 3′ subsequence (Miura et al., 2005). The publicly available version of AnthOligo has a limitation on the cumulative size of 1MB for all targeted regions provided by the user and will notify the user when that limit has been exceeded. This is because of the resource limitations of our production environment. However, if a user wishes to submit regions whose cumulative size is larger than this provided limit, it can be accommodated using the internal instance of the application. Please contact us using the information on the help page to arrange.
Although AnthOligo is developed to support the RSE method, its current abilities and flexibility for future enhancements may have wider applications in designing internal oligos that can be used to target the MHC using CRISPR-Cas9, baits for gene panel analysis or even probes for CgH array processes. AnthOligo is thus, a unique tool to an unaddressed domain and results show that it achieves the desired objectives.
Supplementary Material
Acknowledgements
The authors thank to Juan Carlos Perin for suggesting the name AnthOligo, Dr. Kajia Cao for help with the NP-Complete optimization problem.
Funding
The project described was supported by Award Number P30DK019525 from the National Institute of Diabetes and Digestive and Kidney Diseases to D.M.
Conflict of Interest: none declared.
References
- Beliveau B.J. et al. (2018) OligoMiner provides a rapid, flexible environment for the design of genome-scale oligonucleotide in situ hybridization probes. Proc. Natl. Acad. Sci. USA, 115, E2183–E2192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ben Zakour N. (2004) GenoFrag: software to design primers optimized for whole genome scanning by long-range PCR amplification. Nucleic Acids Res., 32, 17–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen N. (2004) Using RepeatMasker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinformatics, 5, 4.10.1–4.10.14. [DOI] [PubMed] [Google Scholar]
- Claes K.B., Leeneer K.D. (2014) Dealing with pseudogenes in molecular diagnostics in the next-generation sequencing era. Methods Mol. Biol., 1167, 303–315. [DOI] [PubMed] [Google Scholar]
- Dapprich J. et al. (2016) The next generation of target capture technologies – large DNA fragment enrichment and sequencing determines regional genomic variation of high complexity. BMC Genomics, 17, 486. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Francis F. et al. (2017) ThermoAlign: a genome-aware primer design tool for tiled amplicon resequencing. Sci. Rep., 7, 44437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gelali E. et al. (2019) iFISH is a publically available resource enabling versatile DNA FISH to study genome architecture. Nat. Commun., 10, 1636. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gnirke A. et al. (2009) Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat. Biotechnol., 27, 182–189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gupta T. et al. (2010) Microtubule actin crosslinking factor 1 regulates the Balbiani body and animal-vegetal polarity of the zebrafish oocyte. PLoS Genet., 6, e1001073. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Horton R. et al. (2008) Variation analysis and gene annotation of eight MHC haplotypes: the MHC Haplotype Project. Immunogenetics, 60, 1–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hysom D.A. et al. (2012) Skip the alignment: degenerate, multiplex primer and probe design using K-mer matching instead of alignments. PLoS One, 7, e34560. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ilie L. et al. (2013) BOND: Basic OligoNucleotide Design. BMC Bioinformatics, 14, 69. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jabado O.J. et al. (2006) Greene SCPrimer: a rapid comprehensive tool for designing degenerate primers from multiple sequence alignments. Nucleic Acids Res., 34, 6605–6611. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kent W.J. (2002) BLAT—the BLAST-like alignment tool. Genome Res., 12, 656–664. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koressaar T. et al. (2018) Primer3_masker: integrating masking of template sequence with primer design software. Bioinformatics, 34, 1937–1938. [DOI] [PubMed] [Google Scholar]
- Markham N.R., Zuker M. (2008) UNAFold: software for nucleic acid folding and hybridization. Methods Mol. Biol., 453, 3–31. [DOI] [PubMed] [Google Scholar]
- Mertes F. et al. (2011) Targeted enrichment of genomic DNA regions for next-generation sequencing. Brief. Funct. Genomics, 10, 374–386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Miura F. et al. (2005) A novel strategy to design highly specific PCR primers based on the stability and uniqueness of 3′-end subsequences. Bioinformatics, 21, 4363–4370. [DOI] [PubMed] [Google Scholar]
- Mulle J.G. et al. (2010) Empirical evaluation of oligonucleotide probe selection for DNA microarrays. PLoS One, 5, e9921. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Navin N. et al. (2006) PROBER: oligonucleotide FISH probe design software. Bioinformatics, 22, 2437–2438. [DOI] [PubMed] [Google Scholar]
- Nicodeme P., Steyaert J.M. (1997) Selecting optimal oligonucleotide primers for multiplex PCR. Proc. Int. Conf. Intell. Syst. Mol. Biol., 5, 210–213. [PubMed] [Google Scholar]
- Nordberg E.K. (2005) YODA: selecting signature oligonucleotides. Bioinformatics, 21, 1365–1370. [DOI] [PubMed] [Google Scholar]
- Okou D.T. et al. (2007) Microarray-based genomic selection for high-throughput resequencing. Nat. Methods, 4, 907–909. [DOI] [PubMed] [Google Scholar]
- Owczarzy R. et al. (2008) IDT SciTools: a suite for analysis and design of nucleic acid oligomers. Nucleic Acids Res., 36, W163–W169. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rouillard J.-M. et al. (2003) OligoArray 2.0: design of oligonucleotide probes for DNA microarrays using a thermodynamic approach. Nucleic Acids Res., 31, 3057–3062. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rychlik W. (2007) OLIGO 7 primer analysis software. Methods Mol. Biol., 402, 35–60. [DOI] [PubMed] [Google Scholar]
- SantaLucia J., Jr., (1998) A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc. Natl. Acad. Sci. USA, 95, 1460–1465. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shen Z. et al. (2010) MPprimer: a program for reliable multiplex PCR primer design. BMC Bioinformatics, 11, 143. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tewhey R. et al. (2009) Enrichment of sequencing targets from the human genome by solution hybridization. Genome Biol., 10, R116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Treangen T.J., Salzberg S.L. (2012) Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet., 13, 36–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Untergasser A. et al. (2012) Primer3—new capabilities and interfaces. Nucleic Acids Res., 40, e115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vallone P.M., Butler J.M. (2004) AutoDimer: a screening tool for primer-dimer and hairpin structures. Biotechniques, 37, 226–231. [DOI] [PubMed] [Google Scholar]
- Wenger A.M. et al. (2019) Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol., 37, 1155–1162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wingo T.S. et al. (2017) MPD: multiplex primer design for next-generation targeted sequencing. BMC Bioinformatics, 18, 14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zheng G.X.Y. et al. (2016) Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat. Biotechnol., 34, 303–311. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zuker M. et al. (1999) Algorithms and thermodynamics for RNA secondary structure prediction: a practical guide In: Barciszewski J., Clark B.F.C. (eds.) RNA Biochemistry and Biotechnology. Springer Netherlands, Dordrecht: pp. 11–43. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.

