GIL: a python package for designing custom indexing primers

Nicholas Mateyko; Omar Tariq; Xinyi E Chen; Will Cheney; Asfar Lathif Salaudeen; Ishika Luthra; Najmeh Nikpour; Abdul Muntakim Rafi; Hadis Kamali Dehghan; Cassandra Jensen; Carl de Boer

doi:10.1093/bioinformatics/btad328

. 2023 May 19;39(6):btad328. doi: 10.1093/bioinformatics/btad328

GIL: a python package for designing custom indexing primers

Nicholas Mateyko ^1,², Omar Tariq ^2,², Xinyi E Chen ³, Will Cheney ⁴, Asfar Lathif Salaudeen ⁵, Ishika Luthra ⁶, Najmeh Nikpour ⁷, Abdul Muntakim Rafi ⁸, Hadis Kamali Dehghan ⁹, Cassandra Jensen ¹⁰, Carl de Boer ^11,^✉

Editor: Can Alkan

PMCID: PMC10246578 PMID: 37208164

Abstract

Summary

Generate Indexes for Libraries (GIL) is a software tool for generating primers to be used in the production of multiplexed sequencing libraries. GIL can be customized in numerous ways to meet user specifications, including length, sequencing modality, color balancing, and compatibility with existing primers, and produces ordering and demultiplexing-ready outputs.

Availability and implementation

GIL is written in Python and is freely available on GitHub under the MIT license: https://github.com/de-Boer-Lab/GIL and can be accessed as a web-application implemented in Streamlit at https://dbl-gil.streamlitapp.com.

1 Introduction

Next-Generation Sequencing (NGS) has become a cornerstone of biology as a crucial method for data collection. The cost of sequencing has decreased with the refinement of sequencing technologies such that the cost of sequencing the whole human genome has decreased by a factor of nearly 3 million since 2003 (Check Hayden 2014). While advances in NGS technologies have facilitated remarkably low-cost acquisition of massive sequencing data, the minimum cost of NGS remains quite high (Schwarze et al. 2020). Multiplexing samples into a single-pooled library to reduce costs is a solution that has been established since the earliest days of DNA sequencing (Church and Kieffer-Higgins 1988, Chee 1991). Multiplexing allows many samples to be sequenced in a single run, enabling researchers to take advantage of NGS’s low costs even when the desired sequencing depth for a single sample falls well below the scale of an NGS run (Smith et al. 2010). Multiplexing is accomplished by appending unique barcodes, or indexes, to each sample. By 2016, the incorporation of these indexes by appropriately modified amplification primers had become standard (O'Donnell et al. 2016). Compatible primers can be purchased as part of commercially distributed sample preparation kits, but at a considerable cost when large numbers of samples are prepared. Designing and ordering primers independently can reduce costs but remain challenging due to the high cost associated with testing an indexing primer set and the many considerations one could account for when designing primers. A set of compatible indexing primers must be sufficiently dissimilar from one another for demultiplexing, avoid self-priming interactions, and be appropriately color balanced for the sequencing modality (Illumina 2018). Custom indexes would facilitate efficient large-scale sequencing at low costs.

Here we present Generate Indexes for Libraries (GIL), a user-friendly Python package that produces customizable sequencing primers in a ready-to-order and ready-to-demultiplex format. Users can provide custom adapter sequences, enabling index generation for any sequencing system or modality, and can create indexes of any length (e.g. greater than the standard 8 nt). Users can customize filtering to eliminate indexes that may cause issues in their setup (e.g. have repetitive sequences or match existing primer sets). The generated order sheets can then be used to purchase primers at a fraction of the cost per sample of commercial library preparation kits. GIL is available at https://github.com/de-Boer-Lab/GIL and can be accessed as a web-application implemented in Streamlit at https://dbl-gil.streamlitapp.com.

2 Materials and methods

2.1 Implementation

GIL is written in Python 3 and can be run both from the command line and from an online Streamlit application (Fig. 1). We designed GIL to create primers that add indexes to libraries by PCR. Input libraries must share a common adapter sequence on the ends, which can be added in multiple ways, including PCR for targeted sequencing, adapter ligation, or tagmentation, similar to the NEBNext^® Multiplex Oligos for Illumina^® and iTru approaches (Glenn et al. 2019).

Figure 1. — Overview of GIL index generation pipeline.

We first generate a set of indexes, either all k-mers for index length k ≤ 9 or a random sample of k-mers for k > 9, by default sampled from a pool of 5000 but with the option for the user to request a larger pool. We filter out indexes with undesirable qualities, find sequences from those that remain that are all sufficiently dissimilar to each other, and create a 96-well plate layout where the order of the barcodes on the plate maintains color balance. The default filtering steps are as follows, although the parameters and filtering steps can all be customized:

Remove indexes that start with G. In Illumina sequencers that use 2-channel chemistry, G is not labeled with a fluorescent dye. Illumina warns that if the first two bases in the index are both G, “intensity is not generated” (Illumina 2018; https://support.illumina.com/content/dam/illumina-support/documents/documentation/chemistry_documentation/experiment-design/index-adapters-pooling-guide-1000000041074-05.pdf), and so indexes where the index read starts with G are removed.
Remove indexes with extreme GC content. We removed indexes with GC content ≤25% or ≥75%.
Remove homopolymer and dinucleotide repeats. We removed indexes with >2 homopolymer or dinucleotide repeats since simple repeats are associated with DNA synthesis errors.
Remove indexes too similar to an existing index set (optional). In order for two sets of indexes to be compatible, they must be sufficiently distinct to uniquely identify the samples when demultiplexing. Users can provide an existing set of indexing primers (e.g. those that they use already or are commonly used at their sequencing facility), and all the indexes that are within 3 Levenshtein distance of the existing indexes are removed.
Remove self-priming sequences. We compute the Hamming distance between the reverse complement of the 8 bases on the 3′ end of the primer and all 8 nt windows that include the index sequence. If any of these distances are <3, the index is filtered out.
Color balancing. When only a few indexes are used in a sequencing run, color balance must be considered (Illumina 2018). We place indexes within a 96-well plate such that groups of four indexes along the rows of the plate are color balanced. This allows for multiplexing with as few as four consecutive indexes without having to consider color balance issues. We then place the generated indexes into the primer sequence context, which, by default, are designed to work with Illumina TruSeq. We designed the primers to have a Tm of 65°C with NEB Q5 polymerase.

GIL has two main outputs. For each plate of indexes, GIL generates (i) an order sheet (CSV) that contains the well, primer name, and primer sequence columns in an order-ready format and (ii) a demultiplexing sample sheet in the standard Illumina format. Because some sequencers read index 2 in the reverse complement direction, two sample sheets are generated for each index plate, one with the index 2 column reverse complemented. Since a sample can be uniquely identified by the combination of index 1 and index 2 sequences, a single 96-well plate of each of index 1 and 2 primers can be used to index 96² (9216) samples that are compatible for pooling and sequencing together. However, most will not require so many indexing primers and so by default we create demultiplexing sample sheets where both index 1 and index 2 are unique and redundant and could be used individually to demultiplex samples, enabling the detection and exclusion of index hopping reads (Farouni et al. 2020, van der Valk et al. 2020).

2.2 Ordering primers

GIL was run with default parameters and generated three 96-well plates of TruSeq index primers. The 96 i5 and 96 i7 primers from the first generated plate were ordered as 100 nmol oligos with standard desalting from IDT, including a single phosphorothioate bond between the last two bases on the 3′ end of the primers to prevent 3′-to-5′ degradation by DNA polymerase (Skerra 1992). The oligo plates, sample sheets and order sheets from this order are available on Zenodo (https://doi.org/10.5281/zenodo.7922539).

3 Conclusion

After generating three plates of mutually compatible primers with default settings through GIL, we ordered primers, indexed samples (n = 44), and sequenced them on the Illumina MiSeq Nano platform. Sequencing BCL files were demultiplexed successfully using the generated sample sheet with bcl2fastq software. All samples were present. Approximately 0.2% of reads within an index were thrown out due to deletions in the indexes compared to 0.02% for a PhiX control. Further primer purification could alleviate loss due to mismatching barcode sequences.

3.1 Comparison to existing software

Several programs for designing and testing sequencing indexes exist. EDITTAG (Faircloth and Glenn 2012), BARCRAWL and BARTAB (Frank 2009), and DNABarcodes (Buschmann and Bystrykh 2013) are all freely available tools for producing custom indexes for multiplexed sequencing. Each of these solutions provide some of the functionality found in GIL, however GIL has several advantages over existing solutions. For instance, GIL allows the user to consider self-priming interactions with the constant flanking primer sequence, the presence of a 5′ G, and dinucleotide repeats. Furthermore, GIL can produce primers with arbitrary constant regions flanking the indexes, enabling atypical uses or non-Illumina platforms. Finally, GIL is much easier to use, providing primers in an order-ready format, the files needed for demultiplexing (bcl2fastq), and a graphical interface (Table 1).

Table 1.

Comparison of GIL to existing software.

Feature	BARCRAWL	EDITTAG	DNABarcodes	GIL
Homopolymers	✓	✓	✓	✓
Custom length	✓	✓	✓	✓
Hairpins	✓	✓	✓	✓
Edit distance	✓	✓	✓	✓
GC content	✓	✓		✓
Color balance			✓	✓
Index exclusion			✓	✓
Specify initial index pool			✓
Dinucleotide repeats				✓
Starting G				✓
Self-priming interactions				✓
Construct primer from indexes				✓
Order ready format				✓
Demultiplexing				✓

Open in a new tab

Acknowledgements

We thank Marjan Barazandeh for their help in testing GIL’s implementation.

Contributor Information

Nicholas Mateyko, School of Biomedical Engineering, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.

Omar Tariq, School of Biomedical Engineering, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.

Xinyi E Chen, School of Biomedical Engineering, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.

Will Cheney, School of Biomedical Engineering, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.

Asfar Lathif Salaudeen, School of Biomedical Engineering, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.

Ishika Luthra, School of Biomedical Engineering, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.

Najmeh Nikpour, School of Biomedical Engineering, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.

Abdul Muntakim Rafi, School of Biomedical Engineering, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.

Hadis Kamali Dehghan, School of Biomedical Engineering, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.

Cassandra Jensen, School of Biomedical Engineering, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.

Carl de Boer, School of Biomedical Engineering, University of British Columbia, Vancouver, BC V6T 1Z3, Canada.

Conflict of interest

None declared.

Funding

This work was supported by the Natural Sciences and Engineering Research Council of Canada (RGPIN-2020-05425) Stem Cell Network (ECR-C4R1-7) Canadian Institute for Health Research (PJT-180537). WestGrid, Compute Canada (www.computecanada.ca), and Advanced Research Computing at the University of British Columbia.

Data availability

The data underlying this article are available in Zenodo, at https://doi.org/10.5281/zenodo.7922539.

References

Buschmann T, Bystrykh LV.. Levenshtein error-correcting barcodes for multiplexed DNA sequencing. BMC Bioinform 2013;14:272. 10.1186/1471-2105-14-272. [DOI] [PMC free article] [PubMed] [Google Scholar]
Check Hayden E. Technology: The $1,000 genome. Nature 2014;507(7492):294–295. 10.1038/507294a. [DOI] [PubMed] [Google Scholar]
Chee M. Enzymatic multiplex DNA sequencing. Nucleic Acids Res 1991;19:3301–5. 10.1093/nar/19.12.3301. [DOI] [PMC free article] [PubMed] [Google Scholar]
Church GM, Kieffer-Higgins S.. Multiplex DNA sequencing. Science 1988;240:185–8. 10.1126/science.3353714. [DOI] [PubMed] [Google Scholar]
Faircloth BC, Glenn TC.. Not all sequence tags are created equal: designing and validating sequence identification tags robust to indels. PLoS One 2012;7:e42543. 10.1371/journal.pone.0042543. [DOI] [PMC free article] [PubMed] [Google Scholar]
Farouni R, Djambazian H, Ferri LE. et al. Model-based analysis of sample index hopping reveals its widespread artifacts in multiplexed single-cell RNA-sequencing. Nat Commun 2020;11:2704. 10.1038/s41467-020-16522-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
Frank DN. BARCRAWL and BARTAB: software tools for the design and implementation of barcoded primers for highly multiplexed DNA sequencing. BMC Bioinform 2009;10:362. 10.1186/1471-2105-10-362. [DOI] [PMC free article] [PubMed] [Google Scholar]
Glenn TC, Nilsen RA, Kieran TJ. et al. Adapterama I: universal stubs and primers for 384 unique dual-indexed or 147,456 combinatorially-indexed illumina libraries (iTru & iNext). PeerJ 2019;7:e7755. 10.7717/peerj.7755. [DOI] [PMC free article] [PubMed] [Google Scholar]
Illumina. Index Adapters Pooling Guide (1000000041074). lllumina (San Francisco, 2018).
O'Donnell JL, Kelly RP, Lowell NC. et al. Indexed PCR primers induce template-specific bias in large-scale DNA sequencing studies. PLoS One 2016;11:e0148698. 10.1371/journal.pone.0148698. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schwarze K, Buchanan J, Fermont JM. et al. The complete costs of genome sequencing: a microcosting study in cancer and rare diseases from a single center in the United Kingdom. Genet Med 2020;22:85–94. 10.1038/s41436-019-0618-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Skerra A. Phosphorothioate primers improve the amplification of DNA sequences by DNA polymerases with proofreading activity. Nucleic Acids Res 1992;20:3551–4. https://doi.org/10.1093/nar/20.14.3551. [DOI] [PMC free article] [PubMed] [Google Scholar]
Smith AM, Heisler LE, St.Onge RP. et al. Highly-multiplexed barcode sequencing: an efficient method for parallel analysis of pooled samples. Nucleic Acids Res 2010;38:e142. 10.1093/nar/gkq368. [DOI] [PMC free article] [PubMed] [Google Scholar]
van der Valk T, Vezzi F, Ormestad M. et al. Index hopping on the Illumina HiseqX platform and its consequences for ancient DNA studies. Mol Ecol Resour 2020;20:1171–81. 10.1111/1755-0998.13009. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data underlying this article are available in Zenodo, at https://doi.org/10.5281/zenodo.7922539.

[btad328-B1] Buschmann T, Bystrykh LV.. Levenshtein error-correcting barcodes for multiplexed DNA sequencing. BMC Bioinform 2013;14:272. 10.1186/1471-2105-14-272. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad328-B2] Check Hayden E. Technology: The $1,000 genome. Nature 2014;507(7492):294–295. 10.1038/507294a. [DOI] [PubMed] [Google Scholar]

[btad328-B3] Chee M. Enzymatic multiplex DNA sequencing. Nucleic Acids Res 1991;19:3301–5. 10.1093/nar/19.12.3301. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad328-B4] Church GM, Kieffer-Higgins S.. Multiplex DNA sequencing. Science 1988;240:185–8. 10.1126/science.3353714. [DOI] [PubMed] [Google Scholar]

[btad328-B5] Faircloth BC, Glenn TC.. Not all sequence tags are created equal: designing and validating sequence identification tags robust to indels. PLoS One 2012;7:e42543. 10.1371/journal.pone.0042543. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad328-B6] Farouni R, Djambazian H, Ferri LE. et al. Model-based analysis of sample index hopping reveals its widespread artifacts in multiplexed single-cell RNA-sequencing. Nat Commun 2020;11:2704. 10.1038/s41467-020-16522-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad328-B7] Frank DN. BARCRAWL and BARTAB: software tools for the design and implementation of barcoded primers for highly multiplexed DNA sequencing. BMC Bioinform 2009;10:362. 10.1186/1471-2105-10-362. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad328-B8] Glenn TC, Nilsen RA, Kieran TJ. et al. Adapterama I: universal stubs and primers for 384 unique dual-indexed or 147,456 combinatorially-indexed illumina libraries (iTru & iNext). PeerJ 2019;7:e7755. 10.7717/peerj.7755. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad328-B9] Illumina. Index Adapters Pooling Guide (1000000041074). lllumina (San Francisco, 2018).

[btad328-B10] O'Donnell JL, Kelly RP, Lowell NC. et al. Indexed PCR primers induce template-specific bias in large-scale DNA sequencing studies. PLoS One 2016;11:e0148698. 10.1371/journal.pone.0148698. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad328-B11] Schwarze K, Buchanan J, Fermont JM. et al. The complete costs of genome sequencing: a microcosting study in cancer and rare diseases from a single center in the United Kingdom. Genet Med 2020;22:85–94. 10.1038/s41436-019-0618-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad328-B12] Skerra A. Phosphorothioate primers improve the amplification of DNA sequences by DNA polymerases with proofreading activity. Nucleic Acids Res 1992;20:3551–4. https://doi.org/10.1093/nar/20.14.3551. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad328-B13] Smith AM, Heisler LE, St.Onge RP. et al. Highly-multiplexed barcode sequencing: an efficient method for parallel analysis of pooled samples. Nucleic Acids Res 2010;38:e142. 10.1093/nar/gkq368. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btad328-B14] van der Valk T, Vezzi F, Ormestad M. et al. Index hopping on the Illumina HiseqX platform and its consequences for ancient DNA studies. Mol Ecol Resour 2020;20:1171–81. 10.1111/1755-0998.13009. [DOI] [PubMed] [Google Scholar]

PERMALINK

GIL: a python package for designing custom indexing primers

Nicholas Mateyko

Omar Tariq

Xinyi E Chen

Will Cheney

Asfar Lathif Salaudeen

Ishika Luthra

Najmeh Nikpour

Abdul Muntakim Rafi

Hadis Kamali Dehghan

Cassandra Jensen

Carl de Boer

Roles

Abstract

Summary

Availability and implementation

1 Introduction

2 Materials and methods

2.1 Implementation

Figure 1.

2.2 Ordering primers

3 Conclusion

3.1 Comparison to existing software

Table 1.

Acknowledgements

Contributor Information

Conflict of interest

Funding

Data availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

GIL: a python package for designing custom indexing primers

Nicholas Mateyko

Omar Tariq

Xinyi E Chen

Will Cheney

Asfar Lathif Salaudeen

Ishika Luthra

Najmeh Nikpour

Abdul Muntakim Rafi

Hadis Kamali Dehghan

Cassandra Jensen

Carl de Boer

Roles

Abstract

Summary

Availability and implementation

1 Introduction

2 Materials and methods

2.1 Implementation

Figure 1.

2.2 Ordering primers

3 Conclusion

3.1 Comparison to existing software

Table 1.

Acknowledgements

Contributor Information

Conflict of interest

Funding

Data availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases