LOCUST: a custom sequence locus typer for classifying microbial isolates

Lauren M Brinkac; Erin Beck; Jason Inman; Pratap Venepally; Derrick E Fouts; Granger Sutton

doi:10.1093/bioinformatics/btx045

. 2017 Jan 25;33(11):1725–1726. doi: 10.1093/bioinformatics/btx045

LOCUST: a custom sequence locus typer for classifying microbial isolates

Lauren M Brinkac ^1,^2,^✉, Erin Beck ¹, Jason Inman ¹, Pratap Venepally ¹, Derrick E Fouts ¹, Granger Sutton ¹

Editor: John Hancock

PMCID: PMC5860141 PMID: 28130240

Abstract

Summary

LOCUST is a custom sequence locus typer tool for classifying microbial genomes. It provides a fully automated opportunity to customize the classification of genome-wide nucleotide variant data most relevant to biological research.

Availability and Implementation

Source code, demo data, and detailed documentation are freely available at http://sourceforge.net/projects/locustyper.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

An increased prevalence of hospital acquired infections and foodborne illnesses caused by pathogenic and multidrug resistant bacteria has stimulated a pressing need for computational methods to rapidly and accurately classify bacteria from genomic sequence data. This is particularly important for early detection of outbreaks (Joensen, et al., 2014), diagnosis of infections (Punina, et al., 2015), and source tracking (Ruan and Feng, 2016). The traditional sequence-based method of classifying bacterial species is Multilocus Sequence Typing (MLST) (Maiden, et al., 1998). It involves defining a profile, or sequence type (ST) for each isolate through the identification of a conserved allelic variant, or unique combination of conserved allelic variants. In other methods, such as Multiple Locus Variable-number Tandem Repeat Analysis (MLVA) (Keim, et al., 2000), profiles are generated based on the calculated number of tandemly repeated DNA sequence variants in different loci. While amplicon-based typing methods have been extremely useful for point-of-source classification, whole-genome sequencing (WGS) has become increasingly more economical, and with the maturation of the MinION from Oxford Nanopore Technologies (Quick, et al., 2015), WGS is now more portable. Thus, there is a critical need for an easy to use, multi-platform, publicly available, and flexible WGS sequence typer that can type genomes using a variety of DNA sequence markers.

To meet this need, we present LOCUST, our LOcus CUstom Sequence Typer. LOCUST is an automated classification software tool that enables users to customize the typing of microbial isolates from WGS data. Several key features distinguish LOCUST from other computational typing tools (Didelot, et al., 2009; Gilmour, et al., 2007; Inouye, et al., 2014; Kruczkiewicz, 2013; Nascimento, et al., 2016; Pightling, et al., 2015; Yoshida, et al., 2016):

Automated processing from genome sequence download to genotypic classification and phylogenetic inferences.
User customization enabling the processing of existing or generation of novel, classification schemes.
Genomic assertions based on genotypic classifications.
Database and annotation independent.
User-friendly, intuitive, and easy to install and run locally.

2 Design and implementation

LOCUST is implemented as a collection of PERL scripts called by the top-level typer.pl program. Two input types are required: allele sequences and genome sequences. Inputting genome sequences can be done either by providing a tabular list file of genome names and their fasta file locations or by downloading fasta files directly from GenBank. Users can choose to download by search term such as 'Mycobacterium abscessus’, or can supply a list of either NCBI Biosample Ids or Assembly Accessions. Additional filters include specifying a minimum value of contig N50 size, a maximum allowed number of contigs per genome, and whether complete and/or draft quality genomes should be downloaded. Genomes that would normally be included in the results that do not pass these requirements are excluded.

The basic approach (Supplementary Fig. S1) is to ‘BLAST’ a set of alleles in a seed file against each genome to identify the alleles for each locus. The best BLAST match, based on bitscore, is chosen for each locus. If the BLAST match is full length, then it is extracted directly; if it is less than the maximum number of unmatched bases at the end of a blast match (default BLAST length threshold is 5) then its nucleotide sequence is extended by the length difference on one or both ends; if it is greater than the blast length threshold it is not extracted.

A novel aspect of LOCUST is the generation of a seed file from a given set of alleles. To minimize BLAST search time, while retaining maximal sensitivity to find full length allele matches, a minimal set of alleles representing the diversity of sequence variation is required. This minimal set of representative alleles is chosen for each locus based on an implementation of a well-known heuristic used to solve the ‘set cover’ problem (Cormen, et al., 2001). For example, if we define for each allele its subset to be its redundant alleles, then choosing the fewest number of alleles whose subsets together contain all the alleles is known as the ‘set cover’ problem, which cannot be solved in polynomial time. Instead, a well-known heuristic is used to achieve a good solution. Specifically, the allele with the most redundancy is chosen and all its redundant alleles are removed from consideration and reserved for later use in generating the final list of alleles. For the remaining alleles, the number of redundant alleles are recalculated and the process is iterated until there are no alleles left to be chosen. The result is a set cover that may not be minimal, but very nearly minimal. The alleles that are remaining should be central to their cluster of removed redundant alleles since we chose the alleles with the largest number of redundant alleles at each iteration.

All-versus-all BLASTN searches of the alleles is used to determine allele redundancy. An allele is redundant with another allele if the pairwise BLAST match is full length and greater than or equal to 90% identity. An added benefit of doing the all-versus-all BLAST of the alleles is that LOCUST can check for poorly defined alleles. LOCUST checks and warns for three things: alleles that match a locus other than their own, alleles that do not appear to be trimmed to the same endpoints, and alleles that appear to be on the opposite strand from alleles from the same locus.

Phylogenetic trees generated by either FastTree (Price, et al., 2009, 2010) or RaxML (Stamatakis, 2014) can be built from multi-genome data as an available user option. Multi-FASTA sequences from a group of conserved alleles (e.g. MLST or universally conserved bacterial markers (Brown, et al., 2001)), present in all genomes in the dataset, are aligned by MUSCLE (Edgar, 2004) and gapped columns in the alignment are removed by trimAl (Capella-Gutierrez, et al., 2009). Annotated alleles are concatenated, maintaining allelic order, into a single multi-allelic sequence for each genome and used to generate input for building FastTree or RaxML trees.

3 Summary

The key advantage of LOCUST is the ability to type bacterial genomes using any user-supplied DNA sequence. This includes the ability to auto-generate a novel typing scheme as output that can be reused and added to. In addition to typing, LOCUST can download genome sequences directly from GenBank and auto-generate phylogenetic trees of publication quality.

Funding

This project has been funded in whole or part with federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services under Award Number U19AI110819.

Conflict of Interest: none declared.

Supplementary Material

Supplementary Data

Click here for additional data file.^{(182KB, doc)}

References

Brown J.R. et al. (2001) Universal trees based on large combined protein sequence data sets. Nat. Genet., 28, 281–285. [DOI] [PubMed] [Google Scholar]
Capella-Gutierrez S. et al. (2009) trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics (Oxford, England), 25, 1972–1973. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cormen T.H. et al. (2001) Introduction to Algorithms. MIT Press, Cambridge, MA. [Google Scholar]
Didelot X. et al. (2009) SimMLST: simulation of multi-locus sequence typing data under a neutral model. Bioinformatics (Oxford, England), 25, 1442–1444. [DOI] [PubMed] [Google Scholar]
Edgar R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res., 32, 1792–1797. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gilmour M.W. et al. (2007) Sequence-based typing of genetic targets encoded outside of the O-antigen gene cluster is indicative of Shiga toxin-producing Escherichia coli serogroup lineages. J. Med. Microbiol., 56, 620–628. [DOI] [PMC free article] [PubMed] [Google Scholar]
Inouye M. et al. (2014) SRST2: Rapid genomic surveillance for public health and hospital microbiology labs. Genome Med., 6, 90.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Joensen K.G. et al. (2014) Real-time whole-genome sequencing for routine typing, surveillance, and outbreak detection of verotoxigenic Escherichia coli. J. Clin. Microbiol., 52, 1501–1510. [DOI] [PMC free article] [PubMed] [Google Scholar]
Keim P. et al. (2000) Multiple-locus variable-number tandem repeat analysis reveals genetic relationships within Bacillus anthracis. J. Bacteriol., 182, 2928–2936. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kruczkiewicz PaM. et al. (2013) MIST: a tool for rapid in silico generation of molecular data from bacterial genome sequences. In: Proceedings of Bioinformatics 2013: 4th International Conference on Bioinformatics Models, Methods and Algorithms, pp. 316–323.
Maiden M.C. et al. (1998) Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms. Proc. Natl. Acad. Sci. U. S. A., 95, 3140–3145. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nascimento M. et al. (2016) PHYLOViZ 2.0: providing scalable data integration and visualization for multiple phylogenetic inference methods. Bioinformatics (Oxford, England), [Epub ahead of print]. [DOI] [PubMed] [Google Scholar]
Pightling A.W. et al. (2015) The Listeria monocytogenes Core-Genome Sequence Typer (LmCGST): a bioinformatic pipeline for molecular characterization with next-generation sequence data. BMC Microbiol., 15, 224.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Price M.N. et al. (2009) FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol. Biol. Evol., 26, 1641–1650. [DOI] [PMC free article] [PubMed] [Google Scholar]
Price M.N. et al. (2010) FastTree 2–approximately maximum-likelihood trees for large alignments. PloS One, 5, e9490.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Punina N.V. et al. (2015) Whole-genome sequencing targets drug-resistant bacterial infections. Hum. Genomics, 9, 19.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Quick J. et al. (2015) Rapid draft sequencing and real-time nanopore sequencing in a hospital outbreak of Salmonella. Genome Biol., 16, 114.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ruan Z., Feng Y. (2016) BacWGSTdb, a database for genotyping and source tracking bacterial pathogens. Nucleic Acids Res., 44, D682–D687. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stamatakis A. (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics (Oxford, England), 30, 1312–1313. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yoshida C.E. et al. (2016) The Salmonella In Silico Typing Resource (SISTR): an open web-accessible tool for rapidly typing and subtyping draft salmonella genome assemblies. PloS One, 11, e0147101.. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Click here for additional data file.^{(182KB, doc)}

[btx045-B1] Brown J.R. et al. (2001) Universal trees based on large combined protein sequence data sets. Nat. Genet., 28, 281–285. [DOI] [PubMed] [Google Scholar]

[btx045-B2] Capella-Gutierrez S. et al. (2009) trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics (Oxford, England), 25, 1972–1973. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx045-B3] Cormen T.H. et al. (2001) Introduction to Algorithms. MIT Press, Cambridge, MA. [Google Scholar]

[btx045-B4] Didelot X. et al. (2009) SimMLST: simulation of multi-locus sequence typing data under a neutral model. Bioinformatics (Oxford, England), 25, 1442–1444. [DOI] [PubMed] [Google Scholar]

[btx045-B5] Edgar R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res., 32, 1792–1797. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx045-B6] Gilmour M.W. et al. (2007) Sequence-based typing of genetic targets encoded outside of the O-antigen gene cluster is indicative of Shiga toxin-producing Escherichia coli serogroup lineages. J. Med. Microbiol., 56, 620–628. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx045-B7] Inouye M. et al. (2014) SRST2: Rapid genomic surveillance for public health and hospital microbiology labs. Genome Med., 6, 90.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx045-B8] Joensen K.G. et al. (2014) Real-time whole-genome sequencing for routine typing, surveillance, and outbreak detection of verotoxigenic Escherichia coli. J. Clin. Microbiol., 52, 1501–1510. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx045-B9] Keim P. et al. (2000) Multiple-locus variable-number tandem repeat analysis reveals genetic relationships within Bacillus anthracis. J. Bacteriol., 182, 2928–2936. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx045-B10] Kruczkiewicz PaM. et al. (2013) MIST: a tool for rapid in silico generation of molecular data from bacterial genome sequences. In: Proceedings of Bioinformatics 2013: 4th International Conference on Bioinformatics Models, Methods and Algorithms, pp. 316–323.

[btx045-B11] Maiden M.C. et al. (1998) Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms. Proc. Natl. Acad. Sci. U. S. A., 95, 3140–3145. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx045-B12] Nascimento M. et al. (2016) PHYLOViZ 2.0: providing scalable data integration and visualization for multiple phylogenetic inference methods. Bioinformatics (Oxford, England), [Epub ahead of print]. [DOI] [PubMed] [Google Scholar]

[btx045-B13] Pightling A.W. et al. (2015) The Listeria monocytogenes Core-Genome Sequence Typer (LmCGST): a bioinformatic pipeline for molecular characterization with next-generation sequence data. BMC Microbiol., 15, 224.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx045-B14] Price M.N. et al. (2009) FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol. Biol. Evol., 26, 1641–1650. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx045-B15] Price M.N. et al. (2010) FastTree 2–approximately maximum-likelihood trees for large alignments. PloS One, 5, e9490.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx045-B16] Punina N.V. et al. (2015) Whole-genome sequencing targets drug-resistant bacterial infections. Hum. Genomics, 9, 19.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx045-B17] Quick J. et al. (2015) Rapid draft sequencing and real-time nanopore sequencing in a hospital outbreak of Salmonella. Genome Biol., 16, 114.. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx045-B18] Ruan Z., Feng Y. (2016) BacWGSTdb, a database for genotyping and source tracking bacterial pathogens. Nucleic Acids Res., 44, D682–D687. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx045-B19] Stamatakis A. (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics (Oxford, England), 30, 1312–1313. [DOI] [PMC free article] [PubMed] [Google Scholar]

[btx045-B20] Yoshida C.E. et al. (2016) The Salmonella In Silico Typing Resource (SISTR): an open web-accessible tool for rapidly typing and subtyping draft salmonella genome assemblies. PloS One, 11, e0147101.. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

LOCUST: a custom sequence locus typer for classifying microbial isolates

Lauren M Brinkac

Erin Beck

Jason Inman

Pratap Venepally

Derrick E Fouts

Granger Sutton

Roles

Abstract

Summary

Availability and Implementation

Supplementary information

1 Introduction

2 Design and implementation

3 Summary

Funding

Supplementary Material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

LOCUST: a custom sequence locus typer for classifying microbial isolates

Lauren M Brinkac

Erin Beck

Jason Inman

Pratap Venepally

Derrick E Fouts

Granger Sutton

Roles

Abstract

Summary

Availability and Implementation

Supplementary information

1 Introduction

2 Design and implementation

3 Summary

Funding

Supplementary Material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases