Abstract
Background
Co-localized sets of genes that encode specialized functions are common across microbial genomes and occur in genomes of larger eukaryotes as well. Important examples include Biosynthetic Gene Clusters (BGCs) that produce specialized metabolites with medicinal, agricultural, and industrial value (e.g. antimicrobials). Comparative analysis of BGCs can aid in the discovery of novel metabolites by highlighting distribution and identifying variants in public genomes. Unfortunately, gene-cluster-level homology detection remains inaccessible, time-consuming and difficult to interpret.
Results
The comparative gene cluster analysis toolbox (CAGECAT) is a rapid and user-friendly platform to mitigate difficulties in comparative analysis of whole gene clusters. The software provides homology searches and downstream analyses without the need for command-line or programming expertise. By leveraging remote BLAST databases, which always provide up-to-date results, CAGECAT can yield relevant matches that aid in the comparison, taxonomic distribution, or evolution of an unknown query. The service is extensible and interoperable and implements the cblaster and clinker pipelines to perform homology search, filtering, gene neighbourhood estimation, and dynamic visualisation of resulting variant BGCs. With the visualisation module, publication-quality figures can be customized directly from a web-browser, which greatly accelerates their interpretation via informative overlays to identify conserved genes in a BGC query.
Conclusion
Overall, CAGECAT is an extensible software that can be interfaced via a standard web-browser for whole region homology searches and comparison on continually updated genomes from NCBI. The public web server and installable docker image are open source and freely available without registration at: https://cagecat.bioinformatics.nl.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12859-023-05311-2.
Keywords: Gene cluster, Secondary metabolite, Homology search, Colocalized, Biosynthetic, Comparative analysis
Background
Genes working cooperatively in a metabolic pathway are often physically co-localized in prokaryotic and fungal genomes. These gene clusters are commonly observed in specialized metabolism involved in ecological adaptations, such as nutrient utilization and production of virulence factors. In particular, Biosynthetic Gene Cluster (BGCs) that code for specialized metabolites has gained significant interest due to their major role in modern society as a source of pharmaceutical drugs (e.g. antibiotics) and crop protection chemicals [1, 2]. These loci not only contain genes responsible for biosynthesis but often include auxiliary regions coding for regulatory and transporter proteins [2, 3]. Using signature genes and machine-learning-based methods, several computational frameworks have been developed to effectively detect hypothetical BGCs from genomic data, such as ClusterFinder, PRISM, DeepBGC, and antiSMASH [4–7]. With these mature pipelines and the increase in publicly available genomes, a vast number of BGCs, both experimentally verified and hypothetical, have been catalogued in several databases. These include MIBiG, antiSMASH-DB, BiG-FAM, ARTS-DB, and IMG–ABC [8–12]. Unfortunately, much of this data remains unannotated. For instance, as little as 0.3% of the ~ 400,000 BGCs in IMG–ABC v5 are experimentally validated. Comparative genomic analysis can shed light on the functions of BGCs and their underlying genes. However, accessible online tools to allow scientists to perform custom comparative genomic analyses are lacking.
Gene cluster analysis methods for homology grouping, search, and visualisation are essential tasks to effectively leverage the available public resources. While tools such as BIG-SCAPE, BiG-SLiCE, MultiGeneBlast and cblaster aid in gene cluster analysis, these demand local computational resources or require command-line experience [13–16]. Due to the technological barrier, there is a need for a user-friendly and accessible platform for performing these analyses. Additionally, downstream methods for interpreting these results are often required. Visualisation and comparative genomic tools such as clinker and CORASON are capable of highlighting synteny or evolutionary relationships between BGCs; however, these also require expertise to operate and are not easily connected to homology search results [13, 17]. To remedy this problem and provide an accessible, “BLAST-like” web server for gene clusters, we present CAGECAT (the CompArative GEne Cluster Analysis Toolbox).
The CAGECAT web server enables researchers to execute a full gene cluster analysis pipeline using customizable BLAST searches on up-to-date genomic databases. The service provides seamless connections between the search and visualisation modules, enabling execution, inspection, and fine-tuning of relevant search results. While some multi-gene search portals exist, such as ClusterScout and antiSMASH-DB, these only provide for model-based searching (e.g. Pfam) on predefined genome datasets, which often lag behind rapidly growing public genomic databases [9, 18]. In addition to providing more up-to-date results, leveraging BLAST homology allows for refined control compared with model searches (e.g. identity and coverage), which can lead to more specific matches that aid in annotation, taxonomic distribution, or gene cluster evolution. Furthermore, with the interconnection of modules a user can accelerate result curation and downstream analysis, e.g. using gene neighbourhood estimation output to adjust intergenic distance thresholds to obtain more relevant matches. To our knowledge, we present the first free and publicly available web server for accelerated curation of homologous gene clusters with integrated downstream interpretation. By broadening accessibility of gene cluster analysis methods we hope this will lead to accelerated analysis and annotation of BGCs and contribute to the general knowledge of their subsequent products.
Implementation and available tools
The aim of CAGECAT is to provide a platform to seamlessly connect gene cluster analysis tools in an accessible web server for search and interpretation of results. To provide this service, CAGECAT implements a queue system that allows parallel job submissions which is supported by the python ‘rq’ library and Flask web-server (see Additional file 1). The search module leverages the cblaster pipeline, which utilises remote BLAST searches via NCBI’s servers as well as accelerated local Hidden Markov Model (HMM) based searches. Besides rapid similarity searches of entire BGC regions, cblaster provides several functions for gene neighbourhood estimation (GNE), sequence extraction, and visualisation (see Gilchrist et al. for a detailed description of methods) [16]. The clinker pipeline is currently used for the visualisation module, which provides automated cluster alignment and homology annotations. CAGECAT has been designed to provide rapid interoperability between these functions, where homologous clusters of interest can be selected to be used in subsequent analysis. A graphical summary of tool interoperability is given in Fig. 1.
Databases for hidden markov model (HMM) searches
Searches for homologous gene clusters based on HMM profiles using cblaster require cblaster-generated HMM databases. Genus-specific Pfam databases were generated as detailed in supplemental methods resulting in 70 genera with 10 or more genomes for fungi, and 43 genera with 50 or more genomes. A custom script to fetch representative and reference genomes of prokaryotes and fungi was made using NCBI’s e-search utilities [19]. To maintain CAGECAT’s free accessibility and storage, researchers will be required to use the command line version of cblaster or a local installation of CAGECAT to utilise custom HMM databases.
Job management
CAGECAT manages job submissions through a queue submission system, which processes jobs in a parallelizable first-in-first-out manner. Remote BLASTp queries are submitted to the NCBI API which leverages a scalable infrastructure allowing for multiple simultaneous searches (~ 10 requests/sec with an API key). By default, up to 15 jobs can be run in parallel to ensure stability and throughput. Upon job execution, the job command is constructed with the user-defined values of the input parameters and the appropriate pipelines are executed via Python. All output files are then stored and saved using a uniquely generated job ID. See supplemental methods for further technical details.
Results and user interface
Input and output
Two entry points for queries are currently implemented in CAGECAT for either gene cluster search via cblaster (search module) or visualisation via clinker (visualisation module). Input and output for other implemented modules are shown in Table 1.
Table 1.
Tool | Purpose | Input | Output |
---|---|---|---|
Cblaster (Search module) |
Search homologous gene clusters |
FASTA (protein) GenBank NCBI accessions HMM profile identifiers |
Interactive hit visualisation (HTML) Summary table Session file (JSON) |
Clinker (Visualisation module) |
Visualise genomes containing homologous gene clusters (all vs. all) | GenBank | Interactive cluster visualisation (HTML) |
Recompute | Re-filter a previous search with new parameters | Session file output of cblaster search (JSON format) |
Interactive hit visualisation (HTML) Summary table Session file (JSON) |
Extract sequences | Extract sequences which contain a certain query |
Protein sequences (FASTA) Protein headers (TXT) |
|
Extract clusters | Extract selected clusters | GenBank | |
Gene Neighborhood Estimation (GNE) | Check intergenic gap from results to optimize parameter | Summary file (TXT)—Interactive visualisation (HTML) | |
Plot clusters | Visualise a search session (align queries to clusters) | Interactive visualisation (HTML) |
cblaster enables gene cluster searches and clinker creates publication-ready gene cluster visualisations. Additional downstream functions can be executed directly form results of previous session
The search module allows for local files in either GenBank or FASTA format (protein sequences) to be uploaded and processed by the cblaster pipeline. Additionally, NCBI accession numbers can be used to submit a search query on the NCBI database, which can be combined with local searches using HMM profiles in predefined databases on CAGECAT. The input page (Additional file 1: Figure S1) also contains optional parameters for selection of remote databases, search behaviour, and clustering of results. For the visualisation module, users can upload several genbank files or directly use outputs from the search module.
After completion of remote NCBI searches, users are presented with a cluster heatmap, which displays the absence/presence of each query protein sequence across the genomic hits (Fig. 2A). As in the original cblaster, the results are sorted and colored based on BLAST similarity and number of matching proteins to the query cluster for rapid identification and comparison of homologous gene clusters across genomes. For the visualisation module, clinker will generate interactive gene cluster comparison figures with links drawn between similar genes on neighbouring clusters and shaded based on sequence identity (Fig. 2B). Further details of these modules can be found at https://cagecat.bioinformatics.nl/tools/explanation and several example case studies for the cblaster output can be found in Gilchrist et al.
Features and interoperability
Users can download job results to their local computer within 30 days and output HTML files are displayed in-browser allowing for interactive inspection of results. The search module output allows for manual gene cluster selection to further curate results, which can be directly exported as genbank sequences. To accelerate analysis, CAGECAT provides interoperation between results and the available modules. Selections of output from the search module can be directly used as input for downstream analysis (e.g. to selectively visualise some results) or to recompute a search using different parameters (Fig. 3). Notably, when genomic regions from the search module are used for analysis in the visualisation module, it will include all genes present within each genomic region that were not specified in the search query.
Runtime and scalability
Remote search times are largely dependent on NCBI services which cannot be definitively benchmarked due to dependency on service traffic. However, processing of 346 queries over the 5-month user testing period showed an average search completion time under 8 min. Other functions such as clinker visualisation, recompute, gene cluster neighbourhood estimation, and cluster extraction all showed negligible processing time under 30 s (Additional file 1: Table S1).
Conclusions and future directions
With CAGECAT, we aim to lower the technical barrier to execute gene cluster analysis. Downstream analyses can be rapidly performed using the results of a previously executed job, which accelerates curation and comparative visualization. This service enables a quick search of whole gene cluster sequences against NCBI non-redundant or RefSeq databases that can be confined to a selected genus. Currently, two entry points exist to start analysing on CAGECAT: (I) finding homologous gene clusters using a query cluster and the cblaster search module, and (II) a visualisation of gene clusters using a set of query clusters and the clinker module. CAGECAT does not impact or interfere with the analysis capabilities of the implemented tools and acts as a bridge to allow for rapid retrieval of homologous gene clusters from continually updated public databases. We foresee CAGECAT being used by a wide audience to easily uncover homologous BGCs and provide publication-quality visualisations without the need for computational resources or programming expertise. The service is also built to be extensible so that additional downstream analyses can be connected in future versions. Suggestions and comments sent via the contact page will be carefully considered during development. Furthermore, CAGECAT is also useful for comparative analysis and discovery of gene clusters beyond those that encode the production of specialized metabolites, such as xenobiotic degradation pathways [20]. Considering the remote database has no restriction to any particular taxa, this service can thus be used for general homology searches beyond those detailed in this manuscript on a variety of genomes (e.g. Human, mouse). Inter-taxa results are also possible with lower homology thresholds set in the advanced options. With this web server, we aim to accelerate comparative analysis of gene clusters and provide an easy-to-use interface to help uncover clues for further study of BGCs encoding useful specialized metabolites as well as a starting point for investigating gene cluster evolution.
Availability
Project name: Comparative Gene Cluster Analysis Toolbox (CAGECAT).
Project home page: https://cagecat.bioinformatics.nl
Operating system(s): Linux / Platform independent via Docker.
Programming language: Python.
Other requirements: Python 3.8, Docker.
License: MIT.
Source code: https://github.com/malanjary-wur/CAGECAT
Supplementary Information
Acknowledgements
We thank all researchers involved in beta testing from within the Bioinformatics group, Wageningen University, School of Molecular Sciences, The University of Western Australia.
Abbreviations
- API
Application programming interface
- BGC
Biosynthetic Gene Cluster
- CAGECAT
Comparative Gene Cluster Analysis Toolbox
- CORASON
Core Analysis of Syntenic Orthologs to prioritize Natural Product BGCs
- HMM
Hidden Markov Model
- IMG–ABC
Integrated Microbial Genomes–Atlas of Biosynthetic Gene Clusters
- MIBiG
Minimum information about a biosynthetic gene cluster
- NCBI
National Center for Biotechnology Information
Author contributions
M.B. developed and maintained web and core python architecture for CAGECAT. C.L.M.G provided cblaster / clinker integration support and product testing. Y-H.C and T.J.B contributed to testing and manuscript preparation. M.H.M and M.A. supervised and coordinated project development. All authors read and approved the final manuscript.
Funding
M.A is supported by the NWO Talent programme Veni science domain (VI.Veni.202.130). C.L.M.G is supported by the Australian Government Research Training Project (RTP) Ph.D. scholarship, the National Research Foundation of Korea (NRF) [2021R1C1C1012065, 2019R1A6A1A10073437], the Samsung DS research fund program and the Creative-Pioneering Researchers Program through Seoul National University. Y-H.C is supported by an Australian Research Council Future Fellowship (FT160100233). M.H.M. is supported by an ERC Starting Grant (948770-DECIPHER to M.H.M.).
Availability of data and materials
All data and materials are freely available via the updated git repository: https://github.com/malanjary-wur/CAGECAT as well as the release version used in this manuscript: https://github.com/malanjary-wur/CAGECAT/releases.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
MHM is a co-founder of Design Pharmaceuticals and a member of the scientific advisory board of Hexagon Bio. All other authors have no conflict of interest.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Laich F, Fierro F, Cardoza RE, Martin JF. Organization of the gene cluster for biosynthesis of penicillin in Penicillium nalgiovense and antibiotic production in cured dry sausages. Appl Environ Microbiol. 1999;65:1236–1240. doi: 10.1128/AEM.65.3.1236-1240.1999. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Medema MH, Fischbach MA. Computational approaches to natural product discovery. Nat Chem Biol. 2015;11:639–648. doi: 10.1038/nchembio.1884. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Crits-Christoph A, Bhattacharya N, Olm MR, Song YS, Banfield JF. Transporter genes in biosynthetic gene clusters predict metabolite characteristics and siderophore activity. Genome Res. 2020 doi: 10.1101/gr.268169.120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Cimermancic P, Medema MH, Claesen J, Kurita K, Wieland Brown LC, Mavrommatis K, Pati A, Godfrey PA, Koehrsen M, Clardy J, et al. Insights into secondary metabolism from a global analysis of prokaryotic biosynthetic gene clusters. Cell. 2014;158:412–421. doi: 10.1016/j.cell.2014.06.034. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Skinnider MA, Merwin NJ, Johnston CW, Magarvey NA. PRISM 3: expanded prediction of natural product chemical structures from microbial genomes. Nucleic Acids Res. 2017;45:W49–W54. doi: 10.1093/nar/gkx320. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Hannigan GD, Prihoda D, Palicka A, Soukup J, Klempir O, Rampula L, Durcak J, Wurst M, Kotowski J, Chang D, et al. A deep learning genome-mining strategy for biosynthetic gene cluster prediction. Nucleic Acids Res. 2019;47:e110. doi: 10.1093/nar/gkz654. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Blin K, Shaw S, Steinke K, Villebro R, Ziemert N, Lee SY, Medema MH, Weber T. antiSMASH 5.0: updates to the secondary metabolite genome mining pipeline. Nucleic Acids Res. 2019;47:W81–W87. doi: 10.1093/nar/gkz310. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Kautsar SA, Blin K, Shaw S, Navarro-Muñoz JC, Terlouw BR, van der Hooft JJJ, van Santen JA, Tracanna V, Suarez Duran HG, Pascal Andreu V, et al. MIBiG 2.0: a repository for biosynthetic gene clusters of known function. Nucleic Acids Res. 2020;48:D454–D458. doi: 10.1093/nar/gkz882. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Blin K, Shaw S, Kautsar SA, Medema MH, Weber T. The antiSMASH database version 3: increased taxonomic coverage and new query features for modular enzymes. Nucleic Acids Res. 2021;49:D639–D643. doi: 10.1093/nar/gkaa978. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Kautsar SA, Blin K, Shaw S, Weber T, Medema MH. BiG-FAM: the biosynthetic gene cluster families database. Nucleic Acids Res. 2021;49:D490–D497. doi: 10.1093/nar/gkaa812. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Mungan MD, Blin K, Ziemert N. ARTS-DB: a database for antibiotic resistant targets. Nucleic Acids Res. 2022;50:D736–D740. doi: 10.1093/nar/gkab940. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Palaniappan K, Chen I-MA, Chu K, Ratner A, Seshadri R, Kyrpides NC, Ivanova NN, Mouncey NJ. IMG-ABC vol 5.0: an update to the IMG/Atlas of Biosynthetic Gene Clusters Knowledgebase. Nucleic Acids Res. 2020;48:D422–D430. doi: 10.1093/nar/gkz932. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Navarro-Muñoz JC, Selem-Mojica N, Mullowney MW, Kautsar SA, Tryon JH, Parkinson EI, De Los Santos ELC, Yeong M, Cruz-Morales P, Abubucker S, et al. A computational framework to explore large-scale biosynthetic diversity. Nat Chem Biol. 2020;16:60–68. doi: 10.1038/s41589-019-0400-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Kautsar SA, van der Hooft JJJ, de Ridder D, Medema MH. BiG-SLiCE: a highly scalable tool maps the diversity of 12 million biosynthetic gene clusters. Gigascience. 2021;10:giaa154. doi: 10.1093/gigascience/giaa154. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Medema MH, Takano E, Breitling R. Detecting sequence homology at the gene cluster level with MultiGeneBlast. Mol Biol Evol. 2013;30:1218–1223. doi: 10.1093/molbev/mst025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Gilchrist CLM, Booth TJ, van Wersch B, van Grieken L, Medema MH, Chooi Y-H. cblaster: a remote search tool for rapid identification and visualization of homologous gene clusters. Bioinf Adv;2021:1. [DOI] [PMC free article] [PubMed]
- 17.Gilchrist CLM, Chooi Y-H. Clinker & clustermap.js: automatic generation of gene cluster comparison figures. Bioinformatics;2021. 10.1093/bioinformatics/btab007 [DOI] [PubMed]
- 18.Hadjithomas M, Chen I-MA, Chu K, Huang J, Ratner A, Palaniappan K, Andersen E, Markowitz V, Kyrpides NC, Ivanova NN. IMG-ABC: new features for bacterial secondary metabolism analysis and targeted biosynthetic gene cluster discovery in thousands of microbial genomes. Nucleic Acids Res. 2017;45:D560–D565. doi: 10.1093/nar/gkw1103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Entrez Programming Utilities Help [Internet]. Bethesda (MD): National Center for Biotechnology Information (US);2010. [DOI] [PMC free article] [PubMed]
- 20.Wisecaver JH, Rokas A (2015) Fungal metabolic gene clusters—caravans traveling across genomes and environments. In: Frontiers in microbiology (Vol. 6). [DOI] [PMC free article] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All data and materials are freely available via the updated git repository: https://github.com/malanjary-wur/CAGECAT as well as the release version used in this manuscript: https://github.com/malanjary-wur/CAGECAT/releases.