Abstract
The determination of core genes in viral and bacterial genomes is crucial for a better understanding of their relatedness and for their classification. CoreGenes5.0 is an updated user-friendly web-based software tool for the identification of core genes in and data mining of viral and bacterial genomes. This tool has been useful in the resolution of several issues arising in the taxonomic analysis of bacteriophages and has incorporated many suggestions from researchers in that community. The webserver displays result in a format that is easy to understand and allows for automated batch processing, without the need for any user-installed bioinformatics software. CoreGenes5.0 uses group protein clustering of genomes with one of three algorithm options to output a table of core genes from the input genomes. Previously annotated “unknown genes” may be identified with homologues in the output. The updated version of CoreGenes is able to handle more genomes, is faster, and is more robust, providing easier analysis of custom or proprietary datasets. CoreGenes5.0 is accessible at coregenes.org, migrating from a previous site.
Keywords: coregenes, webserver, bioinformatics, genomics, viruses, bacteria
1. Introduction
Core genes are the set of common genes in a set of genomes, in contrast to genes which are not common amongst these genomes (accessory genes) [1]. A better understanding of these core genes has led to the design and synthesis of a minimal bacterial genome [2] to address basic and applied (biotechnological) questions and needs. Core genes have also revealed insights into carbon cycling and carbohydrate metabolism in soil metagenomes [3]. The variation in core genes can also be used for epidemiological typing in bacteria [4]. Core genes have also been used in evolutionary studies of nucleo-cytoplasmic large DNA viruses of eukaryotes [5].
With the growing bacterial and viral genome databases, there has been an increase in the demand for easily accessible and user-friendly software for genomic analyses. Since its original development in 2002 [6], CoreGenes has been continuously updated in order to increase its ease of use, add functionality, and meet additional user suggestions [7]. In particular, it has been used extensively for characterizing and classifying bacteriophages, for example resolving taxonomic issues in the Podoviridae, Myoviridae, and Siphoviridae families [8,9,10] and characterizing newly sequenced bacteriophages [11]. The taxonomic approach pioneered by the CoreGenes application for the Podoviridae, Myoviridae, and Siphoviridae families has been extended by other researchers [12]. CoreGenes has demonstrated utility in the data mining of pathogenic viral and bacterial genomes [13]. However, a limiting factor with older versions of the software was its slower processing speed, which required a smaller allowance for accession numbers in each run and greatly limited the size of the input genomes. In this iteration of CoreGenes, these limitations have been greatly improved. Additionally, coding sequence retrieval can now be used on the web interface, allowing the user to easily view the genome or retrieve new coding sequence information quickly, bypassing the NCBI webpage.
2. Methods
CoreGenes5.0 is written mainly using Python for processing. The webpage implementation is written using Python’s Django module, HTML, CSS, and JavaScript. The reimplementation of the CoreGenes3.5 algorithm uses the same iterative algorithm to process a query genome against a reference genome, and to then create a new consensus genome as described previously [7]. The updated 3.5 version uses MMseqs2 [14] for rapid protein searches instead of Washington University BLAST (WU-BLAST). The genomes are retrieved from GenBank using the user-inputted accession numbers. The former option for the user-input blast score has been updated to accept e-values. While the former iteration of CoreGenes only allowed for up to five input accession numbers, both versions of the updated CoreGenes algorithm now allow for the quick input of twenty accession numbers or an uploaded .txt file of up to two hundred comma-separated accession numbers, using the file upload option. It is recommended that protein queries not exceed 50 input genomes for bacterial genomes, due to their larger size requiring longer processing.
CoreGenes5.0 uses the GET_HOMOLOGUES package [15,16] with BLAST+ [17,18] to perform group protein clustering using default options. The group protein clustering is supported by three common clustering algorithms: OrthoMCL [19], bidirectional best hit, and COGtriangles [20]. In this updated version, an optional input for the inclusion or exclusion of paralogs is available. This option excludes sequences that have significant matches in the same genome and is implemented by the GET_HOMOLOGUES package. The GET_HOMOLOGUES manual defines “inparalogs” as “sequences with best hits in its own genome and excludes clusters with these sequences”. In addition, the user can also input a BLAST e-value in the web interface. The output table of either version is displayed on the webserver, with a link that is available for up to one month. A .csv file with a similar table formatting is available for download through the optional email input. For queries which may take longer to process, an optional email notification is now available with a link to the results page.
CoreGenes now includes a coding sequence (CDS) retrieval option. Genome accession numbers are input, then used to parse coding sequence data from the Genbank database. The parsed CDS files can be downloaded in a zipped folder containing the FASTA files (with a .fasta extension) of the input genomes via email. The Custom Dataset upload option has been updated to accept a standardized FASTA file, with a space-separated accession identifier and protein title.
The rewritten Iterative Comparison Algorithm with MMseqs2 allows for a higher number of accession numbers to be processed. While the iterative comparison algorithm is able to handle larger genomes than the previous version of CoreGenes, the algorithm still performs with greater speed when using viral and small bacterial genomes. The display table is formatted in an easy-to-read format, with hyperlinks for each accession number to the NCBI page. Hypothetical proteins are highlighted in red for an easy-to-locate comparison with annotated homologues. For larger bacterial genomes, the CoreGenes5.0 webserver is able to process multiple 5 Mb genomes in minutes and is able to handle genomes as large as 10 Mb. The main differences between CoreGenes5.0 and the previous version, CoreGenes4.0, are shown in Table 1 below.
Table 1.
Functionality | CoreGenes4.0 | CoreGenes5.0 |
---|---|---|
Additional clustering algorithms such as Bidirectional best hit, OrthoMCL and COGtriangles made available through the GET_HOMOLOGUES package | X | ✓ |
Faster protein searches using MMseqs2 in the Iterative Comparison Algorithm | X | ✓ |
Email results to user | X | ✓ |
Easy CDS retrieval from GenBank | X | ✓ |
More robust custom data input | X | ✓ |
3. Results
The web interface of CoreGenes5.0 (Figure 1) enables the input of 20 genome accession numbers. We have designed the user interface to be intuitive and easy to use, without a lot of confusing options. Links on the left of the page lead to the file upload of accession numbers for batch processing of more than 20 genomes. Custom datasets can also be uploaded using the link on the left. The “old” CoreGenes3.5 algorithm can also be accessed by a link on the left (Iterative Comparison Algorithm). Figure 2 shows the partial CoreGenes5.0 output of five human adenovirus genomes. The output is clean and easy to read for the human eye. Links can be clicked on to access the complete genome or individual proteins in GenBank.
CoreGenes5.0 can process small or large bacterial genomes in minutes, as shown in Table 2. Three 5 Mb bacterial genomes take less than 10 minutes to process, while three 1 Mb genomes take only a minute to process. It must be noted that these times will increase as the number of queried genomes increase. A partial core gene output of three 5 Mb bacterial genomes is shown in Figure 3.
Table 2.
Genome Size | 1 Mb | 2 Mb | 3 Mb | 4 Mb | 5 Mb |
---|---|---|---|---|---|
Accession #s |
NUHQ01000006.1 CAIT01000004.1 ASWA01000004.1 |
MTBP01000002.1 CP033822.1 UHGI01000001.1 |
UGNN01000001.1 CP035563.1 UKAD01000001.1 |
CP021892.1 CP012872.1 UGNN01000001.1 |
UGBR01000009.1 SILS01000001.1 RDRU01000001.1 |
Run Time | 00 h:01 m:10 s | 00 h:02 m:22 s | 00 h:05 m:00 s | 00 h:05 m:54 s | 00 h:09 m:56 s |
Number of Homologues | 41 | 372 | 499 | 732 | 1213 |
The annotation of hypothetical or previously annotated “unknown” proteins is made possible by highlighting all hypothetical proteins in red and by providing putative homologues across the output table. For example, in Figure 4, a hypothetical protein in the bacterium Rhizobium leguminosarum is annotated as a hypothetical protein and labeled in red in the right-hand column. The homologous protein in E. coli, which is annotated as a 2’-5’ RNA ligase (e-value threshold ≤1 × 10−5), is located on the same row in the left-hand column. It is very likely that this hypothetical protein is also a 2’-5’ RNA ligase. Theoretically, a genome with hundreds of hypothetical proteins may be annotated using a closely related reference genome and CoreGenes5.0.
4. Discussion
CoreGenes5.0 is a vastly improved version of its predecessor, CoreGenes3.5 [7], which is no longer supported at its previous home (http://binf.gmu.edu/genometools.html, accessed on 9 November 2022). It is strongly recommended that users migrate to this updated version. This version is more user-friendly, faster, more robust, and able to handle more genomes, all of which were suggested by users. The bacteriophage research community has used CoreGenes extensively to resolve taxonomic issues that were in question, based on traditional methods. Nucleotide sequence analysis for taxonomy has been improved with the application of this tool [21]. CoreGenes has been cited in 334 publications, as per Google Scholar. Frequent citations are in the International Committee on Taxonomy of Viruses (ICTV) Working Group publications, which are not recorded by Pubmed (https://pubmed.ncbi.nlm.nih.gov, accessed on 9 November 2022). For example, Kropinski A.M., Turner D., Tolstoy I., Moraru C., Adriaenssens E.M., and Mahony J. cited results from CoreGenes 3.5 in “Code assigned: 2022.001 B” as a report from the Bacterial Viruses Subcommittee, Caudoviricetes Study Group in 2022. This publication notes that “the genera Sfi21dtunalikevirus (2013.036 a-dB) and Sfi1unalikevirus (2013.034 a-dB) were renamed Moineauvirus and Brussowvirus, respectively through Taxonomy Proposal 2015.025 aB”. These examples serve to underscore the usefulness and current application of CoreGenes.
CoreGenes5.0 provides large-sized bacterial genomes analyses in a shorter timeframe as well. In addition, results can be downloaded in .csv format for offline use. Batch processing is available by uploading a list of accession numbers, with the results emailed to the user. Hypothetical proteins can also be readily identified and annotated, using reference genomes. An added option of including/excluding paralogs may be of particular interest to CoreGenes5.0 users. This was conveniently a function of the GET_HOMOLOGUES package [15,16]. Its definition of “paralogs” is used, as noted in Methods. All of these features make CoreGenes5.0 an easy-to-use software tool for non-computationally savvy users.
Future work will involve replacing the BLAST+ portions of the pipeline with MMseqs2 [14] or DIAMOND [22] to perform fast protein searches. This will enable even more genomes to be processed at a faster rate. Eventually, we hope to transfer the application to a more powerful computer server or a cloud computing environment for higher throughput processing. It should be noted that CoreGenes has been in continuous use since the earliest version in 2002 [6,23].
Acknowledgments
We thank Venkat Mahadevan for helpful comments on this manuscript.
Author Contributions
P.D. and P.M. implemented the webserver and wrote the manuscript. P.M. conceived the idea for the webserver. D.S. contributed to manuscript writing. All authors have read and agreed to the published version of the manuscript.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The genome data analyzed in this paper are available from GenBank at ncbi.nlm.nih.gov.
Conflicts of Interest
The authors declare no conflict of interest.
Availability and Requirements
Server’s homepage: coregenes.org. Operating system(s): Platform-independent. Other requirements: A web browser such as Chrome, Firefox, Safari, or Microsoft Edge is needed to access the webserver.
Funding Statement
This research received no external funding.
Footnotes
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Tettelin H., Masignani V., Cieslewicz M.J., Donati C., Medini D., Ward N.L., Angiuoli S.V., Crabtree J., Jones A.L., Durkin A.S., et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: Implications for the microbial “pan-genome”. Proc. Natl. Acad. Sci. USA. 2005;102:13950–13955. doi: 10.1073/pnas.0506758102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Hutchison C.A., 3rd, Chuang R.Y., Noskov V.N., Assad-Garcia N., Deerinck T.J., Ellisman M.H., Gill J., Kannan K., Karas B.J., Ma L., et al. Design and synthesis of a minimal bacterial genome. Science. 2016;351:aad6253. doi: 10.1126/science.aad6253. [DOI] [PubMed] [Google Scholar]
- 3.Howe A., Yang F., Williams R.J., Meyer F., Hofmockel K.S. Identification of the Core Set of Carbon-Associated Genes in a Bioenergy Grassland Soil. PLoS ONE. 2016;11:e0166578. doi: 10.1371/journal.pone.0166578. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Leekitcharoenphon P., Lukjancenko O., Friis C., Aarestrup F.M., Ussery D.W. Genomic variation in Salmonella enterica core genes for epidemiological typing. BMC Genom. 2012;13:88. doi: 10.1186/1471-2164-13-88. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Yutin N., Koonin E.V. Hidden evolutionary complexity of Nucleo-Cytoplasmic Large DNA viruses of eukaryotes. Virol. J. 2012;9:161. doi: 10.1186/1743-422X-9-161. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Zafar N., Mazumder R., Seto D. CoreGenes: A computational tool for identifying and cataloging “core” genes in a set of small genomes. BMC Bioinform. 2002;3:12. doi: 10.1186/1471-2105-3-12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Turner D., Reynolds D., Seto D., Mahadevan P. CoreGenes3. 5: A webserver for the determination of core genes from sets of viral and small bacterial genomes. BMC Res. Notes. 2013;6:140. doi: 10.1186/1756-0500-6-140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Lavigne R., Seto D., Mahadevan P., Ackermann H.W., Kropinski A.M. Unifying classical and molecular taxonomic classification: Analysis of the Podoviridae using BLASTP-based tools. Res. Microbiol. 2008;159:406–414. doi: 10.1016/j.resmic.2008.03.005. [DOI] [PubMed] [Google Scholar]
- 9.Lavigne R., Darius P., Summer E.J., Seto D., Mahadevan P., Nilsson A.S., Ackermann H.W., Kropinski A.M. Classification of Myoviridae bacteriophages using protein sequence similarity. BMC Microbiol. 2009;9:224. doi: 10.1186/1471-2180-9-224. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Adriaenssens E.M., Edwards R., Nash J.H., Mahadevan P., Seto D., Ackermann H.-W., Lavigne R., Kropinski A.M. Integration of genomic and proteomic analyses in the classification of the Siphoviridae family. Virology. 2015;477:144–154. doi: 10.1016/j.virol.2014.10.016. [DOI] [PubMed] [Google Scholar]
- 11.Zhou W., Feng Y., Zong Z. Two New Lytic Bacteriophages of the Myoviridae Family Against Carbapenem-Resistant Acinetobacter baumannii. Front. Microbiol. 2018;9:850. doi: 10.3389/fmicb.2018.00850. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Bin Jang H., Bolduc B., Zablocki O., Kuhn J.H., Roux S., Adriaenssens E.M., Brister J.R., Kropinski A.M., Krupovic M., Lavigne R., et al. Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks. Nat. Biotechnol. 2019;37:632–639. doi: 10.1038/s41587-019-0100-8. [DOI] [PubMed] [Google Scholar]
- 13.Mahadevan P., King J.F., Seto D. Data mining pathogen genomes using GeneOrder and CoreGenes and CGUG: Gene order, synteny and in silico proteomes. Int. J. Comput. Biol. Drug Des. 2009;2:100–114. doi: 10.1504/IJCBDD.2009.027586. [DOI] [PubMed] [Google Scholar]
- 14.Steinegger M., Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 2017;35:1026–1028. doi: 10.1038/nbt.3988. [DOI] [PubMed] [Google Scholar]
- 15.Contreras-Moreira B., Vinuesa P. GET_HOMOLOGUES, a versatile software package for scalable and robust microbial pangenome analysis. Appl. Environ. Microbiol. 2013;79:7696–7701. doi: 10.1128/AEM.02411-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Vinuesa P., Contreras-Moreira B. Robust Identification of Orthologues and Paralogues for Microbial Pan-Genomics Using GET_HOMOLOGUES: A Case Study of pIncA/C Plasmids. In: Mengoni A., Galardini M., Fondi M., editors. Bacterial Pangenomics, Methods in Molecular Biology. Volume 1231. Humana Press; New York, NY, USA: 2015. pp. 203–232. [DOI] [PubMed] [Google Scholar]
- 17.Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J., Bealer K., Madden T.L. BLAST+: Architecture and applications. BMC Bioinform. 2009;10:421. doi: 10.1186/1471-2105-10-421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Altschul S.F., Madden T.L., Schäffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucl. Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Li L., Stoeckert C.J., Roos D.S. OrthoMCL: Identification of ortholog groups for eukaryotic genomes. Genome Res. 2003;13:2178–2189. doi: 10.1101/gr.1224503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Kristensen D.M., Kannan L., Coleman M.K., Wolf Y.I., Sorokin A., Koonin E.V., Mushegian A. A low-polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches. Bioinformatics. 2010;26:1481–1487. doi: 10.1093/bioinformatics/btq229. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Kropinski A.M., Lingohr E.J., Ackermann H.W. The genome sequence of enterobacterial phage 7–11, which possesses an unusually elongated head. Arch Virol. 2011;156:149–151. doi: 10.1007/s00705-010-0835-5. [DOI] [PubMed] [Google Scholar]
- 22.Buchfink B., Xie C., Huson D.H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods. 2015;12:59–60. doi: 10.1038/nmeth.3176. [DOI] [PubMed] [Google Scholar]
- 23.Mazumder R., Kolaskar A., Seto D. GeneOrder: Comparing the order of genes in small genomes. Bioinformatics. 2001;17:162–166. doi: 10.1093/bioinformatics/17.2.162. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The genome data analyzed in this paper are available from GenBank at ncbi.nlm.nih.gov.