Abstract
Background
Bacterial infections pose a global health threat across clinical and community settings. Over the past decade, the alarming expansion of antimicrobial resistance (AMR) has progressively narrowed therapeutic options, particularly for healthcare-associated infections. This critical situation has been formally recognized by the World Health Organization as a major public health concern. Epidemiological studies have demonstrated that the dissemination of AMR is frequently mediated by specific high-risk bacterial lineages, often designated as “global clones” or “clonal complexes.” Consequently, surveillance of these epidemic clones and elucidation of their pathogenic mechanisms and AMR acquisition pathways have become essential research priorities.
The advent of whole genome sequencing has revolutionized these investigations, enabling comprehensive epidemiological tracking and detailed analysis of mobile genetic elements responsible for resistance gene transfer. However, despite the exponential increase in available bacterial genome sequences, significant challenges persist. Current genomic datasets often suffer from uneven representation of clinically relevant strains and inconsistent availability of accompanying metadata. These limitations create substantial obstacles for large-scale comparative studies and hinder effective surveillance efforts.
Description
This database represents a comprehensive genomic analysis of 98,950 Staphylococcus aureus isolates, a high-priority bacterial pathogen of global clinical significance. We provide detailed isolate characterization through several established typing schemes including multilocus sequence typing (MLST), clonal complex (CC) assignments, spa typing results, and core genome MLST (cgMLST) profiles. The dataset also documents the presence of CRISPR-Cas systems in these isolates.
Beyond fundamental typing data, our resource incorporates the distribution of antimicrobial resistance determinants, virulence factors, and plasmid replicons. These systematically curated genomic features offer researchers valuable insights into isolate epidemiology, resistance mechanisms, and horizontal gene transfer patterns in this highly concerning pathogen.
Conclusion
This database is freely available under CC BY-NC-SA at 10.5281/zenodo.14833440. The data provided enables researchers to identify optimal reference isolates for various genomic studies, supporting critical investigations into S. aureus epidemiology and antimicrobial resistance evolution. This resource will ultimately inform the development of more effective prevention and control measures against this high-priority pathogen.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12863-025-01363-w.
Keywords: Staphylococcus aureus, Healthcare-associated infections, Genomic epidemiology, Whole genome sequencing, CgMLST, Clonal complexes, Antibiotic resistance, Virulence factors
Background
Bacterial infections are one of the most significant challenges in global healthcare. The treatment of these infections is becoming increasingly difficult due to the rise of antimicrobial resistance (AMR) among clinical pathogens, which limits available therapeutic options [1]. However, AMR is not confined to hospital settings; it can also spread among community-based bacterial populations, making it a global issue [2].
The transmission of AMR within specific bacterial species is often driven by the presence of “global clones”, “clonal groups” or “clonal complexes” representing epidemiologically relevant sets of genetically similar strains. Such clones play a crucial role in the spread of resistance, particularly among notorious hospital-associated ESKAPE pathogens such as Klebsiella pneumoniae [3], Acinetobacter baumannii [4], Pseudomonas aeruginosa [5, 6] and Staphylococcus aureus [7]. Thus, effective epidemiological surveillance of the infections caused by multidrug-resistant (MDR) bacteria and the development of strategies to prevent their spread should include identifying whether particular isolates belong to these global clones.
Currently, whole genome sequencing (WGS) has become the preferred method for the purpose of clonal complex detection and many other types of analysis due to its ability to generate large amounts of data efficiently and at a relatively low cost [8]. Public databases, such as NCBI GenBank (https://www.ncbi.nlm.nih.gov/genbank/), contain hundreds of thousands of bacterial genomes, and the increasing availability of WGS data will continue to accelerate the growth of these resources.
The proportion of a bacterial species’ genome sequences in public databases depends on its relevance to public health and the incidence of infections caused by it. Thus, S. aureus, being a major cause of both hospital and community infections globally, accounts for a substantial portion of sequenced genomes [9]. The World Health Organization (WHO) has recently confirmed a high priority in searching for novel therapy options for methicillin-resistant S. aureus (MRSA) [10].
At present, about 100,000 draft S. aureus genomes are available at NCBI (https://www.ncbi.nlm.nih.gov/datasets/taxonomy/1280/, accessed on 10 May 2025). However, this repository usually does not supply metadata such as typing results or AMR gene presence. These features require additional processing using separate pipelines, which sometimes demand substantial bioinformatics skills from their users. Another well-known database PubMLST [11] contains strain identification information and various epidemiological metadata for more than 30,000 genomes of S. aureus [12] (https://pubmlst.org/bigsdb?db=pubmlst_saureus_isolates%26;page=query%26;genomes=1, accessed on 10 May 2025). Retrieving antimicrobial resistance, virulence, and plasmid data from this resource is challenging, especially for high-throughput analyses. Additionally, there is no efficient way to download multiple genomes at once using advanced filtering options. Previously, S. aureus genomes from Genbank were analyzed using Staphopia analysis pipeline [13], but the number of genomes available at that time was only about 40,000, which is more than two times lower than the one currently available.
Taking into account all of the above, we developed the database containing typing information, including the determination of clonal complex and SPA-type, for the whole set of the S. aureus genomes available at NCBI genome assembly database (https://www.ncbi.nlm.nih.gov/datasets/taxonomy/1280/, accessed on 23 December 2024), which included 98,950 genomes upon completion of quality filtration process. The data provided in the database includes multilocus sequence typing (MLST)-based sequence types (STs, based on 7 loci), SPA-types (based on the polymorphic X region of the protein A gene), as well as core genome MLST (cgMLST, based on 1861 loci) profiles, and the information regarding the presence and type of CRISPR-Cas systems in S. aureus isolates from Genbank. The dataset al.so includes antibiotic resistance markers (potentially conferring resistance to different drug classes), virulence-associated genes, and plasmid replicon elements.
In our prior studies, this database proved invaluable for selecting optimal reference genomes for comparative analyses [14]. Given its proven utility for S. aureus genomic research, we made the database publicly accessible.
Construction and content
We extracted 99,191 genomic sequences of S. aureus from Genbank (https://www.ncbi.nlm.nih.gov/genbank/, accessed on 23 December 2024), for which the level of assembly was indicated as ‘Complete Genome’, ‘Chromosome’, or ‘Scaffold’. Then we excluded 241 isolates (about 0.24%) for which the MLST revealed inexact scheme, or shown contamination by other species or the incorrect species identification. All remaining isolates (n = 98,950) passed the 95% ANI threshold in pairwise comparisons, meaning no additional genomes required exclusion based on this parameter. This value was recently confirmed to represent a reliable threshold for species differentiation [15].
Detection of MLST-based STs was performed using the Institut Pasteur database [16] (https://pubmlst.org/bigsdb?db=pubmlst_saureus_seqdef, accessed on 26 December 2024) using the typing scheme described previously [17].
The emergence of resistance to multiple drug classes in S. aureus primarily occurs through plasmid-encoded resistance gene transfer [18]. Notably, specific hospital-associated epidemic lineages, classified into distinct clonal complexes, have been demonstrated as key contributors to this phenomenon [19, 20]. Consequently, classifying clinical isolates into established clonal complexes serves as a critical component in both epidemiological monitoring and tracking AMR dissemination. In our database, such an assignment was based on MLST ST according to previously published data [7]. Given the differential involvement of plasmid types in AMR gene dissemination [21], detection of clinically significant ‘epidemic’ plasmids through replicon typing represents a critical analysis step. Our database includes these typing results to facilitate such evaluations.
The core genome multilocus sequence typing (cgMLST) profiles were generated using the MentaList software [22] (https://github.com/WGS-TB/MentaLiST, version 0.2.4, default settings, accessed on December 24, 2024) with a scheme of 1861 loci downloaded from cgmlst.org (https://www.cgmlst.org/ncs/schema/schema/Saureus3315/, accessed on December 23, 2024).
AMR genes were revealed by means of Resfinder 4.6.0 software [23] (https://genepi.food.dtu.dk/resfinder, accessed on January 10, 2025, using default parameters).
Virulence factors in S. aureus genomes were identified using the Virulence Factor Database (VFDB) [24] (http://www.mgc.ac.cn/VFs/main.htm, accessed on January 12, 2025) with default settings.
Plasmid replicons were detected via PlasmidFinder 2.1, using its standard parameters [25] (https://cge.food.dtu.dk/services/PlasmidFinder/, accessed on January 10, 2025).
The presence of functional CRISPR-Cas systems in the genomes was analyzed with CRISPRCasFinder [26] version 4.2.20 using the options ‘-fast -rcfowce -ccvRep -vicinity 1000 –cas -useProkka’.
Further data processing, formatting of output files, and preparation of tables for the database were carried out using a computational pipeline that we previously developed and have applied in several prior studies [4, 27].
The database includes four tables provided in different formats: xlsx (for all tables except cgMLST), tab-delimited txt, and pdf (only for the summary table). The xlsx files can be processed further by users in spreadsheet software, as outlined in previous studies [28, 29]. The txt format is suitable for computational processing with bioinformatics tools, while the pdf format contains a summary of the key typing results, presented in a user-friendly, ready-to-read format.
The tables included in the database are as follows:
Summary table (table_summary): Contains typing information for all isolates, such as MLST sequence type (ST), SPA-types, clonal complex assignments, and the presence and type of CRISPR-Cas systems in the genomes.
AMR gene table (table_amr): Lists the AMR genes found in the genomes of all isolates, specifically those known to confer resistance to various classes of antimicrobial drugs.
Virulence gene table (table_vfdb): Provides information on the presence of genes encoding virulence factors as identified in the VFDB (Virulence Factor Database).
Plasmid replicon table (table_plasmid): Details the plasmid replicons detected in the genomes of the isolates.
cgMLST profile table (table_cgmlst): Includes the cgMLST profiles for the isolates, offering higher resolution and more detailed comparison than traditional MLST.
Utility and discussion
Table structure and content description
The format and exemplary data for the summary table are presented in Table 1.
Table 1.
Exemplary data for the typing summary table
| Assembly code | Clonal complex | MLST ST | SPA-formula | SPA-type | CRISPR-Cas |
|---|---|---|---|---|---|
| GCA_000025145.2 | CC5 | ST225 | T-M-D-M-G-M-M-K | t003 | NF3 |
| GCA_001307235.1 | NA1 | ST1093 | U-J-G-F-M-B-B-P-B | t359 | III-A |
| GCA_001349975.1 | NA | ND2 | no enriched sequence | ND | UNKN |
1 ‘NA’ (not available) indicates that the isolate did not belong to any known clonal complex
2 ‘ND’ (not determined) indicates that SPA-type or ST could not be determined
3 ‘NF‘(not found) –CRISPR-Cas system was not found in the genome
The first column indicates the assembly code from Genbank, which identifies a particular S. aureus genome assembly in a unique way and is used throughout all tables.
‘CC’ stands for ‘clonal complex’ and shows the assignment of a particular isolate to known clonal complexes, which represent the sets of closely related STs. Such an assignment is based on the ST according to the previously published data [7]. If an isolate could not be assigned to any defined CC, this column contains ‘NA’ (not available) designation. The composition of CCs is presented in Table S1 for reference.
Third column contains an ST determined by a combination of seven loci of the housekeeping genes (arcC, aroE, glpF, gmk, pta, tpi and yqiL) of a typing scheme [17]. Each variant of a particular locus is numbered sequentially, and the unique combination of seven locus variants constitutes a ST, to which its own number is assigned. For instance, the loci combination of arcC_3, aroE_3, glpF_1, gmk_1, pta_4, tpi_4 and yqiL_3 alleles designates the ST8. The definitions of STs can be found in the Institut Pasteur database (https://pubmlst.org/bigsdb?db=pubmlst_saureus_seqdef, accessed on 23 May 2025). ‘ND’ in this column indicates that a ST for the corresponding isolate was not determined due to either low sequencing quality, insufficient coverage of genome regions corresponding to MLST alleles or the presence of a MLST allele not revealed previously, or novel allele combination, which was not yet uploaded to the database.
The fourth and the fifth columns contain SPA-formula and SPA-type of the isolate, respectively. The spa typing method is based on determination of the sequence of the polymorphic X region of the protein A gene of S. aureus, which is present in all isolates and thus represents a useful marker for epidemiological surveillance [30]. The X region contains a variable number of 24-bp repeats, each of which can be assigned a numerical (or letter) code, and the combinations of such codes (SPA-formula) defines a SPA-type. The regularly updated database of SPA-types is available at Ridom Spa Server (http://spaserver.ridom.de/, 22,221 types are available; accessed on 23 May 2025).
The final column of the database indicates the presence and type of CRISPR-Cas systems in each isolate. Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) arrays and CRISPR-associated genes (cas) function as a dynamic genetic element, forming adaptive immune systems in bacteria. CRISPR-Cas systems are categorized into six primary types (I-VI) and multiple subtypes (A-I, K, U) based on structural and genomic analyses [31, 32]. The following notations are used in this column: ‘NF’ indicates that no CRISPR-Cas system was identified in the genome, ‘UNKN’ shows an incomplete or untypeable system, and other entries specify the identified system type.
Another section of the database contains data on AMR genes detected in the S. aureus genomes analyzed. The first column lists the assembly code, corresponding to the identifier in the typing summary table. The second column specifies the total number of AMR genes identified in each isolate. Subsequent columns indicate the presence of specific AMR genes, with each gene listed in the header row, and provide the sequence similarity between the detected gene and the corresponding allele from the ResFinder database. For clarity, the absence of a particular gene in an isolate is marked with a dot.
An example of the AMR table records is provided in Table 2, where just a few genes are shown.
Table 2.
Exemplary data for the typing summary table
| Assembly code | NUM_FOUND | aac(3)-IIa | blaZ | … | mecA |
|---|---|---|---|---|---|
| GCA_000307695.1 | 11 | 100.00 | . | … | 99.8 |
| GCA_000262835.1 | 13 | 100.00 | 100.00 | … | . |
It is important to note that the mere presence of a gene associated with resistance to a particular antimicrobial drug does not guarantee phenotypic resistance. For instance, the gene might not be expressed or could be expressed at an insufficient level to confer resistance [33, 34]. Genbank database does not contain gene expression or phenotypic data for the bacterial genomes analyzed, and thus our database is limited to in silico genome-based AMR gene predictions. Nevertheless, this information is critical for understanding the AMR repertoire and its distribution within the bacterial population under study.
The format of the virulence gene table mirrors that of the AMR gene table, differing only in the set of genes included. Similarly, the plasmid replicon table follows the same structure.
The data provided in the AMR and virulence gene tables are valuable for comparative analyses of pathogenicity within specific groups of isolates. Even genetically and epidemiologically related isolates may exhibit differences in their AMR and virulence gene profiles, making this information essential for selecting appropriate reference sets. Additionally, the co-occurrence of specific plasmid replicons, such as rep16, rep5a, or IncR, alongside AMR or virulence genes, could suggest a plasmid-borne origin of these genes. However, further studies are always required to confirm such hypotheses.
The fourth section of the database contains cgMLST profiles for every isolate. Unlike traditional MLST, which assigns profiles based on a limited number of gene alleles, cgMLST analyzes a comprehensive set of conserved genes—typically, those found in over 90% of known isolates belonging to a particular species. Although both methods classify strains by allele combinations, cgMLST offers higher resolution by examining a significantly larger number of loci. For S. aureus, the cgMLST scheme comprises 1,861 loci (in comparison to seven loci in MLST), with allele variants cataloged in a dynamically maintained database accessible at cgmlst.org (https://www.cgmlst.org/ncs/schema/schema/Saureus3315/, accessed on 29 December 2024).
The cgMLST profiles can be utilized for cluster analysis of a selected group of isolates to assess their genomic similarity and identify epidemiologically or clinically significant groups. A threshold of 15 allele differences in cgMLST profiles has been suggested for determining whether two S. aureus isolates belong to the same strain or different strains [35]. However, depending on the specific objectives of the investigation, more or less stringent criteria may be applied. In this database, the cgMLST profiles are presented in a table format. The first column includes the assembly identifier corresponding to the one from previously described tables, while the remaining columns display numbers (or some special symbols described below) that represent the gene variants listed in the header row. Several special designations may appear in these columns:
‘N’: Indicates a novel allele variant that has not yet been recorded in the database; ‘0?’: Indicates a locus that is missing in the assembly, which could be due to poor sample quality or sequencing issues; ‘-’: Signifies that the allele is only partially covered.; ‘+’: Represents multiple possible alleles; in this case, the allele with the highest probability is reported.
A part of the cgMLST profiles is presented in Table 3.
Table 3.
A part of exemplary CgMLST profiles for two S. aureus isolates
| Assembly code | SACOL0013 | SACOL0014 | SACOL0015 | SACOL0077 | SACOL0086 | SACOL0088 | |
|---|---|---|---|---|---|---|---|
| GCA_000009005.1 | 8 | 9 | 9+ | N | 9 | 8 | |
| GCA_000009585.1 | 4 | 4 | 300- | 4 | 15 | 4 |
Here top row shows the names of the genes from cgMLST set, and the numbers in other rows represent the alleles of the corresponding genes revealed in the corresponding genomes. For example, allele 8 was revealed for SACOL0013 in the first isolate, while allele 8 was found in the second. More than one allele was revealed for SACOL0015 in GCA_000009005.1, and allele 9 had the highest probability; the allele 300 was only partially covered for this gene in the second isolate. In addition, novel allele not yet uploaded to the cgMLST database was revealed for SACOL0077 in the first isolate.
General descriptive statistics
This section provides general descriptive statistics for the database. Previous studies have demonstrated that the GenBank set of genomes cannot serve as a representative of the global S. aureus population, as it was heavily skewed towards MDR and other clinically relevant strains from specific regions [13]. Nonetheless, descriptive statistics regarding the distribution of specific sequence types (STs), clonal complexes (CCs), AMR genes, virulence factors, and other characteristics can offer valuable insights for selecting reference sets and making comparative analyses.
For the sake of simplicity, we refer to any GenBank assembly record containing either a complete or partial genome as an “isolate.” It is important to note that some assemblies may represent the same isolate, or certain genomic records may only contain a part of the genome.
The summary of top three dominating characteristics in each feature category is given in Table 4.
Table 4.
Top three representatives dominating the studied features in S. aureus database
| Feature | Top three representatives |
|---|---|
| Clonal complex | CC8, CC5, NA |
| Sequence type | ST8, ST5, ST22 |
| SPA-type | t008, t002, NF |
| AMR genes | blaZ, mecA, ant(9)-Ia |
| Virulence genes | isdI, cap8P, sspC |
| Plasmid replicons | rep16, rep5a, rep7c |
| CRISPR-Cas system | NF, UNKN, III-A |
Totally, 82.3% of the isolates belonged to 13 known CCs, and four most frequent ones (CC8, CC5, CC22 and CC1) accounted for more than a half (56.3%) of the genomes. The largest CC8 was assigned to 20,415 (20.6%) of the isolates, and the second largest was CC5 (20.3%). CC22 took the third place with 8%. The central STs of these CCs were, in turn, the most frequent in ST category, accounting for 15.2% (ST8), 13.7% (ST5) and 7.7% (ST22), respectively. It is interesting that almost all members of CC22 also had ST22 besides the presence of 70 STs within this complex.
The distribution of the main CCs and STs of S. aureus in Genbank is shown in Fig. 1.
Fig. 1.
The distribution of the top clonal complexes (CC (%), panel A) and sequence types (ST (numbers), panel B) for S. aureus genomes from Genbank
The total number of distinct STs available in Genbank for S. aureus was 2173, and it was not possible to determine ST for 5527 (5.6%) of the isolates. Of these, 1242 STs were represented only by single isolates, and 242 STs were assigned to 10 or more isolates each. In turn, top 10 STs accounted for 62% of the genomes. The general distribution of CCs and STs was very similar to previous analysis of 44,012 S. aureus genomes in 2018 [13], but the number of different STs has doubled since this time.
CC8 is a globally distributed clone known for its ability to spread within clinical populations [36], which is the reason for its overrepresentation in the genomic databases. CC5 is another global clone, which, together with CC8, includes a large fraction of MRSA strains [19]. The MRSA CC22 is an epidemic clone often causing outbreaks in healthcare settings [37].
Thus, it is evident that the majority of the S. aureus genomes in Genbank is represented by the clinically relevant strains, which was also the case for other species like A. baumannii [28] and K. pneumoniae [29]. Therefore, the genome distribution is skewed and not necessarily reflects the global population of S. aureus. It is interesting that more than 80% of S. aureus isolates from Genbank were assigned to known CCs, while this fraction was much lower for its Gram-negative counterpart K. pneumoniae (only 37.3% of the isolates belonged to clonal groups of K. pneumoniae [29]). At the same time, the distribution for A. baumannii was very similar with 78.5% of the isolates belonging to the international clones of high risk [28].
Another useful typing element of S. aureus is a polymorphic X region of the protein A gene (spa), the repeats in which can be used for surveillance purposes. The most widespread SPA-types in Genbank were t008 (10%), t002 (8.5%) and NF (6.87%), the latter indicating the lack of revealed X-region sequence. Previously t008 and t002 were found to dominate in various regions of the world [38]. In total, 4393 SPA-types were revealed, which makes this marker more precise than ST detection.
The number of AMR genes possessed by the isolates was in a range from 1 to 23 with a median equal to 3. The dominating genes were blaZ (carried by 84% of the isolates), mecA (61%) and ant(9)-Ia (20%). The first one is a gene encoding beta-lactamase known to be present in a significant fraction of S. aureus isolates. This beta-lactamase confers resistance to several β-lactam antibiotics. The second gene, mecA, is a key determinant of methicillin resistance in S. aureus. Recent meta-analysis revealed the average prevalence of the mecA to be about 22% [39], but in clinical settings this rate was significantly higher and reached 62% [40]. This disproportion again points to the skewness of Genbank S. aureus population towards more dangerous clinical strains possessing higher rates of AMR. The third gene provides aminoglycoside resistance and is usually localized on plasmids and transposons [18].
Another interesting point is the detection of van operon providing vancomycin resistance only in 41 genomes, although its detection rate in clinical isolates was recently reported to be about 1% [41].
The number of virulence genes for the isolates in the database varied from 28 to 93 with a median equal to 77. Among these, the genes from isd cluster (encoding iron acquisition proteins), cap cluster (encoding capsular polysaccharide synthesis enzyme) and sspC (encoding the protein responsible for bacterial DNA protection) were revealed in virtually all isolates, which corresponds to previous studies [42]. Totally, 142 distinct virulence genes were revealed, but only 98 of them were found in a significant number (> = 10) of the isolates.
Among the analyzed isolates, 86,318 (87%) were found to carry at least one plasmid replicon, highlighting the widespread presence of plasmids in the dataset. The number of replicons per isolate varied significantly, ranging from 0 to 16, with a median of 2, indicating that while most isolates possess a small number of replicons, some harbor a much higher count. However, it is worth noting that replicon predictions do not directly equate to the number of physical plasmids for two key reasons. First, multiple replicons can reside on a single plasmid—some plasmids contain more than one replication module, either naturally or through recombination [21]. Second, the computational predictions may introduce redundancy—closely related replicon sequences or variants of the same replicon might be counted separately due to sequence divergence or database overlaps. The most abundant plasmid replicons were rep16 (41%), rep5a (27%) and rep7c (26%). The total number of distinct replicons revealed was 88, or 214 when counting additional replicon variants.
Apparently functional CRISPR-Cas systems were detected in 784 isolates (0.79%), all of which were assigned to type III-A. These findings correspond to other recent reports showing low prevalence of CRISPR-Cas in S. aureus, while it can be revealed in 40% of the isolates belonging to other bacterial species [43, 44]. Interestingly, more than a half of the isolates possessing CRISPR-Cas systems (413) belonged to CC45. Besides this, about 25,000 isolates included the CRISPR-Cas systems being incomplete or having undetectable type, but most of them can simply represent the artifacts of homology search.
The full data describing the distribution of all the characteristics designated above is presented in Table S2.
Applications and future development
The database is available for academic use under the Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA) 4.0 International License. Updates to the database are planned to be released at least once a year.
This database serves three primary aims to support S. aureus genomic research and epidemiological surveillance:
Statistical Analysis and Epidemiological Insights
The database enables comprehensive statistical evaluation of S. aureus genomes, including: prevalence tracking of AMR genes across different lineages (e.g., STs, CCs, SPA-types); detection of novel correlations, such as variations in virulence/AMR gene content among specific CCs or plasmid replicons; trend analysis, including the distribution of key resistance determinants (e.g., β-lactamases) within defined genetic backgrounds.
Such analyses can identify emerging resistance patterns and inform public health interventions.
-
(2)
Reference Genome Selection for Comparative Genomics
Bacterial genomic studies frequently require comparisons with well-characterized reference strains of the same ST, CC, or similar cgMLST profile. However, public databases (e.g., GenBank) often lack detailed metadata on strain lineages or AMR/virulence gene content. PubMLST permits searches by single ST/CC or AMR gene, but lacks batch-processing capabilities and imposes restrictions on simultaneous genome retrieval. Our database overcomes these limitations by enabling effortless filtering (via Excel or command-line tools) and bulk downloads without restrictive query limits.
This facilitates efficient selection of reference genomes for downstream analyses.
-
(3)
Tracking Epidemic Clones and Defining Novel Clonal Complexes
The database supports global surveillance of epidemic S. aureus clones, aiding in identification of novel CCs composed by emerging epidemic strains based on shared genomic features.
The instructions for database use-case scenarios including UNIX/Linux command-line workflows are given in the ‘how_to_use.txt’ file on the accompanying webpage. The commands provided there can be used to output the typing information and AMR/virulence gene content for the isolates belonging to the given ST or CC, export cgMLST profiles for such isolates, select the isolates carrying a particular AMR gene, calculate the number of a particular gene occurrences etc.
Excel filtering features can be used in the xlsx versions of the database files as follows. The summary database file table_summary.xlsx can be used to select the isolates having particular ST, CC, SPA-type and the type of CRISPR-Cas system (or the set or any combination of these parameters). The distributions of these parameters (e.g., the number of the isolates per ST/CC) can be calculated and presented on the corresponding graphs. The files with AMR/virulence/plasmid content information can be used to select the subsets of isolates carrying particular gene/factor/replicon, respectively, and to calculate and visualize the statistics.
However, the advanced processing like cgMLST clustering, selecting cgMLST profiles based on other genomic features (ST, AMR gene presence etc.) can only be performed for txt files using custom scripts or the commands provided in ‘how_to_use.txt’.
Conclusions
We have developed a comprehensive genomic database for S. aureus, a globally significant bacterial pathogen, by systematically curating and organizing publicly available GenBank records. The database integrates several layers of clinically and epidemiologically relevant information, including cgMLST profiles, the presence of AMR genes and virulence factors, and plasmid replicon distribution patterns.
This resource was designed to address key challenges in contemporary S. aureus research. By providing standardized, precomputed genomic metadata, it facilitates efficient selection of reference isolate sets for comparative genomic analyses and supports large-scale epidemiological investigations. The database will be particularly valuable for researchers working in the emerging field of genomic epidemiology, as it eliminates the need for time- and resource-consuming manual curation of genomic features from primary sequence data.
The inclusion of multiple typing schemes alongside comprehensive AMR and virulence gene profiles enables researchers to rapidly identify relevant isolate collections for specific study objectives. Furthermore, the plasmid replicon data provide important insights into horizontal gene transfer dynamics that contribute to the spread of AMR in this clinically relevant pathogen.
Supplementary Information
Authors' contributions
Conceptualization, A.Sh., A.S and Y.M.; data curation, A.Sh., M.Y.; formal analysis, A.Sh., Y.M., V.A.; funding acquisition, V.A.; investigation, A. Sh., A.S., M.Y., V.K., Y.M., and V.A.; methodology, A.Sh. and Y.M.; project administration, V.A.; resources, A.Sh, A.S., V.K.; software, A.Sh.; supervision, V.A.; validation, A.Sh., A.S., M.Y.; visualization, A.Sh., V.K.; writing—original draft, A.Sh.; writing—review and editing, A.Sh., A.S., Y.M., V.A. All authors have read and agreed to the published version of the manuscript.
Funding
This work did not receive any external funding.
Data availability
The datasets generated and/or analyzed during the current study are available in the Zenodo repository, [https://doi.org/10.5281/zenodo.14833440](https:/doi.org/10.5281/zenodo.14833440). This page can be accessed using any web browser and all tables can be downloaded there. The analysis results and descriptive statistics data are included in this manuscript and its supplementary information files.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Salam MA, Al-Amin MY, Salam MT, Pawar JS, Akhter N, Rabaan AA, et al. Antimicrobial resistance: a growing serious threat for global public health. Healthcare. 2023. 10.3390/healthcare11131946. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Denissen J, Reyneke B, Waso-Reyneke M, Havenga B, Barnard T, Khan S, et al. Prevalence of ESKAPE pathogens in the environment: antibiotic resistance status, community-acquired infection and risk to human health. Int J Hyg Environ Health. 2022. 10.1016/j.ijheh.2022.114006. [DOI] [PubMed] [Google Scholar]
- 3.Arcari G, Carattoli A. Global spread and evolutionary convergence of multidrug-resistant and hypervirulent Klebsiella pneumoniae high-risk clones. Pathog Glob Health. 2023;117(4):328–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Shelenkov A, Akimkin V, Mikhaylova Y. International clones of high risk of Acinetobacter baumannii-definitions, history, properties and perspectives. Microorganisms. 2023. 10.3390/microorganisms11082115 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Zurita J, Sevillano G, Solis MB, Paz YMA, Alves BR, Changuan J, et al. Pseudomonas aeruginosa epidemic high-risk clones and their association with multidrug-resistant. J Glob Antimicrob Resist. 2024;38:332–8. [DOI] [PubMed] [Google Scholar]
- 6.Del Barrio-Tofino E, Lopez-Causape C, Oliver A. Pseudomonas aeruginosa epidemic high-risk clones and their association with horizontally-acquired beta-lactamases: 2020 update. Int J Antimicrob Agents. 2020;56(6):106196. [DOI] [PubMed] [Google Scholar]
- 7.Pennone V, Prieto M, Alvarez-Ordonez A, Cobo-Diaz JF. Antimicrobial resistance genes analysis of publicly available Staphylococcus aureus genomes. Antibiotics. 2022. 10.3390/antibiotics11111632. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Price V, Ngwira LG, Lewis JM, Baker KS, Peacock SJ, Jauneikaite E et al. A systematic review of economic evaluations of whole-genome sequencing for the surveillance of bacterial pathogens. Microb Genom. 2023;9(2):mgen000947. [DOI] [PMC free article] [PubMed]
- 9.Adeiza SS, Islam MA, Shittu A. Global, regional, and national burdens: an overlapping meta-analysis on Staphylococcus aureus and its drug-resistant strains. One Health Bull. 2024;4(4):164–80. [Google Scholar]
- 10.WHO Bacterial Priority Pathogens List. 2024: bacterial pathogens of public health importance to guide research, development and strategies to prevent and control antimicrobial resistance. Geneva: World Health Organization; 2024. https://iris.who.int/bitstream/handle/10665/376776/9789240093461-eng.pdf?sequence=1. Accessed 25 July 2025. [DOI] [PMC free article] [PubMed]
- 11.Jolley KA, Bray JE, Maiden MCJ. Open-access bacterial population genomics: BIGSdb software, the pubmlst.org website and their applications. Wellcome Open Res. 2018;3:124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.PubMLST S. aureus isolate database 2025. https://pubmlst.org/bigsdb?db=pubmlst_saureus_isolates&page=query&genomes=1. Accessed10 June 2025.
- 13.Petit RA 3rd, Read TD. Staphylococcus aureus viewed from the perspective of 40,000 + genomes. PeerJ. 2018;6:e5261. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Mikhaylova Y, Shelenkov A, Chernyshkov A, Tyumentseva M, Saenko S, Egorova A, et al. Whole-genome analysis of Staphylococcus aureus isolates from ready-to-eat food in Russia. Foods. 2022. 10.3390/foods11172574. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Rodriguez RL, Conrad RE, Viver T, Feistel DJ, Lindner BG, Venter SN, et al. An ANI gap within bacterial species that advances the definitions of intra-species units. mBio. 2024;15(1):e0269623. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.PubMLST S. aureus MLST profile database 2024 Available from: https://pubmlst.org/bigsdb?db=pubmlst_saureus_seqdef
- 17.Enright MC, Day NP, Davies CE, Peacock SJ, Spratt BG. Multilocus sequence typing for characterization of methicillin-resistant and methicillin-susceptible clones of Staphylococcus aureus. J Clin Microbiol. 2000;38(3):1008–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Brdova D, Ruml T, Viktorova J. Mechanism of staphylococcal resistance to clinically relevant antibiotics. Drug Resist Updat. 2024;77:101147. [DOI] [PubMed] [Google Scholar]
- 19.Smith JT, Eckhardt EM, Hansel NB, Eliato TR, Martin IW, Andam CP. Genomic epidemiology of methicillin-resistant and -susceptible Staphylococcus aureus from bloodstream infections. BMC Infect Dis. 2021;21(1):589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Ibrahim RA, Mekuria Z, Wang SH, Mediavilla JR, Kreiswirth B, Seyoum ET, et al. Clonal diversity of Staphylococcus aureus isolates in clinical specimens from selected health facilities in Ethiopia. BMC Infect Dis. 2023;23(1):399. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Neyaz L, Rajagopal N, Wells H, Fakhr MK. Molecular characterization of Staphylococcus aureus plasmids associated with strains isolated from various retail meats. Front Microbiol. 2020;11:223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Feijao P, Yao HT, Fornika D, Gardy J, Hsiao W, Chauve C, et al. Mentalist - a fast MLST caller for large MLST schemes. Microb Genom. 2018. 10.1099/mgen.0.000146. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Bortolaia V, Kaas RS, Ruppe E, Roberts MC, Schwarz S, Cattoir V, et al. Resfinder 4.0 for predictions of phenotypes from genotypes. J Antimicrob Chemother. 2020;75(12):3491–500. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Liu B, Zheng D, Zhou S, Chen L, Yang J. VFDB 2022: a general classification scheme for bacterial virulence factors. Nucleic Acids Res. 2022;50(D1):D912-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Carattoli A, Zankari E, Garcia-Fernandez A, Voldby Larsen M, Lund O, Villa L, et al. In silico detection and typing of plasmids using plasmidfinder and plasmid multilocus sequence typing. Antimicrob Agents Chemother. 2014;58(7):3895–903. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Couvin D, Bernheim A, Toffano-Nioche C, Touchon M, Michalik J, Neron B, et al. CRISPRCasFinder, an update of crisrfinder, includes a portable version, enhanced performance and integrates search for Cas proteins. Nucleic Acids Res. 2018;46(W1):W246–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Egorova A, Mikhaylova Y, Saenko S, Tyumentseva M, Tyumentsev A, Karbyshev K, et al. Comparative whole-genome analysis of Russian foodborne multidrug-resistant Salmonella infantis isolates. Microorganisms. 2021. 10.3390/microorganisms10010089 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Shelenkov A, Mikhaylova Y, Akimkin V. Genomic epidemiology dataset for the important nosocomial pathogenic bacterium Acinetobacter baumannii. Data. 2024;9(2):22. [Google Scholar]
- 29.Shelenkov A, Slavokhotova A, Mikhaylova Y, Akimkin V. Genomic typing, antimicrobial resistance gene, virulence factor and plasmid replicon database for the important pathogenic bacteria Klebsiella pneumoniae. BMC Microbiol. 2025;25(1):3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Hallin M, Friedrich AW, Struelens MJ. Spa typing for epidemiological surveillance of Staphylococcus aureus. Methods Mol Biol. 2009;551:189–202. [DOI] [PubMed] [Google Scholar]
- 31.Hryhorowicz M, Lipinski D, Zeyland J. Evolution of CRISPR/Cas systems for precise genome editing. Int J Mol Sci. 2023;24(18):14233. [DOI] [PMC free article] [PubMed]
- 32.Makarova KS, Koonin EV. Annotation and classification of CRISPR-Cas systems. Methods Mol Biol. 2015;1311:47–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Shelenkov A, Mikhaylova Y, Yanushevich Y, Samoilov A, Petrova L, Fomina V, et al. Molecular typing, characterization of antimicrobial resistance, virulence profiling and analysis of whole-genome sequence of clinical Klebsiella pneumoniae isolates. Antibiotics. 2020. 10.3390/antibiotics9050261. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Stasiak M, Mackiw E, Kowalska J, Kucharek K, Postupolski J. Silent genes: antimicrobial resistance and antibiotic production. Pol J Microbiol. 2021;70(4):421–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Schurch AC, Arredondo-Alonso S, Willems RJL, Goering RV. Whole genome sequencing options for bacterial strain typing and epidemiologic analysis based on single nucleotide polymorphism versus gene-by-gene-based approaches. Clin Microbiol Infect. 2018;24(4):350–4. [DOI] [PubMed] [Google Scholar]
- 36.Bowers JR, Driebe EM, Albrecht V, McDougal LK, Granade M, Roe CC, et al. Improved subtyping of Staphylococcus aureus clonal complex 8 strains based on whole-genome phylogenetic analysis. mSphere. 2018. 10.1128/mSphere.00464-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Tkadlec J, Le AV, Brajerova M, Soltesova A, Marcisin J, Drevinek P, et al. Epidemiology of methicillin-resistant Staphylococcus aureus in Slovakia, 2020 - emergence of an epidemic USA300 clone in community and hospitals. Microbiol Spectr. 2023;11(4):e0126423. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Asadollahi P, Farahani NN, Mirzaii M, Khoramrooz SS, van Belkum A, Asadollahi K, et al. Distribution of the most prevalent spa types among clinical isolates of Methicillin-resistant and -susceptible Staphylococcus aureus around the world: a review. Front Microbiol. 2018;9:163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Suleiman AS, Bhattacharya P, Islam MA. Global prevalence and dynamics of mecA and mecC genes in MRSA: meta-meta-analysis, meta-regression, and temporal investigation. J Infect Public Health. 2025;18(7):102802. [DOI] [PubMed] [Google Scholar]
- 40.Idrees MM, Saeed K, Shahid MA, Akhtar M, Qammar K, Hassan J, et al. Prevalence of mecA- and mecC-associated methicillin-resistant Staphylococcus aureus in clinical specimens, punjab, Pakistan. Biomedicines. 2023. 10.3390/biomedicines11030878. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Tawfeek CE, Khattab S, Elmaraghy N, Heiba AA, Nageeb WM. Reduced vancomycin susceptibility in Staphylococcus aureus clinical isolates: a spectrum of less investigated uncertainties. BMC Infect Dis. 2024;24(1):1218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Alkuraythi DM, Alkhulaifi MM, Binjomah AZ, Alarwi M, Mujallad MI, Alharbi SA, et al. Comparative genomic analysis of antibiotic resistance and virulence genes in Staphylococcus aureus isolates from patients and retail meat. Front Cell Infect Microbiol. 2023;13:1339339. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Mikkelsen K, Bowring JZ, Ng YK, Svanberg Frisinger F, Maglegaard JK, Li Q, et al. An endogenous Staphylococcus aureus CRISPR-Cas system limits phage proliferation and is efficiently excised from the genome as part of the SCCmec cassette. Microbiol Spectr. 2023;11(4):e0127723. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Cruz-Lopez EA, Rivera G, Cruz-Hernandez MA, Martinez-Vazquez AV, Castro-Escarpulli G, Flores-Magallon R, et al. Identification and characterization of the crispr/cas system in Staphylococcus aureus strains from diverse sources. Front Microbiol. 2021;12:656996. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The datasets generated and/or analyzed during the current study are available in the Zenodo repository, [https://doi.org/10.5281/zenodo.14833440](https:/doi.org/10.5281/zenodo.14833440). This page can be accessed using any web browser and all tables can be downloaded there. The analysis results and descriptive statistics data are included in this manuscript and its supplementary information files.

