DRAMMA: a multifaceted machine learning approach for novel antimicrobial resistance gene detection in metagenomic data

Ella Rannon; Sagi Shaashua; David Burstein

doi:10.1186/s40168-025-02055-4

. 2025 Mar 7;13:67. doi: 10.1186/s40168-025-02055-4

DRAMMA: a multifaceted machine learning approach for novel antimicrobial resistance gene detection in metagenomic data

Ella Rannon ¹, Sagi Shaashua ¹, David Burstein ^1,^✉

PMCID: PMC11887096 PMID: 40055840

Abstract

Background

Antibiotics are essential for medical procedures, food security, and public health. However, ill-advised usage leads to increased pathogen resistance to antimicrobial substances, posing a threat of fatal infections and limiting the benefits of antibiotics. Therefore, early detection of antimicrobial resistance genes (ARGs), especially in pathogens, is crucial for human health. Most computational methods for ARG detection rely on homology to a predefined gene database and therefore are limited in their ability to discover novel genes.

Results

We introduce DRAMMA, a machine learning method for predicting new ARGs with no sequence similarity to known ARGs or any annotated gene. DRAMMA utilizes various features, including protein properties, genomic context, and evolutionary patterns. The model demonstrated robust predictive performance both in cross-validation and an external validation set annotated by an empirical ARG database. Analyses of the high-ranking model-generated candidates revealed a significant enrichment of candidates within the Bacteroidetes/Chlorobi and Betaproteobacteria taxonomic groups.

Conclusions

DRAMMA enables rapid ARG identification for global-scale genomic and metagenomic samples, thus holding promise for the discovery of novel ARGs that lack sequence similarity to any known resistance genes. Further, our model has the potential to facilitate early detection of specific ARGs, potentially influencing the selection of antibiotics administered to patients.

Download video file^{(135.1MB, mp4)}

Open in a new tab

Video Abstract

Supplementary Information

The online version contains supplementary material available at 10.1186/s40168-025-02055-4.

Introduction

Antibiotic substances, drugs targeting bacterial species, have drastically reduced the threat of infections and have become essential for many medical procedures such as surgeries, organ transplants, and cancer treatment [1–3]. These drugs are invaluable thanks to their ability to kill or inhibit the bacteria causing the infection while not causing any harm to the host’s cells [1]. Nevertheless, prolonged overuse of antibiotics has resulted in the emergence and worldwide spread of antibiotic-resistant pathogens, which threatens the continued benefits of antibiotics [4]. The pathogens’ resistance, also known as antimicrobial resistance (AMR), allows these strains to grow and spread due to the strong selective pressure of antibiotics, becoming dominant in their environment [5].

It is estimated that globally in 2019, 4.95 million deaths were associated with drug-resistant infections, with approximately 1.27 million of these deaths directly attributable to antibiotic-resistant bacteria [6]. This number is projected to reach ten million by 2050 if no solutions are devised to slow down the emergence of antibiotic-resistant bacteria [5]. This value is probably an underestimation, as key medical procedures, such as surgeries and chemotherapy, may become too dangerous to perform if antibiotics lose their effectiveness. Additionally, drug resistance has already made an economic impact. Resistance to first-line antibiotic treatments costs the US health system 20 billion USD per year [7], and the global cost of antibiotic resistance is predicted to exceed 100 trillion USD throughout the next few decades [5].

Most of the antibiotics used today are either compounds discovered during the “golden era” of antibiotic discovery between the 1940s and the 1960s, or their derivatives [8]. Since the 1980s, the rate of antibiotic discovery has fallen drastically, and only a few antibiotics have reached the market in the last two decades [5]. Further, since the 1960s, no new class of broad-spectrum compounds has been discovered [8]. Hence, we are currently lacking new drugs to battle antibiotic-resistant bacteria [5]. Therefore, other approaches are required in order to combat antimicrobial resistance, one of which is reliable surveillance of antimicrobial resistance. Surveillance data can help improve patient treatment and health, inform health policies, shape responses to health emergencies, provide early warnings of emerging threats, and help identify long-term trends [5]. The foresight of drug-resistant bacteria emergence should enable a proactive development of next-generation treatment strategies before the dissemination of resistance threats [1].

Metagenomic data is essential for the characterization of the global antibiotic resistome, as many relevant AMR genes evolved in environmental microorganisms [1]. Metagenomic research focusing on environmental samples from soil, sewage, and other sources in urban and rural areas worldwide has highlighted differences in the diversity, abundance, and distribution of ARGs, their class, and their resistance mechanisms across various regions [9–12]. These studies also reveal correlations between ARG abundance and socio-economic, health, and environmental factors, as well as associations with mobile genetic elements. Compared to human-derived samples, environmental samples offer significant advantages: they are readily accessible, enable real-time analysis, and are cost-effective, with no ethical constraints [9]. It was revealed that resistance genes in human pathogens share, in some cases, more than 99% nucleotide identity with resistance genes from soil bacteria [13]. Moreover, the synteny of resistance genes with mobility elements suggests these genes have likely undergone horizontal gene transfer [1]. Therefore, inferring the mobility of a gene across a wide variety of taxonomic groups can help detect antibiotic resistance genes and assess the threat they might impose.

Numerous bioinformatic tools have been developed for ARG annotation in metagenomic datasets, most of which are based on sequence similarity to a predefined gene database [14–25]. The gene repertoire that these methods can discover is thus limited to the current, incomplete, ARG knowledge base, and they lack the ability to generalize and identify novel ARGs. Recently, a few machine learning models were developed for ARG detection without the need for a predefined database. For example, HMD-ARG [26], a hierarchical deep-learning framework, utilizes the protein’s raw amino acid sequence to predict multiple ARG properties. These properties include gene classification to ARG or non-ARG, the antibiotic family to which it confers resistance, the gene’s resistance mechanism, whether the ARG is intrinsic or acquired, and the specific subclass of the beta-lactamase. Another recently developed algorithm is PLM-ARG [27], which utilizes the publicly available pre-trained protein language model ESM-1b [28] with two consecutive XGBoost [29] models for ARG identification task and resistance category prediction. An additional deep-learning model, ARGNet [30], processes either short reads or complete genes as input and applies an autoencoder model to identify ARGs and a convolutional neural network (CNN) multiclass classifier to predict ARG categories. However, since these models only utilize the sequence of the ARG or its product, they cannot take into account biological knowledge beyond the sequence, which can be crucial for ARG detection.

Here, we trained DRAMMA, a Random Forest model on global-scale metagenomic data. The model was trained on a wide variety of tailored features based on biological knowledge and understanding of ARG characteristics. These features take into account protein biochemical, physical, and structural properties, as well as genomic and evolutionary context. This approach allows us to predict new ARGs that are genuinely unknown and thus present no detectable sequence similarity to any known resistance gene. The model demonstrated strong performance on both the training set and an independent validation set annotated using an external ARG database based on functional metagenomics experiments. This led to the identification of novel ARG candidates, which were subsequently subjected to rigorous analysis. We anticipate that further investigation of the top candidates identified by DRAMMA, along with the application of our model on newly generated metagenomic datasets, will facilitate the early detection of previously unknown antimicrobial resistance genes. This, in turn, has the potential to significantly advance our understanding of antibiotic resistance and inform strategies to combat this growing global health threat.

Results

Dataset compilation

An extensive dataset was compiled of genes from genomic and metagenomic sources [31]. This data was acquired from various ecosystems, including human and animal microbiomes, groundwater, sewage, marine, and soil (Table 2). Only protein-coding genes in large contigs (≥ 10 kbp) were used. Overall, 492.1 million proteins were retrieved from assemblies of 22,241 metagenomes (Table 2).

Table 2.

Metagenomic assemblies from various ecosystems were downloaded from NCBI and EBI. The protein and base-pair count for informative contigs (≥ 10 kbp) were retrieved from assemblies of 22,241 metagenomes from different environments

Assemble sample type	Proteins in assemblies	Base-pairs in assemblies
Human microbiome	316,897,729	311.25 Gbp
Unclassified / Other	45,293,055	40.56 Gbp
Plants	420,898	0.39 Gbp
Sediment	6,811,745	6.25 Gbp
Fermentation	3,160,683	2.98 Gbp
Animal microbiome	10,420,960	10.12 Gbp
Marine	47,648,243	45.02 Gbp
Bioreactor	13,704,124	13.06 Gbp
Wastewater	6,527,929	5.74 Gbp
Sludge	13,847,044	13.64 Gbp
Compost	2,510,130	2.32 Gbp
Human/animal microbiome	6,193,314	6.17 Gbp
Soil	12,492,613	11.18 Gbp
Groundwater	2,950,435	2.73 Gbp
Freshwater	3,253,136	2.97 Gbp
Total	492,132,038	474.38 Gbp

Open in a new tab

The known ARGs in the dataset were annotated using DRAMMA-HMM-DB, a database of profile Hidden Markov Models (HMMs, Supplementary Datasets 1–3) that we compiled based on several AMR databases (Resfams [22], CARD [23], and HMD-arg-DB [26]). Positive examples (ARGs) for our classification scheme were genes with high similarity to known ARG families. Negative examples (non-AMR proteins) were randomly sampled from the gene pool to establish a ratio of 1:10 resistance to non-resistance genes, and duplicate (highly similar) proteins were then removed.

Feature extraction

We extracted 512 features for each protein. The features can be divided into four main categories: (1) amino acid properties, such as gene and contig length, physical and chemical attributes of the protein, the proportion of each amino acid in the protein, the proportion of groups of amino acids sharing similar attributes, and averages of amino acid indices that represent different physicochemical and biochemical characteristics for each amino acid [32]; (2) amino acid patterns, including 8-mers of hydrophilic/hydrophobic residues, Helix Turn Helix (HTH) domain, DNA binding domains, and transmembrane domains; (3) HGT signals, e.g., GC content differences between the gene and the contig it is coded on, the distance between DNA k-mer distribution vectors of the gene and its contig, and the distribution of each gene across diverse taxonomic groups; (4) genomic context, including the presence of known ARGs and genes of mobile genetic elements in the genomic region of the analyzed gene (Fig. 7B, Supplementary Table 1).

Fig. 7 — Dataset compilation pipeline. A Gene annotation and selection. ARGs are collected from different public AMR databases and are then filtered and merged for the creation of DRAMMA-HMM-Db, an HMM ARG database that is used for ARG annotation of the metagenomics data. B Feature extraction. Illustration of the four main feature categories used by our model and examples of the features from each category

Model and feature selection

Following a comparison of several machine learning models (see Hyperparameter optimization in Methods), Random Forest was chosen due to its favorable trade-off between predictive accuracy and computational efficiency. To choose the optimal subset of features for the classification model, we utilized Random Forest’s feature importance, known as impurity-based importance or Gini importance, and selected the best features according to these scores. These scores reflect the reduction in node impurity, weighted by the proportion of samples reaching the node, and averaged across all trees in the ensemble [33]. To choose the optimal number of selected features, we examined how varying the number of features impacted the mean performance (measured as the area under the precision-recall curve, PR-AUC) of the model across a five-fold cross-validation on the development set. The optimal number of features was approximately 30 (Fig. 8).

Fig. 8 — Seeking the optimal number of features to select. Measurement of five-fold cross-validation classification performance (measured as ROC-AUC and PR-AUC) of a Random Forest algorithm trained on different numbers of features. Dark blue is the mean score across folds, light blue is the standard deviation. There is a decrease in the classifier’s performance with models utilizing more than ~ 30 features

The model’s features were selected based on the feature importance values of the Random Forest model trained on the entire training set. Prominent features included information about the presence of the proteins in different taxonomic groups, amino acid composition and patterns (percentage of different amino acids or groups of amino acids, presence of HTH domains, and frequency of hydrophilic-hydrophobic signatures), features regarding the proteins’ physical and biochemical properties (gene product size, grand average of hydropathy (GRAVY) value, where negative values indicate hydrophilicity and positive values indicate hydrophobicity, and molar extinction coefficient), and features regarding the presence of ARGs within the gene’s genomic region (see Supplementary Fig. 1A, Supplementary Table 1). We observed a distinct difference between the distribution of these feature values within the AMR and non-AMR populations (Supplementary Fig. 2). Specifically, this distinction becomes evident when examining the most significant feature, gene product size. ARGs were typically comprised of 500–1500 amino acids, making them, on average, larger than non-ARGs. This observation is supported by the SHAP values associated with this feature (Supplementary Fig. 1B), which tended to be negative for smaller values and thus contributed to negative classification. A similar pattern was observed for the taxonomic distribution features.

Model performance evaluation

The trained model, which we named DRAMMA for Detection of Resistance to AntiMicrobials using Machine-learning Approaches, was evaluated using the mean ROC-AUC and mean PR-AUC in a five-fold cross-validation process. The results indicate highly accurate classification (Fig. 1A), with a mean ROC-AUC of 0.98, and a mean PR-AUC of 0.857.

Fig. 1 — Classification performances of DRAMMA over five-fold cross-validation measured as ROC-AUC and PR-AUC. A Performance on a genomic and metagenomic dataset split to folds by contigs. The dataset is comprised of 30.9M proteins, containing 5% AMR genes annotated using HMMs from the DRAMMA-HMM-DB database. B Performance on a genomic dataset split into folds by taxonomic groups. The frequency of positive proteins in each fold is noted in brackets

In order to ensure that the high performance is not the result of data leakage from genes of the same taxonomic groups that share multiple genomic properties, we decided to re-evaluate DRAMMA’s performance using an NCBI WGS genomes dataset, which we divided into five folds according to major taxonomic groups: Actinobacteria, Gammaproteobacteria, Firmicutes, Alphaproteobacteria, and Bacteroidetes. The model’s performances were expected to be lower since in this evaluation the model was tested on genomes from taxa that were evolutionarily distant from any species in the training set. Indeed the model had lower performances but was still accurate with a mean ROC-AUC of 0.938, and a mean PR-AUC of 0.668 (Fig. 1B).

We further tested DRAMMA’s performance on an external validation set taken from the Global Sewage Surveillance Project [9], which is comprised of sewage metagenomic samples collected worldwide (see Dataset compilation in the Methods section). The microbial communities in these samples are expected to include both environmental microbes as well as microorganisms prevalent in human microbiomes. This dataset was assembled from read data and annotated using two ARG databases, our DRAMMA-HMM-DB database and ResfinderFG v2.0. The latter is an external experimental database of ARGs obtained by functional metagenomics. The model was evaluated on each annotation scheme. The model’s performance on this dataset was high as well, with a mean ROC-AUC of 0.99 and a mean PR-AUC of 0.91 when tested on the annotation according to DRAMMA-HMM-DB. It achieved a mean ROC-AUC of 0.91 and a mean PR-AUC of 0.59 when tested on the experimental ResfinderFG annotation (Fig. 2). It was observed that, reassuringly, DRAMMA’s score tends to be low for proteins with negative annotation (non-ARGs) according to ResfinderFG (Supplementary Fig. 3).

Fig. 2 — Classification performances of DRAMMA over a validation set of sewage samples measured as ROC-AUC and PR-AUC. A The model’s performance on the dataset annotated by our ARG HMM database, DRAMMA-HMM-DB. B The model’s performance on the dataset annotated by the experimental ResfinderFG v.2.0 database

In addition, we assessed the runtime for feature extraction and model prediction on a 64-CPU machine, using the E. coli K-12 MG1655 genome and a random selection of approximately 100,000 metagenomic proteins from the Global Sewage Surveillance Project [9]. This process took 11.1 min for the 4329 proteins in the E. coli genome and 21.23 min for the 100,532 proteins in the sewage dataset.

Impact of sequence disruption on DRAMMA predictions

Given that DRAMMA utilizes a range of biological features rather than relying on sequence homology to predefined ARG sequences, we sought to evaluate the impact of the protein sequence on model predictions. To investigate this, we scrambled the protein sequences from the E. coli genome and the ~ 100,000 proteins selected from the Global Sewage Surveillance Project [9], used for the runtime evaluation. As expected, following scrambling, none of the proteins in these datasets were labeled as ARGs by our DRAMMA-HMM-DB database (in contrast to 149 and 2780 of the original proteins in the E. coli genome and sewage samples, respectively). Despite DRAMMA’s strong reliance on contextual and content features, its misclassification rate was low: only 0.02% (one of 4329) of the E. coli proteins and less than 0.1% (96 of 100,532) of the scrambled sewage proteins were classified as ARGs. An analysis of the SHAP values of the scrambled sequences’ features has shown that the features contributing to positive classifications are indeed those unaffected by the sequence order. These included features capturing amino acid composition irrespective of their order and context-dependent features (see Supplementary Fig. 4).

Benchmarking

The performance of the DRAMMA model was compared with that of previous algorithms for ARG prediction: (1) Resfams [22], (2) DeepARG [21], (3) ARGNet [30], (4) PLM-ARG [27], (5) CARD October 2020 release [23], and (6) CARD October 2023 release [34]. The two CARD releases were selected since our DRAMMA-HMM-DB database was comprised of ARGs in the 2020 release, while the October 2023 release was the latest available at the time of the benchmarking. The performance evaluation was conducted on the sewage test set annotated based on the ResFinderFG database, an external ARG database collecting information from functional metagenomic experiments. In our pipeline, regulatory genes, efflux pumps, and resistance conferred via point mutations are not considered positive cases of ARGs. Some of the approaches to which we compared DRAMMA do consider these genes as positive. This might bias the results toward DRAMMA and unjustifyingly increase the false positive rates of the other algorithms. To ensure a fair and unbiased comparison, proteins that our approach labeled as non-ARGs but were considered ARGs by other approaches were excluded from the test set. The performance of each algorithm was assessed by MCC, true positive rate (TPR), false positive rate (FPR), macro precision, recall, and F1, with comparisons to our classification model at two different score thresholds (0.75 and 0.95), corresponding to expected precision scores. Results revealed that DRAMMA achieved the best recall (75.1%) and CARD strict reached the highest precision (94.4%), which can be expected as it is based on strict sequence comparisons and thus is not expected to yield numerous false positives. Notably, CARD strict also received a low recall rate (50.6%). Our approach, on the other hand, achieved the best balance between precision and recall, as indicated by F1 and MCC scores (0.78 and 0.567, accordingly, see Table 1). Although CARD loose achieved the highest TPR scores (0.975 and 0.95 for the 2020 and 2023 releases, respectively), it also had a high FPR (0.903 and 0.807 for 2020 and 2023, respectively). This indicates that while it correctly classifies many ARGs, it also frequently misclassifies non-ARGs as ARGs, as reflected in its relatively low precision scores (0.537 and 0.54 for the 2020 and 2023 releases, respectively). Conversely, CARD strict achieved the lowest FPR (0.00003) but exhibited low TPR (0.013) and recall (0.506) scores.

Table 1.

Benchmarking on a test set comprising sewage metagenomic samples, with ARGs annotated using ResFinderFG v2.0. The number in brackets corresponds to the expected precision determined by the chosen DRAMMA model score. The high score threshold for DeepARG was selected based on their recommended setting. “Strict” and “Loose” refer to CARD’s rgi search parameters. TPR true positive rate, FPR false positive rate

Algorithm	TPR	FPR	Precision	Recall	F1	MCC
DRAMMA (0.75)	0.527	0.024	0.82	0.751	0.78	0.567
DRAMMA (0.95)	0.408	0.013	0.853	0.698	0.748	0.529
Resfams	0.418	0.021	0.804	0.698	0.737	0.491
DeepARG High Score (> = 0.8)	0.027	0.0002	0.913	0.513	0.502	0.147
DeepARG All Scores	0.039	0.0003	0.923	0.519	0.514	0.18
ARGNet	0.056	0.077	0.488	0.49	0.488	-0.023
PLM-ARG	0.335	0.012	0.837	0.662	0.711	0.467
CARD 2020 Strict	0.013	0.00003	0.944	0.506	0.489	0.106
CARD 2020 Loose	0.975	0.903	0.537	0.536	0.177	0.073
CARD 2023 Strict	0.028	0.0002	0.915	0.514	0.503	0.151
CARD 2023 Loose	0.95	0.807	0.54	0.571	0.256	0.107

Open in a new tab

In bold are the best score for each performance measure

Candidate analysis

To identify genuinely novel ARGs, we first used the DRAMMA model trained on the entire metagenomic training set to classify approximately 650 million proteins from genomic and metagenomic data. We focused on the top-ranking 18.1 million proteins that received a score equivalent to precision of > 95% on the training set. Subsequently, we categorized the candidates based on their source, differentiating between those originating from genomic and metagenomic samples. We then assessed the distribution of novel candidates (high-scoring genes annotated as non-ARG) with regard to taxonomic groups and environments (Fig. 3). Our analysis revealed that the most prevalent taxonomic groups among predicted ARGs were Gammaproteobacteria (20.66%), Firmicutes (18.46%), Bacteroidetes/Chlorobi group (14.45%), Alphaproteobacteria (13.35%), and Actinobacteria (12.14%). Among the metagenomes, predicted ARGs were detected primarily in samples originating from the human microbiome, accounting for a significant portion of the metagenomic ARG candidates (73.02%).

Fig. 3 — Distribution of novel ARG candidates. A Distribution of novel candidates from genomic samples across different taxonomic groups. B Distribution of novel candidates from metagenomic samples across different ecosystems

The genomic databases are highly biased toward specific bacteria (mostly pathogens and model organisms) and ecosystems (human-associated bacteria). Therefore, to properly assess enrichment, we normalized the number of novel candidates in the different groups according to the total number of non-ARG proteins within the same taxonomic groups and environments: $enrichment (g) = {log}_{2} \frac{percentage of novel candidates from group g}{percentage of non - ARGs from group g}$ (Fig. 4 and Supplementary Fig. 5). Our observations revealed that only the Bacteroidetes/Chlorobi and Betaproteobacteria groups displayed a notable enrichment of ARG candidates (with enrichment scores of 0.947 and 0.535, respectively), whereas the Firmicutes, Alphaproteobacteria, and Bacteria Candidate Phyla groups exhibited a more modest enrichment (with enrichment scores of 0.221, 0.199, and 0.121, respectively). In contrast, Stenosarchaea and Euryarchaeota exhibited notable depletions (with enrichment score of − 1.237 for both groups). In addition, within metagenomic samples, it was evident that ARG candidates were highly enriched in human and animal microbiomes, and, to a lesser extent, in plant-associated bacteria.

Fig. 4 — Enrichment analysis of novel candidates across different groups. The enrichment of high-ranking candidates for each group $g$ is calculated as $enrichment (g) = {log}_{2} \frac{percentage of novel candidates from group g}{percentage of non - ARGs from group g}$ . A Enrichment of novel candidates from genomic samples in different taxonomic groups. B Enrichment of novel candidates from metagenomic samples from the ecosystem analyzed

We also investigated the distribution of the drugs they confer resistance to and predicted mechanisms by which the candidates confer resistance across the various taxonomic groups and environments (Fig. 5, Supplementary Fig. 6). These findings revealed that beta-lactam antibiotics were the most common drugs the candidates provided resistance to and that the most frequent resistance mechanisms among our predictions were target alteration and antibiotic inactivation. Notably, a similar distribution of resistance mechanisms and antibiotic drugs was predicted for the ARGs across the different taxonomy groups and environments. Betaproteobacteria is the only taxonomy group to exhibit a significant percentage of the “reduced permeability to antibiotics” mechanism. Finally, we examined the novel candidates for the enrichment of known domains from Pfam [35]; however, the vast majority of these proteins (99.6%) did not contain a well-characterized domain. Furthermore, none of the few domains found exhibited a prevailing presence among the novel candidates (Supplementary Dataset 4).

Fig. 5 — The distributions of antibiotic drugs the high-ranking ARG candidates confer resistance to. The distribution of antibiotic drugs to which high-ranking ARG candidates confer resistance, across different taxonomic groups (A), and ecosystems (B). The number at the top of each bar indicates the total count of ARG candidates in the respective group

Candidate selection

We aimed to highlight ARG candidates of most interest among the analyzed genes, focusing on genes that could not have been identified using traditional sequence-based approaches. We thus identified predictions with no annotation across several databases, as well as top-ranking predictions with partial annotations or annotations not indicative of resistance. To that end, we first removed all the known ARGs from the list of model candidates, resulting in 1.28 million proteins. We then performed several other annotation steps, in which the ARG candidates were compared to different databases, namely the Kyoto Encyclopedia of Genes and Genomes (KEGG) [36], NCBI’s non-redundant protein database [31], PDB [37], and Pfam [35] (Fig. 6).

Fig. 6 — Description of the annotation and filtration pipeline. The number of proteins that had no annotation in each step and were thus passed to downstream annotation is indicated below each step

In each step, the candidates were assigned to one of the following categories: (1) Known ARGs, (2) ARG-related, (3) A possible target of antibiotics, (4) Annotated, but not known to be associated with AMR, (5) Unknown function (Supplementary Fig. 7). Following each step, only the proteins of unknown function were used as input for the annotation next step to characterize them as much as possible. Since increasingly sensitive searches were applied, the number of proteins with unknown functions decreased with each step of the pipeline. However, the percentage of unknown proteins increased in each step of the pipeline, reaching 55.9% representing 213 unannotated proteins in the final step (HH-suite remote homology search [38]).

Our candidates of interest thus included the 213 predicted ARGs that remained completely unannotated, without even remote homology to characterized proteins. In addition to these, we also included in our pool of potential ARGs of interest candidates with some annotations: (1) proteins that have annotation only based on HH-suite remote homology search, but with no annotation in a BLAST search against NCBI NR and HMM search of KEGG; (2) the 100 top-scoring predictions that had an annotation according to BLAST and no KEGG annotation; and (3) the top 200 candidates that had a KEGG annotation. This resulted in a total of 681 novel ARG candidates.

Finally, we wished to pinpoint top candidates that would be the most straightforward to test experimentally. First, we removed candidates with Helix-Turn-Helix (HTH) domains, which are indicative of regulatory function. We then assigned taxonomy to the remaining candidates by comparing all the genes on the relevant contigs to organisms with known taxonomy using MMseqs2 taxonomy search against UniRef100 [39] or by extracting the taxonomy of the DIAMOND hit with the lowest e-value against UniRef100 [39]. To focus on genes that are likely to be relevant for screening in E. coli, we filtered out genes originating from Gram-positive bacteria. We wished to pinpoint “standalone” ARGs, i.e., genes that confer resistance by themselves to facilitate experimental testing of top-ranking candidates. The neighboring genes of each candidate were thus examined to filter out candidates that consistently appeared with the same neighboring genes, assuming each of them may not function properly without the others. Using an additional machine learning classifier, we developed (see Supplementary Fig. 8), we also predicted the most likely resistance mechanisms of the potential ARGs. Subsequently, we utilized AlphaFold3 [40] to predict the structure of the candidate proteins and conducted searches against AlphaFold’s protein structure database [41, 42] to obtain potential functional annotations. As a last step, we tested for similarity within the prediction to avoid testing two relatively similar proteins. The top candidates after filtering and adding the structural, syntenic, and taxonomic information are detailed in Supplementary Table 2. We anticipate that this rich source of information on potential ARGs will contribute to a better understanding of antimicrobial resistance mechanisms and their dissemination.

Discussion

The emergence and worldwide spread of antibiotic-resistant pathogens is a rising threat to public health [4]. The detection of ARGs has a pivotal role in enhancing patient well-being, issuing early alerts about emerging threats and informing administration policies [5]. This study introduced a novel machine learning approach designed to identify previously unknown antibiotic resistance genes (ARGs) within genomic and metagenomic data using a wide array of biological features. Most of the existing bioinformatical tools for ARG prediction rely solely on sequence similarity to a predefined ARG database [14–25, 43–48], which severely limits the detection of resistance genes that can be detected to genes that are similar enough to those in the database. By utilizing prior knowledge and biological properties characterizing ARGs, our method demonstrates the potential of discovering genuinely novel ARGs exhibiting no sequence similarity to any known resistance gene.

DRAMMA has been trained using an extensive dataset encompassing a wide array of genes originating from diverse environments and organisms. The ARG database compiled in this research encompasses a diverse spectrum of ARG gene families. Combined, these allow DRAMMA to exhibit robust generalization capabilities, effectively detecting ARGs in different environments, including genes conferring resistance to a variety of antibiotics through different resistance mechanisms. Our analysis of the ARG candidates identified by the model underscores this capability, revealing a consistent distribution of resistance mechanisms and antibiotics to which these genes confer resistance across diverse taxonomic groups and ecosystems (Fig. 3), implying that the model identifies ARGs in well-studied as well as in less-explored environments and taxonomic groups. The analysis also highlighted beta-lactam antibiotics as the most common antibiotic to which the candidates demonstrated resistance, aligning with the fact that they are indeed one of the most prescribed antibiotic classes [49].

The machine learning algorithm presented in this work demonstrated promising performance both in cross-validation and on an independent dataset. On the training set, the DRAMMA model received mean ROC-AUC scores of 0.98 and 0.938, along with mean PR-AUC scores of 0.857 and 0.668 for the metagenomic five-fold cross-validation and taxonomic five-fold cross-validation, respectively. The consistency in performance across the different folds of the metagenomic dataset can be attributed to the large size of the training data. The decreased performance quality on the taxonomic folds can be attributed to the inherent difficulty of this task. Unlike metagenomic datasets where organisms are expected to be similar to those encountered during training, taxonomic folds involve testing the model on organisms that are evolutionarily distant from those modeled in the training process. Despite this challenge, it is noteworthy that the model's performance on this task remains relatively high, especially when comparing PR-AUC results to those anticipated from a random classifier (approximately 0.06, the fraction of positive samples). Additionally, variations in the model’s performance across different folds are noteworthy, with the Actinobacteria and Firmicutes folds exhibiting the lowest performance. This is probably because these two taxonomic groups predominantly comprise Gram-positive bacteria, making them highly dissimilar from the organisms in the other folds. DRAMMA demonstrated high performance on the external validation sewage samples as well. When assessed with the labels retrieved from DRAMMA-HMM-DB, which was the database used to label the training set, the model achieved a ROC-AUC score of 0.99 and a PR-AUC score of 0.91. Labeling the same data using ResFinderFG, a dataset of ARGs identified by functional metagenomic experiment, achieved a ROC-AUC score of 0.91 and PR-AUC score of 0.59. The latter PR-AUC score, although lower than the former, remains significantly higher than what would be expected from a random classifier, as the frequency of positive genes in this case is only 0.0083. Furthermore, non-ARG proteins received considerably lower scores compared to ARGs (Supplementary Fig. 3). Among the metagenomic samples, only the human, animal, and plant microbiomes were enriched in ARG candidates (Fig. 4B). The enrichment within the plant microbiome environment was modest and could be attributed to the fact that plants may be exposed to antibiotics through waste disposal in soil and groundwater, thereby creating selective pressure for the emergence of ARGs. The considerable enrichment observed in the human and animal microbiomes can be explained by their continuous exposure to antimicrobial substances, which exerts significant selective pressure, driving the emergence of resistance genes that remain undiscovered, despite those environments being well studied. Within the groups, the Bacteroidetes\Chlorobi, Betaproteobacteria, Firmicutes, Alphaproteobacteria, and Bacteria Candidate Phyla groups exhibited an enrichment of ARG candidates (Fig. 4A). This enrichment can be attributed to the prevalence of Bacteroidetes/Chlorobi, Proteobacteria, and Firmicutes in human and animal intestinal microbiomes [50, 51]. These groups are thus exposed to antimicrobial substances, which could result in selective pressure for the emergence of ARGs. The enrichment observed in Bacteria Candidate Phyla, which is comprised of uncultured bacteria [52], suggests its potential role as a reservoir for undiscovered ARGs. In contrast, Stenosarchaea and Euryarchaeota exhibited notably negative enrichment values. This observation may suggest that the model is less adept at predicting ARGs originating from Archaea. Archaea have a natural resistance to particular classes of antibiotics, including those that target the synthesis or cross-linkage of the peptide subunit of murein, while exhibiting sensitivity to other types of antibiotics [53–55]. The depletion could be attributed to the limited investigation of ARGs in Archaea, to variation in their biophysical properties, or to their evolutionary distance from most well-characterized ARGs.

DRAMMAs’ prediction was primarily driven by patterns within the taxonomic distribution of genes. In general, ARGs tended to have highly similar homologs across a wide range of taxonomic groups and even demonstrated more significant hits within specific taxonomic groups such as the Firmicutes and Bacteroidetes\Chlorobi groups (Supplementary Fig. 2). These findings align with the notion that ARGs confer an adaptive advantage and thus tend to disseminate to other organisms through HGT.

This study is subject to several limitations. First, inherent biases within the training data can affect the model’s ability to generalize to new examples of resistance gene families, especially if resistance genes in less-represented taxa or ecosystems have unique patterns in terms of the features we measured. Second, the ARGs were labeled using HMM profiles of resistance genes, i.e., statistical models, as opposed to a manual curation process. Despite the stringent e-value we used to label ARGs, these profiles have the potential to generate false positives, which, when integrated into the model training, may impact its overall accuracy. Notably, the application of the HMM profiles used in this study for labeling ARGs led to the discovery of a novel ARG in Nocardia, which was experimentally validated [56]. Consequently, this approach not only facilitates model training on established ARGs but also provides an avenue for the identification and training based on potentially new ARGs.

By design DRAMMA is not trained to identify ARG-related genes: efflux pumps and genes that confer resistance through point mutation are not detected by our approach, as our focus is on discovering novel resistance genes. Given the critical role of these mechanisms in resistance, future work could benefit from integrating DRAMMA with models that specifically address these types of resistance genes. Additionally, this study only included contigs longer than 10 kbp, due to the model’s reliance on genomic context features derived from neighboring genes. Shorter contigs would result in incomplete or inaccurate values for the encoded proteins and obscure potential signals. Consequently, while shorter contigs can also encode ARGs, our methodology is less suited for their analysis. Current features designed to detect HGT signals rely on profile HMM searches of neighboring genes against mobility-related genes, such as plasmid and phage genes, from the Pfam database. However, alternative databases or genome-based MGE detection tools, such as VIBRANT [57] and geNomad [58], could be employed in future studies to improve the detection of HGT signals. Finally, a range of bioinformatics tools was employed in this study, each with inherent limitations that could influence downstream analyses. Sequence similarity tools such as DIAMOND and MMseqs2 rely on local alignment, which might lead to local matches based on irrelevant regions of the proteins. While these tools are significantly faster than BLAST, they are slightly less sensitive, which could impact the accuracy of the phylogenetic distribution features and potentially result in missed homology or erroneous signals. This limitation may also affect the candidate filtration process, which depends on sequence similarity to previously annotated proteins. Additionally, CD-HIT, a clustering tool used for de-duplication of the training data, employs a greedy incremental clustering algorithm that may generate suboptimal clusters. The clustering results can also be influenced by the order of input sequences, potentially leading to a partially redundant training set. Such issues could affect both model training and evaluation.

As part of DRAMMA, we developed a pipeline to compute novel biological-based features. Beyond their contribution to predicting novel antimicrobial resistance genes, they can be computed across diverse metagenomic samples and utilized for training machine learning models for a variety of biological questions, thereby providing new insights and characterizations of various genes.

In conclusion, DRAMMA offers the capability to supply rapid identification of both known and novel ARGs in large-scale genomic and metagenomic samples. By detecting sequences with low or no similarity to known ARGs, DRAMMA extends beyond the limitations of traditional sequence-based approaches. The model has the potential to expand the current ARG knowledge with genes with lower sequence similarity to those in the existing databases. Additionally, DRAMMA can potentially enable early detection of novel ARGs or genes associated with distinct resistance mechanisms before their widespread emergence in clinical settings. This could provide a crucial window for preventive intervention, allowing healthcare systems to implement modified treatment protocols before these resistance genes become widely distributed in bacterial populations. In the long term, our approach could contribute to the better use and effectiveness of existing and newly developed antimicrobial treatments.