Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2025 Sep 11;21(9):e1013470. doi: 10.1371/journal.pcbi.1013470

kMermaid: Ultrafast metagenomic read assignment to protein clusters by hashing of amino acid k-mer frequencies

Anastasia Lucas 1,2, Daniel E Schäffer 2,3, Jayamanna Wickramasinghe 4, Noam Auslander 2,5,*
Editor: Sarath Chandra Janga6
PMCID: PMC12507277  PMID: 40934284

Abstract

Shotgun metagenomic sequencing can determine both the taxonomic and functional content of microbiomes. However, functional classification for metagenomic reads remains highly challenging as protein mapping tools require substantial computational resources and yield ambiguous classifications when short reads map to homologous proteins originating from different bacteria. Here we introduce kMermaid for the purpose of uniquely mapping bacterial short reads to taxa-agnostic clusters of homologous proteins, which can then be used for downstream analysis tasks such as read quantification and pathway or global functional analysis. Using a nested hash map containing amino acid k-mer profiles as a model for protein assignment, kMermaid achieves the sensitivity of popular existing protein mapping tools while remaining highly resource efficient. We evaluate kMermaid on simulated data and data from human fecal samples as well as demonstrate the utility of kMermaid for classifying reads originating from new, unseen proteins. kMermaid allows for highly accurate, unambiguous and ultrafast metagenomic read assignment into protein clusters, with a fixed memory usage, and can easily be employed on a typical computer.

Author summary

Whole-genome shotgun sequencing has allowed for the collection of a wealth of metagenomic data. Evidence that microbiomes play key roles in human health and disease is growing, but approaches for studying functional metagenomic content are still limited. Current protein mapping approaches do not allow for direct quantification of protein coding potential because short reads commonly map to similar proteins in different bacteria. Mapping metagenomic sequencing reads to proteins in such a way that a microbiome’s coding potential can be quantified is a key first step to pinpointing specific functional mechanisms or associations of disease. Here, we present a framework to first group similar proteins together, then uniquely map reads directly to these homologous protein groups. Our results show that by using k-mer frequencies stored in a two-layer hash map, we can sensitively classify metagenomic reads from high-depth sequencing data in only a few hours. We present our protein mapping method in an easy-to-use, resource efficient Python package, kMermaid. kMermaid results can be directly quantified which in turn will enable linkage of microbiome amino acid content to numerous health and disease phenotypes.

Introduction

The gut microbiome has recently emerged as a new frontier for non-invasive biomarker discovery and new therapeutic intervention. As the field of metagenomics has matured in terms of popularity and technical advancement [1,2], there has been increasing recognition of the importance of functional analysis of microbiomes [3]. Whole-genome shotgun sequencing has allowed for the collection of vast amounts of metagenomic data which can be used to gain insights about both the taxonomic and functional composition of microbiomes. Functional profiling and quantitative comparisons of microbial proteins have immense potential to reveal microbe-microbe and host-microbe interactions, establish new microbial biomarkers, and provide predictions based on microbiomes [4,5]. However, the quantitative analyses required for these tasks are contingent on the functional classification of shotgun metagenomic reads as a preprocessing step, which remains a computational challenge.

Functional read classification is a broad notion that can encompass mapping reads directly to proteins or mapping to higher level functional classes, such as ortholog protein groups or pathways. Several methods and pipelines, such as eggNOG-mapper [6], PANNZER2 [7], BlastKOALA [8], HUMAnN3 [9](p3), and fmh-funprofiler [10], which been recently developed to perform these higher-level classifications. Such tools use a variety of computational algorithms, including alignment and sketching techniques, to report mappings to functional reference databases, such as KEGG Ortholog or eggNOG [11]. In contrast, direct protein mapping is often performed using alignment-based methods, which rely on homology between a metagenomic sequence and microbial proteins in reference databases [5]. BLASTX [12] remains the gold standard for sensitivity despite it being infeasibly slow for typical metagenomic experiments producing tens of millions of reads [13]. DIAMOND [14,15] was developed in part to address computational challenges associated with BLASTX and allows for ultrafast read-to-protein alignments, making it one of the most widely used metagenomic protein mapping tools. Another popular method, MMseqs2 [16], was primarily developed to cluster metagenomic nucleotide and protein sequences, but also has translated protein search capabilities. While ultra-fast, neither DIAMOND nor MMseqs2 addresses BLASTX’s other challenge of multimapping, i.e., when a single read aligns to more than one protein. Multimapping often occurs due to homologous proteins or domains originating in different taxa, but can be problematic for downstream read counting and subsequent analysis [1719]. Mapping to higher-level functional classes can resolve multimapping, but not without the loss of the granular information protein annotation provides. Therefore, there is a need for resource-efficient methods that can provide unique read-to-protein level maps.

Methods for efficient and sensitive taxonomic classification of metagenomic sequences have addressed similar challenges in computational efficiency of taxonomic assignment by using k-mer based approaches, which are faster than alignments [20]. Most notably, Kraken [21] introduced accurate and highly efficient taxonomic classification by mapping k-mers to lowest common ancestors. This approach was later provided with improved resource usage through Kraken2 [22] and improved precision through KrakenUniq [23]. Other high speed taxonomic classification methods are CLARK [24], another k-mer based approach, as well as Centrifuge [25] and Kaiju [26], which are based on FM-indexing. Importantly, Kaiju demonstrated that the use of protein-level sequence comparisons substantially improves taxonomic classification. k-mer based approaches could offer similar advancements for protein mapping; however, such methods are currently lacking.

Here, we introduce kMermaid, a new method for ultrafast and resource-efficient protein mapping of metagenomic reads (Fig 1, S1 Methods). kMermaid uniquely maps query nucleotide sequences into taxa-agnostic clusters of highly homologous proteins using a precomputed k-mer frequency model (S1 Fig, Methods). The underlying rationale for kMermaid is that proteins with high sequence homology have similar biological functions and thus should be grouped together for downstream analysis. To this end, mapping the sequences to clusters representing homologous groups of proteins irrespective of taxa addresses issues along both computational and biological axes. The read-to-cluster approach resolves the problem of alignment ambiguity, referred to here as multi-mapping, i.e., when a single read similarly aligns to multiple proteins. The resulting aligned proteins are often functionally similar but originate in different species. By aggregating at the protein cluster level, kMermaid can capture novel biological effects that may be overlooked when performing analyses conditioned on taxa or when aggregating reads into broader functional categories, such as ortholog groups or pathways. kMermaid can classify tens of millions of sequences in just a few hours, providing the computational speed and resource efficiency needed for the large volumes of data generated through metagenomic sequencing experiments, while matching the sensitivity of BLASTX. Through comprehensive benchmarking against other widely used metagenomic protein mapping tools, we show that kMermaid achieves fast, resource-efficient, and sensitive metagenomic read classification into functional units expected to improve downstream quantitative analysis.

Fig 1. kMermaid unambiguously maps nucleotide sequences to functionally homogeneous protein clusters.

Fig 1

To classify a metagenomic read, a nucleotide query sequence undergoes a six-frame translation (Step 1), and frames containing stop codons are removed. Each amino acid k-mer in a non-truncated coding frame (Step 2) is then mapped to the protein clusters that contain that k-mer in the database. An assignment score is calculated to evaluate a match between the query sequence and each protein cluster, by summing frequencies of k-mers in the query sequence in every protein cluster (Step 3). The query is classified to the protein cluster assigned with the highest assignment score which corresponds to the cluster in which the k-mers of the query are most frequently observed (Step 4). A description of and pseudo-code for kMermaid’s implementation is provided in the S1 Methods.

Results and discussion

Using k-mer frequencies to map reads to homologous protein clusters

The main motivation for kMermaid is that short metagenomic reads are rarely mapped to a single protein and instead often map to multiple functionally similar proteins. Obtaining a unique read-to-protein mapping requires grouping these homologous proteins into some broader functional unit. Such functional units become especially critical for downstream quantitative analyses to prevent issues such as double counting multi-mapped reads. We find that most reads map to at least five protein hits using BLASTX, which is the minimum recommended value for the number of hits reported, while only 7% of the reads can be uniquely mapped by alignment to a single protein by either BLASTX or DIAMOND in BLASTX mode (Fig 2a). In contrast, by aggregating multi-mapped hits from single proteins into homologous protein clusters (see Methods), we find that more than 93% of the reads can be uniquely mapped to a single cluster or functional unit. In other words, for more than 93% of the reads, all BLASTX hits for the read belong to a single cluster. DIAMOND follows a similar trend to BLASTX. Together, this demonstrates that our clusters resolve the majority of ambiguous alignments without loss of information from multimapping.

Fig 2. Protein clusters underlying kMermaid mitigate multi-mapping and allow cluster-specific prevalences of k-mers.

Fig 2

(a) The reduction in the number of reads mapped to >1 protein using default configurations of BLASTX and DIAMOND compared to cluster assignments, when employed to the BLASTX and DIAMOND outputs, respectively, across 29 human fecal samples. (b) The percent of co-clustered proteins using our clustering process also co-clustered in the NCBI PCLA prokaryotic protein clusters for all overlapping proteins, plotted as bars. The x-axis contains individual proteins ordered by cluster, i.e., proteins in the same cluster are plotted next to each other, and the color of the bar corresponds to the percent similarity. (c) The distribution (Kernel density estimation, KDE) of keyword percentage, i.e., percent of cluster members with the most common word from all names of proteins in the cluster, across all clusters (yellow), and for clusters with specific common keywords of interest. (d) Visual representation of the kMermaid cluster frequency model sorted by the number of clusters a k-mer is present in. Colors represent the number of clusters in which a unique k-mer is found, and the y-axis corresponds to the k-mer frequency in the cluster. The panel shows a representative random 50K subset of all k-mers.

We developed kMermaid to uniquely and efficiently map microbial short read sequencing into homologous protein clusters using this underlying clustering framework. A user can provide a file with nucleotide sequences from whole-genome shotgun sequencing to kMermaid, for querying against our precomputed model of 1,793,361 proteins aggregated into 32,308 clusters. kMermaid will uniquely assign each nucleotide read to a cluster and provide a readable functional annotation, i.e., protein label, based on its cluster representative. Because approximately 25% of our cluster representatives had non-descriptive names, e.g., “hypothetical protein,” in RefSeq, we employed HH-suite3 [27] remote homology detection to produce descriptive cluster names. In total, we reannotated 8,617 proteins, of which 6,488 were with high confidence (S1 Table). These include 601 phage proteins, 436 membrane proteins, 230 transcriptional proteins, and 205 lipoproteins (S1 Table). The composition of kMermaid’s clusters is also highly consistent with preexisting, smaller-scale cluster annotations. We verified that 96% of proteins share 100% similarity with existing NCBI protein clusters [28], i.e., all proteins in the kMermaid cluster also co-occur in the broader NCBI-derived clusters (Fig 2b). In addition, using keywords of NCBI-assigned protein names, we show that the functional annotations of the proteins are highly homogenous within clusters (Fig 2c). Therefore, we concluded that kMermaid’s cluster model and corresponding annotations are sufficiently biologically accurate and highly reflective of the cluster content.

kMermaid’s internal pipeline assigns sequencing reads to clusters according to a frequency-based assignment score calculated from amino acid (AA) k-mer frequencies. A higher assignment score indicates that k-mers within a query sequence are more frequently observed in the assigned cluster than other clusters in the underlying model. For some query sequences, a k-mer may uniquely determine cluster assignment, while other query sequences may contain multiple k-mers that have a higher combined frequency in the assigned (maximal) cluster compared with any other cluster. With this in mind, we reasoned that a k-mer found in many clusters may be less informative and its contribution to the assignment score for that cluster is noisy. In the improbable case that all k-mers were present in all clusters at a similar frequency, our cluster assignment would be close to random chance. On the other hand, k-mers that are only found in one or two clusters allow for deterministic classification and enhance our confidence in use for read assignments. Out of approximately 2.5 million AA 5-mers in our model, presence in a single cluster was the most common scenario (22%) and 81% of all AA 5-mers were found in <10 clusters (Fig 2d). We refer to these 5-mers as “deterministic,” such that the presence of a deterministic k-mer in a query sequence is highly informative for cluster assignment. We also found that the AA 5-mers that are present in many clusters have relatively similar frequencies across the clusters when examining the top 10 clusters where they are most frequently found, implying that common 5-mers should not bias the assignment score. As anticipated, given the relative percentage of 5-mers that tend to be deterministic and that the assignment score considers multiple k-mers for each query sequence, kMermaid clusters fully resolve read alignment ambiguity or multi-mapping in more than 95% of the cases (Fig 2a).

Performance evaluation on simulated data with known labels

To provide validation for our model, we comprehensively benchmarked the accuracy and sensitivity of kMermaid against the most widely used metagenomic protein mapping methods: BLASTX, DIAMOND, and MMseqs2. We first benchmarked kMermaid using simulated data sets with varying rates of point mutations selected to mimic biological mutation rates and sequencing error rates. To compare against ground truth, the reads were simulated from the RefSeq data used in the second step of the clustering procedure (see Methods) so that their true cluster labels would be known. kMermaid, DIAMOND, and MMseqs2 were almost always able to map a read to the correct protein at the level of their reporting, i.e., either the single protein (BLASTX, DIAMOND, MMseqs2) or the protein cluster (kMermaid) (Fig 3a). As expected, when BLASTX and DIAMOND were restricted to reporting only one match per read, their performance dropped considerably as highly homologous proteins will have the same alignment scores. The performance of methods that align to a single protein increases substantially when viewing the singly-aligned results at the cluster level in concordance with the notion that clustering proteins by homology resolves ambiguous alignments in most cases (Fig 2a). Encouragingly, kMermaid was able to assign reads to the correct cluster in nearly all cases, albeit with a slight decrease in coverage (percentage of reads classified) when query sequences are > 500 nucleotides and the mutation rates are high (Fig 3b). Because kMermaid uses a cumulative scoring model, it is expected that the assignments for long, highly mutated sequences are noisier, especially at the default scoring threshold which is tuned to short reads. Similar trends were observed when we simulated reads guaranteed to have a certain number of mutations rather than using a probabilistic rate (S2a and S2b Fig). Thus, by resolving multi-mapped reads at the cluster level, kMermaid shows improved sensitivity compared to methods that assign reads only to individual proteins.

Fig 3. kMermaid sensitivity and resource benchmarking on simulated microbial protein data.

Fig 3

(a) The percentage of reads classified correctly by kMermaid compared with leading read-to-protein mapping tools averaged across 10 simulated datasets per each combination of read length and mutation rate. (b) The number of reads classified by each tool normalized by the number of input reads averaged across 10 simulated datasets per each combination of read length and mutation rate. (c) kMermaid (green) provides up to a 25-fold decrease in runtimes (in seconds, log-transformed) compared to BLASTX and has comparable runtimes to DIAMOND (blue). The y-axis has been truncated and tools that exceeded a 24-hour run time for larger input sizes are denoted with an asterisk. (d) kMermaid (green) requires a fixed, low memory allocation in comparison to other read-to-protein mapping tools. BLASTX was excluded from comparisons with more than 1 million sequences due to the infeasible running times. Methods exceeding 16GB of RAM are denoted with an asterisk.

Computational efficiency of kMermaid

Shotgun metagenomics experiments commonly yield tens of millions of sequences per sample, and each must be queried against a large reference database for classification purposes. Computational efficiency remains a challenge. BLASTX is perhaps the most established nucleotide-to-protein aligner, but its computational time is infeasible for typical large files, necessitating the development of alternatives that can process these reads in a reasonable timeframe. We benchmarked kMermaid’s single-CPU runtime and RAM usage again against BLASTX, DIAMOND, and MMseqs2 (Fig 3c, d). DIAMOND is regarded as one the fastest accurate approaches for protein mapping of metagenomic reads and was in part developed to address the runtime limitations of BLASTX. kMermaid ran 1,600 times faster than BLASTX on files with 100,000 sequences while DIAMOND ran 1,000 times faster. Both methods also provided substantial reductions in running time over MMseqs2. Both kMermaid and DIAMOND were able to classify 500K sequences in a minimum of 2.7 minutes and 3.7 minutes, respectively. The same input when given to BLASTX took over six days to complete. Further, classifying 40M sequences took kMermaid 3.3 hours (compared to 1.9-2.2 hours for DIAMOND), which highlights its usability for experimental shotgun metagenomic sequencing files. Like some other methods, since kMermaid classifies reads independently of other reads in the same input file, it easily lends itself to parallelization by means of splitting input files into smaller chunks, allowing for further speed improvement when resources are available.

Along with speed, RAM usage is another potentially limiting factor when input files are large. We have developed kMermaid to be highly memory efficient. We therefore compared kMermaid to leading tools including DIAMOND, which has excellent running times, but achieves this performance in speed at the expense of higher memory and multiple CPU utilization. With these limitations in mind, kMermaid performs read assignments in a way that only requires the precomputed k-mer frequency model, and not the input data, to be loaded into memory. As such, kMermaid requires a fixed amount (2GB) of memory per run regardless of file size, whereas DIAMOND and MMseqs2 generally require memory to scale with the increasing input file size (Fig 3d). BLASTX was excluded from comparisons with more than 1 million sequences due to long running times.

Using kMermaid to map new or unseen proteins

A key challenge in metagenomics is the classification of unknown microbial sequences that are not present in existing reference databases. To understand how kMermaid performs on unknown microbial proteins, we classified segments of 22,435 new RefSeq protein sequences deposited between January and May 2025, after our frequency model was developed. We compared the resulting mappings to BLASTX alignments with the same RefSeq database used to construct kMermaid’s underlying database. Importantly, we observed that the kMermaid assignment score is correlated with BLASTX percent identity (Spearman r = 0.83, 0.82, 0.8 for reads of length 125, 150, and 200, respectively; S3a, b Fig), highlighting the importance of choosing a more stringent score threshold when classifying reads which are likely to originate from unseen microbes. We further used the area under the receiver operating curve (AUROC) to assess the kMermaid’s ability to correctly classify reads, where a correct classification was based on BLASTX alignment. The kMermaid was able to achieve AUROCs of 0.93 and ≥0.96 for reads matching BLASTX results and higher confidence BLASTX results filtered at a more stringent percent-identity threshold of 66.6%, respectively (Fig 4a). We observed that kMermaid scores ranging from 6.3-8, depending on the read length with longer reads requiring higher scores, were able to achieve false positive rates ≤0.05 while still maintaining true positive rates around 80% (S2 Table). BLASTX was able to classify a slightly higher percentage of reads (13.7-14.9%) compared to kMermaid (11.7%-13.6%), but kMermaid was found to be highly concordant with BLASTX results, with around 95% of assignments correct assuming BLASTX as the gold standard (S2 Table).

Fig 4. Biological applications, function-specific performance, and evidence of remote homology detection of kMermaid.

Fig 4

(a) Receiver operating curves demonstrating the ability of kMermaid’s assignment score to correctly classify reverse-translated nucleotide segments of varying lengths from unseen protein sequences that were added to RefSeq in early 2025. (b) Agreement of BLASTX alignments and kMermaid protein assignments on 29 fecal samples from ulcerative colitis patients. (c) Boxplot showing kMermaid agreement with BLASTX for clusters with specific functional annotations. (d) Violin plots showing the k-mer frequency scores for six clusters of reads unclassified with BLASTX that were correctly functionally classified by kMermaid.

kMermaid is highly sensitive for protein cluster mapping of human fecal samples

Even though kMermaid performed well on simulated reads, it is difficult to account for the additional challenges and noise associated with real, experimental data by simulations alone. Therefore, we performed additional testing using real sequencing data from 29 publicly available human fecal samples of ulcerative colitis patients [29], comparing kMermaid assignments against BLASTX. On average, kMermaid results agreed with BLASTX alignments 83.3% of the time and the agreement rate was highly consistent across the 29 samples (Fig 4b). Assuming BLASTX hits to be the ground truth, kMermaid was able to maintain a balance between retaining a high percentage of the assignments that agree with BLASTX hits as well as a high ratio of assignments that agree with BLASTX to assignments that disagree with BLASTX (Figs 4b and S3c) at the default kMermaid assignment score ≥ 3. Given that the overwhelming majority of BLASTX hits belong to a single cluster (Fig 2a), the consistent agreement between BLASTX and kMermaid provides strong evidence for kMermaid’s ability to accurately and sensitively classify short reads from human fecal metagenomic sequencing.

Cluster specific results

A primary objective of kMermaid is to achieve high performance for classification at the read level. To comprehensively assess kMermaid’s performance, we also evaluated its cluster specific performance. We compared the kMermaid read assignments of experimental metagenomic sequencing input samples used for benchmarking to the assignment by BLASTX, within each cluster. Interestingly, we find that clusters related to restriction, toxins, transposons, and those in the GCN5-related N-acetyltransferases family (GNAT) had high agreement with BLASTX, whereas clusters related to ABC transporters tended to have relatively low agreement with BLASTX, and therefore likely lower accuracy (Fig 4c). We also confirmed that the proportion of reads concordant with BLASTX was correlated with the mean kMermaid assignment score for all reads mapping to the cluster, a trend that was not confounded by the number of reads mapped to the cluster (S3d Fig). Importantly, by investigating reads that were assigned with a high kMermaid assignment score but were not classified by BLASTX, we identified reads with remote homology to proteins within kMermaid clusters. We verified a correct functional classification of reads assigned to six such kMermaid clusters using both PSI-BLAST [30] and HHblits3 from HH-suite3 [27] and validated these kMermaid functional annotations (Fig 4d and S1 File).

Conclusions

Metagenomic sequencing allows for the functional profiling of diverse microbes facilitating numerous biomedical applications, such as biomarker discovery and disease prediction [3133]. To date, k-mer and binning based methods have been immensely useful in allowing efficient and sensitive classification of short read metagenomic sequencing into taxonomic units [2124,3437]. However, to the best of our knowledge, analogous methods that achieve sensitive and efficient classification at the protein level have not been developed. As a result, there remain critical limitations in our ability to classify short microbial reads for the ultimate task of downstream analysis and biological inference. Notable limitations of functional read assignment methods are ambiguous alignments and computational costs which are prohibitive for the large volumes of data typically generated by next generation sequencing experiments. The loss of granularity from higher-level functional mappings into pathways additionally prohibits analyses where amino acid sequences may be needed such as microbial peptide binding prediction [38,39]. As such, there is a need for methods that can capture and retain the underlying biology of microbial function in a granular, computationally feasible, and accessible manner.

To address these challenges, we have developed kMermaid, a novel ultrafast method for the unambiguous and sensitive classification of short reads into functional units consisting of protein clusters. We show that by using a well-known concept of clustering homologous proteins into a single functional unit, kMermaid rapidly resolves the majority of ambiguous BLASTX protein alignments while retaining granular amino acid level information. kMermaid uses a precomputed k-mer frequency model based on high-confidence protein clusters encompassing almost two million microbial proteins from RefSeq. Our designated clustering allows the assignment of diverse proteins into clusters with mostly homogenous k-mers or words, enhancing the potential of the model to correctly capture distinct functions. Using both simulated short reads and sequencing data from real human fecal samples, we demonstrate that kMermaid classifies reads with high accuracy and sensitivity compared to BLASTX but runs up to 2,500% faster on typical files with tens of millions of sequences. Additionally, we were able to verify kMermaid’s ability to correctly classify reads that were unclassified by BLASTX, for sequences sharing remote homology to proteins within kMermaid clusters. This striking performance is likely achieved through kMermaid’s composite assignment scoring which uses information on a set of proteins in a cluster to classify each read in contrast to BLASTX, which is based on pairwise comparisons.

kMermaid assigns a short read to a single protein cluster from a fixed set (database) based on a global maximum k-mer frequency assignment score, which implies certain limitations. First, in the case of ties it will randomly assign the sequence to a cluster based on the first time it reaches the maximum. Despite this, we have shown that ties should not be a widespread issue since most multi-mapped reads are mapped to a single cluster (Fig 2a) and many amino acid 5-mers are unique (Fig 2d). Second, it is possible that there exist additional clusters of proteins which are not homologous or functionally similar to any provided through our precomputed kMermaid database. Because of this, it is recommended that researchers use our method as a first pass means of dealing with the computational burden of BLASTX and perform alignment-based verification for select proteins of interest. Orthogonally, additional kMermaid databases/models can be (re)trained to reflect periodic increases in the quantity and diversity of available sequences. Third, although we did find evidence supporting some ability for remote homology detection, like any method that relies on sequence comparisons, kMermaid cannot classify truly novel or unseen proteins. Last, there are several biological limitations of inferring function from short reads that kMermaid does not address, including misclassification of multi-domain proteins and operons. These limitations will remain for any metagenomic protein mapping tool designed for short reads and researchers wanting a more thorough view of the functional content of metagenomes should consider methods that include some degree of contig assembly prior to classification [4042].

In summary, we present kMermaid, a novel, sensitive, and runtime and memory efficient approach for the task of assigning protein identities to short microbial sequences. Future studies can utilize kMermaid for the discovery of microbial functional biomarkers and as a precursor to downstream quantitative functional analyses.

Methods

Forming functionally similar microbial protein clusters using a two-step clustering procedure

kMermaid is foremost designed to address ambiguity in read alignment, where most shotgun metagenomic reads align against multiple microbial proteins with similar alignment scores and are therefore classified as multiple functionally related proteins by the alignment (Fig 2a). To circumvent this issue, kMermaid uses a well-defined concept and groups functionally related proteins into functional units prior to the assignment such that a read can be uniquely classified into a single functional unit. We therefore constructed a set of comprehensive and high confidence clusters of microbial proteins by employing a two-step clustering procedure using CD-HIT [43,44]. CD-HIT uses an incremental greedy algorithm to identify representative sequences and cluster remaining sequences by sequence similarity using short word filtering. We first clustered 43,176 proteins from the NCBI RefSeq non-redundant microbial protein database [45] using CD-HIT with a similarity threshold of 65% and a word size of 5. This first clustering step resulted in 32,308 clusters allowing non-redundant clusters. In the second step, to further expand and diversify protein members within these clusters, we applied CD-HIT-2D [44] with a 70% sequence similarity threshold to cluster all RefSeq microbial proteins against the previously selected cluster representatives (N = 1,797,426 proteins from RefSeq, dataset downloaded in May 2023). A set of expanded clusters was created from this process such that the final dataset clusters 1,793,361 proteins into 32,308 functional groups.

Assigning functions to clusters of hypothetical or uncharacterized microbial proteins

kMermaid annotates its underlying protein clusters based on the name of the representative sequence as determined in the initial CD-HIT phase. Even though these proteins are in the representative microbial protein database, 8,617 (approximately 25%) of the cluster representatives are best annotated by NCBI or RefSeq as hypothetical proteins, i.e., proteins of unknown or unverified function. To annotate these protein clusters and assign them with protein names, we used HHblits from HH-suite3 [27] for remote homology detection against two databases from The Protein Databank and UniProtKB (PD70 [46] and Uniclust30 [47] v2023, respectively) and selected the match with lowest e-value across the databases that did not map to hypothetical, unknown, or uncharacterized proteins. We were able to confidently assign a protein name or function to 6,488 of the 8,617 hypothetical clusters (e-value < 0.01) and the rest are assigned with lower confidence.

Creating kMermaid’s k-mer frequency cluster model using nested hashing

The goal of kMermaid is to functionally classify short reads using the previously defined clusters of functionally similar microbial proteins. To this end, we built a k-mer frequency model by obtaining all amino acid (AA) k-mers of all protein sequences in each of 32,308 clusters and computing the cluster-level frequency of each AA k-mer (S1 Fig). The value of k was chosen through hyperparameter search, by simulating truncations of each protein in the database into 50 overlapping AA segments. We evaluated k values of 3, 4, 5 and 6 for classifying truncated protein sequences, measuring accuracy as the fraction of truncated sequences which were correctly assigned to their cluster. Both k = 5 and k = 6 achieved similarly high accuracy (>0.99), but the number of k-mers increased sharply from 2,574,615 for k = 5 to 12,043,100 for k = 6. Therefore, k = 5 was selected, achieving high accuracy with substantially fewer parameters.

We then obtained overlapping 5-mers of each protein amino acid sequence. The k-mer frequencies for each cluster C were defined by the count of the k-mer in the cluster C divided by the total number of proteins in the cluster (note that frequency can be > 1 if k-mers appear multiple times on average in the proteins of a cluster). The underlying model is then stored in a two-level hash map where the top-level map stores for each k-mer w, a map of clusters containing w, and the second level maps the clusters C containing w, to the frequency of w in C, Cw (S1 Methods, Algorithm 1). The resulting nested hash map can be written as Model[w][C]frequencyCw.This precomputed model, consisting of 2,574,615 unique 5-mers (all the naturally occurring AA 5-mers in the underlying protein cluster database), is then used to determine the cluster to which a query sequence is assigned. The map is distributed along with the kMermaid package and is implemented as the default frequency model.

Assigning protein maps to reads using the pre-computed k-mer model

To assign a read into a protein cluster, a six-frame translation is applied to a query sequence r and translations containing a stop codon (truncated frames) are discarded. Next, all overlapping AA 5-mers are extracted from the non-truncated frames. A score representing the strength of a match between r and each cluster C is then calculated by the summation of the precomputed model k-mer frequencies for each k-mer w in the query sequence. The score sr,C for query sequence r and each cluster C is computed as:

sr,C= wrfrequencyCw

where frequencyCw is the model frequency of each k-mer w from r in cluster C, i.e., the average occurrence of w in proteins of C. Finally, the sequence is then assigned to the cluster with the global maximum score across all clusters (Fig 1; S1 Methods, Algorithm 2). kMermaid annotates the query sequence by the cluster representative for the cluster corresponding to this maximum assignment score. This scoring approach effectively assigns higher confidence to scores when k-mers within a query sequence are more frequently observed in a cluster. kMermaid uses a k of 5 AA chosen via hyperparameter search, as described previously, for both the base model construction and the assignment procedure described previously and reports assignments with an assignment score >3.

Evaluating the clusters of functionally similar microbial proteins

As the protein clusters lie at the base of kMermaid’s approach, we validated their correctness using two orthogonal analyses aimed at verifying that the clusters produced contain homologous proteins with shared biological function.

  1. Compatibility with NCBI protein clusters. To demonstrate kMermaid’s ability to construct biologically relevant clusters, we compared the results of the two-step CD-HIT clustering to the datasets from the NCBI Protein Clusters [28], which groups together proteins by sequence similarity. A subset of 102,380 of the proteins contained in our expanded cluster model was also clustered through the prokaryotic PCLA protein clusters dataset within this database. Proteins that were in this overlapping subset and were also in a non-singleton kMermaid cluster were used for comparison (N = 102,367, mapped to 9,984 kMermaid clusters). For each of these proteins, we evaluated their tendency to be co-clustered with the same proteins in both PCLA and kMermaid clusters by computing the percent of co-clustered proteins by kMermaid clustering that were also co-clustered in PCLA. The number of kMermaid clusters was chosen as the denominator for this evaluation metric to verify the correctness of kMermaid clusters rather than to assess its ability to maximize clusters, which is not an objective of this approach.

  2. Within-cluster keyword similarity. High-throughput text analysis was performed on the protein name annotations to further investigate the similarity and homogeneity of the clusters. Trends and frequencies of word presence in clusters were used to evaluate cluster functional homogeny. After removing ubiquitous and generic words (e.g., “bacteria” or “protein”), we computed the frequency of the most common keyword found in each cluster, i.e., the fraction of proteins in the cluster containing the most common keyword in that cluster. Most clusters demonstrated a common keyword frequency of 1, indicating that our clusters are highly homogenous in key functions.

Benchmarking against established protein mapping tools

We benchmarked kMermaid against popular methods that can be used for protein mapping—BLASTX, DIAMOND, and MMseqs2. Each method was run against an underlying database containing the same 1,797,426 RefSeq protein sequences described previously. For consistency, we used the default or recommended configurations of each method. BLASTX and DIAMOND in BLASTX mode were run with e-values of 1e-4 and 1e-3 (default), respectively. MMseqs2 was run with a min-length set to 16 and e-value set to 1e-4. Each method except for BLASTX was set to report the default number of matches based on the e-value. BLASTX was set to report a maximum of 1 match for simulated data and 5 matches for experimental data due to its excessive running time when reporting all matches. DIAMOND was additionally run with a maximum of 1 match to demonstrate the consistency of mapping when using higher level groups rather than exact matches as well as to provide a fairer comparison resource benchmarking.

Performance assessment using reads simulated from RefSeq sequences.

To demonstrate that kMermaid correctly assigns proteins to clusters, we benchmarked kMermaid using data simulated from nucleotide sequences of 1,383 microbial coding frames downloaded from RefSeq for which the true cluster identity is known, i.e., proteins that already exist in kMermaid’s model and thus have a ground-truth cluster label. From these, simulated data were generated with varying mutation rates and read lengths. Mutation rates were chosen to be representative of bacterial mutation rates [48] and error rates in next-generation sequencing data [49]. For continuous rates, the number of mutations per sequence was determined probabilistically using a binomial distribution and the location of the mutation in the sequence was determined by random sampling. A mutation was defined as a random assignment of any nucleotide that did not match the original position. Query sequences were then created by segmenting the mutated sequence to the specified read length, l, starting at some random position x such that x + l  length(read). Since low mutation rates could probabilistically result in no mutations, we additionally simulated reads guaranteed to have 1, 2, 3, or 4 mutations resulting in an amino acid change. To guarantee an amino acid change we employed the following procedure: 1) randomly select a substring of read length, l, starting at some random position x such that x + l  length(read), 2) perform a translation on the protein nucleotide sequence for the correct open reading frame based on the original sequence, 3) randomly select an amino acid to change, 4) concatenate the original substring with a randomly-selected reverse translation of the amino acid. Protein maps were assigned using the same reference database of 1,797,426 proteins that were used to develop the kMermaid database using default configurations of BLASTX, DIAMOND, kMermaid, and MMseqs2 or as described above. In lieu of the defaults, BLASTX was run with the recommended minimum value of 1 maximum match to accommodate a reasonable running time. DIAMOND was additionally run with only (at most) 1 match as a point of comparison for unambiguous reporting. Results from the simulations were averaged across 10 replicate datasets generated with a different random seed for each combination of parameters.

Benchmarking computational resource utilization.

We also compared the speed and maximum memory usage of kMermaid to BLASTX, DIAMOND, and MMseqs2. DIAMOND, which is up to 20,000 times faster than BLASTX, is widely considered the fastest protein aligner that can maintain the sensitivity of BLASTX results and is included in our benchmarking as a standard for efficient resource consumption. To compare the runtime of each method, we created random subsets of a single fasta file containing nucleotide metagenomic sequences from a published immunotherapy trial of melanoma patients [50]. For running time comparisons, we tested input files with a varying number of sequences ranging from 5,000–40 million with 10 replicates each to account for machine or algorithmic variability. BLASTX and DIAMOND sequence queries were performed against the same database that was used to create the kMermaid model described above. All comparisons were run on a Linux kernel using 1 task and 1 CPU per task. Because most methods can run tens of millions of sequences in under a day, we set an upper time limit of 72 hours and denote jobs that were unable to be completed in that time frame. Since no jobs took between 24 and 72 hours, we truncated the upper limit of the y-axis for visualizations to 24 hours.

Individual metagenomic sequencing experiments can yield large volumes of data and file sizes are commonly on the order of tens of gigabytes. As such, efficient memory usage is another important factor to consider when choosing analysis tools. We compared the memory (RAM) usage between all methods for input files containing 100,000, 1 million, 10 million, 20 million, 40 million, 60 million, and 80 million reads, with the latter numbers corresponding to the total number of reads commonly generated from a standard paired-end sequencing experiment (10-40M each, combined paired end). Because our SLURM cluster was not set up to report maximum memory usage, maximum RAM usage was inferred by submitting jobs with increasing amounts of memory in 1–2GB intervals (500MB, 1GB, 2GB, 3GB, 4GB, 6GB, 8GB, 10GB, 12GB, 14GB, 16GB) until the job did not report a memory related error and was able to complete successfully. For example, the max RAM for a job that was reported as requiring 12G of memory used more than 10 but less than 12GB of RAM. We denoted jobs that could not be completed with 16GB of memory, meant to reflect the feasibility of a laptop analysis. Runtime in seconds was computed manually using the date function in Linux. BLASTX was excluded from comparisons of over 1,000,000 sequences due to its infeasibly high running time, although we acknowledge that BLASTX memory usage is generally minimal.

Analysis of unseen RefSeq microbial protein data

Since kMermaid was first implemented in 2023, we were able to test the ability of our tool to classify unknown microbial proteins by using 22,435 annotated protein sequences deposited in RefSeq between January 2025 and May 2025 (obtained May 2025). Because our method uses nucleotide reads as input, we had to perform a reverse translation of the amino acid sequences. If an amino acid reverse mapped to multiple tri-nucleotide sequences without a stop codon, a tri-nucleotide sequence was selected at random, which created an even more challenging classification task. We also removed sequences with missing or non-standard amino acid sequences at this stage. We then created datasets containing randomly selected 125, 150, and 200 base pair substrings of the reverse translated sequences with 10 different seeds each. We then ran BLASTX with max_target_seqs = 5 and evalue = 0.0001 and kMermaid with default configurations. Spearman correlation was computed on all BLASTX and kMermaid results; datapoints were down sampled to 150,000 across all 30 datasets for visualization (S2a Fig). We additionally performed post-hoc filtering of BLASTX results using percent identity ≥66.6 to obtain higher confidence hits for comparison with kMermaid. Because the reference databases for each method did not contain the truth assignment, we considered a correct hit, or true positive, to be a sequence for which the BLASTX and kMermaid protein map matched. We computed AUROC using the kMermaid scores for each read assigned by both BLAST and kMermaid. We then the kMermaid scoring threshold at which the false positive rate fell below 0.05 for each read length. These thresholds were used for subsequent comparisons with BLASTX (S2 Table). We note that while kMermaid assignment scores generally correlate with higher confidence for data of consistent read length, the specific scores thresholds reported are likely specific to this analysis given the cluster-specific trends and biases in recently deposited protein sequences, i.e., over-representation of specific proteins.

Protein mapping of reads from fecal samples from ulcerative colitis patients

The goal of kMermaid is to efficiently map reads to proteins, while maintaining the accuracy and sensitivity of BLASTX. We benchmarked kMermaid’s functional read classification against BLASTX using paired end reads from ulcerative colitis patients enrolled in the LOTUS fecal matter transplant clinical trial [29] available on the NCBI’s Sequence Read Archive (SRA). Two samples were excluded from analysis based on data incompletion (low read depth). Due to the infeasibly long running times incurred by BLASTX, we randomly subsampled each fasta file to 100,000 reads. We used a small, representative subset (n = 3) of the available samples for in-depth follow-up analyses of all reads in the samples. BLASTX was set to max_target_seqs = 5, evalue = 1e-4 and to max_target_seqs = 3, evalue = 0.01 for these analyses, respectively. We additionally filtered BLASTX results at percent identity >66.6% where specified. We compared the overall coverage, defined as the percentage of reads that were able to be classified by both BLASTX and kMermaid, as well as the percentage of BLASTX hits that kMermaid was able to classify. We further investigated the correctness of kMermaid’s assignments using BLASTX results as a gold standard for correct read assignment, by examining reads which were assigned by both methods. Correct assignment by kMermaid was defined for reads as a non-empty overlap between proteins in the BLASTX hits and the assigned kMermaid clusters.

Cluster-specific results and remote homology detection

To evaluate kMermaid’s performance for specific biological functions, we calculated the accuracy across clusters with similar functional annotations. We used the LOTUS clinical trial data to evaluate function specific performance in a real metagenomic sequencing cohort. To this end, for every cluster we calculated the ratio of reads correctly assigned to that cluster (i.e., assigned to that cluster by BLASTX and kMermaid), out of all the reads assigned to that cluster by kMermaid. Then, we evaluated the distribution of cluster performances for distinct functions, i.e., clusters named with common keywords (S3 Table).

To evaluate kMermaid’s ability to identify sequences with remote homology to proteins in the database, we examined reads that were not classified by BLASTX from the LOTUS clinical trial data and were assigned with high kMermaid assignment scores (>20). We explored reads that were confidently assigned to six clusters that failed to be classified with BLASTX, and carefully verified that these reads have remote homology to their kMermaid assigned clusters using PSI-BLAST and HHblits3 from HH-suite3 [27] (S1 File).

Supporting information

S1 Fig. kMermaid’s internal k-mer frequency model with all steps outlined.

(TIF)

pcbi.1013470.s001.tif (940.5KB, tif)
S2 Fig. Performance on simulated reads with known labels, with 1–4 introduced mutations per read.

(a) Percent correct labels by each method for 3 typical read lengths evaluated. (b) Percent of input reads mapped by each method for 3 typical read lengths evaluated.

(TIF)

pcbi.1013470.s002.tif (748.2KB, tif)
S3 Fig. Biological applications.

(a) Spearman correlation between kMermaid’s assignment score and the maximum BLASTX percent identity per read for all reverse-translated nucleotide segments (lengths = 125, 150, 200 base pairs) of each unseen RefSeq protein that was mapped by both methods without thresholding. To prevent overplotting, reads were down sampled to 100,000 (40% of total). (b) Histograms showing the kMermaid score of all reverse-translated nucleotide segments (lengths = 125, 150, 200 base pairs) of each unseen RefSeq protein that was mapped by both kMermaid and BLASTX at > 66.6 percent identity. The dashed lines denote the read length-specific thresholds determined by maintaining a false positive rate < 0.05. (c) The percent of all input reads able to be classified by kMermaid compared to BLASTX for sequencing from 3 representative colitis samples, chosen randomly. kMermaid’s optimal scoring threshold was determined by maximizing the percentage of the assignments that agree with BLASTX hits (Sensitivity, dark blue) while retaining a high ratio of assignments that agree with BLASTX to assignments that disagree with BLASTX (Accuracy, light blue). (d) Correlation between the proportion of reads concordant with BLASTX and the mean assignment scores (log-transformed) for all proteins in the cluster. Distributions of these metrics broken down by number of reads mapped to the cluster where clusters in the bottom tertile have the lowest number of mapped reads and clusters in the top tertile contain the highest.

(TIF)

pcbi.1013470.s003.tif (1.2MB, tif)
S1 Table. Protein names and descriptions of cluster representatives of the protein clusters underlying the kMermaid model.

(CSV)

S2 Table. Performance evaluation for sequences from unseen RefSeq microbial protein data.

(XLSX)

pcbi.1013470.s005.xlsx (9.5KB, xlsx)
S3 Table. The median kMermaid model performance when classifying protein cluster containing different, specific keywords.

(CSV)

pcbi.1013470.s006.csv (87.3KB, csv)
S1 File. Selected reads failed to be classified with BLASTX that were correctly classified with kMermaid share remote sequence homology with proteins in their associated clusters.

(TXT)

pcbi.1013470.s007.txt (6.2KB, txt)
S1 Methods. Pseudo-code for model training and read assignment.

(PDF)

pcbi.1013470.s008.pdf (123.1KB, pdf)

Data Availability

Availability and usage kMermaid is freely available as an open-source command line program written in Python that requires a user-provided input file containing query sequences in either fastq or fasta format. The precomputed k-mer frequency model and files containing the protein cluster members and cluster representatives are both accessible and used as internal default parameters in kMermaid. Complete download and installation instructions are found at github.com/AuslanderLab/kmermaid. Availability of data and materials All data is used in this work is publicly available. The LOTUS trial paired end sequencing metagenomic reads from ulcerative colitis patients fastq files used for benchmarking are available through SRA: PRJEB50699.

Funding Statement

This work was supported by the National Institutes of Health, R01 LM014503 (to N.A.) and the V foundation V2024-006 (to N.A.). N.A. received salary from R01 LM014503 and V2024-006. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Chiu CY, Miller SA. Clinical metagenomics. Nat Rev Genet. 2019;20(6):341–55. doi: 10.1038/s41576-019-0113-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Ko KKK, Chng KR, Nagarajan N. Metagenomics-enabled microbial surveillance. Nat Microbiol. 2022;7(4):486–96. doi: 10.1038/s41564-022-01089-w [DOI] [PubMed] [Google Scholar]
  • 3.Gao Y, Li D, Liu Y-X. Microbiome research outlook: past, present, and future. Protein Cell. 2023;14(10):709–12. doi: 10.1093/procel/pwad031 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Nayfach S, Pollard KS. Toward Accurate and Quantitative Comparative Metagenomics. Cell. 2016;166(5):1103–16. doi: 10.1016/j.cell.2016.08.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Prakash T, Taylor TD. Functional assignment of metagenomic data: challenges and applications. Brief Bioinform. 2012;13(6):711–27. doi: 10.1093/bib/bbs033 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Cantalapiedra CP, Hernández-Plaza A, Letunic I, Bork P, Huerta-Cepas J. eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale. Mol Biol Evol. 2021;38(12):5825–9. doi: 10.1093/molbev/msab293 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Törönen P, Medlar A, Holm L. PANNZER2: a rapid functional annotation web server. Nucleic Acids Res. 2018;46(W1):W84–8. doi: 10.1093/nar/gky350 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Kanehisa M, Sato Y, Morishima K. BlastKOALA and GhostKOALA: KEGG Tools for Functional Characterization of Genome and Metagenome Sequences. J Mol Biol. 2016;428(4):726–31. doi: 10.1016/j.jmb.2015.11.006 [DOI] [PubMed] [Google Scholar]
  • 9.Beghini F, McIver LJ, Blanco-Míguez A, Dubois L, Asnicar F, Maharjan S, et al. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. Elife. 2021;10:e65088. doi: 10.7554/eLife.65088 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Hera MR, Liu S, Wei W, Rodriguez JS, Ma C, Koslicki D. Metagenomic functional profiling: to sketch or not to sketch? Bioinformatics. 2024;40(Suppl 2):ii165–73. doi: 10.1093/bioinformatics/btae397 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Hernández-Plaza A, Szklarczyk D, Botas J, et al. eggNOG 6.0: enabling comparative genomics across 12 535 organisms. Nucleic Acids Res. 2022;51(D1):D389–D394. doi: 10.1093/nar/gkac1022 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology. 1990;215(3):403–10. doi: 10.1016/s0022-2836(05)80360-2 [DOI] [PubMed] [Google Scholar]
  • 13.Gweon HS, Shaw LP, Swann J, De Maio N, AbuOun M, Niehus R, et al. The impact of sequencing depth on the inferred taxonomic composition and AMR gene content of metagenomic samples. Environ Microbiome. 2019;14(1):7. doi: 10.1186/s40793-019-0347-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2014;12(1):59–60. doi: 10.1038/nmeth.3176 [DOI] [PubMed] [Google Scholar]
  • 15.Buchfink B, Reuter K, Drost H-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods. 2021;18(4):366–8. doi: 10.1038/s41592-021-01101-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35(11):1026–8. doi: 10.1038/nbt.3988 [DOI] [PubMed] [Google Scholar]
  • 17.Golob JL, Minot SS. In silico benchmarking of metagenomic tools for coding sequence detection reveals the limits of sensitivity and precision. BMC Bioinformatics. 2020;21(1):459. doi: 10.1186/s12859-020-03802-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Schaeffer L, Pimentel H, Bray N, Melsted P, Pachter L. Pseudoalignment for metagenomic read assignment. Bioinformatics. 2017;33(14):2082–8. doi: 10.1093/bioinformatics/btx106 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Raghupathy N, Choi K, Vincent MJ, Beane GL, Sheppard KS, Munger SC, et al. Hierarchical analysis of RNA-seq reads improves the accuracy of allele-specific expression. Bioinformatics. 2018;34(13):2177–84. doi: 10.1093/bioinformatics/bty078 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Moeckel C, Mareboina M, Konnaris MA, Chan CSY, Mouratidis I, Montgomery A, et al. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J. 2024;23:2289–303. doi: 10.1016/j.csbj.2024.05.025 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014;15(3). doi: 10.1186/gb-2014-15-3-r46 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol. 2019;20(1):257. doi: 10.1186/s13059-019-1891-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Breitwieser FP, Baker DN, Salzberg SL. KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome Biol. 2018;19(1):198. doi: 10.1186/s13059-018-1568-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Ounit R, Wanamaker S, Close TJ, Lonardi S. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics. 2015;16(1):236. doi: 10.1186/s12864-015-1419-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Kim D, Song L, Breitwieser FP, Salzberg SL. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 2016;26(12):1721–9. doi: 10.1101/gr.210641.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat Commun. 2016;7:11257. doi: 10.1038/ncomms11257 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Steinegger M, Meier M, Mirdita M, Vöhringer H, Haunsberger SJ, Söding J. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics. 2019;20(1):473. doi: 10.1186/s12859-019-3019-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Klimke W, Agarwala R, Badretdin A, Chetvernin S, Ciufo S, Fedorov B, et al. The National Center for Biotechnology Information’s Protein Clusters Database. Nucleic Acids Res. 2009;37(Database issue):D216-23. doi: 10.1093/nar/gkn734 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Haifer C, Paramsothy S, Kaakoush NO, Saikal A, Ghaly S, Yang T, et al. Lyophilised oral faecal microbiota transplantation for ulcerative colitis (LOTUS): a randomised, double-blind, placebo-controlled trial. The Lancet Gastroenterology & Hepatology. 2022;7(2):141–51. doi: 10.1016/s2468-1253(21)00400-3 [DOI] [PubMed] [Google Scholar]
  • 30.Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402. doi: 10.1093/nar/25.17.3389 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Segata N, Izard J, Waldron L, Gevers D, Miropolsky L, Garrett WS, et al. Metagenomic biomarker discovery and explanation. Genome Biol. 2011;12(6). doi: 10.1186/gb-2011-12-6-r60 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Pasolli E, Truong DT, Malik F, Waldron L, Segata N. Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights. PLoS Comput Biol. 2016;12(7):e1004977. doi: 10.1371/journal.pcbi.1004977 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.LaPierre N, Ju CJ-T, Zhou G, Wang W. MetaPheno: A critical evaluation of deep learning and machine learning in metagenome-based disease prediction. Methods. 2019;166:74–82. doi: 10.1016/j.ymeth.2019.03.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Břinda K, Sykulski M, Kucherov G. Spaced seeds improve k-mer-based metagenomic classification. Bioinformatics. 2015;31(22):3584–92. doi: 10.1093/bioinformatics/btv419 [DOI] [PubMed] [Google Scholar]
  • 35.Choi I, Ponsero AJ, Bomhoff M, Youens-Clark K, Hartman JH, Hurwitz BL. Libra: scalable k-mer-based tool for massive all-vs-all metagenome comparisons. Gigascience. 2019;8(2):giy165. doi: 10.1093/gigascience/giy165 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Shen W, Xiang H, Huang T, Tang H, Peng M, Cai D, et al. KMCP: accurate metagenomic profiling of both prokaryotic and viral populations by pseudo-mapping. Bioinformatics. 2022;39(1). doi: 10.1093/bioinformatics/btac845 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Alneberg J, Bjarnason BS, de Bruijn I, Schirmer M, Quick J, Ijaz UZ, et al. Binning metagenomic contigs by coverage and composition. Nat Methods. 2014;11(11):1144–6. doi: 10.1038/nmeth.3103 [DOI] [PubMed] [Google Scholar]
  • 38.Medzhitov R. Recognition of microorganisms and activation of the immune response. Nature. 2007;449(7164):819–26. doi: 10.1038/nature06246 [DOI] [PubMed] [Google Scholar]
  • 39.Cusick MF, Libbey JE, Fujinami RS. Molecular mimicry as a mechanism of autoimmune disease. Clin Rev Allergy Immunol. 2012;42(1):102–11. doi: 10.1007/s12016-011-8294-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Elbasir A, Ye Y, Schäffer DE, Hao X, Wickramasinghe J, Tsingas K, et al. A deep learning approach reveals unexplored landscape of viral expression in cancer. Nat Commun. 2023;14(1):785. doi: 10.1038/s41467-023-36336-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Shang J, Jiang J, Sun Y. Bacteriophage classification for assembled contigs using graph convolutional network. Bioinformatics. 2021;37(Suppl_1):i25–33. doi: 10.1093/bioinformatics/btab293 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Lugli GA, Milani C, Mancabelli L, van Sinderen D, Ventura M. MEGAnnotator: a user-friendly pipeline for microbial genomes assembly and annotation. FEMS Microbiol Lett. 2016;363(7):fnw049. doi: 10.1093/femsle/fnw049 [DOI] [PubMed] [Google Scholar]
  • 43.Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–9. doi: 10.1093/bioinformatics/btl158 [DOI] [PubMed] [Google Scholar]
  • 44.Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150–2. doi: 10.1093/bioinformatics/bts565 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Tatusova T, Ciufo S, Fedorov B, O’Neill K, Tolstoy I. RefSeq microbial genomes database: new representation and annotation strategy. Nucleic Acids Res. 2014;42(Database issue):D553–9. doi: 10.1093/nar/gkt1274 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, et al. The Protein Data Bank. Nucleic Acids Res. 2000;28(1):235–42. doi: 10.1093/nar/28.1.235 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Mirdita M, von den Driesch L, Galiez C, Martin MJ, Söding J, Steinegger M. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 2017;45(D1):D170–6. doi: 10.1093/nar/gkw1081 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Watford S, Warrington SJ. Bacterial DNA Mutations. In: StatPearls. StatPearls Publishing; 2023. [Accessed August 10, 2023]. http://www.ncbi.nlm.nih.gov/books/NBK459274/ [PubMed] [Google Scholar]
  • 49.Ma X, Shao Y, Tian L, Flasch DA, Mulder HL, Edmonson MN, et al. Analysis of error profiles in deep next-generation sequencing data. Genome Biol. 2019;20(1):50. doi: 10.1186/s13059-019-1659-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Peters BA, Wilson M, Moran U, Pavlick A, Izsak A, Wechter T, et al. Relating the gut metagenome and metatranscriptome to immunotherapy responses in melanoma patients. Genome Med. 2019;11(1):61. doi: 10.1186/s13073-019-0672-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1013470.r001

Decision Letter 0

Sarath Chandra Janga

15 Apr 2025

PCOMPBIOL-D-25-00398

kMermaid: Ultrafast functional classification of microbial reads

PLOS Computational Biology

Dear Dr. Auslander,

Thank you for submitting your manuscript to PLOS Computational Biology. After careful consideration, we feel that it has merit but does not fully meet PLOS Computational Biology's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript within 60 days Jun 15 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at ploscompbiol@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pcompbiol/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

* A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. This file does not need to include responses to formatting updates and technical items listed in the 'Journal Requirements' section below.

* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, competing interests statement, or data availability statement, please make these updates within the submission form at the time of resubmission. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter

We look forward to receiving your revised manuscript.

Kind regards,

Sarath Chandra Janga, Ph.D

Academic Editor

PLOS Computational Biology

James Faeder

Section Editor

PLOS Computational Biology

Additional Editor Comments :

Although reviewers agreed on the simplicity and utility of the proposed metagenomic tool kMermaid, several critical concerns were raised by all three reviewers regarding the methodology, novelty, benchmarking, and clarity. In particular, I am willing to consider a significantly revised version of the manuscript which addresses the major concerns raised by the reviewers including those highlighted below.

• Improve the benchmarking of the tool by comparing speed/accuracy against MMseq2, DIAMOND and other existing tools using standardized datasets.

• Clearly make the case for the novelty of the tool by differentiating kMermaid’s approach from existing tools (e.g., cluster definitions, ambiguity metrics).

• Provide details of the method by formalizing the algorithm, revise figures for clarity, and define terms such as "truncated reads" to avoid ambiguity.

• Justify the use of protein clusters as functional proxies or adopt orthology databases. For instance, established orthology databases (KEGG, EggNOG, Pfam) are preferred for functional annotation since clustering at 65–70% identity is a poor proxy for function.

Journal Requirements:

1) We ask that a manuscript source file is provided at Revision. Please upload your manuscript file as a .doc, .docx, .rtf or .tex. If you are providing a .tex file, please upload it under the item type u2018LaTeX Source Fileu2019 and leave your .pdf version as the item type u2018Manuscriptu2019.

2) Please provide an Author Summary. This should appear in your manuscript between the Abstract (if applicable) and the Introduction, and should be 150-200 words long. The aim should be to make your findings accessible to a wide audience that includes both scientists and non-scientists. Sample summaries can be found on our website under Submission Guidelines:

https://journals.plos.org/ploscompbiol/s/submission-guidelines#loc-parts-of-a-submission

3) Please upload all main figures as separate Figure files in .tif or .eps format. For more information about how to convert and format your figure files please see our guidelines: 

https://journals.plos.org/ploscompbiol/s/figures

4) We have noticed that you have uploaded Supporting Information files, but you have not included a list of legends. Please add a full list of legends for your Supporting Information files after the references list.

5) Please amend your detailed Financial Disclosure statement. This is published with the article. It must therefore be completed in full sentences and contain the exact wording you wish to be published.

1) State the initials, alongside each funding source, of each author to receive each grant. For example: "This work was supported by the National Institutes of Health (####### to AM; ###### to CJ) and the National Science Foundation (###### to AM)."

2) State what role the funders took in the study. If the funders had no role in your study, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

6) Please amend the description of the manuscript in the online submission form to "Manuscript" rather than "Cover Letter."

Reviewers' comments:

Reviewer's Responses to Questions

Reviewer #1: This work describes a new metagenomic functional profiler tool. The key idea seems to be to first cluster proteins into clusters. However, this is a common approach used by all functional profilers in that they work with existing clusters of orthologous genes (e.g. KEGG OC, EggNOG, COG). Once the clusters are formed, kMermaid seems to use a very standard approach of matching kmers. The approach seems pretty straightforward and perhaps there-in lies its utility.

Specific Comments

1. Why does kMermaid have to rely on its own clusters of proteins? Why cannot it use existing databases?

2. The results in figure 2 do not seem very interesting or clear to me. For example, for part a, why would you not use BLASTX results at the cluster level as well? Also part d is hard to interpret. I am not sure I understand what is being conveyed here or that it is important.

3. Similarly figure 3 also does not seem to be conveying anything interesting in terms of a result.

4. Why does the comparison with DIAMOND also not include an evaluation of sensitivity and specificity? Also, why are other recently published tools (e.g. fmh-funprofiler: https://academic.oup.com/bioinformatics/article/40/Supplement_2/ii165/7749078) not included in this comparison?

Reviewer #2: This manuscript presents kMermaid, a tool to assign functions to short metagenomic reads. The ideas are interesting and the software is well implemented, but I feel that the authors are overselling the advantages of their approach compared to existing alternatives, mostly by artificially constraining the comparison to only one family of approaches (BLASTX/DIAMOND). The benchmarking is rather simplistic as well.

MAJOR

1. When describing the advantages of kMermaid, the authors claim that it leads to fewer ambiguous mappings, but partly this is achieved by using a different definition of ambiguity so that it applies more strictly when BLASTX is being evaluated than when kMermaid is being evaluated (Fig 2a). For example, if a read is multi-mapped to proteins A1, A2, and A3; but they all share the same cluster, then the read is ambigous at the protein level, but not at the cluster level. This is not novel in the field, but it is should be differentiated what is due to a different tool vs. a different definition of ambiguity.

BLASTX is designed to be very sensitive as well (compared to other alternatives, such as DIAMOND or mapping to nucleotide databases, such as gene catalogs), thus it will map reads to more proteins.

2. I disagree that "BLASTX remains the gold standard" for querying metagenomic reads against a large database. Despite having published many papers using metagenomics data, I have never used BLASTX for this purpose. If working in a reference-based manner, I would use nucleotide-alignment (such as bwa or strobealign) to a gene catalog (either the venerable IGC if working with the human gut [https://doi.org/10.1038/nbt.2942] or a more recent one [https://doi.org/10.1038/s41586-021-04233-4]) or a catalog of genomes (including MAGs, [https://doi.org/10.1038/s41587-020-0603-3]). Alternatively, the authors could use the set of 1,793,361 sequences they considered (after dereplicating at 97% or 95% nucleotide identity). I would use eggnogmapper [https://doi.org/10.1093/molbev/msab293] to assign functions to the genes. At the very least, the authors need to acknowledge that there are other approaches that are widely used in the field and, ideally, compare their tool to these alternatives.

3. The authors use the term "functional", but they are simply assigning to a protein cluster at 65%/70% identity. Why is this a good proxy for function? Normally, when discussing function, an orthologous group is used (KEGG being the most popular, but also eggNOG, Pfam, ... or function-specific databases such as CARD for antibiotic resistance, CAZy for carbohydrate enzymes, ...). The authors should clarify this point.

4. A typical test that is missing is to remove certain taxonomic groups from the database while using them in simulation (to mimic the case where the real world contains species/strains absent in the database). While still limited, this would be closer to a real-world scenario. I generally refrain from asking for specific experiments in reviews, but this paper is really lacking in having a more realistic benchmark.

MINOR

"human samples" -> "human gut samples"

Reviewer #3: The manuscript introduces kmermaid, which is capable of performing functional classification of metagenomes. It focuses on speed, low memory consumption and disambiguation when assigning metagenomic reads against a pre-calculated database of protein clusters.

It avoids the effort of taxonomic disambiguation by performing clustering of reads translated to their amino acid sequences and counting k-mer matches against precomputed protein clusters. The algorithm is rather simple: a single read (not assembled) is translated into all possible reading frames, and the contained k-mers are compared to a set of precomputed protein clusters in what looks like a rather brute-force approach (no indexing/hashing/sorting is mentioned). The cluster with most exact matches is chosen for functional annotation transfer.

The paper is well written and the code it is accompanied with is well structured and organized. However there are a number of major concerns I have with the manuscript.

MMseq2 [1], DIAMOND, Super-FOCUS and FMAP (Fast Metagenomic Functional Profiling) also address the same research question and are established methods (several years old). The authors should carve out their novelty with respect to them and/or ideally quantify performance measures such as speed and accuracy (the current version only contains runtime comparison to DIAMOND). The introduction does not comprehensively summarize the state of the art (except for DIAMOND). The benchmark should be designed such that a more direct and comprehensive comparison is possible. Ideally, the same benchmarks from e.g. MMseq2 could be used. Thus the benchmarking with the state of the art is not rigorous/comprehensive enough.

The main algorithm isn’t presented formally and Figure 1 lacks detail and rigor. Subfigures with labels a-e (like in some of the other figures) would be desirable, also for a more structured caption of Figure 1. Alternatively, pseudo-code could be accompanied. It is not entirely clear what is meant with ‘truncated reads’ (together with the X-shaped symbols) – it could be truncation due to instrumental limitations from sequencing technology, low quality bases, incomplete synthesis (premature termination) during sequencing or the presence of a stop codon – which one is it?

Figure 2 claims that BLASTX has a large amount of ambiguous functional maps. 1. It is not clear from the manuscript whether the BLASTX ambiguity is derived from raw scores or E-values (or something else altogether), as it seems that raw scores would also lead to less ambiguous maps. 2. How about the other state of the art methods? Do they also exhibit such high levels of ambiguity?

Fig 2a) how are the 3 metagenomic samples chosen?

2c) It seems odd that the x-axis exceeds 100% (probably an artefact from KDE)

2d) the x-axis has k-mers “sorted”. It would be best to make clear that the sorting isbased on the number of clusters, a k-mer appears in. When the text refers to Figure 2d) it mentions a single cluster being common (22%), this is not obvious from the Figure, but it could be marked. Likewise the 19% of k-mers falling into >10 clusters

How did the authors come to 2.5M k-mers. There are approx. 205 =3.2 possibilities for 5-mers. In general the choice of k is not fully motivated

The method section does not motivate some of the design choices made:

• Why 65% (1. Step) and 70% (2. Step) were used as thresholds for clustering of the precomputed model

• Why are there 2 steps, with first some 43K proteins (how are they exactly selected? From NCBI’s Prokaryotic Reference Genomes?) and then with 1.7M?

Since the 2-step clustering is non-standard and differs from plain CD-HIT, it would be good to get a sense of clustering quality as measured by a clustering coefficient like the Silhouette or Dunn index or homogeneity/separation ratio.

Was the hyperparameter (k) evaluation done according to best practices? I.e., was it conducted with proper dataset splits (inner validation sets for model/hyperparameter selection and “outer” final test set(s), which was entirely unused(!) during hyperparameter selection? How sensitive are the results to a change of k?

Limitations should be clearer. Avoiding assembly is obiouslybeneficial from a computational point of view, but lacks the uniqueness/predictive power stemming from assembled contigs.It would be desirable, if that could be quantified. The selection of 43K reference proteins most likely contains multi-domain proteins. There are occasions when read assembly is important:

1. Multidomain Proteins & Functional Context:

o Many functional proteins, especially in signaling (e.g., two-component systems) and metabolism, derive their activity from domain interactions.

o If sequence reads only cover individual domains, assigning function may be incomplete or misleading (e.g., distinguishing a full enzymatic complex from fragments).

o Assembling longer contigs helps reconstruct the true architecture of multidomain proteins.

2. Pathway Reconstruction & Gene Clustering:

o Some pathways rely on synteny (gene order and co-localization), which is lost in fragmented reads.

o Functional units like operons in prokaryotes may be misclassified if only single reads are used.

3. Taxonomic and Functional Coupling:

o If a gene is part of a mobile genetic element (e.g., plasmid, transposon), assembly can clarify if it belongs to a specific species or is horizontally transferred.

Another limitation is that beyond the precomputed protein clusters, novel proteins in metagenomes will be ignored. While this is common to many methods, it should probably be mentioned.

It would be very useful if the presented work could assess the amount of misclassification coming from unassembled reads (vs assembled reads).

Assessing the quality of the initial clustering is also important:

Regarding memory consumption, DIAMONDs strategy seems a realistic approach (scaling with input, capping at 16GB). There should be a stronger motivation for a limitation to 2GB, an amount that is exceeded by nearly any reasonable machine in use these days. It is not clear how the 2GB memory requirement is achieved – streaming/generators?

A central point is that existing methods (particularly BLASTX, not so sure about the other abovementioned methods) have a high percentage of ambiguous mappings (Fig. 2). Have the authors

Regarding speed comparisons, indexing is an extremely common method in databases to perform fast lookup. It is also used in [1]. kMermaid does not use indexing or other fast information retrieval methods hashing/LSH, sorting etc. It would be helpful to provide insides as to why such common/best practices were not deployed/necessary.

There are a few technical issues with the installation. Both on Ubuntu 20.04 (as recommended) and Mac OS 15.3, large file installation did not seem to work as described in the installation instructions: even with git-lfs installed, the file kmermaid/db/kmer_model.pkl is only 134bytes and thus not useful (not readable with pickle). The alternative method using the wget command (as per the github install instructions) does not give a larger file.

[1] Steinegger, M., Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 35, 1026–1028 (2017). https://doi.org/10.1038/nbt.3988

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy .

Reviewer #1: No

Reviewer #2: No

Reviewer #3: Yes:  Andreas Henschel

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

Figure resubmission:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. If there are other versions of figure files still present in your submission file inventory at resubmission, please replace them with the PACE-processed versions.

Reproducibility:

To enhance the reproducibility of your results, we recommend that authors of applicable studies deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1013470.r003

Decision Letter 1

Sarath Chandra Janga

27 Jul 2025

PCOMPBIOL-D-25-00398R1

kMermaid: Ultrafast metagenomic read assignment to protein clusters by hashing of amino-acid k-mer frequencies

PLOS Computational Biology

Dear Dr. Auslander,

Thank you for submitting your manuscript to PLOS Computational Biology. After careful consideration, we feel that it has merit but does not fully meet PLOS Computational Biology's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript within 30 days Sep 26 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at ploscompbiol@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pcompbiol/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

* A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. This file does not need to include responses to formatting updates and technical items listed in the 'Journal Requirements' section below.

* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, competing interests statement, or data availability statement, please make these updates within the submission form at the time of resubmission. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

We look forward to receiving your revised manuscript.

Kind regards,

Sarath Chandra Janga, Ph.D

Academic Editor

PLOS Computational Biology

James Faeder

Section Editor

PLOS Computational Biology

Additional Editor Comments :

In light of the minor comments raised by two of the reviewers regarding the minor issues and typos as well as providing clarity on benchmarking details, the authors should submit a revised version of the manuscript addressing these comments.

Journal Requirements:

1) Please amend your detailed Financial Disclosure statement. This is published with the article. It must therefore be completed in full sentences and contain the exact wording you wish to be published.

1) State what role the funders took in the study. If the funders had no role in your study, please state: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.".

2) Please ensure that the Figures Files are uploaded in a correct numerical order in the online submission form.

If the reviewer comments include a recommendation to cite specific previously published works, please review and evaluate these publications to determine whether they are relevant and should be cited. There is no requirement to cite these works unless the editor has indicated otherwise.

Reviewers' comments:

Reviewer's Responses to Questions

Reviewer #1: I appreciate the authors’ considerable effort in improving the benchmark and the figures, and I think the authors have resolved most of my concerns.

Regarding my previous comments:

1. I agree with the authors that there is value in using the new database where proteins are clustered using AA words.

2. Thank you for your detailed clarification for the content in figure 2. There are some small typos: In Figure 2.a, “≤5” should be “≥5”, and in Figure 2.c, consider limiting the x-axis to be between 0 and 100%.

3. The benchmark in Figure 3 is clearer and shows that kMermaid is among the best performing tools in terms of sensitivity and coverage in most cases. The authors should specify how “% match” is calculated. Is it the same for all tools? If “% match” is the sensitivity at the cluster level for kMermaid, but at the protein level for the other tools, this might not be a fair comparison.

Some additional typos and minor issues that I spotted:

1. In the “Creating kMermaid’s k-mer frequency cluster model using nested hashing” section, the authors used “for each k-mer k, a map of clusters containing k”, which is confusing as k can both indicate a k-mer and the length of a k-mer. Consider changing it to “for each k-mer w”.

2. In the “Assigning protein maps to reads using the pre-computed k-mer model” section, the author claimed that the score would be the sum of k-mer frequencies for the k-mers in the query sequence, but in the formula, the frequency is summed over $k_i\in C$ (all the k-mers in the cluster C), which is inconsistent.

3. Do the “composite score” and the “confidence score” refer to the same thing?

4. I’m still a bit confused by the description of hyperparameter search: “k=5 was selected as it optimized the performance on truncated reference AA sequences”. It might be better to specify which metric they are optimizing: is it based on cluster purity, or AUROC of the downstream classification task, or something else?

5. “The cluster frequency” is better written as “The frequency of k-mers in the cluster”.

6. “The k-mer frequencies for each cluster C were defined by the count of the k-mer in the cluster C divided by the total number of proteins in the cluster.” Can this k-mer frequency be >1 if this k-mer appear multiple times in the same protein?

7. In Figure 3.c, consider using log scale for the y-axis.

Reviewer #2: This version addresses my previous concerns.

Reviewer #3: Most of the previous issues have been fully addressed. One remaining issue is the argument why Silhouette and Dunn index have not been calculate (the authors claim that it is impossible in the absence of centroids). No, it is not true that the Silhouette or Dunn index require the presence or calculation of centroids.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy .

Reviewer #1: No

Reviewer #2: No

Reviewer #3: Yes:  Andreas Henschel

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

Figure resubmission:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. If there are other versions of figure files still present in your submission file inventory at resubmission, please replace them with the PACE-processed versions.

Reproducibility:

To enhance the reproducibility of your results, we recommend that authors of applicable studies deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1013470.r005

Decision Letter 2

Sarath Chandra Janga

26 Aug 2025

Dear Dr Auslander,

We are pleased to inform you that your manuscript 'kMermaid: Ultrafast metagenomic read assignment to protein clusters by hashing of amino-acid k-mer frequencies' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Sarath Chandra Janga, Ph.D

Academic Editor

PLOS Computational Biology

James Faeder

Section Editor

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1013470.r006

Acceptance letter

Sarath Chandra Janga

PCOMPBIOL-D-25-00398R2

k Mermaid: Ultrafast metagenomic read assignment to protein clusters by hashing of amino-acid k -mer frequencies

Dear Dr Auslander,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

You will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Zsofia Freund

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. kMermaid’s internal k-mer frequency model with all steps outlined.

    (TIF)

    pcbi.1013470.s001.tif (940.5KB, tif)
    S2 Fig. Performance on simulated reads with known labels, with 1–4 introduced mutations per read.

    (a) Percent correct labels by each method for 3 typical read lengths evaluated. (b) Percent of input reads mapped by each method for 3 typical read lengths evaluated.

    (TIF)

    pcbi.1013470.s002.tif (748.2KB, tif)
    S3 Fig. Biological applications.

    (a) Spearman correlation between kMermaid’s assignment score and the maximum BLASTX percent identity per read for all reverse-translated nucleotide segments (lengths = 125, 150, 200 base pairs) of each unseen RefSeq protein that was mapped by both methods without thresholding. To prevent overplotting, reads were down sampled to 100,000 (40% of total). (b) Histograms showing the kMermaid score of all reverse-translated nucleotide segments (lengths = 125, 150, 200 base pairs) of each unseen RefSeq protein that was mapped by both kMermaid and BLASTX at > 66.6 percent identity. The dashed lines denote the read length-specific thresholds determined by maintaining a false positive rate < 0.05. (c) The percent of all input reads able to be classified by kMermaid compared to BLASTX for sequencing from 3 representative colitis samples, chosen randomly. kMermaid’s optimal scoring threshold was determined by maximizing the percentage of the assignments that agree with BLASTX hits (Sensitivity, dark blue) while retaining a high ratio of assignments that agree with BLASTX to assignments that disagree with BLASTX (Accuracy, light blue). (d) Correlation between the proportion of reads concordant with BLASTX and the mean assignment scores (log-transformed) for all proteins in the cluster. Distributions of these metrics broken down by number of reads mapped to the cluster where clusters in the bottom tertile have the lowest number of mapped reads and clusters in the top tertile contain the highest.

    (TIF)

    pcbi.1013470.s003.tif (1.2MB, tif)
    S1 Table. Protein names and descriptions of cluster representatives of the protein clusters underlying the kMermaid model.

    (CSV)

    S2 Table. Performance evaluation for sequences from unseen RefSeq microbial protein data.

    (XLSX)

    pcbi.1013470.s005.xlsx (9.5KB, xlsx)
    S3 Table. The median kMermaid model performance when classifying protein cluster containing different, specific keywords.

    (CSV)

    pcbi.1013470.s006.csv (87.3KB, csv)
    S1 File. Selected reads failed to be classified with BLASTX that were correctly classified with kMermaid share remote sequence homology with proteins in their associated clusters.

    (TXT)

    pcbi.1013470.s007.txt (6.2KB, txt)
    S1 Methods. Pseudo-code for model training and read assignment.

    (PDF)

    pcbi.1013470.s008.pdf (123.1KB, pdf)
    Attachment

    Submitted filename: Reviewers-response.docx

    pcbi.1013470.s009.docx (73KB, docx)
    Attachment

    Submitted filename: response_to_reviewers_plos_comp_bio.docx

    pcbi.1013470.s010.docx (186.4KB, docx)

    Data Availability Statement

    Availability and usage kMermaid is freely available as an open-source command line program written in Python that requires a user-provided input file containing query sequences in either fastq or fasta format. The precomputed k-mer frequency model and files containing the protein cluster members and cluster representatives are both accessible and used as internal default parameters in kMermaid. Complete download and installation instructions are found at github.com/AuslanderLab/kmermaid. Availability of data and materials All data is used in this work is publicly available. The LOTUS trial paired end sequencing metagenomic reads from ulcerative colitis patients fastq files used for benchmarking are available through SRA: PRJEB50699.


    Articles from PLOS Computational Biology are provided here courtesy of PLOS

    RESOURCES