ABSTRACT
A key application of phylogenetics in ecological studies is identifying unknown sequences with respect to known ones. This goal can be formalised as assigning taxonomic labels or inserting sequences into a reference phylogenetic tree (phylogenetic placement). Much attention has been paid to the phylogenetic placement of short fragments used in amplicon sequencing or metagenomics. However, placing longer pieces of DNA, such as assembled genomes, contigs, or long reads, is less studied. Placing long sequences should be easier than short reads due to their increased signal. However, handling larger inputs poses its own challenges including finding homologues and the computational burden. Here, we explore a phylogenetic placement method that uses k‐mer frequencies to measure distances between long query sequences and reference genomes. Our proposed method, kf2vec, requires no alignment and can work on any region of the genome (needs no marker genes), thus simplifying analysis pipelines. A rich literature exists on using short k‐mers frequencies to measure distances that correlate with phylogeny. Existing methods, however, have had moderate practical success despite enjoying strong theory. Instead of using predefined metrics, we train a deep neural network to estimate a distance from k‐mer frequency vectors such that those distances match the path lengths on the reference phylogeny. The trained model is then used to characterise new samples. We demonstrate that kf2vec outperforms existing k‐mer‐based approaches in distance calculation and allows accurate phylogenetic placement and taxonomic identification of new samples from various types of long sequences.
Keywords: assembly‐free and alignment‐free distance calculation, deep learning, genomic distance, k‐mer‐based distance calculation, metagenomics, phylogenetic placement
1. Introduction
Phylogenetic reconstruction is useful not just for understanding the evolutionary history but also for identifying the content of biological samples to help understand the ecology of their associated environments. These samples can be single individuals of unknown identity (Bohmann et al. 2020), or mixtures, as often used in microbiome studies (Matsen 2015). Either way, the contents of the sample are unknown, and we need to characterise the sample with respect to a database of known biological sequences from known species. The characterisation can be taxonomic—assigning names to samples. An even more ambitious goal is a phylogenetic analysis that precisely identifies the relationship between an unknown organism and those in a reference library (Washburne et al. 2018); this is especially useful when the library misses a close relative of the sample (e.g., another individual of the same species or population). However, de novo phylogenetic tree reconstruction has often been impractical, owing to the size of the available reference data. A solution that has emerged is phylogenetic placement—adding new sequences onto an existing reference tree (Matsen et al. 2010; Berger et al. 2011), which is often sufficient in microbiome and ecology applications. Placement can scale to very large datasets (Balaban et al. 2022; Wedell et al. 2023; Barbera et al. 2019; Turakhia et al. 2021) and enables incremental updates to phylogenies (Balaban et al. 2023). Interestingly, evidence shows that placement in some cases can be even more accurate than the more laborious de novo tree reconstruction (Janssen et al. 2018). A plethora of phylogenetic placement methods exist (Matsen et al. 2010; Berger et al. 2011; Barbera et al. 2019; Jiang, Balaban, et al. 2022; Wedell et al. 2021; Linard et al. 2019), but they have mostly focused on placing relatively short aligned sequences (e.g., reads or assembled marker genes).
Phylogenetic placement, as often used, is still far from trivial owing to the preprocessing needed. Many types of data (e.g., assembled (meta)genomes, contigs, scaffolds, unassembled genome skims, reads from marker genes and amplicon data) are available for sample identification. Ecological studies can benefit from uniting various data types by placing them onto the same tree (McDonald et al. 2023). To place an amplicon, the preprocessing is relatively simple, requiring only aligning the amplicon to the multiple sequence alignment of references. Placing reads directly from a metagenomic sample faces additional challenges in preprocessing including mapping genes to markers (Shah et al. 2021; Stark et al. 2010). To use full‐length marker genes, we need several steps—map reads to markers, (optionally) assemble reads and align them (Asnicar et al. 2020). Placing full genomes, contigs, or scaffolds is even more complex: after obtaining and checking the assembly, we need to identify marker genes (using annotation or alignment), align them, place sequences from each marker on a gene tree, and finally combine the results across genes (e.g., Mai and Mirarab 2022). While methods exist for each step, the whole pipeline becomes cumbersome, especially when many samples are to be analysed.
One attractive alternative is assembly‐free and alignment‐free placement of samples, and several methods have shown that unassembled or unaligned sequences can also be placed with acceptable accuracy using concepts such as k‐mer‐based distances (Balaban et al. 2020) or spaced word matches (Lau et al. 2019; Blanke and Morgenstern 2020). While these methods have not matched maximum likelihood using aligned sequences in accuracy (Balaban et al. 2022; Wedell et al. 2023), especially for placing short reads (Hasan et al. 2022), they dramatically simplify the analysis pipeline. Perhaps more interestingly, they hold the promise, unfulfilled as of yet, that they could enable placing disparate data types (e.g., long reads, genomes, contigs) onto the same tree. The goal of this paper is to enable such analyses, specifically for long input sequences in a marker‐free fashion.
Genomes tend to become more distant as they evolve, enabling us to define measures of sequence distance that relate to evolutionary history. Distance‐based phylogenetic inference has a long history (Warnow 2017), and while not as accurate as maximum likelihood inference (Höhl and Ragan 2007; Bogusz and Whelan 2016), it continues to have applications. In particular, distance‐based placement has been more scalable than alternatives (Balaban et al. 2022; Wedell et al. 2023). Two factors underpin this continued popularity: Distance‐based methods are much faster than alternatives and are more versatile (able to use various data types such as unassembled or unaligned data). As a result, calculating sequence distances in a phylogenetically meaningful way has gained increased attention (e.g., Tang et al. 2019; Sarmashghi et al. 2019; Ondov et al. 2016; Shaw and Yu 2023; Sapci and Mirarab 2025).
Many alignment‐free methods of distance calculation have been developed (Zielezinski et al. 2017; Ren et al. 2018), and most rely on k‐mers and related concepts such as spaced words (Lau et al. 2019). The k‐mer methods either rely on the presence/absence of long k‐mer (e.g., ) or frequencies of short k‐mers (e.g., ). The wealth of methods available (Zielezinski et al. 2019) differ in the equations they use for translating statistics computed from the k‐mer composition to genomic distances, and much attention has been paid to their theoretical underpinning and empirical accuracy (Vinga and Almeida 2003; Reinert et al. 2009). Methods that work on long k‐mers include Skmer (Sarmashghi et al. 2019), AFNN (Tang et al. 2019) and Mash (Ondov et al. 2016). For short k‐mer frequency, a software package called CAFE has implemented 28 alignment‐free dissimilarity measures based on various k‐mer statistics (Lu, Tang, et al. 2017). Among these metrics, dissimilarity measures based on background‐adjusted k‐mer counts have been found to most closely approximate tree distances and can be used for phylogenetic analysis in downstream applications (Lu et al. 2021; Yuval Bussi and Ziv Reich 2021). Nevertheless, these metrics have not been widely adopted for phylogenetic analyses, whether de novo or placement. The main reason is that, as we will see, these metrics provide distances with limited correlation with any given tree.
In this paper, we approach distance calculation in a different way. Instead of using pre‐defined equations for translating k‐mer statistics to distances, we train a machine learning model on a reference dataset to learn how to calculate distances. Our proposed approach, called kf2vec (k‐mer frequency to vectors), represents each input (genome, scaffold, genome skim, or long read) as a vector of canonical k‐mer frequencies. It then uses a reference tree, assumed to be available, to learn how to translate k‐mer frequencies into embeddings with the square of distances roughly matching the reference tree. Given a new query, the model can be used to compute its distance to reference sequences; these distances can subsequently be used to place queries on a tree using distance‐based methods (Balaban et al. 2022) or to assign taxonomic labels. Our approach is not the first machine learning method designed to learn sequence embeddings from k‐mer frequencies, a topic we revisit in the Discussion. Our method also has conceptual similarities to Deep‐learning Enabled Phylogenetic Placement (DEPP) (Jiang, Balaban, et al. 2022), as we will see. However, unlike DEPP, kf2vec requires minimal data preprocessing once the model is trained. It is agnostic to the origin of the reference tree and can be applied to a variety of different sequencing data types. In a number of validation experiments, we demonstrate the accuracy and versatility of kf2vec distances used for phylogenetic placement.
2. Materials and Methods
2.1. kf2vec: Mapping k‐Mer Frequencies to Phylogenetic Distances
2.1.1. Learning Distances
Let for some small k (e.g., k=7 2023) be the number of all possible canonical k‐mers. Given any sufficiently long sequence (which can be assembled, an unassembled set of short reads, or even a single long read), we extract its k‐mer frequencies using standard methods such as JellyFish (Marçais and Kingsford 2011). We then normalise k‐mer counts to add up to 1. Thus, each input is represented as a vector ‐dimensional simplex (i.e., ).
The computational problem is as follows. Given is a set of reference (a.k.a. backbone) genomes, represented as an matrix where each row represents a sequence as a vector giving the normalised frequency of each canonical k‐mer in that sequence. We also have a reference phylogenetic tree, , leaf‐labelled by the reference taxa, and the patristic distance between any two leaves and is given by (Figure 1A). Fix a constant embedding dimension . We seek a function such that matches for . This objective is similar to that of DEPP (Jiang, Balaban, et al. 2022). The main difference is that the input is k‐mer frequencies, obtained in an alignment‐free fashion, rather than aligned sequences used in DEPP. Using k‐mer frequencies obviates the need for multiple sequence alignment (MSA) that is required for the usage of DEPP (Jiang, Balaban, et al. 2022). It also eliminates the need to detect homology to marker genes. Thus, relying on k‐mer frequencies dramatically reduces preprocessing time and simplifies the workflow (though it may reduce the phylogenetic signal). Two aspects of this problem definition need clarification.
FIGURE 1.

Method workflow. (A) Training process. For every backbone sequence, normalised k‐mer frequency vectors are generated and summarised. A model is trained to produce an embedding vector for any input. The true distance matrix is obtained from patristic distances on the backbone tree. The loss function is the weighted mean squared error (MSE) of the Euclidean norm between embedding vectors and true vectors. (B) Query step. The trained model is applied to the k‐mer frequency vectors of the query sequences. Once embeddings are obtained, all pairwise distances between query and backbone embedding are computed to produce a pairwise distance matrix, which is used to perform placement using APPLES. (C) Cladding. The phylogeny is partitioned into subtrees, and an embedder model is trained independently for each subtree alongside a classifier to assign queries to subtrees. At query time, the classifier assigns the query to a subtree, after which the corresponding embedder is used to compute its embedding. (D) Chunking. Genomes are split into overlapping 10 kbp chunks, and random consecutive subsets of chunks are sampled per epoch during training.
The function is useful only if it is also generalizable; if some of the sequences are hidden from us in the construction of , the values should still match for these hidden sequences. Having trained the function using a set of reference sequences, the goal is to then use it on unseen new query sequences to compute their distances to the reference trees. For any query , we can compute an ‐dimensional distance vector comprising values for each , giving us distances from queries to backbone sequences.
Another subtle point is the use of instead of . Both de Vienne et al. (2012) and Layer and Rhodes (2017) have shown that for any tree , there exists a collection of points in the Euclidean space such that the distance between the points and corresponds to . Thus, for , the norms are justified as long as and not is matched. For , there will be some distortion of distances, which can be tolerated when the distortion is small compared to the branch lengths.
To solve the formulated problem, we use machine learning and a neural network model. To train the model, we can define the mean squared error (MSE) loss directly following the problem definition:
| (1) |
However, longer distances are harder to estimate and are often down‐weighted in distance‐based phylogenetics (Fitch and Margoliash 1967; Beyer et al. 1974); moreover, they are often unimportant with regard to placement (Balaban et al. 2022). Therefore, following Jiang, Balaban, et al. (2022), we down‐weight long distances in the MSE loss function:
| (2) |
where is a weight inspired by Fitch and Margoliash (1967) that makes each term scale‐free, and is a pseudocount to avoid very high weights including infinity when .
To build the function , we use a neural network. Our model is composed of two feed‐forward linear layers separated by a rectified linear unit (ReLU) activation function. The size of the hidden dimensions for the first and second layers is and (default m = 1024), respectively. The output of the last layer is the embedding output.
For optimising the loss function Equation (2), we use the Adam optimizer (Kingma and Ba 2014), which is an extension of the stochastic gradient descent and adapts the learning rate for each weight of the neural network. The batch size used for stochastic gradient descent is set equal to 16 by default, leading to 120 pairs. The optimization is allowed to run for a fixed number of epochs, after which the model with the lowest training error is selected. The learning rate is initialized at 0.000013 and is adjusted down gradually by step‐wise multiplication, but never below 0.000003. We implemented and trained our model using PyTorch (Paszke et al. 2019). Unless otherwise specified, we use default kf2vec parameters: , (8192 canonical k‐mers).
2.1.2. Using Distances for Sample Identification
2.1.2.1. Phylogenetic Placement
For each query , the estimated distance vectors can be used as input into the distance‐based placement method APPLES‐II (Balaban et al. 2022) for placement on (Figure 1B). APPLES‐II uses dynamic programming to find the optimal placement of on according to the least squares criterion, again down‐weighting larger distances compared to small ones. We use APPLES‐II with a settings ‐f 0 ‐b 5 that asks it to ignore long distances and use the smallest five distances between each query and backbones.
2.1.2.2. Taxonomic Assignment
We developed a heuristic algorithm to assign taxonomic labels based on distances between embeddings. The algorithm seeks to find a subset of references with statistically indistinguishable distances to the query (denoted as its neighbours). To test indistinguishability, we assume the estimated distance to a reference genome at true distance multiplied by the genome length is a draw from a Poisson distribution with , that is, . Then, for two references with estimated distances , we can calculate a ‐value for the null hypothesis that the smallest is the true value but the larger is a noisy draw from the same distribution . We base our decisions on this model and an extra criterion. We iterate reference genomes in ascending order of their distance to the query, comparing each to the closest match and the previous one . We stop at and select when one of two criteria is met: Either the ‐value comparing to is below 0.1 (thus is statistically higher than ), or we observe a distance gradient much higher than previous iterations; precisely, where is the maximum observed increase so far (defining since we have no prior gradients).
Taxonomic labels are assigned by analysing how frequently different labels appear among selected neighbours. Beginning at the species level and progressing to higher taxonomic ranks (if necessary), we classify to a taxonomic label if it appears in at least of neighbours. Thus, the process moves up to classifying at higher ranks only if a consensus is not reached at a lower rank. If no rank meets the threshold, the query is classified under the root category.
2.1.3. Divide‐and‐Conquer Enables Scaling to Large Trees
Two difficulties face building a single model on the complete set of genomes. On the one hand, the tree is represented as an matrix, and the quadratic growth of the size of the table means that for , it becomes difficult to keep the table in memory and to optimise the neural network over all
entries. Moreover, as datasets become more phylogenetically diverse, it may become harder for a single model to capture all the complexities of its evolution. Jiang et al. (2023) proposed solving a similar problem facing DEPP by dividing the tree into smaller subtrees, training a model for each subtree, and using a classifier to pick the best subset for each query. We adopt a similar approach. To do so, we divide the tree into subsets using TreeCluster (Balaban et al. 2019), aiming for clusters of similar sizes with a user‐specified maximum (Figure 1C). Then, we train embedder functions separately for each subtree . We also need to train a classifier that probabilistically assigns a k‐mer frequency vector to the subtrees. To train the classifier, we use the standard cross‐entropy loss function.
At the time of the query, we first apply the classifier to pick the subtree model with the highest probability (Figure 1C). We use that model for computing distances between all pairs of reference sequences from that particular subtree and the query. We leave the distance to all other reference genomes as undefined, which is allowed by APPLES‐II. The architecture of the classifier model shares the same first layer with . This first layer is followed by a classification layer, which is a linear layer followed by the softmax function to produce probabilities.
When is sufficiently small, building a single model is practical. However, when the dataset is diverse enough that a single model may not capture its full complexity, users can opt to use divide‐and‐conquer, which we call a ‘cladded’ model (though note that any subtrees can be used and they do not have to be clades). By default, we perform divide‐and‐conquer only for datasets with references.
2.1.4. Handling Incomplete Data (e.g., Contigs, Long Reads) Using Chunking
As we will see in the results, training a model on k‐mer frequencies computed from the entire genomes is not optimal for embedding and placing incomplete data such as long reads or contigs. Our proposed solution involves generating many subsequences of the genome and training the model on these subsequences. This setup allows the model to observe variable‐length regions of the original genomes and learn to map these subsequences to the same place in the embedding space. To do so in an efficient manner, we approximate generating subsequences using a chunking approach.
We begin by splitting each backbone genome into a set of fragments (chunks) to span the sequence (Figure 1D), with chunk size set to R = 10,000 bp by default. The chunks are mostly non‐overlapping, but we do let chunks have short overlaps to accommodate genome sizes that do not divide by . We perform this step using seqkit (Wei et al. 2016), first computing the length of each contig, then calculating the appropriate overlap so that sliding windows evenly cover the sequence, and finally extracting each resulting chunk. We exclude genomes from the training set if their total length is < 50 kbp, because they must be highly incomplete. For each chunk of genome , we precompute its unnormalized k‐mer frequencies and represent each backbone genome as a series of ordered k‐mer frequency vectors, . We use this sample representation as input for training the chunked versions of the classifier and embedder.
During training, for each genome at every epoch, a random consecutive subset of k‐mer frequency vectors is selected, with the first chunk, , selected uniformly at random. The number of sampled vectors is drawn from an exponential distribution with the mean set to (20% of all chunks) and truncated to be within . Note that the choice of the truncated exponential distribution ensures high variance (e.g., and if and ). The sampled subset is then taken as a contiguous block of vectors starting at within the chunk set. Then, k‐mer frequencies of selected chunks are added up and renormalized to create a single input entry . This process is repeated at the time of constructing each batch so that each epoch is trained based on a new subsequence of the original genome. This chunking strategy is used for training the classifier and embedder. For the case of the embedder (but not the classifier), to encourage the model to map different subsequences onto the same embedding, we include in each batch two rounds of subsampling for each genome (so, a and constructed with different and ), setting the distance between the two subsamples of the same genome to 0. As a result, the batch size of chunked models is twice as large as that of unchunked models, leading to roughly four times as many pairs (496 instead of 120, by default). When both cladding and chunking are used, all chunks of the same genome are assigned to the same true clade at the time of training the embedder.
At the inference time, no chunking is performed. Each input genome, contig, or long read is used as input to the method in whole, with one k‐mer frequency vector calculated. Because inputs are not chunked at query time, the entire input is classified into one clade and embedded into one position. Thus, cladding and chunking can be used simultaneously.
2.2. Evaluation Experiments
We evaluated the proposed method in a series of experiments over nine datasets with different properties (Table 1), focusing on several questions. In each case, a reference tree was available. In many cases, we performed leave‐out experiments where tree leaves were held out from training and were used as queries. In these situations, we treated the placement of the query on the reference as ground truth. As a measure of error, we compared the placement obtained by a method and the ground truth and reported the number of nodes between the two, referring to it as the placement error. In some cases, queries are placed in a tree that is different from the reference tree . In these cases, we compare the Robinson and Foulds (RF) (Robinson and Foulds 1981) distance between the updated tree and (both induced to have backbones and one query), subtract the RF distance between and , and report the results, referred to as the delta error; the delta error shows the additional tree distance created by the placement and would match the placement error when . We next describe each experiment, leaving details and exact commands to Appendix S3.
TABLE 1.
Description of the datasets used in evaluation experiments.
| Set | Backbone (size) | Query (testing sequences) | Ground truth b | Filtering threshold |
|---|---|---|---|---|
| D1 | WoL19 a (10,075) | Full length genomes (500) left out of WoL19 | WoL19 (10,575) | — |
| D2 | WoL19 (10,575) | Full length genomes (500) from WoL23 | WoL23 (199,330), delta error | — |
| D3 | WoL19 (10,075) | Contigs from D1 (36,549) | WoL19 (10,575) | ≥ 5 kbp |
| D4 | WoL19 (10,075) | 10 best from D1, fragmented into 20 different sizes, 5 replicates for each size (1000) | WoL19 (10,575) | — |
| D5 | WoL19 (10,075) | PacBio HiFi reads (102,428) from D1 | WoL19 (10,575) | ≥ 5 kbp |
| D6 | WoL19 (10,075) | Nanopore reads (1810–1869) from D4 | WoL19 (10,575) | — |
| D7 | WoL23 (50,752) | CAMI2 GSA contigs (20,292) | Taxonomy CAMI | ≥ 10 kbp |
| D8 | Fungi (1564) | Left out full length challenging genomes (80) | Fungal (1672) . | — |
| D9 | Insects (1118) | Left out full length challenging genomes (60) | Insects (49,358) (Chesters 2016) | — |
WoL19/23 refers to the Web of Life phylogeny released in 2019 or 2023, respectively. Five genomes (G000633455, G000633415, G000633375, G000468535, G000633215) IDs no longer exist in NCBI and were excluded.
The reference set used to evaluate placement accuracy or error.
2.2.1. D1/D2: Full Length Microbial Genomes
To benchmark our tool on microbial data, we used the Web of Life dataset, consisting of a phylogenetic tree of 10,575 taxa (Zhu et al. 2019) (WoL19) and a new version with 199,330 genomes (Balaban et al. 2023) (WoL23), both constructed from around 380 gene trees. We selected queries not seen in the training data in two ways (D1 and D2 in Table 1). In initial analyses that focused on examining variations of the kf2vec algorithm, we removed a set of 500 randomly selected queries from WoL19 and used the rest as the reference (D1). In analyses comparing kf2vec to other methods, we used the full WoL19 tree as the backbone tree and selected 500 queries appearing in WoL23 but missing from WoL19 as queries (D2), reporting the delta error. These queries were selected to have a range of distances to the closest reference taxon in WoL19, allowing us to examine the impact of novelty. We define novelty as a minimum genomic nucleotide distance from a query to any genome in the reference dataset as approximated by Mash (Ondov et al. 2016).
2.2.1.1. Impact of kf2vec Parameters
By default, we use . Two layers in our neural network are separated by the Rectified Linear Unit (ReLU). We train a classifier for 2000 epochs and an embedder for 8000 epochs. Using the D1 set, we studied the performance of kf2vec with different hyperparameters. We examine (k‐mer lengths), training both the embedded and classifier with the same k. Setting is impractical as the number of k‐mers grows rapidly, making inputs excessively large. We examine the number of neural network layers of the embedder (keeping the classifier fixed), adding one (hidden dimension 4096) or two (hidden dimensions 6144 and 4096) additional neural network layers (thus getting 3 or 4 layers). For training, we examined the number of epochs, variable learning rate ( and ), and related parameters such as learning rate minimum (), learning rate decay (), and batch size (), only examining the embedder while keeping the classifier settings unchanged.
We also evaluated the impact of the divide‐and‐conquer strategy, applied to the WoL19 phylogeny, which was divided into 15 subtrees, each with 507–846 taxa. We tested whether we can achieve better placement accuracy by computing distances only between the query and the subtree to which it was classified, rather than computing distances to all genomes of the backbone. We note that the cladded approach depends on the accuracy of both the classification and embedder models, whereas for the uncladded approach, only accurate distance computation is required.
2.2.1.2. CAFE vs. kf2vec
We compared the accuracy of distance estimation between kf2vec and aCcelerated Alignment‐FrEe sequence analysis (CAFE) (Lu, Tang, et al. 2017). CAFE is a state‐of‐the‐art tool that allows computing 28 different alignment‐free dissimilarity measures using k‐mer frequency statistics. For each subtree, we computed 8 distance measures (CVTree, [], [] Cosine [] Co‐phylog, Euclidean [Eu], Jensen‐Shannon divergence [JS] and Manhattan [Ma]) previously shown to best approximate phylogenetic distances (Lu, Tang, et al. 2017). Here, we used D1 query set (Table 1). Besides the placement error, we show how the computed distances correlate with the tree distances.
2.2.1.3. Alignment‐Based DEPP vs. kf2vec
We compared the performance of alignment‐free kf2vec with its alignment‐based counterpart, DEPP (Jiang, Balaban, et al. 2022), which uses a similar approach but with aligned marker genes as input. In this study, we used the D2 query set (Table 1). This eliminated the need to retrain DEPP for the leave‐out strategy, as models trained on the full WoL19 tree were already available (Jiang, Balaban, et al. 2022). Moreover, using a second query set not used during our hyperparameter tuning makes the comparisons fairer. We applied DEPP in four ways: randomly selected 30 marker genes and used each marker gene independently, used the first marker gene (p0000) of Zhu et al. (2019) which is among their longest, used all 381 marker genes, and used the commonly used 16S gene.
2.2.2. D3/D4: Partial Microbial Genomes
One potential application of kf2vec is to place contigs from metagenomic assemblies on a reference tree to enable taxonomic identification, sample differentiation, or binning. As metagenomic contigs are often small, a question arises: How short can a contig be for kf2vec to be placed with reasonable accuracy and for contigs of the same genome to have low distances to each other? This experiment attempted to answer that question using two query sets. To generate the D3 queries, we extracted all contigs of length at least 5000 bp from 500 genomes used in D1, computed their k‐mer frequency vectors, and treated each contig as a separate query. In total, we obtained 36,549 queries. To create the D4 dataset, we first selected 10 complete high‐quality genomes among D1, defined as those that were placed with no error by kf2vec in parameter validation experiments (Table S1) and had an estimated completeness of 100% as well as contamination and contamination portion of 0% reported by CheckM and GUNC quality evaluation tools (McDonald et al. 2023). Next, we used BBTools (Bushnell et al. 2017) to select random regions of size 1, 2, …, 1000 kbp (five replicates) to create inputs with controlled size (1000 queries in total: 50 queries per size, 20 sizes).
2.2.3. D5/D6: Long Reads
To check whether kf2vec is capable of placing error‐prone long sequencing reads directly on phylogeny, we generated the D5 dataset. We attempt to place reads without further preprocessing (error removal) that is often required by other tools. We used PBSIM3 to simulate PacBio HiFi (Sequel‐CCS) reads from genomes of the D1 data set (Table 1) (Ono et al. 2022). We used the simulation parameters used by Kim and Steinegger (2024) in their study. The reads below 5000 bp were filtered out. Each of the remaining 102,428 reads was treated as a separate query. We calculated distances from each read to WoL19 backbone genomes using all WoL‐based trained models and placed each sequence on the WoL19 tree. We tested chunked and unchunked models, both with and without cladding.
Additionally, to evaluate the effect of sequencing error, we generated the D6 dataset, consisting of long nanopore reads with variable accuracies 0.85, 0.90, 0.95, 0.99, 1.00 (Table 1). We ran simulations on genomes from the D4 dataset using PBSIM3 (Table S1). Exact commands are provided in Appendix S3. We assessed the performance of chunked cladded and uncladded WoL19 models with these reads.
2.2.4. D7: CAMI Taxonomic Binning Challenge
Another application of kf2vec that we explored is taxonomic identification. We benchmark the ability of kf2vec to perform taxonomic assignment on a data set included in the CAMI2 challenge (D7; Table 1) (Meyer et al. 2022). CAMI is a community‐driven effort to benchmark tools that perform taxonomic binning on standardised datasets, and CAMI2 became the second such challenge organised by the community in 2019 (Meyer et al. 2022; Yi Yue et al. 2020). CAMI2 organisers provided a set of NCBI reference genomes, corresponding taxonomy, and a collection of gold standard (GSA) metagenomically assembled contigs that were generated from short reads and corresponded to the query set.
To create the D7 dataset, we selected a subset of CAMI2 NCBI reference genomes in the following way. Initially, we randomly picked 500 unique genomes per species from the CAMI2 reference set and further selected only genomes that were present in the WoL23 phylogeny. This procedure provided us with 50,752 backbone sequences (out of 141,677 available to all methods that participated in CAMI2) for which we computed ground truth pairwise distances based on the WoL23 phylogeny. We split the backbone tree into 16 clades and trained the cladded chunked model and a corresponding classifier. As queries, we used all 20,292 GSA contigs of length at least 10 kbp present in the marine CAMI dataset. We computed distances from all queries to backbone genomes and assigned taxonomic labels to queries based on tax IDs of the reference sequences and distance information. The taxonomy used for this assessment was the NCBI taxonomy version provided to participants by the CAMI2 challenge organisers. We note that phylogenetic placement was not performed in this case.
We evaluated the results of taxonomic binning based on the average purity and completeness of the bins, as well as the accuracy per sample at different taxonomic ranks, by comparing the output of kf2vec with outputs from other methods that were benchmarked by CAMI2 (Meyer et al. 2022). To compare tools, we used AMBER (Meyer et al. 2018) with the same settings as were utilised in the original study (Meyer et al. 2022) and reported results for contigs of length ≥ 10 kbp.
2.2.5. D8: Fungal Dataset
Li et al. (2021) published a phylogeny of 1672 fungal genomes generated using 290 marker gene trees. Because marker gene trees are available on this data, we can use them to compare kf2vec with alignment‐based methods. To select challenging queries, we picked the 80 genomes with the largest distance to their closest taxon on the full tree (Table S2). We also excluded 28 outgroup taxa. We trained kf2vec on the remaining 1564 genomes. We used a single marker gene, EOG092C19ZG (RNA polymerase Rpb1, domain 4) for EPA‐ng; this marker had the highest taxon occupancy of 1608 genomes, of which seven had names that did not match the tree and were removed. To obtain its backbone, we removed query entries from multiple sequence alignments obtained from the original study and recomputed model parameters using RAxML‐NG with the LG + R10 model, which was selected for this marker in the original study (Stamatakis 2014). We report placement results for 73 queries, as seven queries did not contain the selected marker gene. Exact commands for placement are provided in Appendix S3.
2.2.6. D9: Insect Dataset
To compare kf2vec to alternative alignment‐free methods, we used an insect phylogeny of 49,358 species compiled by Chesters (2016) based on of the widely used single marker gene COI extracted from transcriptomic and mitochondrial sequencing datasets. Among the species in this reference phylogeny, we were able to find the genomes of 1178 species on NCBI (Sayers et al. 2018). Among these, we selected 60 genomes with the highest distance to their closest match among the other genomes (Table S3). These 60 genomes were used as the query set and the rest as the backbone. We benchmarked kf2vec against Skmer, an assembly‐free and alignment‐free tool that is able to estimate Hamming distances using the presence/absence of long k‐mers (Sarmashghi et al. 2019). We used the same reference set used for kf2vec to build the Skmer reference library with default parameters. We evaluated the accuracy of the placement as well as the running time. For kf2vec, library construction time encompasses the time needed to train the model, and the query time includes computing distances between each query sample and backbone genomes. Reported values incorporate the time necessary for k‐mer count computation. Model training was executed on NVIDIA‐GeForce‐RTX‐3090 GPUs.
3. Results
3.1. Accuracy on Full‐Length Data
Starting from the D1 dataset, we observe that kf2vec in its default unchunked setting had low placement errors (Figure S1A), finding the correct placement in 43% of cases, up to 1 edge of error in of cases, and no more than two edges of error in 80% of cases (average error = 1.63 edges). There is, however, a small tail of 3.2% of queries with high error, ranging from 8 to 20 edges.
Several factors impact query placement accuracy. A key factor impacting the error is query novelty (Figure 2A). The model produced the lowest placement error for queries within the novelty range of . Queries that are even less novel have slightly higher errors. This may be due to the high similarity of their k‐mer profiles to several genomes present in the training data; distinguishing among these highly similar genomes likely exceeds the resolution capacity of the embedder model. The placement error increases only slightly as the novelty increases from 0.01 to 0.2, showing robustness to substantial levels of novelty. However, for the most novel queries (i.e., the range), the accuracy does decrease substantially, showing that the robustness to novelty has limits. Note that the novelty values are in the unit of expected numbers of substitutions per site (for marker genes); thus, a novelty of 0.2–1 compared to the closest reference species can be considered very high. Beyond novelty, genome quality mattered for the queries. We tested various genome quality metrics and found that three metrics had a significant impact on the placement accuracy of a query (Figure S2); these metrics (mean hit identity, reference representation score, and genes retained index) measure assembly completeness and contamination.
FIGURE 2.

Results on full‐length genomes. All results are with unchunked models and on full‐length genomes (D1 for panels A–D and D2 for E and F). (A) Impact of divide‐and‐conquer and novelty. We show placement error versus the distance from a query to the closest taxon on the backbone phylogeny (novelty). ‘cladded (true)’ bypasses classification and places each query in the true subtree. (B) Impact of k on placement accuracy. Classifier and cladded embedder models were trained using variable k (default: ). (C) Distance comparison between kf2vec and metric from CAFE for backbone and queries, showing the Pearson correlation between true and estimated distances. Each dot is a backbone or a query genome compared to a backbone genome in the same subtree. (D) Placement error for different CAFE metrics in comparison to kf2vec, both applied to each clade. Placement errors for Co‐phylog and JS metrics were 22.9 and 12.1, respectively, and are not shown. CAFE metrics use either raw or background adjusted k‐mer frequencies. Distance measures based on adjusted k‐mer counts are more phylogenetically accurate. (E) Distribution of the placement error, shown as empirical cumulative distribution function (ecdf), divided into four levels of novelty (boxes). (F) Histograms of placement error for kf2vec and DEPP with 16S rRNA gene and set of 381 markers, computed on 128 queries for which 16S rRNA data were available.
We next examine how the design of kf2vec impacts its accuracy and then compare kf2vec to alternatives.
3.1.1. Design Choices of kf2vec
3.1.1.1. Model
The default parameters perform well for the D1 dataset (Figure 2B). Exploring alternative values, we observe yields the lowest placement error when tested on prokaryotes (Figure 2B). Both shorter and longer ‐mers result in higher placement errors. Although is close to and may provide better accuracy for other genomes (e.g., larger eukaryotic genomes), we chose as the default due to its lower running time and higher accuracy. Adding one or two extra linear layers with a non‐linear activation does not improve accuracy but significantly increases training time by and , respectively (Figure S1B).
3.1.1.2. Training
We next examined the parameters of the training process. Both training and testing losses decrease dramatically for the first 100 epochs (Figure S3A). The training loss continues to drop for 8000 epochs. The testing loss, however, does not improve beyond around 1000 epochs. Nevertheless, despite the stable training loss, the placement error continues to improve for at least 4000 epochs, with the most substantial gains in the first 1000 epochs (Figure S3B). Given this trend, we suggest training for 4000–8000 epochs is sufficient. Increasing the learning rate and/or the minimum learning rate by has a small but negative impact on placement error, while a increase worsens results substantially (Figure S1B), showing that choosing appropriate learning rates does matter. We also tested whether increasing learning rate decay and batch size by would impact model convergence, but these changes had no measurable effect on placement error.
3.1.1.3. Divide‐and‐Conquer (Cladding)
Cladding is effective in reducing both error and running time. Overall, embedder models trained on subtrees outperform the uncladded model (mean error 1.81 vs. 1.63), regardless of query novelty (Figure 2A). The classifier used to assign to subtrees is accurate for 98.2% of the queries. The cases with wrong classifications are among the most novel queries, with the distance to the closest genome ranging from 0.21 to 1.6 (mean: 0.98). If we place each query within its true subtrees to focus on the embedder, we see that the cladded model yields better placements in 13 out of 15 subtrees (Figure S4A). Placement error varies across subtrees, ranging from 1.1 to 2.5 edges, showing that some parts of the tree are more challenging than others. Notably, differences in error between clades are not always explained by clade size or subtree diameter (Table S4); the clades with higher mean errors have normal median errors, showing that their higher average error is mostly due to outliers. In addition to the slight improvement in accuracy, the cladding also helps with scalability by allowing independent training of clades, as we will see.
3.1.1.4. Filtering References
Based on the impact of quality metrics for queries, one may suspect that quality filtering for reference genomes may also improve accuracy. We tested this idea and observed that the accuracy of full‐genome placement remained unchanged even when models were trained with 171 low‐quality (contamination , clade separation score , or reference representation score ) reference genomes removed (Figure S4B); results demonstrate the robustness of the training process, and highlight that filtering can add to novelty of queries, which may be detrimental. Thus, we chose to perform no filtering of references in subsequent analyses.
3.1.2. Accuracy vs. Other Methods
3.1.2.1. Comparison to Predefined Equations of CAFE
We next compared the distances calculated using kf2vec and multiple metrics from the CAFE suite (Figure 2C,D). Among CAFE distances, , , and Cosine most accurately represent phylogenetic distances and result in placement errors of 2.67, 2.88 and 2.87 edges, respectively. In contrast, kf2vec has a mean placement error of 1.50 edges (note that we use correct classifications in this comparison; the error would increase to 1.63 otherwise). Unsurprisingly, kf2vec has near‐perfect accuracy in computing distances between genomes in the reference tree (Figure 2C; median Pearson correlation coefficients: 0.99). Even for query genomes not directly observed by kf2vec, it obtains distances that correlate with phylogenetic patristic distances far better than distances (median correlation coefficients: 0.94 for kf2vec and 0.58 for ). Moreover, kf2vec distances show no bias for either training or testing data, while tends to underestimate larger distances.
It must be noted that CAFE distances are completely agnostic of the reference phylogeny. Thus, the fact that some of them could still place sequences with some level of accuracy when paired with APPLES‐2 is noteworthy. In contrast, kf2vec is directly trained to place on this tree, and therefore, enjoys an inherent advantage. By taking into account a given reference tree, kf2vec learns to compute distances compatible with that tree, as opposed to the pre‐defined tree‐agnostic metrics such as . We emphasise that these results are not meant to suggest that kf2vec distances are universally better metrics than tree‐agnostic metrics. Rather, we show these results to emphasise that if the goal is to place on a tree, our supervised metric learning approach is more accurate.
3.1.2.2. Comparison to the Alignment‐Based DEPP
Comparison to the similar marker‐based method, DEPP, depends on the marker genes used (Figure 2E,F and Figure S5). When all 381 marker genes are used, DEPP has better accuracy than kf2vec. Both DEPP using all 381 makers and kf2vec find the correct placement for roughly the same percentage of queries (41% and 44%). However, kf2vec has above 4 edges of error for 16% of queries, but DEPP with 381 markers rarely produces such high levels of error. We note that the reference tree is inferred using these 381 marker genes (not the k‐mer frequencies), and thus, marker genes have an inherent advantage.
When a single marker gene is used, kf2vec performs comparably or better than DEPP, and the reduction in mean error depends on the choice of the marker. On average, kf2vec has 6% lower error when we randomly select a marker gene and shows nearly identical error if we use a marker with a very high signal (Figure S5). Notably, compared to the widely‐used 16S marker gene, kf2vec reduces the error by (Figure 2E,F). In particular, kf2vec never exceeds 11 edges of error (for the subset of 128 queries used here), whereas using DEPP+16S occasionally leads to errors ranging from 11 to 27 edges, presumably indicating incorrect 16S annotation or horizontal transfer events. Comparison of DEPP and kf2vec is also impacted by novelty. kf2vec is substantially better than 16S for novelty below 0.4 but not for the most novel queries (Figure 2E and Figure S6). The higher accuracy of DEPP with 381 genes compared to our method is also mainly concentrated in queries with novelty levels of 0.2 or higher.
3.1.2.3. Comparison vs. Skmer and EPA‐ng
For both Eukaryotic datasets, where genomes are significantly longer and k‐mer frequencies are presumably more information‐rich, kf2vec achieves near‐perfect accuracy (Table 2). In the insect dataset, only one species, Nesidiocoris tenuis , is placed incorrectly. Skmer, which uses the presence/absence of long k‐mers misplaces the same species along with one additional query sample ( Bemisia tabaci ), producing slightly higher placement errors. The high placement error for both Skmer and kf2vec (6 edges) in these cases suggests potential mislabeling, misplacement in the reference phylogeny, or genome contamination/chimerism. In terms of running time of kf2vec versus Skmer (Table 2), while kf2vec has a slightly longer library construction time, it is approximately faster than Skmer during query time.
TABLE 2.
Placement of eukaryotic assemblies.
| Dataset | Genome count | Query count | Method | Placement error | Correctly placed (%) | Library construction time | Query time |
|---|---|---|---|---|---|---|---|
| Fungi | 1601 | 73 | kf2vec | 0.055 | 97.3 | 152 m a | 4 m |
| Fungi | 1601 | 73 | EPA‐ng | 0.096 | 97.3 | NA | NA |
| Insects | 1178 | 60 | kf2vec | 0.100 | 98.3 | 295 m a | 20 m |
| Insects | 1178 | 60 | Skmer | 0.217 | 96.7 | 237 m | 65 m |
Note: Running time reported for kf2vec includes the time required for getting k‐mer frequencies. Runtime measured on an AMD EPYC 7742 2.25 GHz CPU with 256 GB of memory and 128 cores. Model training for kf2vec was assessed on 1 GPU NVIDIA‐GeForce‐RTX‐3090 with 128 GB of RAM and 4 CPU cores. We did not perform a running time comparison on the fungal dataset because we started EPA‐ng from existing gene alignments, and gene finding and alignment are a substantial part of the running time.
Preprocessing step (248 m for insects, 76 m for fungi) was done on CPU (20 cores), and model training (46 m for insects, 76 m for fungi) was performed on GPU. Embedder models for different clades and a classifier were trained in parallel, and we report the average training time.
On the Fungal data, kf2vec correctly places 71 out of 73 queries, with the remaining two placed only one or two edges away from the correct position. This performance was comparable to EPA‐ng, which also fails to correctly place the same two queries but with higher placement errors (3 and 4 edges, respectively). Note that EPA‐ng is an alignment‐based and marker‐based method, which requires a much more extensive analysis pipeline; on the other hand, it uses a single marker gene compared to the full genome. Compared to insects, the fungal dataset took less time for library construction and query, with the difference explained by the differences in genome size and the time needed to compute k‐mer frequencies.
3.2. Accuracy for Incomplete Data
We now turn from the placement of (mostly) complete genomes to long segments that nevertheless form a small fraction of the full genome.
3.2.1. Placement of Contigs
A controlled experiment on fragments created from high‐quality genomes demonstrates that placement error is directly dependent on both the query length and the models used for placement (Figure 3A, Figure S7A). As expected, shorter sequence lengths are placed with higher error, with the most dramatic error increase observed at 10 kbp or lower. Whether we use cladding or not, chunking improves accuracy dramatically for inputs smaller than 10 kbp but is slightly detrimental for larger segments. For full‐size genomes, we observe an increase in error from 1.6 to 2.0 edges when using the chunked instead of unchunked models, with error increasing at all levels of novelty (Figure S8). The effects of chunking are most relevant for sequences of length 5–10 kbp; above this range, chunking does not help, and below it, the error remains high (e.g., > 5 edges) even after dramatic reductions by chunking. However, in some conditions, the error reductions are significant in practice as they go from too high to acceptable (from 9.1 to 1.5 edges for 6 kbp queries with the cladded model).
FIGURE 3.

kf2vec performance on incomplete genomes and long reads. (A, B) Impact of training using kf2vec on chunked backbone genomes, showing cladded and uncladded conditions, comparing chunked (arrow head) versus unchunked (arrow tail) treatments. Thus, each arrow indicates the change in mean placement error due to chunking versus using full sequences in training. Here, ‘cladded (true)’ refers to the hypothetical placement accuracy we could achieve if the true subtree of each query were known. Results are presented for (A) artificial chunks in the D4 dataset and (B) varying contig lengths of the D3 dataset. (C) On D3 dataset (WoL19) contigs (≥ 100 kbp), we measure whether contigs of the same genome are grouped together based on distances that are either trained using kf2vec (uncladded unchunked) or computed using CAFE distances that are either raw or use background‐adjusted k‐mer frequencies, with the latter considered more phylogenetically accurate. For kf2vec, we compute distances either between embeddings or based on a phylogeny estimated from kf2vec distances using FastME2. Closest contig mismatch: How often the contig with the lowest distance to a contig from a multi‐contig genome is not one of the other contigs from the same genome. (pseudo) F statistic: The mean distance between contigs of a genome to contigs of other genomes over the mean distance of contigs of the same genome; we show the median statistic across all multi‐contig genomes. We exclude JS, CVtree, and Co‐phylog because their closet match errors were much higher than those of other methods (> 47%). (D) Results on long HiFi read queries of varying length (D5 dataset). We show the placement error of the chunked vs. unchunked models, with or without cladding.
When chunking was used, the impacts of cladding were mixed due to the further difficulties of classifying into the right clade (Figure S7B). When we assign all queries to the correct clade, cladding dramatically improves accuracy compared to uncladded models (Figure 3A). However, classification accuracy drops below 90% at 6 kbp, going as low as 60% for 1Kbp regions. As a result, there is a substantial difference between the placement errors of cladded models used with the classifier and the true clades. Overall, cladded/chunked and uncladded/chunked models are comparable in accuracy (2.8 edges with uncladded/chunk vs. 2.7 averaged over all chunk lengths).
We next compare computing k‐mer frequencies from each contig of an assembly instead of the full genome. Once again, the accuracy depends on contig length and the model (Figure 3B). For contigs of 50 kbp or shorter, the chunked model produces substantially better results. For example, contigs with as little as 10–20 kbp have an average error of 4.5 edges (median: 2), which is substantially lower than 7.5 (median: 3) without chunking. Note that the high average errors are due to a long tail of high errors, evident from lower median error values than average (Figure S9). The error would be much better if the classifier to clades had perfect accuracy, consistent with the observation that 9% of contigs are misclassified (with the error concentrated on the shortest ones; see Figure S10A) and any misclassified contig would have high placement errors.
We next asked if embeddings from kf2vec are able to cluster contigs of the same genome together on a subset of D3 longest contigs (≥ 100 kbp, n = 2262, of which 2122 had at least one other contig from the same genome). We evaluated within‐genome versus across‐genome distances using embeddings computed with the uncladded/unchunked WoL19 model and alternative metrics obtained from CAFE (Figure 3C). For kf2vec, we also inferred a phylogeny of contigs using FastME2 (Lefort et al. 2015) with kf2vec embedding distances as input and used the path lengths on the tree to compute distances between contigs. We first ask how often the closest match to a contig belongs to a different genome. These mismatches happen for only 4.3% (with FastME) or 4.5% (without FastME) of contigs with kf2vec, compared to 6.3% with the best‐performing CAFE measure, and 7.7% for the next best, Cosine similarity. Moreover, the ratio between across‐genome to within‐genome distances (akin to F statistics) from kf2vec without FastME was the highest, followed by the Cosine similarity and kf2vec with FastME. Finally, we ask how often contigs form a monophyletic clade in the FastME tree inferred from kf2vec. Among 262 query genomes with more than one contig, 203 (77%) are monophyletic. Of the 59 non‐monophyletic cases, nine are a single extra taxon away from monophyly (Figure S11), five have two extra taxa, and ten have three extra taxa in their clade.
Another observation is that for a small subset of contigs (0.5%) that are unusually large (e.g., > 1.2 Mbp), the mean and eventually the median placement error increases. This is mostly due to a few queries exhibiting extremely high errors of 12, 16 and 18 edges, with several others showing errors of 5 edges or more (Figure S9A). We hypothesized that some of these unusually long contigs with high placement errors might be chimeric genomes, leading to high errors.
3.2.2. Placing Long Reads
We assess how accurately kf2vec can place long sequencing reads without any additional preprocessing to omit errors (Figure 3D, Figure S10B). Patterns observed on contigs and chunks are also observed for long reads: The chunked model remains the most accurate for both the classifier and embedder when used on HiFi reads. Classification accuracy with the chunked model is , compared to in the unchunked setting (Figure S10B). Cladding is beneficial for reads ≥ 8 kbp where sequencing reads presumably contained enough signal for the clade information to be accurately detected (Figure 3D). However, short (< 8 kbp) reads have errors exceeding 5 edges with or without cladding, making the results less useful for downstream applications. Overall, kf2vec models can effectively compute distances from long sequencing reads to a phylogenetic tree, performing just as well as when applied to contigs or fragmented genomes (Figure 3D, Figure S10B).
Evaluation of sequencing data quality using simulated nanopore long reads with variable error rates demonstrates that the WoL19 model is robust to moderate levels of sequencing noise, but not to very high error levels (Figure S12). One limitation is classification accuracy (into clades), which falls below 94% once sequencing error exceeds 5%. The cladded model consistently outperformes the uncladded model in placement accuracy. Notably, for the cladded model, placement error for long contigs (≥ 10 kbp) remaines within three edges of the correct placement when sequencing error stays ≤ 5%; however, placement error substantially increases when error reaches 10% or 15%, when every 7‐mer is expected to have 0.7 or 1.05 errors, respectively.
3.2.3. Taxonomic Binning (CAMI2)
In the taxonomic classification task, kf2vec closely follows the accuracy of the best performing method, Kraken 2.0.8‐beta (Wood et al. 2019), with the exception of the species level, where kf2vec is clearly less accurate (Figure 4A,B). The third most accurate method is LSHVec, with accuracy, while all other methods have accuracy below . At the higher taxonomic levels, both kf2vec and Kraken2 reach comparable accuracy, averaging , whereas all other methods remain at or below (Figure 4A). In terms of purity, all methods except PhyloPythiaS+ produced bins that are at least pure at the species level, with purity increasing to or higher at broader taxonomic ranks (Figure 4B). Completeness varies widely across different methods. At the species level, kf2vec and LSHvec have similar performance ( complete and 91% pure) and are dominated by Kraken. However, the completeness of kf2vec rapidly increases to at the genus level and at the family level. Overall, kf2vec demonstrates the second‐best performance (next to Kraken‐II) across various metrics, achieving high accuracy while maintaining a good balance between purity and completeness. Note that kf2vec uses a smaller library here than Kraken2, which represented approximately 36% of the original CAMI2 dataset. If we update Kraken2 to use the same subset of backbone genomes employed to train the kf2vec model, its accuracy at the species level drops by 9% but still remains better than kf2vec. The purity metric remains essentially unchanged, with completeness slightly declining from 99% to 96%.
FIGURE 4.

Taxonomic binning of CAMI2 marine GSA contigs. (A) Accuracy per sample at different taxonomic ranks for kf2vec compared to other benchmarked tools. (B) Average purity and completeness of bins at different taxonomic ranks. Labels were assigned using the NCBI taxonomy version provided to CAMI2 participants. Results are reported for contigs of length at least 10 kbp. Computation was performed over a ‐filtered dataset (excluding the smallest of bins by base pairs), as reported in the original study. Kraken* is Kraken2 v2.1.1 but using the same library used as kf2vec (i.e., 50,752 genomes out of the 141,677 genomes used in CAMI2).
3.3. Scalability
We benchmarked kf2vec on the D1 dataset for all four variations of our model, measuring the running time and peak memory required to train classifier and embedder models (Table 3). Since each clade model is independent, we trained all clades simultaneously. The total training time for training cladded models was 2–3× longer than uncladded models. However, since clades can be trained in parallel, the wall‐clock time used to train the embedders was 6× lower with cladding than the uncladded model. The memory required to train an uncladded model was comparable to the peak memory usage observed when training the corresponding classifier in both the chunked and unchunked cases. Chunking also increased the running time: chunked models took nearly longer for the classifier and longer for an embedder than the unchunked model. This can be explained by batches of chunked model including as many pairs plus the extra time needed for computing sums and normalisation across chunks per batch. Chunking also required substantially higher memory consumption, reaching 60G of RAM while training a classifier for cladded models or embedders for uncladded models. This is due to representing each reference genome as an matrix (where is the number of chunks and is in the order of hundreds) instead of a single vector of length . We added a ‐cap option that can be used to reduce memory usage by half, by reading the input as unsigned 8‐bit integers instead of unsigned 16‐bit format; this will result in no loss of accuracy when all k‐mers appear at most 255 times in a chunk, which is reasonable when chunk size is .
TABLE 3.
Running time and memory consumption for training kf2vec models on the D1 dataset performed using 1 GPU NVIDIA‐GeForce‐RTX‐3090 with 128 GB of RAM and 4 CPU cores.
| Training time (h:min) | Peak memory (G) | |||
|---|---|---|---|---|
| Classifier | Embedder (mean/max/sum for cladded) | Classifier | Embedder (max for cladded) | |
| Cladded unchunked | 01:40 | 00:51/00:58/12:42 | 6.13 | 4.40 |
| Cladded chunked | 06:33 | 04:31/06:14/67:41 | 60.30 a | 9.62 |
| Uncladded unchunked | 0 | 05:51 | 0 | 7.95 |
| Uncladded chunked | 0 | 38:15 | 0 | 59.50 a |
Note: For cladded models, training for all clades is performed in parallel; for time, we report mean, max (wall‐clock time) and total (CPU time). For the uncladded model, a single number is reported for the embedder, and since uncladded models do not require a classifier, we leave those as 0.
When training is performed using the —cap parameter, memory consumption can be reduced to 32.90G and 38.2G for classifier and distance models, respectively. The effect of capping on running time for classifier and training models is insignificant (05:49 and 37:03, respectively).
We also benchmarked the inference (i.e., query) time of kf2vec against Kraken2 on the CAMI2 (D7) dataset (Table S5). Kraken2 completed query taxonomic assignment approximately faster than kf2vec (6.5 vs. 28.8 min). The running time of kf2vec is dominated (85%) by k‐mer frequency calculation using JellyFish; the clade classification, inference and taxonomic classification steps are all very fast. The slower running time of kf2vec is offset by its dramatically lower memory usage. Kraken2 required 103 GB of RAM at the query time, while kf2vec averaged only 2.2 GB with a peak usage of 4.2 GB of RAM. This represents more than a 20× reduction in memory consumption, making kf2vec accessible on systems where Kraken2 cannot be run due to memory constraints. Overall, these results highlight a clear trade‐off: while Kraken2 provides faster runtimes, kf2vec is markedly more memory‐efficient and better suited for large‐scale or resource‐limited environments where memory availability is a bottleneck (Table S5).
4. Discussion and Future Works
We proposed a new method for inferring phylogenetic distances from k‐mer frequencies. Unlike existing measures such as and Co‐phylog, we used a neural network to learn how to compute distances for a given phylogeny. For placing on a tree, the method was better than other k‐mer frequency measures, but not better than alignment‐based methods. Nevertheless, it is far easier to use than alignment‐based methods and thus, it has practical value for placing long reads, contigs, genome skims and assembled genomes. Moreover, it can place all these disparate data types onto the same underlying tree using the chunking strategy. This ability to combine data from various sources onto the same tree can be powerful in downstream analyses (McDonald et al. 2023) and makes the method highly versatile in ecological studies, where samples and references may be available as different data types. The power of kf2vec compared to marker‐based alignment‐based methods lies in its simplicity of use and versatility.
The accuracy of kf2vec depended on several factors. First, the most novel queries are, unsurprisingly, the most difficult to place in all scenarios. Additionally, the shortest contigs and reads (i.e., those with less than 10 kbp) were not placed with high accuracy. The accuracy tended to improve with longer input, though gains in accuracy were minimal beyond 30 kbp. Finally, sequencing errors, present in long reads but less prevalent in contigs, seem to reduce the placement accuracy only slightly when errors are sufficiently small (e.g., see Figure S12 and compare Figure 3B vs. Figure 3D), but sequencing error at 10% or higher substantially reduced both classification and placement accuracy.
Our method can be used with or without cladding, to improve scalability, and with or without chunking, to improve accuracy on incomplete input. Cladding, while designed primarily for scalability, also impacted accuracy. If we could always assign queries to correct clades, cladding would lead to consistent improvements in accuracy; however, due to classification errors, results were mixed, with cladding reducing accuracy for the shortest inputs (< 10 kbp) but improving it for others. Thus, for users mostly concerned with very short inputs and dealing with reference sets such that the distance matrix can fit their available memory, we recommend uncladded chunked models. Chunking together with cladding is recommended when queries are moderate sized (e.g., 10–50 kbp) and the reference set is too large for the memory of the machine used for training (note that high memory consumption is only for training and not inference). For sufficiently long inputs (> 50 kbp), chunking adds no benefit; it slightly degrades accuracy, increases the embedder training time by 5–7×, and adds the running time for the classification at the query time. The increased error is because during the training on randomly sized chunks, the embedder needs to learn to map k‐mer frequencies from various chunks of the genome onto the same point in , which reduces its precision for long chunks. Thus, for users interested in placing sufficiently large inputs (e.g., > 50 kbp), we recommend the unchunked version, with cladding to accommodate large and otherwise without cladding. Future work should explore an ensemble approach that automatically switches between ‘expert’ models trained on various chunk sizes to unburden users from making these decisions.
The use of k‐mers for phylogenetic placement is not entirely new. RAPPAS (Linard et al. 2019) and IPK/EPIK (Romashchenko et al. 2023) solve the phylogenetic placement problem using k‐mers. However, both require the computation of informative phylo‐k‐mers, which in turn depend on a reference alignment of the backbone—an additional step that our method does not need. The recently developed method krepp (Sapci and Mirarab 2025) also uses k‐mers for placement, but uses the presence/absence of long k‐mers as opposed to the frequency of short ones, as we do. None of these methods employs machine learning. Interestingly, the simple Kraken2 method (which is based on presence/absence of long k‐mer) had better accuracy than our machine learning method on the CAMII‐2 data for taxonomic binning. The main advantages of kf2vec over Kraken‐2 are its much lower memory usage at the time of inference and the fact that it provides distances, which can be used for downstream tasks such as phylogeny inference (e.g., Figure 3c).
Our method was based on representation learning, a topic of active exploration in bioinformatics, with a wealth of methods developed for metagenomics (e.g., Menegaux and Vert 2019; Joulin et al. 2017; Liang et al. 2020; Georgiou et al. 2019; Nielsen et al. 2014; Kang et al. 2019; Lu, Chen, et al. 2017; Alneberg et al. 2014; Zhang et al. 2022; Liu et al. 2022; Woloszynek et al. 2019; Kutuzova, Nielsen, et al. 2024; Nissen et al. 2021; Kutuzova, Piera, et al. 2024; Pan et al. 2022; Wang et al. 2024; Zheng et al. 2019; Ma et al. 2022). Most of these methods, unlike ours, focus on tasks that neatly translate to standard classification or clustering tasks—taxonomic assignment, phenotype classification and binning of contigs. Both classification and clustering have standard loss functions adopted in these methods. What distinguishes our method from these earlier works is not the use of representation learning, but rather tackling the phylogenetic placement task. Thus, instead of using auto‐encoders or contrastive learning, we focus on metric learning based on phylogenetic distances (see Equation 2).
The existing machine learning methods have used k‐mers in two ways. Some, inspired by representations of words in a vector space used in Natural Language Processing (NLP) (Mikolov et al. 2013; Qi et al. 2018), use vector representation of DNA k‐mers to create the input to downstream steps. These methods differ from our approach in that we do not embed a k‐mer; rather, we treat each short k‐mer as a feature. FastDNA (Menegaux and Vert 2019) builds on the fastText algorithm (Joulin et al. 2017) to create an embedding for each observed k‐mer, sums all k‐mer embeddings in a read, and uses this sum vector as input to a linear classifier trained to minimise the cross‐entropy loss. The authors report good performance with . With a similar objective, Liang et al. (2020) propose DeepMicrobes, a neural network with an architecture composed of an embedding of k‐mers, a bidirectional long short‐term memory (LSTM), and a self‐attention layer to classify reads at the species or the genus level. The authors show that performance increases up to , the largest possible k that allows fitting the model in the memory of the hardware needed to train the model. Georgiou et al. (2019) managed to reduce the memory footprint of the model by using a locality‐sensitive hashing (LSH) of k‐mers, allowing them to test models up to . In the work of Woloszynek et al. (2019), the objective is to retrieve the source environment of a metagenome (phenotype prediction) in addition to taxonomic profiling. A Skip‐gram word2vec model (Mikolov et al. 2013) is trained for k‐mer embeddings and the SIF algorithm (Arora et al. 2017) is used to create read and sample embeddings. Results are used for clustering and classification. In contrast, Ren et al. (2022) uses the k‐mer embeddings averaged over a sequence to compute distances and perform de novo phylogenetic reconstruction.
Other methods leverage the k‐mer frequency (a.k.a. composition) of the input sequences to create input features, as kf2vec does. These existing methods differ from kf2vec in applications and implementation details. Approaches such as Taxometer (Kutuzova, Nielsen, et al. 2024) use k‐mer composition or frequency vectors to conduct taxonomic assignment of the queries. Numerous modern machine learning–based tools including VAMB/TaxVAMB (Nissen et al. 2021; Kutuzova, Piera, et al. 2024), SemiBin (Pan et al. 2022), COMEBin (Wang et al. 2024), MaxBin (Yu‐Wei et al. 2016), SolidBin (Wang et al. 2019) and many others (Nielsen et al. 2014; Kang et al. 2019; Lu, Chen, et al. 2017; Alneberg et al. 2014; Zhang et al. 2022; Liu et al. 2022) use k‐mer frequency vectors as input to carry out metagenomic binning, posed essentially as a clustering task. Note, however, that nearly all these methods supplement k‐mer data with the additional information coming from contig co‐abundances (Kang et al. 2019; Alneberg et al. 2014; Nissen et al. 2021; Yu‐Wei et al. 2016; Albertsen et al. 2013; Imelfort et al. 2014), assembly graphs (Vijini Mallawaarachchi et al. 2020; Zhang and Zhang 2021; Lamurias et al. 2022), codon usage (Yu et al. 2018), GC content (Albertsen et al. 2013), single‐copy genes (Pan et al. 2022; Wang et al. 2024; Lin and Liao 2016; Sieber et al. 2018) and taxonomic information (Kutuzova, Piera, et al. 2024; Pan et al. 2022; Wang et al. 2019; Krause et al. 2008; Huson et al. 2011).
The most relevant methods to ours are those that directly address metric learning. DEPP, as noted earlier, uses the same loss function but works on alignments. SENSE (Zheng et al. 2019) minimises mean squared error between alignment distances and pairwise distances in the embedding space. In this sense, their loss function has connections to ours. However, SENSE employs a convolutional architecture and is applied to short amplicon sequences; this method is not designed for long sequences or phylogenetic placement. Instead of using the exact reference distance, MELT (Ma et al. 2022) has adopted the triple training framework to learn from relative distances. Their method requires a set of known comparisons between inputs (e.g., full genomic sequences, gene expression profiles, or abundance profiles), making statements such as is closer to than is. Given such data, it creates a shared embedding space, where similar pairs are pulled closer and dissimilar pairs are pushed apart. The triple information can come from either taxonomic information or other biological sources. The trained network can be used to project new samples into the embedding space to enable clustering and classification.
The kf2vec approach also has several limitations, which can be addressed in the future. Our model requires substantial resources for training; however, at inference time, it achieves high speed and very low memory usage. In particular, the classifier still requires substantial memory to train for very large ; this issue can be ameliorated in the future by training the classifier in several batches, each made of a subset of the data. Furthermore, our chunking strategy was relatively simple and based on a single distribution for the number of chunks used; future work can explore whether a base model could be fine‐tuned based on the length distribution of a particular input. In addition, we can explore training different models (experts) for queries with different length ranges, freeing the user from choosing between chunked and unchunked models at inference time. In this work, we explored a fixed , which worked the best for microbial genomes. Many papers have attempted to use machine learning for virus classification (Ali et al. 2023; Raju et al. 2022; Remita et al. 2017) including using k‐mer frequencies (Solis‐Reyes et al. 2018; Ren et al. 2017). Our default approach will likely not work well for the much smaller virus genomes. However, the same methodology may work, perhaps by combining frequencies from multiple k values into a single model. Future work should explore this direction. Finally, kf2vec currently generates only one embedding and one placement per input, without quantifying uncertainty. Future work should explore doing so using resampling or subsampling of the data with appropriate corrections (Rachtman et al. 2022).
The machine learning aspects could also improve. As Jiang, Tabaghi, and Mirarab (2022) have shown, using hyperbolic geometry can enable lower‐dimensional embeddings with high fidelity; however, using hyperbolic geometry poses its technical challenges. Furthermore, although we employed simple architectures in our approach, future work should investigate whether more complex models can further improve accuracy compared to our current model. Additionally, by integrating more advanced methods for including taxonomic information, we could potentially enhance taxonomic identification performance, leading to improved accuracy, particularly at lower taxonomic ranks.
Author Contributions
All authors conceived the idea. S.M. developed a theoretical model. E.R. implemented the software and performed experiments. Y.J. completed the analysis with DEPP. All authors contributed to the analyses of data and the writing. All authors read and approved the final manuscript.
Conflicts of Interest
The authors declare no conflicts of interest.
Supporting information
Appendix S1: men70055‐sup‐0001‐AppendicesS1‐S3.pdf.
Appendix S2: men70055‐sup‐0001‐AppendicesS1‐S3.pdf.
Appendix S3: men70055‐sup‐0001‐AppendicesS1‐S3.pdf.
Rachtman, E. , Jiang Y., and Mirarab S.. 2025. “Machine Learning Enables Alignment‐Free Distance Calculation and Phylogenetic Placement Using k‐Mer Frequencies.” Molecular Ecology Resources 25, no. 8: e70055. 10.1111/1755-0998.70055.
Handling Editor: Alana Alexander
Funding: This study was supported by the National Institutes of Health (1R35GM142725) grant and a Minderoo Foundation research grant to S.M. This work used the Expanse (Strande et al. 2021) at San Diego Supercomputing Center through allocation ASC150046 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services and Support (ACCESS) program, which U.S. National Science Foundation supports grants #2138259, #2138286, #2138307, #2137603 and #2138296. This work used NRP/Nautilus and was supported in part by National Science Foundation (NSF) awards CNS‐1730158, ACI‐1540112, ACI‐1541349, OAC‐1826967, OAC‐2112167, CNS‐2100237, CNS‐2120019. This research received support through Schmidt Sciences LLC.
Data Availability Statement
kf2vec was implemented in PyTorch (v.1.10.2) and CUDA was used when running on a NVIDIA‐GeForce‐RTX‐3090 GPU. The software is available publicly at https://github.com/noraracht/kf2vec and a permanent, citable version is archived at Zenodo (https://doi.org/10.5281/zenodo.17201944). Raw data, scripts, and summary data tables are publicly available at https://github.com/noraracht/kf2vec_data. Additionally, data repositories are stored in Zenodo (https://doi.org/10.5281/zenodo.17203745). The detailed description of genomic datasets used in our experiments, accession numbers of the assemblies, and the exact commands are provided in Supporting Information.
References
- Albertsen, M. , Hugenholtz P., Skarshewski A., Nielsen K. L., Tyson G. W., and Nielsen P. H.. 2013. “Genome Sequences of Rare, Uncultured Bacteria Obtained by Differential Coverage Binning of Multiple Metagenomes.” Nature Biotechnology 31, no. 6: 533–538. 10.1038/nbt.2579. [DOI] [PubMed] [Google Scholar]
- Ali, S. , Bello B., Chourasia P., et al. 2023. “Virus2Vec: Viral Sequence Classification Using Machine Learning.” http://arxiv.org/abs/2304.12328. arXiv:2304.12328 [q‐bio].
- Alneberg, J. , Bjarnason B. S., De Bruijn I., et al. 2014. “Binning Metagenomic Contigs by Coverage and Composition.” Nature Methods 11, no. 11: 1144–1146. 10.1038/nmeth.3103. [DOI] [PubMed] [Google Scholar]
- Arora, S. , Liang Y., and Ma T.. 2017. “A Simple but Tough‐to‐Beat Baseline for Sentence Embeddings.” In ICLR.
- Asnicar, F. , Thomas A. M., Beghini F., et al. 2020. “Precise Phylogenetic Analysis of Microbial Isolates and Genomes From Metagenomes Using PhyloPhlAn 3.0.” Nature Communications 11, no. 1: 2500. 10.1038/s41467-020-16366-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Balaban, M. , Jiang Y., Roush D., Zhu Q., and Mirarab S.. 2022. “Fast and Accurate Distance‐Based Phylogenetic Placement Using Divide and Conquer.” Molecular Ecology Resources 22, no. 3: 1213–1227. 10.1111/1755-0998.13527. [DOI] [PubMed] [Google Scholar]
- Balaban, M. , Jiang Y., Zhu Q., McDonald D., Knight R., and Mirarab S.. 2023. “Generation of Accurate, Expandable Phylogenomic Trees With uDance.” Nature Biotechnology 42, no. 5: 768–777. 10.1038/s41587-023-01868-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Balaban, M. , Moshiri N., Mai U., Jia X., and Mirarab S.. 2019. “TreeCluster: Clustering Biological Sequences Using Phylogenetic Trees.” PLoS One 14, no. 8: e0221068. 10.1371/journal.pone.0221068. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Balaban, M. , Sarmashghi S., and Mirarab S.. 2020. “APPLES: Scalable Distance‐Based Phylogenetic Placement With or Without Alignments.” Systematic Biology 69, no. 3: 566–578. 10.1093/sysbio/syz063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barbera, P. , Kozlov A. M., Czech L., et al. 2019. “EPA‐Ng: Massively Parallel Evolutionary Placement of Genetic Sequences.” Systematic Biology 68, no. 2: 365–369. 10.1093/sysbio/syy054. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Berger, S. A. , Krompass D., and Stamatakis A.. 2011. “Performance, Accuracy, and Web Server for Evolutionary Placement of Short Sequence Reads Under Maximum Likelihood.” Systematic Biology 60, no. 3: 291–302. 10.1093/sysbio/syr010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beyer, W. A. , Stein M. L., Smith T. F., and Ulam S. M.. 1974. “A Molecular Sequence Metric and Evolutionary Trees.” Mathematical Biosciences 19, no. 1–2: 9–25. 10.1016/0025-5564(74)90028-5. [DOI] [Google Scholar]
- Blanke, M. , and Morgenstern B.. 2020. “Phylogenetic Placement of Short Reads Without Sequence Alignment.” bioRxiv 2020‐10. [DOI] [PMC free article] [PubMed]
- Bogusz, M. , and Whelan S.. 2016. “Phylogenetic Tree Estimation With and Without Alignment: New Distance Methods and Benchmarking.” Systematic Biology 66, no. 2: 218–231. 10.1093/sysbio/syw074. [DOI] [PubMed] [Google Scholar]
- Bohmann, K. , Mirarab S., Bafna V., and Gilbert M.. 2020. “Beyond DNA Barcoding: The Unrealised Potential of Genome Skim Data in Sample Identification.” Molecular Ecology 29, no. 14: 2521–2534. 10.1111/mec.15507. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bushnell, B. , Rood J., and Singer E.. 2017. “BBMerge—Accurate Paired Shotgun Read Merging via Overlap.” PLoS One 12, no. 10: 1–15. 10.1371/journal.pone.0185056. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chesters, D. 2016. “Construction of a Species‐Level Tree of Life for the Insects and Utility in Taxonomic Profiling.” Systematic Biology 66, no. 3: 426–439. 10.1093/sysbio/syw099. [DOI] [PMC free article] [PubMed] [Google Scholar]
- de Vienne, D. M. , Ollier S., and Aguileta G.. 2012. “Phylo‐MCOA: A Fast and Efficient Method to Detect Outlier Genes and Species in Phylogenomics Using Multiple Co‐Inertia Analysis.” Molecular Biology and Evolution 29, no. 6: 1587–1598. 10.1093/molbev/msr317. [DOI] [PubMed] [Google Scholar]
- Fitch, W. M. , and Margoliash E.. 1967. “Construction of Phylogenetic Trees.” Science 155, no. 3760: 279–284. 10.1126/science.155.3760.279. [DOI] [PubMed] [Google Scholar]
- Georgiou, A. , Fortuin V., Mustafa H., and Rätsch G.. 2019. “META{2}: Memory‐Efficient Taxonomic Classification and Abundance Estimation for Metagenomics With Deep Learning.” http://arxiv.org/abs/1909.13146.
- Hasan, N. B. , Balaban M., Biswas A., Bayzid M. S., and Mirarab S.. 2022. “Distance‐Based Phylogenetic Placement With Statistical Support.” Biology 11, no. 8: 1212. 10.3390/biology11081212. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Höhl, M. , and Ragan M. A.. 2007. “Is Multiple‐Sequence Alignment Required for Accurate Inference of Phylogeny?” Systematic Biology 56, no. 2: 206–221. 10.1080/10635150701294741. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huson, D. H. , Mitra S., Ruscheweyh H.‐J., Weber N., and Schuster S. C.. 2011. “Integrative Analysis of Environmental Sequences Using MEGAN4.” Genome Research 21, no. 9: 1552–1560. 10.1101/gr.120618.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Imelfort, M. , Parks D., Woodcroft B. J., Dennis P., Hugenholtz P., and Tyson G. W.. 2014. “GroopM: An Automated Tool for the Recovery of Population Genomes From Related Metagenomes.” PeerJ 2: e603. 10.7717/peerj.603. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Janssen, S. , McDonald D., Gonzalez A., et al. 2018. “Phylogenetic Placement of Exact Amplicon Sequences Improves Associations With Clinical Information.” MSystems 3, no. 3: e00021‐18. 10.1128/mSystems.00021-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jiang, Y. , Balaban M., Zhu Q., and Mirarab S.. 2022. “DEPP: Deep Learning Enables Extending Species Trees Using Single Genes.” bioRxiv: The Preprint Server for Biology 2021.01.22.427808. 10.1101/2021.01.22.427808. [DOI] [PMC free article] [PubMed]
- Jiang, Y. , McDonald D., Knight R., and Mirarab S.. 2023. “Scaling Deep Phylogenetic Embedding to Ultra‐Large Reference Trees: A Tree‐Aware Ensemble Approach.” bioRxiv: The Preprint Server for Biology. 10.1101/2023.03.27.534201. [DOI] [PMC free article] [PubMed]
- Jiang, Y. , Tabaghi P., and Mirarab S.. 2022. “Learning Hyperbolic Embedding for Phylogenetic Tree Placement and Updates.” Biology 11, no. 9: 1256. 10.3390/biology11091256. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Joulin, A. , Grave E., Bojanowski P., and Mikolov T.. 2017. “Bag of Tricks for Efficient Text Classification.” In EACL.
- Kang, D. D. , Li F., Kirton E., et al. 2019. “MetaBAT 2: An Adaptive Binning Algorithm for Robust and Efficient Genome Reconstruction From Metagenome Assemblies.” PeerJ 7: e7359. 10.7717/peerj.7359. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim, J. , and Steinegger M.. 2024. “Metabuli: Sensitive and Specific Metagenomic Classification via Joint Analysis of Amino Acid and DNA.” Nature Methods 21, no. 6: 971–973. 10.1038/s41592-024-02273-y. [DOI] [PubMed] [Google Scholar]
- Kingma, D. P. , and Ba J.. 2014. “Adam: A Method for Stochastic Optimization.” http://arxiv.org/abs/1412.6980.
- Krause, L. , Diaz N. N., Goesmann A., et al. 2008. “Phylogenetic Classification of Short Environmental DNA Fragments.” Nucleic Acids Research 36, no. 7: 2230–2239. 10.1093/nar/gkn038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kutuzova, S. , Nielsen M., Piera P., Nissen J. N., and Rasmussen S.. 2024. “Taxometer: Improving Taxonomic Classification of Metagenomics Contigs.” Nature Communications 15, no. 1: 8357. 10.1038/s41467-024-52771-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kutuzova, S. , Piera P., Nielsen K. N., et al. 2024. “Binning Meets Taxonomy: TaxVAMB Improves Metagenome Binning Using Bi‐Modal Variational Autoencoder.”
- Lamurias, A. , Sereika M., Albertsen M., Hose K., and Nielsen T. D.. 2022. “Metagenomic Binning With Assembly Graph Embeddings.” Bioinformatics 38, no. 19: 4481–4487. 10.1093/bioinformatics/btac557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lau, A. K. , Dörrer S., Leimeister C. A., Bleidorn C., and Morgenstern B.. 2019. “Read‐SpaM: Assembly‐Free and Alignment‐Free Comparison of Bacterial Genomes With Low Sequencing Coverage.” BMC Bioinformatics 20, no. S20: 638. 10.1186/s12859-019-3205-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Layer, M. , and Rhodes J. A.. 2017. “Phylogenetic Trees and Euclidean Embeddings.” Journal of Mathematical Biology 74, no. 1–2: 99–111. 10.1007/s00285-016-1018-0. [DOI] [PubMed] [Google Scholar]
- Lefort, V. , Desper R., and Gascuel O.. 2015. “FastME 2.0: A Comprehensive, Accurate, and Fast Distance‐Based Phylogeny Inference Program.” Molecular Biology and Evolution 32, no. 10: 2798–2800. 10.1093/molbev/msv150. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li, Y. , Steenwyk J. L., Chang Y., et al. 2021. “A Genome‐Scale Phylogeny of the Kingdom Fungi.” Current Biology 31, no. 8: 1653–1665.e5. 10.1016/j.cub.2021.01.074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liang, Q. , Bible P. W., Liu Y., Zou B., and Wei L.. 2020. “DeepMicrobes: Taxonomic Classification for Metagenomics With Deep Learning.” NAR Genomics and Bioinformatics 2, no. 1: lqaa009. 10.1093/nargab/lqaa009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin, H.‐H. , and Liao Y.‐C.. 2016. “Accurate Binning of Metagenomic Contigs via Automated Clustering Sequences Using Information of Genomic Signatures and Marker Genes.” Scientific Reports 6, no. 1: 24175. 10.1038/srep24175. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Linard, B. , Swenson K., and Pardi F.. 2019. “Rapid Alignment‐Free Phylogenetic Identification of Metagenomic Sequences.” Bioinformatics (Oxford, England) 35, no. 18: 3303–3312. 10.1093/bioinformatics/btz068. [DOI] [PubMed] [Google Scholar]
- Liu, C.‐C. , Dong S.‐S., Chen J.‐B., et al. 2022. “MetaDecoder: A Novel Method for Clustering Metagenomic Contigs.” Microbiome 10, no. 1: 46. 10.1186/s40168-022-01237-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lu, Y. Y. , Bai J., Wang Y., Wang Y., and Sun F.. 2021. “CRAFT: Compact Genome Representation Toward Large‐Scale Alignment‐Free daTabase.” Bioinformatics (Oxford, England) 37, no. 2: 155–161. 10.1093/bioinformatics/btaa699. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lu, Y. Y. , Chen T., Fuhrman J. A., and Sun F.. 2017. “COCACOLA: Binning Metagenomic Contigs Using Sequence COmposition, Read CoverAge, CO‐Alignment and Paired‐End Read LinkAge.” Bioinformatics 33, no. 6: 791–798. 10.1093/bioinformatics/btw290. [DOI] [PubMed] [Google Scholar]
- Lu, Y. Y. , Tang K., Ren J., Fuhrman J. A., Waterman M. S., and Sun F.. 2017. “CAFE: aCcelerated Alignment‐FrEe Sequence Analysis.” Nucleic Acids Research 45, no. W1: W554–W559. 10.1093/nar/gkx351. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ma, Z. , Lu Y. Y., Wang Y., et al. 2022. “Metric Learning for Comparing Genomic Data With Triplet Network.” Briefings in Bioinformatics 23, no. 5. 10.1093/bib/bbac345. [DOI] [PubMed] [Google Scholar]
- Mai, U. , and Mirarab S.. 2022. “Completing Gene Trees Without Species Trees in Sub‐Quadratic Time.” Bioinformatics 38, no. 6: 1532–1541. 10.1093/bioinformatics/btab875. [DOI] [PubMed] [Google Scholar]
- Marçais, G. , and Kingsford C.. 2011. “A Fast, Lock‐Free Approach for Efficient Parallel Counting of Occurrences of k‐Mers.” Bioinformatics (Oxford, England) 27, no. 6: 764–770. 10.1093/bioinformatics/btr011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Matsen, F. A. 2015. “Phylogenetics and the Human Microbiome.” Systematic Biology 64, no. 1: e26–e41. 10.1093/sysbio/syu053. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Matsen, F. A. , Kodner R. B., and Armbrust E. V.. 2010. “Pplacer: Linear Time Maximum‐Likelihood and Bayesian Phylogenetic Placement of Sequences Onto a Fixed Reference Tree.” BMC Bioinformatics 11, no. 1: 538. 10.1186/1471-2105-11-538. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McDonald, D. , Jiang Y., Balaban M., et al. 2023. “Greengenes2 Unifies Microbial Data in a Single Reference Tree.” Nature Biotechnology. 10.1038/s41587-023-01845-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Menegaux, R. , and Vert J.‐P.. 2019. “Continuous Embeddings of DNA Sequencing Reads and Application to Metagenomics.” Journal of Computational Biology 26, no. 6: 509–518. 10.1089/cmb.2018.0174. [DOI] [PubMed] [Google Scholar]
- Meyer, F. , Fritz A., Deng Z.‐L., et al. 2022. “Critical Assessment of Metagenome Interpretation: The Second Round of Challenges.” Nature Methods 19, no. 4: 429–440. 10.1038/s41592-022-01431-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meyer, F. , Hofmann P., Belmann P., et al. 2018. “AMBER: Assessment of Metagenome BinnERs.” GigaScience 7, no. 6: giy069. 10.1093/gigascience/giy069. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mikolov, T. , Sutskever I., Chen K., Corrado G., and Dean J.. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” http://arxiv.org/abs/1310.4546.
- Nielsen, H. B. , Almeida M., Juncker A. S., et al. 2014. “Identification and Assembly of Genomes and Genetic Elements in Complex Metagenomic Samples Without Using Reference Genomes.” Nature Biotechnology 32, no. 8: 822–828. 10.1038/nbt.2939. [DOI] [PubMed] [Google Scholar]
- Nissen, J. N. , Johansen J., Allesøe R. L., et al. 2021. “Improved Metagenome Binning and Assembly Using Deep Variational Autoencoders.” Nature Biotechnology 39, no. 5: 555–560. 10.1038/s41587-020-00777-4. [DOI] [PubMed] [Google Scholar]
- Ondov, B. D. , Treangen T. J., Melsted P., et al. 2016. “Mash: Fast Genome and Metagenome Distance Estimation Using MinHash.” Genome Biology 17, no. 1: 132. 10.1186/s13059-016-0997-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ono, Y. , Hamada M., and Asai K.. 2022. “PBSIM3: A Simulator for All Types of PacBio and ONT Long Reads.” NAR Genomics and Bioinformatics 4, no. 4: lqac092. 10.1093/nargab/lqac092. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pan, S. , Zhu C., Zhao X.‐M., and Coelho L. P.. 2022. “A Deep Siamese Neural Network Improves Metagenome‐Assembled Genomes in Microbiome Datasets Across Different Environments.” Nature Communications 13, no. 1: 2326. 10.1038/s41467-022-29843-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Paszke, A. , Gross S., Massa F., et al. 2019. “PyTorch: An Imperative Style, High‐Performance Deep Learning Library.” In Advances in Neural Information Processing Systems, edited by Wallach H., Larochelle H., Beygelzimer A., dAlché Buc F., Fox E., and Garnett R., vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740‐Paper.pdf. [Google Scholar]
- Qi, Y. , Sachan D. S., Felix M., Padmanabhan S. J., and Neubig G.. 2018. “When and Why Are Pre‐Trained Word Embeddings Useful for Neural Machine Translation?” http://arxiv.org/abs/1804.06323.
- Rachtman, E. , Sarmashghi S., Bafna V., and Mirarab S.. 2022. “Quantifying the Uncertainty of Assembly‐Free Genome‐Wide Distance Estimates and Phylogenetic Relationships Using Subsampling.” Cell Systems 13, no. 10: 817–829.e3. 10.1016/j.cels.2022.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Raju, R. S. , Al Nahid A., Dev P. C., and Islam R.. 2022. “VirusTaxo: Taxonomic Classification of Viruses From the Genome Sequence Using k‐Mer Enrichment.” Genomics 114, no. 4: 110414. 10.1016/j.ygeno.2022.110414. [DOI] [PubMed] [Google Scholar]
- Reinert, G. , Chew D., Sun F., and Waterman M. S.. 2009. “Alignment‐Free Sequence Comparison (I): Statistics and Power.” Journal of Computational Biology 16, no. 12: 1615–1634. 10.1089/cmb.2009.0198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Remita, M. A. , Halioui A., Diouara A. A. M., Daigle B., Kiani G., and Diallo A. B.. 2017. “A Machine Learning Approach for Viral Genome Classification.” BMC Bioinformatics 18, no. 1: 208. 10.1186/s12859-017-1602-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ren, J. , Ahlgren N. A., Lu Y. Y., Fuhrman J. A., and Sun F.. 2017. “VirFinder: A Novel k‐Mer Based Tool for Identifying Viral Sequences From Assembled Metagenomic Data.” Microbiome 5, no. 1: 69. 10.1186/s40168-017-0283-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ren, J. , Bai X., Lu Y. Y., et al. 2018. “Alignment‐Free Sequence Analysis and Applications.” Annual Review of Biomedical Data Science 1, no. 1: 93–114. 10.1146/annurev-biodatasci-080917-013431. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ren, R. , Yin C., and Yau S. S.‐T.. 2022. “kmer2vec: A Novel Method for Comparing DNA Sequences by word2vec Embedding.” Journal of Computational Biology 29, no. 9: 1001–1021. 10.1089/cmb.2021.0536. [DOI] [PubMed] [Google Scholar]
- Robinson, D. F. , and Foulds L. R.. 1981. “Comparison of Phylogenetic Trees.” Mathematical Biosciences 53, no. 1–2: 131–147. [Google Scholar]
- Romashchenko, N. , Linard B., Pardi F., and Rivals E.. 2023. “EPIK: Precise and Scalable Evolutionary Placement With Informative Ik/i‐Mers.” Bioinformatics (Oxford, England) 39, no. 12. 10.1093/bioinformatics/btad692. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sapci, A. O. B. , and Mirarab S.. 2025. “A k‐Mer‐Based Maximum Likelihood Method for Estimating Distances of Reads to Genomes Enables Genome‐Wide Phylogenetic Placement.” bioRxiv 2025.01.20.633730. 10.1101/2025.01.20.633730. [DOI]
- Sarmashghi, S. , Bohmann K., Gilbert M. T. P., Bafna V., and Mirarab S.. 2019. “Skmer: Assembly‐Free and Alignment‐Free Sample Identification Using Genome Skims.” Genome Biology 20, no. 1: 34. 10.1186/s13059-019-1632-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sayers, E. W. , Cavanaugh M., Clark K., Ostell J., Pruitt K. D., and Karsch‐Mizrachi I.. 2018. “GenBank.” Nucleic Acids Research 47, no. D1: D94–D99. 10.1093/nar/gky989. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shah, N. , Molloy E. K., Pop M., and Warnow T.. 2021. “TIPP2: Metagenomic Taxonomic Profiling Using Phylogenetic Markers.” Bioinformatics 37, no. 13: 1839–1845. 10.1093/bioinformatics/btab023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shaw, J. , and Yu Y. W.. 2023. “Fast and Robust Metagenomic Sequence Comparison Through Sparse Chaining With Skani.” Nature Methods 20, no. 11: 1661–1665. 10.1038/s41592-023-02018-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sieber, C. M. K. , Probst A. J., Sharrar A., et al. 2018. “Recovery of Genomes From Metagenomes via a Dereplication, Aggregation and Scoring Strategy.” Nature Microbiology 3, no. 7: 836–843. 10.1038/s41564-018-0171-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Solis‐Reyes, S. , Avino M., Poon A., and Kari L.. 2018. “An Open‐Source k‐Mer Based Machine Learning Tool for Fast and Accurate Subtyping of HIV‐1 Genomes.” PLoS One 13, no. 11: e0206409. 10.1371/journal.pone.0206409. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stamatakis, A. 2014. “RAxML Version 8: A Tool for Phylogenetic Analysis and Post‐Analysis of Large Phylogenies.” Bioinformatics (Oxford, England) 30, no. 9: 1312–1313. 10.1093/bioinformatics/btu033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stark, M. , Berger S. A., Stamatakis A., and von Mering C.. 2010. “MLTreeMap–Accurate Maximum Likelihood Placement of Environmental DNA Sequences Into Taxonomic and Functional Reference Phylogenies.” BMC Genomics 11, no. 1: 461. 10.1186/1471-2164-11-461. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tang, K. , Ren J., and Sun F.. 2019. “Afann: Bias Adjustment for Alignment‐Free Sequence Comparison Based on Sequencing Data Using Neural Network Regression.” Genome Biology 20, no. 1: 266. 10.1186/s13059-019-1872-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Turakhia, Y. , Thornlow B., Hinrichs A. S., et al. 2021. “Ultrafast Sample Placement on Existing tRees (UShER) Enables Real‐Time Phylogenetics for the SARS‐CoV‐2 Pandemic.” Nature Genetics 53, no. 6: 809–816. 10.1038/s41588-021-00862-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vijini Mallawaarachchi, V. , Anuradha Wickramarachchi A., and Yu Lin Y.. 2020. “GraphBin: Refined Binning of Metagenomic Contigs Using Assembly Graphs.” Bioinformatics 36, no. 11: 3307–3313. 10.1093/bioinformatics/btaa180. [DOI] [PubMed] [Google Scholar]
- Vinga, S. , and Almeida J.. 2003. “Alignment‐Free Sequence Comparison‐a Review.” Bioinformatics 19, no. 4: 513–523. 10.1093/bioinformatics/btg005. [DOI] [PubMed] [Google Scholar]
- Wang, Z. , Wang Z., Lu Y. Y., Sun F., and Zhu S.. 2019. “SolidBin: Improving Metagenome Binning With Semi‐Supervised Normalized Cut.” Bioinformatics 35, no. 21: 4229–4238. 10.1093/bioinformatics/btz253. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang, Z. , You R., Han H., Liu W., Sun F., and Zhu S.. 2024. “Effective Binning of Metagenomic Contigs Using Contrastive Multi‐View Representation Learning.” Nature Communications 15, no. 1: 585. 10.1038/s41467-023-44290-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Warnow, T. 2017. Computational Phylogenetics: An Introduction to Designing Methods for Phylogeny Estimation. Cambridge University Press. [Google Scholar]
- Washburne, A. D. , Morton J. T., Sanders J., et al. 2018. “Methods for Phylogenetic Analysis of Microbiome Data.” Nature Microbiology 3, no. 6: 652–661. 10.1038/s41564-018-0156-0. [DOI] [PubMed] [Google Scholar]
- Wedell, E. , Cai Y., and Warnow T.. 2021. “Scalable and Accurate Phylogenetic Placement Using pplacer‐XR.” In Algorithms for Computational Biology. AlCoB 2021. Lecture Notes in Computer Science, 94–105. Springer, Cham. 10.1007/978-3-030-74432-8_7. [DOI] [Google Scholar]
- Wedell, E. , Cai Y., and Warnow T.. 2023. “SCAMPP: Scaling Alignment‐Based Phylogenetic Placement to Large Trees.” IEEE/ACM Transactions on Computational Biology and Bioinformatics 20, no. 2: 1417–1430. 10.1109/TCBB.2022.3170386. [DOI] [PubMed] [Google Scholar]
- Wei, S. , Shuai L., Yan L., and Fuquan H.. 2016. “SeqKit: A Cross‐Platform and Ultrafast Toolkit for FASTA/Q File Manipulation.” PLoS One 11, no. 10: 1–10. 10.1371/journal.pone.0163962. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Woloszynek, S. , Zhao Z., Chen J., and Rosen G. L.. 2019. “16S rRNA Sequence Embeddings: Meaningful Numeric Feature Representations of Nucleotide Sequences That Are Convenient for Downstream Analyses.” PLoS Computational Biology 15, no. 2: e1006721. 10.1371/journal.pcbi.1006721. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wood, D. E. , Lu J., and Langmead B.. 2019. “Improved Metagenomic Analysis With Kraken 2.” Genome Biology 20, no. 1: 257. 10.1186/s13059-019-1891-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yi Yue, H. H. , Zhao Qi H.‐M. D., Xin‐Yi Liu T.‐F. H., Yue Chen X.‐J. S., and You‐Hua Zhang J. T.. 2020. “Evaluating Metagenomics Tools for Genome Binning With Real Metagenomic Datasets and CAMI Datasets.” BMC Bioinformatics 21, no. 1: 334. 10.1186/s12859-020-03667-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yu, G. , Jiang Y., Wang J., Zhang H., and Luo H.. 2018. “BMC3C: Binning Metagenomic Contigs Using Codon Usage, Sequence Composition and Read Coverage.” Bioinformatics 34, no. 24: 4172–4179. 10.1093/bioinformatics/bty519. [DOI] [PubMed] [Google Scholar]
- Yuval Bussi, R. K. , and Ziv Reich Z. R.. 2021. “Large‐Scale k‐Mer‐Based Analysis of the Informational Properties of Genomes, Comparative Genomics and Taxonomy.” PLoS One 16, no. 10: e0258693. 10.1371/journal.pone.0258693. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yu‐Wei, W. , Simmons B. A., and Singer S. W.. 2016. “MaxBin 2.0: An Automated Binning Algorithm to Recover Genomes From Multiple Metagenomic Datasets.” Bioinformatics 32, no. 4: 605–607. 10.1093/bioinformatics/btv638. [DOI] [PubMed] [Google Scholar]
- Zhang, P. , Jiang Z., Wang Y., and Li Y.. 2022. “CLMB: Deep Contrastive Learning for Robust Metagenomic Binning.” In Research in Computational Molecular Biology. Lecture Notes in Computer Science, edited by Pe'er I., vol. 13278, 326–348. Springer International Publishing. 10.1007/978-3-031-04749-7_23. [DOI] [Google Scholar]
- Zhang, Z. , and Zhang L.. 2021. “METAMVGL: A Multi‐View Graph‐Based Metagenomic Contig Binning Algorithm by Integrating Assembly and Paired‐End Graphs.” BMC Bioinformatics 22, no. S10: 378. 10.1186/s12859-021-04284-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zheng, W. , Yang L., Genco R. J., Wactawski‐Wende J., Buck M., and Sun Y.. 2019. “SENSE: Siamese Neural Network for Sequence Embedding and Alignment‐Free Comparison.” Bioinformatics (Oxford, England) 35, no. 11: 1820–1828. 10.1093/bioinformatics/bty887. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu, Q. , Mai U., Pfeiffer W., et al. 2019. “Phylogenomics of 10,575 Genomes Reveals Evolutionary Proximity Between Domains Bacteria and Archaea.” Nature Communications 10, no. 1: 5477. 10.1038/s41467-019-13443-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zielezinski, A. , Girgis H. Z., Bernard G., et al. 2019. “Benchmarking of Alignment‐Free Sequence Comparison Methods.” Genome Biology 20, no. 1: 1–44. 10.1186/s13059-019-1755-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zielezinski, A. , Vinga S., Almeida J., and Karlowski W. M.. 2017. “Alignment‐Free Sequence Comparison: Benefits, Applications, and Tools.” Genome Biology 18, no. 1: 186. 10.1186/s13059-017-1319-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Appendix S1: men70055‐sup‐0001‐AppendicesS1‐S3.pdf.
Appendix S2: men70055‐sup‐0001‐AppendicesS1‐S3.pdf.
Appendix S3: men70055‐sup‐0001‐AppendicesS1‐S3.pdf.
Data Availability Statement
kf2vec was implemented in PyTorch (v.1.10.2) and CUDA was used when running on a NVIDIA‐GeForce‐RTX‐3090 GPU. The software is available publicly at https://github.com/noraracht/kf2vec and a permanent, citable version is archived at Zenodo (https://doi.org/10.5281/zenodo.17201944). Raw data, scripts, and summary data tables are publicly available at https://github.com/noraracht/kf2vec_data. Additionally, data repositories are stored in Zenodo (https://doi.org/10.5281/zenodo.17203745). The detailed description of genomic datasets used in our experiments, accession numbers of the assemblies, and the exact commands are provided in Supporting Information.
