ViTax: adaptive hierarchical viral taxonomy classification with a taxonomy belief tree on a foundation model

YuShuang He; Feng Zhou; JiaXing Bai; YiChun Gao; Xiaobing Huang; Ying Wang

doi:10.1093/bib/bbaf041

. 2025 Feb 8;26(1):bbaf041. doi: 10.1093/bib/bbaf041

ViTax: adaptive hierarchical viral taxonomy classification with a taxonomy belief tree on a foundation model

YuShuang He ^1,², Feng Zhou ^2,^3,², JiaXing Bai ⁴, YiChun Gao ⁵, Xiaobing Huang ^6,^✉, Ying Wang ^7,^8,^9,^✉

PMCID: PMC11805961 PMID: 39921398

Abstract

Viruses exert a profound influence on both human health and the global ecosystem, yet they remain largely unexplored. Precise taxonomic classification of viral sequences is essential for discovering novel viruses, elucidating their functions, and assessing their implications for public health and environmental monitoring. Traditional taxonomy methods based on genome references are limited by the vast number of unexplored viruses, rapid mutation rates, and high genetic diversity. Additionally, highly imbalanced species distribution and significant variances in inter-species genomic distances across taxonomic units pose challenges to classifier training. Conceptualizing genomic sequences as sentences in a natural language, large language models provide novel approaches for extracting intrinsic viral genome characteristics. In this study, we introduce ViTax, a virus taxonomy classification tool powered by HyenaDNA, a large language foundation model for long-range genomic sequences at single nucleotide resolution. ViTax integrates supervised prototypical contrastive learning to address the highly imbalanced distributions across various taxonomic clades and demonstrates superior performance to current leading methods in virus taxonomy, particularly significant for long sequences. Moreover, ViTax designs a belief mapping tree using the Lowest Common Ancestor algorithm to adaptively assign a sequence to the lowest taxonomy clade with confidence. For the open-set problem, where sequences belong to novel and unexplored genera, ViTax can adaptively assign them to a higher level of known taxonomy with outstanding performance. These capabilities make ViTax a robust tool for advancing the accuracy and reliability of viral taxonomy classification. The code is available at https://github.com/Ying-Lab/ViTax.

Keywords: virus taxonomy, foundation model, HyenaDNA, contrastive learning, hierarchical classification, taxonomy belief mapping, open-set problem

Introduction

Viruses play a crucial role in human health and the global environment, shaping the intricate balance of life on Earth. The COVID-19 pandemic, caused by the SARS-CoV-2 virus, has profoundly impacted human societies. Indeed, viral infections account for around 60 percent of global medical care [1]. Viruses also play a vital role in Earth’s ecosystems, influencing nutrient cycles, population dynamics, and the evolution of various species [2]. Compared with over Inline graphic viruses on Earth [3], the known viruses cataloged in databases like NCBI are exceedingly limited in the range of , with most viruses remaining unexplored. Determining virus taxonomy, the branch of biological classification, Domain, Kingdom, Phylum, Class, Order, Family, Genus, and Species, is critical to understand the virus’s evolutionary origins and ecological roles and to inform public health strategies by predicting its potential pathogenicity and host range.

Several challenges prevent the performance of taxonomy prediction on virus sequences. Firstly, the constant discovery of new viruses leads to ongoing revisions of the existing taxonomy system. According to the International Committee on Taxonomy of Viruses (ICTV), while fewer than 200 virus species were identified in 1971, the number has soared to 11,273 by 2023 [4]. From the Master Species List (MSL) 37 to the MSL38,214 genera are newly created, and 17 genera are promoted or moved [5]. This highlights the inherently open-set [6] nature of virus taxonomy, where new virus sequences may not belong to any known genus, but can be assigned to a known family or higher taxonomy. This requires virus taxonomy tools to accurately and adaptively identify the lowest appropriate rank taxonomy. For example, if the sequence is from an unexplored species of an unknown genus under a known family, it should be assigned to the taxonomy of family level, but not genus and below.

Furthermore, the hierarchical structure of virus taxonomy is often uneven and unbalanced. Uneven refers to the significant variability in genomic similarities within each taxonomic clade, where some viral strains are highly similar, while others show considerable genetic divergence. This variability, primarily driven by viral mutations, creates an out-of-distribution (OOD) problem, as models encounter strains that differ significantly from those in the training data. Unbalanced, on the other hand, describes the disproportionate representation of genomes within these clades, with certain clades containing a large number of genomes, while others are underrepresented, leading to challenges in classification.

To address the above issues, we develop ViTax, a novel framework for predicting the taxonomy of viruses with adaptive hierarchical classification. ViTax utilizes the HyenaDNA [7] foundation model, a large language model for long-range genomic sequences at single nucleotide resolution to capture long-sequence genomic features. Meanwhile, ViTax incorporates supervised prototypical contrastive learning (PCL) to tackle the issue of highly imbalanced distributions among different taxonomic clades and construct the embedding space. During the prediction process, we design a novel tree-based classification approach, named taxonomy belief mapping within ViTax for virus taxonomy to utilize constructed embedding space. Taxonomy belief mapping combines prior knowledge of viral taxonomy into the classification process by the Lowest Common Ancestor algorithm [8] to adaptively achieve the taxonomy level with the most confidence and tackle the open-set problem for sequences of unknown taxonomy. Thereby facilitating a sophisticated, hierarchical, and fine-grained taxonomic classification approach.

We comprehensively evaluate ViTax on multiple datasets from simulated datasets to real datasets, including virus genomes in the RefSeq dataset, contig simulation dataset, OOD dataset, open-set dataset, and real metagenomic dataset, to ascertain its efficacy in classifying viral sequences. The experimental results demonstrate the outperformance of ViTax over existing methods. Especially, ViTax can handle viral sequences of varying lengths well. The robust performance of ViTax on open-set dataset and OOD dataset further validates the efficacy in classifying the novel viruses that have distributions different from known viruses and cannot be mapped to the existing genus-level taxonomy. Overall, the superior performance of ViTax on both simulation datasets and real metagenome datasets demonstrates its potential to broaden the understanding of viral diversity within metagenomic contexts.

The main contributions of this study are as follows:

(1) To extract comprehensive feature information from long genomic sequences and reduce computational complexity, ViTax introduces HyenaDNA [7], a foundation model for long-range genomic sequences at single nucleotide resolution. Subsequently, the model is fine-tuned to more precisely capture the characteristics of viruses. Through Hyena’s implicit long convolutions and element-wise gating [9], fine-grained feature extraction of long sequences is achieved, and embeddings for these sequences are effectively generated.
(2) To tackle the highly imbalanced distributions across various taxonomic clades and the impact of mutations, a supervised PCL framework is introduced. By learning distinctive features of classes, instead of individual samples, the PCL can alleviate the sample imbalance and mitigate distribution inconsistencies, as well as the impact of outliers.
(3) To address the inherently open-set nature of virus taxonomy, where new virus sequences may not belong to any known genus or even higher taxonomy, we construct a taxonomy belief tree to implement an adaptive hierarchical classification. It harnesses the Lowest Common Ancestor algorithm along with the prior knowledge of virus taxonomy to provide confidence levels for biological hierarchical classification in the embedding space. Taxonomy belief mapping not only enhances the granularity of the embedding space but also incorporates the evolutionary relationships within the virus taxonomy classification. Therefore, ViTax is enabled to identify those viruses in which the genus-level taxonomy has not been explored and find the lowest appropriate rank taxonomy for them.

Related works

Virus taxonomy

Virus taxonomy classification methods can generally be categorized into alignment-based and learning-based strategies.

Alignment-based virus taxonomy

The alignment-based method determines taxonomy by aligning sequences to a reference database. Basic Local Alignment Search Tool [10] employs a heuristic approach to rapidly identify similar regions between sequences, enabling the efficient comparison of a query sequence with a reference database and assigning it to the most similar taxonomy. In addition, k-mer features are also regarded as valuable characteristics for virus classification. For example, Kraken [11], VirusTaxo [12], and CLARK [13] utilize k-mer features for classification by decomposing sequences into k-mers and matching them against pre-built reference databases. They employ efficient search algorithms and hierarchical taxonomic structures to achieve rapid and accurate classification across multiple taxonomic levels. Furthermore, the Contig Annotation Tool (CAT) [14] is a sequence classification tool designed for metagenomics, which predicts open reading frames in sequences and performs homology searches against reference databases to implement classification. The precision of alignment-based methods relies on the completeness and accuracy of the reference database. However, due to the extensive number of unexplored viruses and their rapid mutation rates, the database’s completeness and accuracy are suboptimal.

Learning-based virus taxonomy

Learning-based methods for virus taxonomy leverage machine-learning models to classify viruses based on the feature of sequence, which trains the models on large reference data of viral genomes. PhaGCN [15, 16] integrates DNA sequence features learned through CNN with protein sequence similarities to construct a knowledge graph and utilizes a Graph Convolutional Network to enhance virus taxonomy. However, it is limited to classification at the family level and does not extend to the genus level. PhaGenus [17] is a method for phage genus-level classification that utilizes the Transformer model, where protein clusters are used as tokens to train the classification model. PhaGenus incorporates uncertainty assessment to handle sequences from new or unknown genera. However, PhaGenus relies on converting genomic sequences into protein clusters using reference databases before prediction, which means it still retains the limitations associated with alignment-based methods. vConTACT [18] employs a network-based analysis approach to construct viral taxonomy by evaluating gene sharing among viral genomes. Nevertheless, when dealing with short contigs, it usually shows low recall and accuracy.

These learning-based methods are advantageous for their ability to learn complex patterns and make predictions on unseen data. Meanwhile, these methods do not take the inherent characteristics of hierarchical structure, open set, and data imbalance of virus classification into full consideration.

Genomic sequence foundation model

Since 2022, the ICTV members have abolished the family based on morphology and used nucleotide-based similarity between genomes to classify bacteriophages [19]. Therefore, the genome sequence can serve as an authoritative source of information for virus taxonomy classification, and deep analysis of genome sequences can extract genomic signatures for improving taxonomy classification. Since genome sequences can be considered a natural language composed of basic groups, large language models can be used to mine features from these sequences.

In this context, genomic sequence foundation models, which leverage advanced large language models, have emerged as a powerful tool for capturing and analyzing complex genomic information. The current genomic sequence foundation models are primarily based on Transformer architecture [20]. For example, DNABERT [21] employs a Transformer architecture and utilizes k-mer representation to tokenize DNA sequences, and it is pretrained on large-scale unlabeled human genome data. However, the performance of the Transformer architecture becomes inadequate when handling tens of thousands or even millions of genomic sequences. Although many attempts made by scientists to increase the context length of foundation models, such as using Byte Pair Encoding in DNABERT2 [22] to compress genomic sequences, these approaches have not entirely resolved the issue. Meanwhile, the Hyena architecture [9] utilizes implicit long convolutions to significantly reduce the time and space complexity issues faced by Transformer-based architectures. HyenaDNA [7], a genomic sequence foundation model based on the Hyena architecture, extends the input DNA context to 1 million. Additionally, by using single nucleotide tokens, it achieves single nucleotide resolution compared to previous works. However, genomic sequence foundation models are primarily developed for human or eukaryotic organisms, with no models specifically designed for viruses. Therefore, we fine-tune a genomic sequence foundation model to provide new insights into virus taxonomy.

Methods

As shown in Fig. 1, ViTax utilizes the HyenaDNA, a large language foundation model for long-range genomic sequences based on implicit long convolutions, as its base model to obtain the feature representation of each virus. To address viral mutation and the significant imbalance in viral data distribution, we further fine-tune HyenaDNA by PCL [23]. Based on the embeddings of the viruses, a novel adaptive and hierarchical tree-based classification approach called taxonomy belief mapping, which capitalizes on the hierarchical nature of virus taxonomy is designed for the prediction phase within ViTax. This classification approach constructs a tree-based classifier by using the Lowest Common Ancestor combined with prior knowledge of virus classification. Therefore, for a new query sequence, the classification results can be obtained by mapping fine-tuned embeddings into the tree and calculating the confidence score based on the tree structure to find the lowest appropriate rank of taxonomy adaptively.

HyenaDNA

The HyenaDNA is a large language foundation model for genomic sequence, which utilizes the Hyena operator as its core computing module. The Hyena operator is a data-controlled operator composed of implicit long convolutions and element-wise gating [9], as shown in Fig. 1(b). The gates receive input projections through dense layers and short convolutions, and the implicit long convolutions are implicitly parameterized through MLP that generates the convolutional filters.

To be specific, given an input Inline graphic ( represents the length of the tokenized input sequence), compute a set of three linear projections of the input and the obtained outputs are separately processed through short convolutions. This step is similar to the Query, Key, and Value (Q,K,V) transformations in a Transformer model [20]. The calculation is defined as follows:

(1)

(2)

(3)

where Inline graphic is a linear projection matrix, it projects into a different space. is the short convolution filter used in the depthwise convolution. represents the 1-dimensional depthwise convolution operation.

Following the acquisition of Inline graphic , , and by the Hyena operator, these projections are subsequently processed through implicit long convolutions coupled with element-wise gating. The procedure is delineated as follows:

(4)

(5)

The Toeplitz matrix Inline graphic is formed from a learnable implicit long convolution layer, which is generated as the output of a neural network. It allows the operator to handle very long sequences without an increase in the number of parameters that grows linearly. Additionally, the matrices are constructed with Inline graphic and on the diagonals, the precise definition of and is as follows:

(6)

(7)

where the function Inline graphic is utilized to transform a provided vector into a diagonal matrix, which serves the purpose of element-wise gating within the model.

The Hyena operator can serve as an alternative to the attention mechanism in Transformer models. Thus reducing the computational complexity from Inline graphic to [9].

Supervised PCL with adjusted ProtoNCE

In the field of virus taxonomy, the classification task is significantly complicated by significant imbalances in distribution among different taxonomic clades and the high mutation rates of viruses, which introduce frequent and often anomalous sequence variations. To solve these problems, a robust framework is introduced that integrates supervised PCL with the designed loss function adjusted ProtoNCE (APNCE). Below are detailed descriptions of PCL and APNCE.

Supervised PCL

To tackle the highly imbalanced distributions across various taxonomic clades and the susceptibility of contrastive learning methods to outlier samples, ViTax introduces the concept of ’prototype’. This prototype is defined as the centroid of the cluster formed by sequence fragments within each genus, serving as the representative feature of the taxonomic group. The training strategy focuses on minimizing the APNCE loss, a modification of the ProtoNCE loss function that reduces the impact of outlier sequences, such as mutated gene fragments. Prototypical contrastive learning is effective when class samples are limited. It allows the model to extract representative features from each class rather than from individual samples.

As shown in Fig. 1(a), the training process of ViTax uses an Expectation-Maximization (E-step and M-step) strategy [24] for optimization. In the E-step, the training data is input into ViTax, to generate embeddings for genomic sequences. Next, for each class, the centroid for each genus is calculated. Specifically, for genus Inline graphic , the prototype is defined as the mean of all embeddings belonging to that class.

(8)

where Inline graphic is the embedding of sample . represents the size of this set. is a set that contains the index of all samples labeled as k. This process can be understood as aggregating the feature representations of all genomic sequences within a genus to obtain representative features of the genus.

In step M, the training data is input into HyenaDNA again to update DNA sequence embeddings. Then, the embedding is compared with all prototypes to calculate the APNCE loss, which is backpropagated to optimize the model. The calculation is defined as follows:

(9)

In the calculations, centers belonging to the same genus are considered positive samples, whereas centers from different genera are considered negative samples. Evaluate the convergence of APNCE loss after each iteration through steps E and M. If the loss value reaches the present convergence condition or no longer changes significantly, stop the iteration.

Adjusted protonce

In PCL, ProtoNCE is generally used as the loss function that aims to optimize the model so that the embedding of each sample is close to its corresponding prototype [23]. ProtoNCE is defined as follows:

(10)

where the function Inline graphic denotes the similarity between embedding and prototype embedding . is a set of negative samples, referring to all prototypes that do not belong to the embedding . represents the prototype corresponding to embedding . is the temperature parameter to regulate the smoothness of similarity scores. When Inline graphic is large, the model’s discrimination between different samples will decrease, making it easier to capture global information. When is small, the model has a higher discriminability for difficult negative samples (i.e. negative samples with higher similarity to positive samples).

However, in ProtoNCE, the weight of samples is related to Inline graphic , resulting in more difficult negative samples obtaining larger weights [25]. Excessive attention to negative samples in practice can lead to overfitting of the model, as it fails to effectively capture task-critical features or information.

To address the above problems, we use Gaussian weights to flexibly adjust sample weights and make the model focus on regions with rich information. The Gaussian weight is defined as follows:

(11)

where Inline graphic is the weight assigned to the embedding , and are two hyperparameters. is the mean of the Gaussian function, to control the central region of weight distribution. is the standard deviation, used to control the height of the weight distribution in the central region.

Incorporating the weights into the ProtoNCE, we obtain the APNCE:

(12)

By assigning weights to negative prototypes based on their different similarities through the Gaussian function, the APNCE loss function can optimize the model more effectively, and make the samples closer to their corresponding prototypes while avoiding excessive attention to noisy negative samples.

Taxonomy belief mapping approach

During ViTax’s prediction process, we propose a novel taxonomy belief mapping approach to obtain the classification results. This tree-based classification method utilizes the Lowest Common Ancestor algorithm to build a taxonomy belief tree. It establishes a correspondence mapping between the clustering relationships in the embedding space and the taxonomy tree structure, thereby adaptively obtaining the taxonomy level with the highest confidence in the next step to solve the open set problem in virus taxonomy classification. The detailed steps are as follows:

Constructing taxonomy belief tree

As shown in Fig. 1(c), initially, the tree-based classifier (taxonomy belief tree) is constructed using a training set. Specifically, each sequence Inline graphic in the training set, is segmented into fragments of 2000 base pairs (bp) with an overlap of 400 bp to capture more continuous information. The same strategy is also applied to reverse complementary sequences.

(13)

Subsequently, the trained HyenaDNA is applied to generate embeddings Inline graphic for each segment.

(14)

Following this, all generated embeddings undergo K-means clustering [26]. The K-means algorithm partitions the embedding space into Inline graphic clusters, aiming to group similar embeddings.

(15)

where the Inline graphic is the centroid of the th cluster, is the number of clusters, and denotes the set of data points in the th cluster.

Through K-means, we obtain Inline graphic clusters, which is equivalent to partitioning the embedding space into distinct regions.

(16)

Taxonomy confidence is assigned to each cluster through the Lowest Common Ancestor algorithm combined with the viral evolutionary tree, thereby obtaining the taxonomy belief tree. This novel approach calculates the lowest common ancestor node at the biological level for all members within a cluster. The node represents the most recent common ancestor in the evolutionary tree for that cluster, giving a robust measure of taxonomy confidence. It is crucial for understanding the phylogenetic relationships and serves as a foundational element in the taxonomy belief mapping process. The Lowest Common Ancestor of a cluster Inline graphic denoted as , defined as follows:

(17)

where Inline graphic is the th cluster from K-means clustering, and are the embeddings within that cluster. For each cluster , following the above steps, a tree for classification is constructed, and corresponding taxonomy level confidence is assigned to each partitioned mapping space.

Prediction step

As shown in Fig. 1(d), for a new query sequence Inline graphic , ViTax maps it to the already constructed tree and classifies it based on the mapping results. The precise process is as follows:

Sequence cutting: cut the query sequence and its reverse complementary chain into multiple fragments, each of length 2000 bp, with a sliding window of 400 bp. This process generates a set of overlapping fragments from the query sequence and its reverse complement.
(18)
Generate embeddings: input each fragment into the trained HyenaDNA to generate the corresponding embedding representation .
(19)
Assign to clusters: each generated embedding is assigned to its corresponding cluster center within a pre-trained K-means cluster. More precisely, calculate the distance between each embedding and all cluster centers, and select the cluster center with the smallest distance.
(20)
where is the center of the th cluster, is the th cluster.
Mapping to the taxonomy belief tree: based on the clustering results, map each query fragment to the corresponding node on the tree and increase the weight of each corresponding node by 1. In detail, the weight calculation of is as follows:
(21)
where the function is an indicator function that equals 1 if the belongs to node in the taxonomy belief tree, and 0 otherwise. This step enables the calculation of confidence for various segments of the query sequence within the tree, thereby establishing a foundation for the subsequent classification step.
Calculate the maximum weight path: after assigning all fragments to the tree, find the path with the highest weight in the tree, where the path refers to the complete evolutionary path from the root node to a leaf node. The formula is as follows:
(22)
The leaf node of the maximum weight path is the genus-level taxonomy classification result.
Calculate the confidence score: after obtaining the classification result, the confidence score is calculated to facilitate the subsequent adaptive hierarchical classification strategy. The confidence score is defined as the ratio of the number of fragments that fall on the maximum path to the total number of fragments. The formula is as follows:
(23)
where the max path is the set of nodes on the maximum path, is the number of fragments mapped to node , and is the total number of fragments.

Adaptive hierarchical classification strategy

In the real-world scenario of virus taxonomy, taxonomists often face a multitude of newly discovered viruses that the genus-level taxonomy has not been explored. Therefore, virus taxonomy classification at the genus level, and many lower taxonomic levels, is an open-set problem, as it involves identifying viruses from classes not represented in the training set. To address this challenge, ViTax utilizes the hierarchical structure of the taxonomy belief tree to classify these viruses based on their confidence levels at each taxonomic level. When encountering viruses for which the genus-level taxonomy has not been explored, ViTax can accurately identify viruses and assign a known higher-level taxonomy which is the lowest appropriate rank taxonomy for these viruses. For example, if a virus is assigned as ’Ligamenvirales_order’ (Ligamenvirales being an Order level in viruses), this suggests that the virus may belong to the Ligamenvirales category but does not match known viruses taxonomy under the Ligamenvirales class perfectly. This approach effectively addresses the open-set problem in virus classification. The specific steps are as follows:

When the confidence score calculated using Equation (23) is less than Inline graphic , this indicates that the current prediction result for the query sequence is not the lowest appropriate rank classification. Consequently, ViTax searches for higher appropriate taxonomy classification. This process begins by considering all leaf nodes and redistributing their weights to their parent nodes. To be precise, the weight of each parent node is updated to the sum of its original weight plus the weight of its child nodes (the merged leaf nodes). Once the weights are reallocated, the leaf nodes are effectively removed from the taxonomy belief tree. This reassignment process is mathematically formalized as follows:

(24)

where Inline graphic is the weight of the node , and leaves is a collection of leaf nodes belonging to the parent node .

After reallocating the weights, the confidence score is recalculated as defined in equation (23). This iterative process continues, recalculating the confidence scores and redistributing the weights until a confidence score surpasses the predetermined threshold Inline graphic . At this point, the leaf node with the highest accumulated weight path is designated as the lowest appropriate rank classification, indicative of the most probable taxonomic affiliation. Using this method, although ViTax is trained only at the genus level, it can achieve adaptive prediction across various levels of taxonomy.

Taking Fig. 2 as an example, at the beginning, as shown in the left tree, the leaf nodes of the maximum path are pink. The sum of weights on the path is Inline graphic , and the total number of segments is 17, resulting in a confidence score of , which is less than the default value of 0.6. Therefore, the pink node is not the lowest appropriate rank classification and the adaptive hierarchical classification strategy begins to seek higher-level taxonomy classifications. We merge all leaf nodes and their weights and obtain the tree on the right. At this point, the leaf node of the maximum path is the purple node, and the sum of weights on the path is Inline graphic , resulting in a confidence score of , which is more than 0.6. Therefore, we can infer that the query sequence is similar to the purple node.

Example of adaptive hierarchical classification strategy.

Dataset

ViTax is tested on multiple datasets that vary in difficulty levels and usage scenarios. Details of the datasets are provided below.

RefSeq dataset: according to the MSL38 virus classification standards [5], we collected all double-stranded DNA viruses from the RefSeq database, where the genera containing multiple genomes were selected. A dataset comprising 3,979 genomes belonging to 631 genera is constructed with the number of genomes per genus ranging from 2 to 181, and the dataset is randomly split into training and testing sets with a ratio of 6:4. Meanwhile the genera with only one genome in RefSeq database are excluded from training, which constructs an open-set testing data. Detailed information on the dataset can be found in the Supplementary Table S1 and ’Open-set dataset’.

Contig simulation dataset: in practical applications, the test sequences often appear in contig form, meaning they are shorter compared to whole genomes. Therefore, we simulate contig data by cutting sequences from the RefSeq dataset into lengths ranging from 4k to 16k and generate a total of 84,399 simulated contigs, aiming to evaluate method performance across sequences of varying lengths.

Out-of-distribution dataset: the dataset used in this study is a subset of the RefSeq database, reallocated for OOD tasks. To prepare this dataset, we first compute the pairwise distances between strains within each genus, generating a similarity matrix. Strains with an average similarity greater than 0.3 are then removed. Following this, spectral clustering is applied to partition the remaining strains within each genus into two clusters. The result of clustering is used to divide the dataset into training and testing sets, ensuring that the virus distributions between the training and testing sets are inconsistent. The detailed steps are in the Supplementary.

Open-set dataset: double-stranded DNA viruses from MSL38 are selected, with a focus on genera that contain only a single sequence, which is not been encountered during training. The collection comprises 789 sequences from 789 distinct genera, designated as open-set data at the genus level.

Real metagenomic virus dataset: to test the generalizability of ViTax and expand its utility in different virome studies, we download three real metagenomic datasets: the Global Ocean Viromes (GOV2.0) [27], the Global Soil Virome dataset (GSV) [28], and the Gut Phage Database [29]. All these datasets are derived from real environmental sampling and assembly, ensuring their relevance to diverse ecological settings.

Data preparation

Data from the RefSeq dataset is utilized to train the model. Given the nature of the learning-based approach, it is essential to have training samples for each genus. Therefore, the data for each genus are randomly split in a 6:4 ratio (ensuring a minimum of one sequence in the test set if the split resulted in fewer than one).

Baseline

The three state-of-the-art methods are compared, including two alignment-based methods, i.e. Kraken [11], and CAT [14], and one learning-based method, i.e. PhaGenus [17].

Performance on virus taxonomy

The performance of ViTax is comprehensively evaluated on the RefSeq testing set with CAT and Kraken, which are uniformly trained on the same RefSeq training set. For PhaGenus, which can only predict 532 genera for its limited capability, its database is constructed using the same dataset, and two comparison versions are provided for fairness: (1) a comparison of the entire test set, which includes 631 genera, exceeding the prediction range of ViTax; (2) a comparison of the subset of the test set containing the 532 genera that PhaGenus can predict, with the second comparison included in the Supplementary Table S2.

Table 1 shows the performance of four tools across different metrics at the genus level on the RefSeq dataset and Out-of-distribution dataset, with varying sequence lengths. ViTax outperforms other methods across all sequence lengths for both datasets, with a particularly significant improvement on the OOD dataset. In the case of the complete genome, ViTax exceeds Kraken by 13.6%. This highlights ViTax’s exceptional generalization and robustness in handling diverse sequence lengths, out-of-distribution samples, and viral variants. ViTax excels in adapting to the genomic diversity found in mutated or novel viral strains, ensuring high accuracy even when faced with previously unseen or poorly represented viral sequences. In addition to performance, the computational resource consumption of each method has also been compared, and details are provided in the Supplementary Table S3.

Table 1.

Comparison of accuracy at the genus level for different tools on RefSeq and out-of-distribution datasets across varying sequence lengths

	RefSeq dataset				Out-of-distribution dataset
Length	ViTax	Kraken	CAT	PhaGenus	ViTax	Kraken	CAT	PhaGenus
4k	0.923	0.911	0.851	0.749	0.648	0.596	0.573	0.471
6k	0.937	0.922	0.862	0.751	0.675	0.595	0.578	0.473
8k	0.948	0.927	0.866	0.750	0.688	0.600	0.580	0.477
10k	0.951	0.932	0.875	0.745	0.698	0.603	0.590	0.470
12k	0.954	0.936	0.878	0.717	0.699	0.606	0.586	0.465
14k	0.957	0.936	0.880	0.737	0.700	0.608	0.586	0.460
16k	0.957	0.936	0.880	0.737	0.700	0.612	0.594	0.453
Complete genome	0.950	0.936	0.878	0.713	0.864	0.728	0.674	0.611

Open in a new tab

The high performance of ViTax is attributed to the effective combination of the trained embedding space with hierarchical a priori information of virus classification. Additionally, ViTax’s robustness to sequence length variations is a consequence of HyenaDNA’s proficiency in handling extended sequences and taxonomy belief mapping detailed understanding of sequences. Furthermore, the embedding space constructed through PCL in ViTax enables it to effectively recognize viruses from different distributions. In contrast to alignment-based methods, which rely solely on direct sequence comparisons, ViTax leverages learned representations, offering superior performance in identifying novel or out-of-distribution viruses.

Performance on open-set data

It is commonly assumed in traditional methods that the training and testing datasets are drawn from a similar distribution. However, due to rapid advancements in sequencing technology, new viruses are constantly being discovered that the genus-level taxonomy has not been discovered and has not been encountered in the training set. This highlights the open-set nature of viral classification, which demands that viral classification tools be capable of addressing open-set problems. Below are the results related to ViTax’s performance on open-set problems.

Confidence threshold setting

To evaluate the effectiveness of setting a confidence threshold, we analyze the hierarchy adjustment rate (the frequency at which sequences are reclassified to taxonomic levels higher than genus, such as family or order when they cannot be confidently classified at the genus level) and various genus-level evaluation metrics, excluding those viruses that are adjusted to higher levels, at different confidence thresholds. As shown in Fig. 3(a), an increase in the hierarchy adjustment rate is positively correlated with raising the threshold. Meanwhile, there is a corresponding enhancement in the accuracy of the retained predictions at the genus level. Figure 3(b) shows the ROC curve at the class level for ViTax. It can be observed that by setting an appropriate threshold, the model achieves a better balance between the true positive rate and the false positive rate, effectively rejecting incorrect samples and improving overall prediction accuracy. We aim to set an appropriate threshold that can improve the overall genus-level prediction accuracy without excessively adjusting the prediction to higher taxonomic levels. In subsequent experiments, we select 0.6 as the default threshold to achieve a balance between accuracy and hierarchy adjustment rate.

a) The changes in the hierarchy adjustment rate (Hierarchy adjustment rate measures how often sequences are reclassified to higher taxonomic levels when confidence at the genus level is low) and ViTax metrics at the genus level with varying confidence thresholds. b) ViTax’s ROC curve at the Class level. c) Changes in ViTax confidence threshold and hierarchy adjustment rate in the open-set dataset. d) A pie chart shows the distribution of taxonomic classifications by ViTax after adjustment. The chart includes ’classified’ and ’unclassified’ categories. For classified sequences, the chart further breaks down into those assigned to two ranks higher, one rank higher, and the same rank compared to the lowest appropriate rank.

Evaluation of adaptive hierarchical classification strategy in Open-Set Dataset

To emulate the open-set challenge, an open-set dataset is created to simulate viruses that we never encountered before. There are 782 genera in the dataset, and none of them should be assigned to any of the genus-level labels present in the training set. Instead, they should be assigned to higher-level classifications.

Figure 3(c) illustrates the changes in the ViTax confidence threshold and hierarchy adjustment rate in the open-set dataset. It can be seen that with a confidence threshold set at 0.6, the hierarchy adjustment rate is 80%. This means that for viruses whose lowest appropriate rank is higher than the genus level, this adaptive hierarchical classification strategy effectively adjusts the prediction of 80% of viruses to higher taxonomic levels. At the same time, as shown in Fig. 3(a), with this threshold setting, there is less than 10% probability of incorrectly adjusting the viruses that should be classified at the genus level to higher taxonomic levels. Therefore, this strategy not only improves the accuracy of the ViTax’s predictions but also effectively reduces misjudgments caused by uncertainty, allowing ViTax to adaptively assign viruses with undiscovered lower-level taxonomy to higher taxonomic levels, thus addressing the open-set problem and enhancing its reliability and robustness in practical applications.

Further analysis of the taxonomy assigned by ViTax after adjusting the classification. We find that ViTax allocates 98% of the sequences to higher taxonomy. An additional 2% is considered entirely unrelated to the taxonomy present in the training set. Among the higher-level classifications, 91.7% are verified as accurate. Details are presented in Fig. 3(d), where 1% of the sequences are classified into two taxonomic ranks higher than the lowest appropriate rank, 67% are allocated to one rank higher, and 32% are assigned to the same rank as the true classifiable level.

The results vary with the adjustment of the confidence threshold setting. When the confidence threshold is set high, ViTax adopts a more conservative strategy, resulting in a hierarchy adjustment rate and typically assigning sequences to higher taxonomic levels. Conversely, when the confidence threshold is set low, ViTax becomes more aggressive.

Not only ViTax but also other tools have corresponding strategies to handle viruses outside their recognition scope. For example, PhaGenus [17] rejects predictions for these viruses, while CAT [14] and Kraken [11] seek higher-level taxonomy classifications. We test four tools to see whether they can identify these viruses with genus-level taxonomy that is unseen in the training set. As shown in Table 2, ViTax demonstrates a higher accuracy in identifying viruses with genus-level taxonomy that has not been discovered compared with other tools. This is credited to the taxonomy belief mapping approach’s effective use of the Lowest Common Ancestor algorithm, which leverages the a priori knowledge of viral hierarchical taxonomy, thereby efficiently integrating all embeddings derived from sequences. Additionally, the taxonomy belief mapping approach’s tree structure facilitates the assignment of unknown viruses to higher levels of similar taxonomy within ViTax to solve open-set problems in virus taxonomy classification.

Table 2.

The accuracy of the four tools for identifying viruses in the open-set dataset

	ViTax	Kraken	CAT	PhaGenus
Accuracy	0.8	0.3	0.69	0.52

Open in a new tab

Ablation study

Ablation study of PCL

In the ViTax pipeline, PCL plays a crucial role in constructing the embedding space, which is then used by the subsequent taxonomy belief mapping (TBM) module for adaptive hierarchical classification. By learning representative features rather than individual instance features, PCL effectively addresses the few-shot learning problem and virus mutation issues in viral taxonomy classification.

To quantify the impact of PCL, we compare it with a model that uses the Hyena model with a classification head and cross-entropy loss. Both models are trained under the same settings on 2000bp viral sequences. As shown in Fig. 4(b), PCL demonstrates a significant improvement compared to the traditional method. We further analyze the prediction performance of both models on small samples. Figure 4(c) and (d) show the confusion matrix for 10 classes, where the training samples for each class are the fewest. It is evident that PCL outperforms the traditional method in few-shot learning, demonstrating better performance with small sample sizes.

a) ACC at genus level of the proposed ViTax and its three variants on the contig simulation dataset. b) Performance comparison between prototype contrastive learning and traditional method. c) Confusion matrix for PCL on the 10 classes with the fewest training samples. d) Confusion matrix for traditional method on the 10 classes with the fewest training samples.

Ablation study of TBM module

To assess the contributing factors for the significant improvement of ViTax, ViTax is compared with its variants on the contig simulation dataset, i.e. ViTax without TBM, ViTax without reverse complementary (RC), ViTax without TBM and RC. As shown in Fig. 4(a), the prediction performance at all lengths is consistently better when the reverse complement of the input is included. In addition, TBM, as an adaptive hierarchical classification module, helps ViTax effectively address OOD and open-set problems. As the virus sequence length increases, the advantages of TBM become more apparent. It should be noted that it not only adaptively adjusts the prediction hierarchy but also outperforms the version without TBM in genus-level classification accuracy. This enables ViTax to make accurate classifications even when facing unknown viruses or insufficient training samples.

Using a tree-based classifier provides more detailed and adaptable results than direct classification. This approach not only better captures subtle differences in the data but also enhances ViTax’s generalization ability and its capability to find the lowest appropriate rank taxonomy for virus classification.

Application of sequences from real metagenomic sequencing data

To test the performance of ViTax in different ecosystems, we apply ViTax to real viral datasets from marine, gut, and soil environments. Below are the results for the marine and soil environments, with the gut results provided in the Supplementary materials.

Application of marine data

The GOV2.0 dataset is a DNA virome dataset comprising 195,728 viral populations, derived from the metagenomic assembly of 145 marine viromes [27]. We compare Kraken, Cat, and ViTax, which use k-mers, and protein sequences as input, respectively. Figure 5(a) illustrates the consistency in predictions at the class level for the three tools. Kraken only predicts 23,363 contigs (11.0%), while Cat and ViTax predict significantly more viruses, with predictions of 60,000 (30. 6%) and 71,827 (36.4%). Meanwhile, ViTax exhibits the highest prediction consistency with the other tools, showing consistency rates of 94.9% and 83.0% with Cat and Kraken, respectively, while the prediction consistency between Kraken and Cat is only 80.8%. The above experimental results show that ViTax predicts more novel viruses and achieves higher accuracy when applied to real datasets.

a) The class-level consistency of ViTax, Kraken, and Cat on the GOV2.0 dataset. b) The composition of taxonomy predicted by ViTax on the GOV2.0 dataset.

The composition of the genus predicted by ViTax is shown in Fig. 5(b), many predictions made by ViTax are confirmed in real marine environments. For instance, Tethysvirus is an aquatic virus known to infect haptophytes [30], a group of marine algae. Another example is Elemovirus, which is identified by Hoetzinger in the Baltic Sea [31]. Additionally, Eurybiavirus and Tangaroavirus are two types of viruses whose host is Prochlorococcus. Prochlorococcus is a widespread marine cyanobacterium and is recognized as one of the most abundant photosynthetic organisms in the ocean [32]. The results demonstrate that ViTax can effectively predict the distribution of viruses in real-world datasets, proving to be a valuable tool for expanding our understanding of viruses in metagenomic studies.

Application of soil data

The GSV dataset [28] constitutes a comprehensive global repository of soil viral metagenomic data, amassed by integrating data from 1,824 soil samples across various biomes, climates, and soil types worldwide. In the GSV, the authors use alignment-based method to classify viral sequences, but 1,415 sequences remain unclassified. To categorize these, the authors search the viral RefSeq database for similar proteins and apply a majority-rules method to assign classifications to these viruses.

We utilize ViTax to predict the taxonomic affiliations of the 1,415 viruses, validating its predictive accuracy on soil viral data. Notably, at the family level, these viruses present an open-set challenge for ViTax, as their family-level annotations are not included in the training dataset, requiring accurate classification without prior exposure to similar sequences.

Table 3 shows the prediction results for these 1,415 viruses using various tools. The Open-Set Error Rate refers to the proportion of viruses misclassified at the family level or below, and ACC refers to the accuracy of viruses correctly classified at the correct taxonomic level. Since Kraken mainly refuses to make predictions when facing unknown viruses, its ACC cannot be calculated. The results demonstrate that ViTax can effectively identify novel viruses and classify them into the lowest appropriate rank, with high accuracy.

Table 3.

The performance of the three tools in soil and open-set scenarios

	ViTax	Kraken	CAT
Open-set error rate	0.224	0.921	0.278
Accuracy	0.972	0.965	NA

Open in a new tab

Conclusion

Accurate viral taxonomy is of great significance for human pathology and microbial research. However, the field confronts substantial challenges, primarily the rapid mutation rates of viruses that complicate classification and necessitate frequent updates to taxonomic frameworks. The imbalance in viral classification data further exacerbates these challenges, skewing the accuracy of learning-based methods towards data-rich viruses and undermining performance for those with limited data.

To overcome these challenges, we propose ViTax for virus adaptively hierarchical taxonomy classification. It is powered by HyenaDNA, a foundation model for long-range genomic sequences at single nucleotide resolution and PCL effectively extracts valuable information from long virus genomic sequences and mitigates the classification impact caused by highly imbalanced and mutant virus data. Moreover, ViTax integrates long genomic sequence information and the hierarchical information of virus taxonomy using the approach without using the taxonomy belief mapping approach to construct a tree-based classifier. This enables ViTax to achieve more granular classification while adaptively assigning viruses to the lowest appropriate rank, effectively addressing the open-set problem by reclassifying to higher, known taxonomic levels when the genus-level taxonomy is unknown. The Experimental results show that this method performs excellently in virus taxonomy, demonstrating outstanding performance with sequences of different lengths, out-of-distribution scenarios, open-set scenarios, and real-world situations.

Meanwhile, in real-world environments, many factors beyond genomic data influence viral behavior, such as location, temperature, and salinity. Incorporating these environmental variables into classification models is a promising direction that could provide a deeper understanding of virus and ecological dynamics, offering a more comprehensive approach to viral classification. Overall, ViTax stands out as a superior method for the taxonomy and investigation of viral entities.

Key Points

Accurate classification of viral sequences is crucial for evaluating their impact on public health and ecosystems. We develop a viral taxonomy method named ViTax that utilizes the Hyena model and prototypical contrastive learning to learn associations between virus sequences.
In the prediction phase, we propose an adaptive hierarchical classification approach for virus taxonomy using a taxonomy belief tree, which applies the Lowest Common Ancestor algorithm and prior knowledge to handle unknown virus sequences. This approach enhances the embedding space’s granularity and incorporates evolutionary relationships, providing confidence levels for taxonomy classification.
The experimental results show that ViTax outperforms state-of-the-art methods. Furthermore, ViTax can identify the novel viruses and assign viruses to the lowest appropriate taxonomy rank.

Supplementary Material

sup_bbae041

sup_bbae041.docx^{(5.2MB, docx)}

Acknowledgement

Shaorong Fang and Tianfu Wu from Information and Network Center of Xiamen University are acknowledged for the help with high performance computing (HPC).

Contributor Information

YuShuang He, Department of Automation, Xiamen University, Xiamen, Fujian 361005, China.

Feng Zhou, Department of Automation, Xiamen University, Xiamen, Fujian 361005, China; National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, Fujian 361005, China.

JiaXing Bai, Department of Automation, Xiamen University, Xiamen, Fujian 361005, China.

YiChun Gao, Department of Automation, Xiamen University, Xiamen, Fujian 361005, China.

Xiaobing Huang, Department of Medical Oncology, Fuzhou First Hospital Affiliated with Fujian Medical University, Fuzhou, Fujian 350108, China.

Ying Wang, Department of Automation, Xiamen University, Xiamen, Fujian 361005, China; National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, Fujian 361005, China; State Key Laboratory of Mariculture Breeding, Xiamen Key Laboratory of Big Data Intelligent Analysis and Decision, Xiamen University, Xiamen, Fujian 350108, China.

Funding

This work was supported by National Natural Science Foundation of China (62173282, 62472363) and Fuzhou Inter-institutional Science and Technology Cooperation Project (2024-Y-018).

Data availability

An open-source implementation of the ViTax can be downloaded from https://github.com/Ying-Lab/ViTax.

References

1. Dronina J, Samukaite-Bubniene U, Ramanavicius A. Advances and insights in the diagnosis of viral infections. J Nanobiotechnol 2021;19:1–23. 10.1186/s12951-021-01081-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Suttle CA. Marine viruses—major players in the global ecosystem. Nat Rev Microbiol 2007;5:801–12. 10.1038/nrmicro1750. [DOI] [PubMed] [Google Scholar]
3. Breitbart M, Rohwer F. Here a virus, there a virus, everywhere the same virus? Trends Microbiol 2005;13:278–84. 10.1016/j.tim.2005.04.003. [DOI] [PubMed] [Google Scholar]
4. Siddell SG, Smith DB, Adriaenssens E. et al. Virus taxonomy and the role of the international committee on taxonomy of viruses (ictv). J Gen Virol 2023;104:001840. 10.1099/jgv.0.001840. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Zerbini FM, Siddell SG, Lefkowitz EJ. et al. Changes to virus taxonomy and the ICTV statutes ratified by the International Committee on Taxonomy of Viruses (2023). Arch Virol 2023;168:175. 10.1007/s00705-023-05797-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Geng C, Huang S-j, Chen S. Recent advances in open set recognition: a survey. IEEE Trans Pattern Anal Mach Intell 2020;43:3614–31. [DOI] [PubMed] [Google Scholar]
7. Nguyen E, Poli M, Faizi M. et al. HyenaDNA: long-range genomic sequence modeling at single nucleotide resolution. Adv Neural Inf Process Syst 2024;36:43177–201. [Google Scholar]
8. Aho AV, Hopcroft JE, Ullman JD. On finding lowest common ancestors in trees. In: Proceedings of the fifth annual ACM symposium on Theory of computing. pp. 253–65. New York: Association for Computing Machinery, 1973.
9. Poli M, Massaroli S, Nguyen E. et al. Hyena hierarchy: towards larger convolutional language models. In: International Conference on Machine Learning, pp. 28043–78. Hawaii: PMLR, 2023. [Google Scholar]
10. Ye J, McGinnis S, Madden TL. BLAST: improvements for better sequence analysis. Nucleic Acids Res 2006;34:W6–9. 10.1093/nar/gkl164. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Wood DE, Jennifer L, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol 2019;20:1–13. 10.1186/s13059-019-1891-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Raju RS, Al Nahid, Dev PC. et al. VirusTaxo: taxonomic classification of viruses from the genome sequence using k-mer enrichment. Genomics 2022;114:110414. 10.1016/j.ygeno.2022.110414. [DOI] [PubMed] [Google Scholar]
13. Ounit R, Wanamaker S, Close TJ. et al. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 2015;16:1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Meijenfeldt FABV, Arkhipova K, Cambuy DD. et al. Robust taxonomic classification of uncharted microbial sequences and bins with cat and bat. Genome Biol 2019;20:1–14. 10.1186/s13059-019-1817-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Jiang J-Z, Yuan W-G, Shang J. et al. Virus classification for viral genomic fragments using phagcn2. Brief Bioinform 2023;24:bbac505. 10.1093/bib/bbac505. [DOI] [PubMed] [Google Scholar]
16. Shang J, Jiang J, Sun Y. Bacteriophage classification for assembled contigs using graph convolutional network. Bioinformatics 2021;37:i25–33. 10.1093/bioinformatics/btab293. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Guan J, Peng C, Shang J. et al. PhaGenus: genus-level classification of bacteriophages using a transformer model. Brief Bioinform 2023;24:bbad408. 10.1093/bib/bbad408. [DOI] [PubMed] [Google Scholar]
18. Bolduc B, Jang HB, Doulcier G. et al. vConTACT: an iVirus tool to classify double-stranded DNA viruses that infect archaea and bacteria. PeerJ 2017;5:e3243. 10.7717/peerj.3243. [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Turner D, Shkoporov AN, Lood C. et al. Abolishment of morphology-based taxa and change to binomial species names: 2022 taxonomy update of the ICTV Bacterial Viruses Subcommittee. Arch Virol 2023;168:74. 10.1007/s00705-022-05694-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Vaswani A, Shazeer N, Parmar N. et al. Attention is all you need. Adv Neural Inf Process Syst 2017;30:6000–10. [Google Scholar]
21. Ji Y, Zhou Z, Liu H. et al. DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics 2021;37:2112–20. 10.1093/bioinformatics/btab083. [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Zhou Z, Ji Y, Li W. et al. DNABERT-2: efficient foundation model and benchmark for multi-species genome. In: International Conference on Learning Representations. Vienna, Austria: ICLR publisher, 2024.
23. Li J, Pan Z, Xiong C. et al. Prototypical contrastive learning of unsupervised representations. Proceedings of the Ninth International Conference on Learning Representations. ICLR, 2021, 4–8.
24. McLachlan GJ, Krishnan T. The EM Algorithm and Extensions. International Conference on Database Systems for Advanced Applications, pp. 230–35. Berlin, Heidelberg: Springer, 2009. [Google Scholar]
25. Junkang W, Chen J, Jiancan W. et al. Understanding contrastive learning via distributionally robust optimization. Adv Neural Inf Process Syst 2024;36:23297–320. [Google Scholar]
26. Steinley D. K-means clustering: a half-century synthesis. Br J Math Stat Psychol 2006;59:1–34. 10.1348/000711005X48266. [DOI] [PubMed] [Google Scholar]
27. Gregory AC, Zayed AA, Conceição-Neto N. et al. Marine DNA viral macro-and microdiversity from pole to pole. Cell 2019;177:1109–1123.e14. 10.1016/j.cell.2019.03.040. [DOI] [PMC free article] [PubMed] [Google Scholar]
28. Ma B, Wang Y, Zhao K. et al. Biogeographic patterns and drivers of soil viromes. Nat Ecol Evol 2024;8:717–28. 10.1038/s41559-024-02347-2. [DOI] [PubMed] [Google Scholar]
29. Camarillo-Guerrero LF, Almeida A, Rangel-Pineros G. et al. Massive expansion of human gut bacteriophage diversity. Cell 2021;184:1098–1109.e9. 10.1016/j.cell.2021.01.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
30. Aylward FO, Abrahão JS, Brussaard CPD. et al. Taxonomic update for giant viruses in the order imitervirales (phylum nucleocytoviricota). Arch Virol 2023;168:283. 10.1007/s00705-023-05906-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
31. Hoetzinger M, Nilsson E, Arabi R. et al. Dynamics of Baltic Sea phages driven by environmental changes. Environ Microbiol 2021;23:4576–94. 10.1111/1462-2920.15651. [DOI] [PubMed] [Google Scholar]
32. Partensky F, Hess WR, Vaulot D. Prochlorococcus, a marine photosynthetic prokaryote of global significance. Microbiol Mol Biol Rev 1999;63:106–27. 10.1128/MMBR.63.1.106-127.1999. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

sup_bbae041

sup_bbae041.docx^{(5.2MB, docx)}

Data Availability Statement

An open-source implementation of the ViTax can be downloaded from https://github.com/Ying-Lab/ViTax.

[ref1] 1. Dronina J, Samukaite-Bubniene U, Ramanavicius A. Advances and insights in the diagnosis of viral infections. J Nanobiotechnol 2021;19:1–23. 10.1186/s12951-021-01081-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref2] 2. Suttle CA. Marine viruses—major players in the global ecosystem. Nat Rev Microbiol 2007;5:801–12. 10.1038/nrmicro1750. [DOI] [PubMed] [Google Scholar]

[ref3] 3. Breitbart M, Rohwer F. Here a virus, there a virus, everywhere the same virus? Trends Microbiol 2005;13:278–84. 10.1016/j.tim.2005.04.003. [DOI] [PubMed] [Google Scholar]

[ref4] 4. Siddell SG, Smith DB, Adriaenssens E. et al. Virus taxonomy and the role of the international committee on taxonomy of viruses (ictv). J Gen Virol 2023;104:001840. 10.1099/jgv.0.001840. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref5] 5. Zerbini FM, Siddell SG, Lefkowitz EJ. et al. Changes to virus taxonomy and the ICTV statutes ratified by the International Committee on Taxonomy of Viruses (2023). Arch Virol 2023;168:175. 10.1007/s00705-023-05797-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref6] 6. Geng C, Huang S-j, Chen S. Recent advances in open set recognition: a survey. IEEE Trans Pattern Anal Mach Intell 2020;43:3614–31. [DOI] [PubMed] [Google Scholar]

[ref7] 7. Nguyen E, Poli M, Faizi M. et al. HyenaDNA: long-range genomic sequence modeling at single nucleotide resolution. Adv Neural Inf Process Syst 2024;36:43177–201. [Google Scholar]

[ref8] 8. Aho AV, Hopcroft JE, Ullman JD. On finding lowest common ancestors in trees. In: Proceedings of the fifth annual ACM symposium on Theory of computing. pp. 253–65. New York: Association for Computing Machinery, 1973.

[ref9] 9. Poli M, Massaroli S, Nguyen E. et al. Hyena hierarchy: towards larger convolutional language models. In: International Conference on Machine Learning, pp. 28043–78. Hawaii: PMLR, 2023. [Google Scholar]

[ref10] 10. Ye J, McGinnis S, Madden TL. BLAST: improvements for better sequence analysis. Nucleic Acids Res 2006;34:W6–9. 10.1093/nar/gkl164. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref11] 11. Wood DE, Jennifer L, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol 2019;20:1–13. 10.1186/s13059-019-1891-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref12] 12. Raju RS, Al Nahid, Dev PC. et al. VirusTaxo: taxonomic classification of viruses from the genome sequence using k-mer enrichment. Genomics 2022;114:110414. 10.1016/j.ygeno.2022.110414. [DOI] [PubMed] [Google Scholar]

[ref13] 13. Ounit R, Wanamaker S, Close TJ. et al. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 2015;16:1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref14] 14. Meijenfeldt FABV, Arkhipova K, Cambuy DD. et al. Robust taxonomic classification of uncharted microbial sequences and bins with cat and bat. Genome Biol 2019;20:1–14. 10.1186/s13059-019-1817-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref15] 15. Jiang J-Z, Yuan W-G, Shang J. et al. Virus classification for viral genomic fragments using phagcn2. Brief Bioinform 2023;24:bbac505. 10.1093/bib/bbac505. [DOI] [PubMed] [Google Scholar]

[ref16] 16. Shang J, Jiang J, Sun Y. Bacteriophage classification for assembled contigs using graph convolutional network. Bioinformatics 2021;37:i25–33. 10.1093/bioinformatics/btab293. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref17] 17. Guan J, Peng C, Shang J. et al. PhaGenus: genus-level classification of bacteriophages using a transformer model. Brief Bioinform 2023;24:bbad408. 10.1093/bib/bbad408. [DOI] [PubMed] [Google Scholar]

[ref18] 18. Bolduc B, Jang HB, Doulcier G. et al. vConTACT: an iVirus tool to classify double-stranded DNA viruses that infect archaea and bacteria. PeerJ 2017;5:e3243. 10.7717/peerj.3243. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref19] 19. Turner D, Shkoporov AN, Lood C. et al. Abolishment of morphology-based taxa and change to binomial species names: 2022 taxonomy update of the ICTV Bacterial Viruses Subcommittee. Arch Virol 2023;168:74. 10.1007/s00705-022-05694-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref20] 20. Vaswani A, Shazeer N, Parmar N. et al. Attention is all you need. Adv Neural Inf Process Syst 2017;30:6000–10. [Google Scholar]

[ref21] 21. Ji Y, Zhou Z, Liu H. et al. DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics 2021;37:2112–20. 10.1093/bioinformatics/btab083. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref22] 22. Zhou Z, Ji Y, Li W. et al. DNABERT-2: efficient foundation model and benchmark for multi-species genome. In: International Conference on Learning Representations. Vienna, Austria: ICLR publisher, 2024.

[ref23] 23. Li J, Pan Z, Xiong C. et al. Prototypical contrastive learning of unsupervised representations. Proceedings of the Ninth International Conference on Learning Representations. ICLR, 2021, 4–8.

[ref24] 24. McLachlan GJ, Krishnan T. The EM Algorithm and Extensions. International Conference on Database Systems for Advanced Applications, pp. 230–35. Berlin, Heidelberg: Springer, 2009. [Google Scholar]

[ref25] 25. Junkang W, Chen J, Jiancan W. et al. Understanding contrastive learning via distributionally robust optimization. Adv Neural Inf Process Syst 2024;36:23297–320. [Google Scholar]

[ref26] 26. Steinley D. K-means clustering: a half-century synthesis. Br J Math Stat Psychol 2006;59:1–34. 10.1348/000711005X48266. [DOI] [PubMed] [Google Scholar]

[ref27] 27. Gregory AC, Zayed AA, Conceição-Neto N. et al. Marine DNA viral macro-and microdiversity from pole to pole. Cell 2019;177:1109–1123.e14. 10.1016/j.cell.2019.03.040. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref28] 28. Ma B, Wang Y, Zhao K. et al. Biogeographic patterns and drivers of soil viromes. Nat Ecol Evol 2024;8:717–28. 10.1038/s41559-024-02347-2. [DOI] [PubMed] [Google Scholar]

[ref29] 29. Camarillo-Guerrero LF, Almeida A, Rangel-Pineros G. et al. Massive expansion of human gut bacteriophage diversity. Cell 2021;184:1098–1109.e9. 10.1016/j.cell.2021.01.029. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref30] 30. Aylward FO, Abrahão JS, Brussaard CPD. et al. Taxonomic update for giant viruses in the order imitervirales (phylum nucleocytoviricota). Arch Virol 2023;168:283. 10.1007/s00705-023-05906-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref31] 31. Hoetzinger M, Nilsson E, Arabi R. et al. Dynamics of Baltic Sea phages driven by environmental changes. Environ Microbiol 2021;23:4576–94. 10.1111/1462-2920.15651. [DOI] [PubMed] [Google Scholar]

[ref32] 32. Partensky F, Hess WR, Vaulot D. Prochlorococcus, a marine photosynthetic prokaryote of global significance. Microbiol Mol Biol Rev 1999;63:106–27. 10.1128/MMBR.63.1.106-127.1999. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

ViTax: adaptive hierarchical viral taxonomy classification with a taxonomy belief tree on a foundation model

YuShuang He

Feng Zhou

JiaXing Bai

YiChun Gao

Xiaobing Huang

Ying Wang

Abstract

Introduction

Related works

Virus taxonomy

Alignment-based virus taxonomy

Learning-based virus taxonomy

Genomic sequence foundation model

Methods

Figure 1.

HyenaDNA

Supervised PCL with adjusted ProtoNCE

Supervised PCL

Adjusted protonce

Taxonomy belief mapping approach

Constructing taxonomy belief tree

Prediction step

Adaptive hierarchical classification strategy

Figure 2.

Dataset

Data preparation

Baseline

Performance on virus taxonomy

Table 1.

Performance on open-set data

Confidence threshold setting

Figure 3.

Evaluation of adaptive hierarchical classification strategy in Open-Set Dataset

Table 2.

Ablation study

Ablation study of PCL

Figure 4.

Ablation study of TBM module

Application of sequences from real metagenomic sequencing data

Application of marine data

Figure 5.

Application of soil data

Table 3.

Conclusion

Key Points

Supplementary Material

Acknowledgement

Contributor Information

Funding

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases