The optimal metric for viral genome space

Hongyu Yu; Stephen S-T Yau

doi:10.1016/j.csbj.2024.05.005

. 2024 May 10;23:2083–2096. doi: 10.1016/j.csbj.2024.05.005

The optimal metric for viral genome space

Hongyu Yu ^a, Stephen S-T Yau ^a,^b,^⁎

PMCID: PMC11128839 PMID: 38803517

Abstract

Understanding the structural similarity between genomes is pivotal in classification and phylogenetic analysis. As the number of known genomes rockets, alignment-free methods have gained considerable attention. Among these methods, the natural vector method stands out as it represents sequences as vectors using statistical moments, enabling effective clustering based on families in biological taxonomy. However, determining an optimal metric that combines different elements in natural vectors remains challenging due to the absence of a rigorous theoretical framework for weighting different k-mers and orders. In this study, we address this challenge by transforming the determination of optimal weights into an optimization problem and resolving it through gradient-based techniques. Our experimental results underscore the substantial improvement in classification accuracy achieved by employing these optimal weights, reaching an impressive 92.73% on the testing set, surpassing other alignment-free methods. On one hand, our method offers an outstanding metric for virus classification, and on the other hand, it provides valuable insights into feature integration within alignment-free methods.

Keywords: Alignment-free methods, Feature integration, Natural vector, Optimal metric, Viral genomes, Classification

Graphical abstract

Highlights

•
We propose an alignment-free method for the automatic integration of diverse features.
•
We improve the performance of the existing methods based on optimization theory.
•
We determine the optimal metric for viral genomes within the framework of moments.
•
Our method surpasses other alignment-free methods in virus classification.

1. Introduction

The study of genome relationships has garnered significant attention in recent years as it provides a fundamental approach to understanding the connections among organisms. Traditional methods for sequence comparison rely on alignment, which, while effective, can be time-consuming [1], [2], [3], [4]. With the advancement of sequencing techniques, the number of known genomes has increased rapidly. Consequently, more and more researchers are turning to alignment-free methods due to their high efficiency. A major idea in alignment-free methods involves embedding each genome to a point in the vector space. This transformation allows the sequence comparison problem to be recast as a classification or clustering problem for vectors, which can be readily solved using machine learning algorithms such as the K-NN method [5] or the K-means method [6]. In addition to its applications in classification and clustering, the concept of sequence embedding offers a novel approach to understanding genomes, representing each genome as a point in genome space and each family (or other levels of classification) as a cluster, providing a geometric perspective on genome analysis. In 2008, the Defense Advanced Research Projects Agency (DARPA) proposed two problems, namely “The Geometry of Genome Space” and “What are the Fundamental Laws of Biology?”, along with 21 other challenges in pure and applied mathematics [7]. These challenges have spurred researchers from diverse academic backgrounds to investigate the genome space and its metrics.

There are various alignment-free methods for embedding sequences into vector spaces and defining metrics, mostly rooted in the analysis of k-mers from probabilistic, statistical, or information theory perspectives [8], [9], [10], [11], [12], [13]. The natural vector (NV) method is an effective method that incorporates the concept of statistical moments, transforming sequences into feature information based on different k-mers and different moment orders [14], [15]. Features here refer to various types of numerical elements extracted from the sequences. By assigning weights to various types of feature information, a comprehensive metric is established. Prior research has validated the effectiveness and efficiency of the NV-based method. Notably, the convex hulls formed by NVs from distinct families do not overlap, demonstrating that NVs belonging to the same family cluster together [16], [17], [18]. Furthermore, the comprehensive metric introduced by NV facilitates the efficient classification and phylogenetic analysis of biological sequences [15], [17].

Despite the success of NV-based methods, determining an optimal metric remains a challenging task due to the lack of a rigorous theoretical framework for weighing different k-mers and orders. In previous studies, these weights have been manually assigned [17]. While experimental results have shown the efficacy of such manual weight selection for real biological data, the search for an optimal weight for classification remains ongoing.

In this paper, we approach the weight selection as an optimization problem. We take a smooth approximation of the classification accuracy as the objective function and employ a modified version of the gradient descent method to calculate an optimal weight. The utilization of the optimal weight for classification yields an accuracy of 92.73% for the testing set, which is 4.88% higher than the best performance achieved by six other alignment-free approaches including both NV-based methods with manually determined weights and methods derived from other perspectives [17], [11], [12], [13]. Moreover, we extend our analysis by applying the optimal weight to each Baltimore class and fine-tuning the weights within the classes. Subsequently, we construct phylogenetic trees based on the fine-tuned optimal weight. This research makes three significant contributions. Firstly, we present a rapid and accurate algorithm for classifying new genomes, which becomes increasingly vital as more genomes are discovered. Secondly, the distance metric derived from the optimal weight can be applied in the construction of phylogenetic trees for organisms, especially for viruses, where the absence of common genes found in cellular organisms poses a challenge [19]. Finally, our method offers an opportunity for the integration of the numerous alignment-free features currently available, facilitating further advancements in alignment-free methods.

2. Materials and methods

2.1. Dataset

The data utilized in this study comprise the complete virus reference sequences sourced from the National Center for Biotechnology Information (NCBI) up to June 30, 2022. The sequences can be accessed via the following URL: https://ftp.ncbi.nlm.nih.gov/refseq/release/viral. To ensure data quality, a data cleaning procedure was performed, which involved the removal of three types of sequences: (1) sequences containing nucleotides other than A, $T (U)$ , C, G; (2) sequences lacking a family label; and (3) sequences belonging to families consisting of only a single sequence. Before the data cleaning process, there were 14,813 sequences. Following this process, the dataset retained a total of 11,559 sequences from 123 families. It is worth noting that our data contains multi-segment viruses, where each sequence represents one segment in this case. To establish the training and testing sets, 80% of the sequences were randomly selected as the training set, while the remaining sequences constituted the testing set. The Genbank IDs are listed in the Supplementary material and can also be found in https://github.com/BobYHY/OptimalMetric.

2.2. Natural vectors and k-mer natural vectors

The natural vector method is an alignment-free method that transforms DNA sequences into vectors of moments [14]. Consider the sequence $S = s_{1} s_{2} . . . s_{n}$ , define

w_{k} (s_{i}) = {\begin{matrix} 1, & s_{i} = k \\ 0, & otherwise \end{matrix}

(1)

where $k, s_{i} \in {A, T, C, G}$ . Then the natural vector of order m can be defined as

(n_{A}, n_{C}, n_{G}, n_{T}, μ_{A}, μ_{C}, μ_{G}, μ_{T}, D_{2}^{A}, D_{2}^{C}, D_{2}^{G}, D_{2}^{T}, . . ., D_{m}^{A}, D_{m}^{C}, D_{m}^{G}, D_{m}^{T})

(2)

where

{\begin{matrix} n_{k} = \sum_{i = 1}^{n} w_{k} (s_{i}) \\ μ_{k} = \sum_{i = 1}^{n} \frac{i}{n_{k}} w_{k} (s_{i}) \\ D_{j}^{k} = \sum_{i = 1}^{n} \frac{{(i - μ_{k})}^{j}}{n_{k}^{j - 1} n^{j - 1}} w_{k} (s_{i}) \\ n = n_{A} + n_{T} + n_{C} + n_{G} \end{matrix}

(3)

$n_{k}$ and $μ_{k}$ are referred to as the order 0 element and order 1 element, respectively. $D_{j}^{k}$ denotes the order j element.

The k-mer natural vector method is an extension of the natural vector method [15]. K-mer is a string composed of k nucleotides and there are $4^{k}$ possible k-mers (denoted by $l_{1}, . . ., l_{4^{k}}$ ). For the sequence $S = s_{1} s_{2} . . . s_{n}$ , we can regard it as a sequence consisting of $n - k + 1$ k-mers $(s_{1} . . . s_{k}) . . . (s_{n - k + 1} . . . s_{n})$ . Similar to traditional natural vectors, we can define the k-mer natural vector

(n_{l_{1}}, . . ., n_{l_{k}}, μ_{l_{1}}, . . ., μ_{l_{k}}, D_{2}^{l_{1}}, . . ., D_{2}^{l_{k}}, . . ., D_{m}^{l_{1}}, . . ., D_{m}^{l_{k}}) .

(If $n_{l_{i}} = 0$ , we let $μ_{l_{i}} = D_{2}^{l_{i}} = . . . = 0$ .)

2.3. The optimal weight and the algorithm for training

There are elements of different k-mers (1-K) and different orders (0-M), as mentioned before. Let $d i s_{k m} (i, j)$ represent the Euclidean distance between the k-mer order m elements of sequence i and sequence j. By assigning a weight $w_{k m} (k = 1, . . ., K; m = 0, . . ., M)$ to each distance, we can formulate a weighting metric:

D i s^{w} (i, j) = \sum_{k = 1}^{K} \sum_{m = 0}^{M} w_{k m} d i s_{k m} (i, j) .

(4)

To determine the optimal weight for classification purposes, we need a scoring criterion to evaluate different weights. In other words, we should develop a smooth function that quantifies the effectiveness of a given weight w. In the case of sequence classification using natural vectors, the 1-nearest neighbor (1-NN) method with the leave-one-out strategy is commonly employed [5]. Therefore, a natural approach is to utilize the accuracy of predictions obtained from the 1-NN method with the leave-one-out strategy as the score for a particular weight, i.e.,

S (w) : = \frac{1}{N} \sum_{i = 1}^{N} 1_{{F (i) = F (\arg \min_{j \neq i} {D i s^{w} (i, j)})}}

(5)

where N is the number of the sequences, $1_{A}$ is the indicator function of the set A and $F (i)$ is the family that the i-th sequence belongs to.

However, $S (w)$ defined above is not continuous for w so it is very complicated to optimize. Therefore, we consider a smooth approximation of $S (w)$ . Let $f_{n} (x) = \frac{1}{x^{n}}$ , we define

S_{n} (w) : = \frac{1}{N} \sum_{i = 1}^{N} C_{i} (w)

(6)

where

C_{i} (w) : = \frac{\sum_{F (j) = F (i), j \neq i} f_{n} (D i s^{w} (i, j))}{\sum_{j \neq i} f_{n} (D i s^{w} (i, j))} .

(7)

Given a fixed weight $w^{(0)}$ and a fixed integer $i_{0}$ , suppose $j_{0} = \arg \min_{j \neq i_{0}} {D i s^{w^{(0)}} (i_{0}, j)}$ is well-defined (the nearest neighbor is unique), then we can prove that $\lim_{n \to + \infty} C_{i_{0}} (w^{(0)}) = 1_{{F (i_{0}) = F (j_{0})}}$ and therefore $\lim_{n \to + \infty} S_{n} (w^{(0)}) = S (w^{(0)})$ . The proof is simple:

D i s^{w^{(0)}} (i_{0}, j_{0}) < \min_{j \neq j_{0}, j \neq i_{0}} D i s^{w^{(0)}} (i_{0}, j)

f_{n} (D i s^{w^{(0)}} (i_{0}, j_{0})) > \max_{j \neq j_{0}, j \neq i_{0}} f_{n} (D i s^{w^{(0)}} (i_{0}, j))

\lim_{n \to + \infty} \frac{\max_{j \neq j_{0}, j \neq i_{0}} f_{n} (D i s^{w^{(0)}} (i_{0}, j))}{f_{n} (D i s^{w^{(0)}} (i_{0}, j_{0}))} = 0

Therefore, when n is sufficiently large, the element $f_{n} (D i s^{w^{(0)}} (i_{0}, j_{0}))$ becomes much larger than any other elements in the fraction of $C_{i}$ , rendering the other elements negligible. That is, if $F (i_{0}) = F (j_{0})$ , then $\lim_{n \to + \infty} C_{i_{0}} (w^{(0)}) = 1$ ; otherwise, $\lim_{n \to + \infty} C_{i_{0}} (w^{(0)}) = 0$ .

Therefore, we can approximate S by $S_{n}$ and our goal is to solve the following optimization problem given $n_{0}$ , K, and M:

\max S_{n_{0}} (w) s.t. w_{k m} \geq 0, k = 1, . . ., K; m = 0, . . ., M .

(8)

The gradient descent method is a traditional optimization algorithm for unconstrained problems. It can also be applied in constrained cases by mapping the result in each step to the feasible region. Let $R e L U (x) = \max (x, 0)$ and this calculation can be broadcast to vectors, then the optimization problem (8) can be solved by the iteration below:

w^{(n + 1)} = R e L U (w^{(n)} + l \nabla S_{n_{0}} (w^{(n)}))

(9)

where l is the learning rate.

Many modified versions of the gradient descent method such as stochastic modifications like Adam [20] have been proposed. However, we found through numerical experiments that utilizing these stochastic modifications in Algorithm 1 did not yield favorable results. Therefore, they were not implemented. We adopt the idea of the backtracking line search to this problem and propose the following algorithm.

Algorithm 1 — The algorithm that solves the optimization problem (8).

In this paper, we choose $n_{0} = 45$ , $K = 9$ , $M = 2$ , $l = 0.1$ , and $P = 50$ . We choose $n_{0}$ to be sufficiently large, but not excessively so, to prevent exceeding the numerical bounds when computing $S_{n_{0}}$ . The choice of K and M is based on previous research [17]. The learning rate l represents the proportion of updates to the original weights. We select it to be a substantial yet not excessive quantity. For P, we opt for a value that is sufficiently large.

In addition, we choose $A_{k m}$ such that the mean of elements in the initial weight is inversely proportional to the mean of the corresponding distance matrix, i.e., $A_{k m} E [d i s_{k m}] = c o n s t a n t$ , which avoids information to be ignored due to magnitude.

We implement the algorithm in the pytorch framework [21]. The code can be found in both the Supplementary material and the Github repository https://github.com/BobYHY/OptimalMetric.

2.4. The phylogenetic analysis

To construct a phylogenetic tree, we utilize the Hausdorff distance to measure the distance between families. The Hausdorff distance quantifies the extent of separation between two subsets. Its definition is as follows:

d_{H} (X, Y) = \max {\sup_{x \in X} d (x, Y), \sup_{y \in Y} d (X, y)} .

(10)

Hausdorff distance has been found to perform well between sets of natural vectors in previous studies, and it follows the triangular inequality, which makes it a well-defined distance in mathematics [22]. Using the distance matrix obtained from the Hausdorff distance, we employ the BioNJ algorithm [23], which is an enhanced version of the neighbor-joining algorithm [24], to construct the tree. Our algorithm is implemented online through the following website: http://www.atgc-montpellier.fr/fastme/ [25]. The trees are visualized using iTOL [26].

3. Results

3.1. The classification performance

Our objective is to effectively classify the viruses into their respective families by determining an optimal metric. To achieve this, we calculate the optimal weight for each k-mer and each moment order based on the training set. Subsequently, we apply the metric, which is induced by the optimal weight, to the testing set.

The optimization of the weight is achieved through the smooth approximation of the accuracy of the classification. During the training process, consisting of 193 iterations, we observe the effectiveness of the approximation. The maximum difference between the training accuracy and its approximation during the training process is found to be only 0.005%. This result eliminates the necessity to differentiate between these two concepts in subsequent discussions.

The progression of training and testing accuracy during the training process is depicted in Fig. 1. Notably, the training accuracy exhibits a remarkable increase, starting from 55.10% and reaching 91.52%. Similarly, the testing accuracy also shows a significant improvement, rising from 60.12% to 92.73%. The rapid and substantial growth in accuracy is evident. The simultaneous increase of both indicates that the knowledge gained about weight selection from the training set is generalizable rather than a result of overfitting. It is noteworthy that the testing accuracy outperforms the training accuracy due to the implementation of a leave-one-out strategy in this paper. This strategy entails predicting the outcomes for the testing set using all other sequences in both the training and testing sets, mimicking real-world scenarios. Conversely, predictions for the training set solely rely on other sequences within the training set to prevent data leakage.

We compare this classification accuracy with other methods. To begin with, we compare it with methods based on NV, but with manually determined weights. These methods do not differentiate weights based on different orders but only distinguish between weights for different k-mers. In the notation used in this paper, we can view the corresponding metric as

D i s^{a} = \sum_{k = 1}^{K} a_{k} \sqrt{\sum_{m = 0}^{2} d i s_{k m}^{2}} .

(11)

In previous studies, three types of manually determined weights are commonly used: $a_{k} = \frac{1}{2^{k}}$ , $a_{k} = \frac{1}{k^{2}}$ , and $a_{k} = 1_{{k = K}}$ . We refer to methods using these three weights as NV1, NV2, and NV3, respectively. We evaluated and examined the accuracy of these three methods on the testing set, varying K from 1 to 9. The best result was achieved by the 9-mer NV1 method, with an accuracy of 87.54%. (See Fig. 2.) Notably, the optimal metric significantly outperforms manually determined weights.

Fig. 2 — Comparison of classification accuracy between our method and six other alignment-free methods.

Then, we compare our method with three other alignment-free algorithms based on different theories. The first method, denoted as the Markov method, originates from [11]. It is based on a Markov model and corrects random background information present in K-mers using information from $(K - 1)$ -mers and $(K - 2)$ -mers ( $K > 2$ ) from the perspective of probability. The second method, denoted as the Jensen method, is from [12] and uses Jensen-Shannon divergence, a concept in information theory, to measure the distance between feature frequency profiles. The third method, denoted as the Jaccard method [13], calculates distances based solely on the presence of features using the Jaccard distance. Similarly, we conducted an evaluation by varying K from 1 to 9 and examined the accuracy of these three methods on the testing set. The results show that the best accuracy was achieved by the 3-mer Jensen method, with an accuracy of 87.85%. (See Fig. 2.) Our method significantly outperforms these methods by 4.88%.

3.2. Fine-tuning within the Baltimore classes

Previously, we have verified the reliability of the optimal metric for viral genomes. For specific task scenarios, we can further fine-tune the weights of this metric according to requirements to enhance its performance. For example, if we already know the Baltimore class to which a virus belongs and need to perform classification within that class, we would need to fine-tune the metric using the dataset specific to that class. The fine-tuning process utilizes the same training method employed to obtain the optimal weight for the entire dataset. However, in this case, the training set is restricted to a subset of the complete training set, and the process begins from the previously obtained optimal weight.

In this paper, we focus on fine-tuning the optimal weight for 7 Baltimore classes. The Baltimore classification system categorizes viruses into 7 classes based on the type of genome molecule and replication strategy [27], [28], [29]. These classes include double-stranded DNA viruses, single-stranded DNA viruses, double-stranded RNA viruses, positive-sense single-stranded RNA viruses, negative-sense single-stranded RNA viruses, single-stranded RNA reverse transcriptase viruses, and double-stranded DNA reverse transcriptase viruses. To construct new training and testing sets, we extract the sequences belonging to each specific Baltimore class from the original training and testing sets. Table 1 provides the corresponding sequence numbers for these subsets.

Table 1.

The number of sequences from each Baltimore class.

Baltimore class	The training set	The testing set
I	4075	1040
II	1161	308
III	1114	282
IV	1452	349
V	1288	290
VI	68	17
VII	89	26

Open in a new tab

Table 2 presents the testing accuracy for each Baltimore class using both the initial optimal weight and the weight after fine-tuning for each class. For Baltimore class VI and Baltimore class VII, the testing accuracy before fine-tuning is already 100%, indicating that retraining is unnecessary. In the case of Baltimore class I, II, III, and V, the testing accuracy before fine-tuning is satisfactory, but retraining can still lead to improvements. However, for Baltimore class IV, fine-tuning may result in over-fitting, where the testing accuracy decreases as the training accuracy increases. In summary, our findings indicate that the classification performance in subsets is generally good even before fine-tuning. Moreover, we have observed that fine-tuning can further enhance the performance in many cases.

Table 2.

The testing accuracy for each Baltimore class before and after fine-tuning.

Baltimore class	Before fine-tuning	After fine-tuning
I	94.03%	95.19%
II	96.90%	97.72%
III	97.12%	97.51%
IV	92.07%	90.83%
V	88.98%	90.00%
VI	100%	-
VII	100%	-

Open in a new tab

3.3. The optimal weight and the corresponding importance

Now we turn our attention to the optimal weights themselves. The weight $w_{k m}$ considered in this study is a 27-dimensional vector assigned to the statistical moments for 1-9-mers and orders 0-2. Fig. 3 presents a visualization of the optimal weight before fine-tuning. Normalization is performed to ensure that its maximum value is 1. Each row represents the weight for a specific order, and each column represents the weight for a particular k-mer. The visualization reveals that the weights for order 1 and order 2 elements tend to 0 as k increases, while the weight for order 0 elements initially increases and then decreases with the growth of k. This optimal weight provides valuable insights into integrating various statistical information within a biological sequence.

Fig. 3 — The optimal weight for each order and k-mer.

However, it is important to note that a high weight assigned to an element does not necessarily indicate its significant role in the classification. For instance, if two elements have similar weights but their corresponding moments have significantly different magnitudes, their importance in the classification task will not be equal. To address this, we introduce the concept of element importance. Let $w_{k j}$ denote the weight for k-mer and order j, and $E [d i s_{k j}]$ represent the mean distance for k-mer and order j. The importance $I_{k j}$ of each element is defined as the product of its weight and mean distance, i.e., $I_{k j} = w_{k j} \times E [d i s_{k j}]$ . We utilize the concept of importance to further investigate the significance of each element. Fig. 4 illustrates that the order 1 elements with large values of k, particularly 6-9-mers, hold great importance. This observation provides an explanation for the notable improvement in classification results with the inclusion of higher k-mers in previous studies [17].

Fig. 4 — The importance corresponding to the optimal weight.

We also provide visualizations of the weight and the corresponding importance after fine-tuning in Fig. A.7, Fig. A.8, Fig. A.9, Fig. A.10, Fig. A.11, Fig. A.12, Fig. A.13, Fig. A.14. We can observe that the weight undergoes only slight changes, whereas the corresponding importance exhibits more significant variations. Furthermore, we find that the previous observations hold true after fine-tuning. The weight pattern remains consistent with what is shown in Fig. 3, with elements having high k values in order 1 continuing to exhibit significant importance. The stability of these features indicates that the optimal metric is not merely a result specific to a particular dataset but possesses a degree of generality.

Fig. A.9 — The optimal weight after fine-tuning (Baltimore class II).

Fig. A.10 — The importance after fine-tuning (Baltimore class II).

Fig. A.11 — The optimal weight after fine-tuning (Baltimore class III).

Fig. A.12 — The importance after fine-tuning (Baltimore class III).

Fig. A.13 — The optimal weight after fine-tuning (Baltimore class V).

Fig. A.14 — The importance after fine-tuning (Baltimore class V).

3.4. The phylogenetic analysis for each Baltimore class

The optimal metric we have obtained provides a novel approach to constructing phylogenetic trees. Instead of relying on the identification and alignment of a common gene, which is challenging to find for viruses, we can directly utilize the entire genomes and use statistical moments to define the distance. Once a suitable metric for the genome is determined, we can extend it to define a metric for virus families using the Hausdorff distance. This metric enables us to perform phylogenetic analysis. In our study, we employ the BioNJ method [23] to construct phylogenetic trees for each Baltimore class.

Fig. 5 illustrates the phylogenetic tree for Baltimore class II, generated using the optimal weight after fine-tuning. Additional phylogenetic trees for other Baltimore classes are shown in Fig. A.15, Fig. A.16, Fig. A.17, Fig. A.18. It is worth noting that for class VI and class VII, there are insufficient families to construct trees. As for class IV, the weight before fine-tuning is applied due to the over-fitting issue.

Fig. A.15 — The phylogenetic tree for Baltimore class I based on the optimal weight after fine-tuning.

Fig. A.16 — The phylogenetic tree for Baltimore class III based on the optimal weight after fine-tuning.

Fig. A.17 — The phylogenetic tree for Baltimore class IV based on the optimal weight before fine-tuning.

Fig. A.18 — The phylogenetic tree for Baltimore class V based on the optimal weight after fine-tuning.

We compared our phylogenetic results with those of the previous study [30]. At the phylum level, the previous study emphasized the similarities between Cossaviricota and Cressdnaviricota, which aligns with our findings. At the family level, most families within the same phylum exhibit close relationships in this tree. However, there are a few exceptions; for instance, two families within Hofneiviricota did not cluster together in a single branch. Similar phenomena are observed in other Baltimore classes. Overall, while our method can provide phylogenetic results of reference value, it does not ensure a comprehensive depiction of relationships between families. This limitation may stem from our optimal weight being trained on predicting family labels, which might not fully capture the relationship between families.

3.5. The impact of different taxonomic standards

Our classification of sequences is based on taxonomic standards determined manually. Different standards yield different family classifications for sequences, thus affecting the classification accuracy. In our previous analysis, we utilized taxonomic standards as of June 30, 2022, to maintain time consistency between the data and its annotations. To further illustrate the effectiveness of our method, we validate its performance under alternative taxonomic standards. We update the annotations to adhere to standards as of March 7, 2024 and employ the same data cleaning process. This results in 11,422 sequences from 148 families, which can also be found in https://github.com/BobYHY/OptimalMetric. (The variation in the number of families results from the splitting or restructuring of certain previous families into smaller units, while the alteration in the number of sequences arises from the cleaning process.)

We repeat testing accuracy calculations in section 3.1. The testing accuracy achieved using the optimal metric was 77.7%. In comparison, the highest accuracy achieved by the NV1, NV2, NV3, Markov, Jensen, and Jaccard methods under different K-values were 74.3%, 74.0%, 69.6%, 58.6%, 72.6%, and 60.6% respectively. (See Fig. 6.)

Fig. 6 — Comparison of classification accuracy in the latest taxonomic standard.

We can derive two insights from this comparison. First, our method remains superior to others under the new taxonomic standards, regardless of whether based on natural vectors or other approaches. This underscores both the effectiveness of our method and its adaptability to different standards. Second, the comparison between different taxonomic standards reveals a lower accuracy under the new standards compared to the previous ones. This could be attributed to several factors: Firstly, the increased number of families under the new taxonomic standards makes classification more challenging. Secondly, the old taxonomic standards align with the time of data acquisition, suggesting the need to incorporate all the latest data for improved classification under the new standards. Lastly, there may be issues with the setup of the new taxonomic standards that require improvement.

4. Discussion

In this paper, we started with the idea of maximizing classification capability and utilized the weight training approach to obtain the optimal alignment-free algorithm based on statistical moments. Then, we analyzed all viral reference sequences and calculated the optimal metric for viral genome space as Formula (12). (Please refer to Fig. 3 for the specific weights.) We validated its excellent classification performance.

D i s^{w} = \sum_{k = 1}^{9} \sum_{m = 0}^{2} w_{k m} d i s_{k m} = 3.5 \times 10^{- 2} d i s_{1, 0} + . . . + 5.1 \times 10^{- 6} d i s_{9, 2} .

(12)

In future applications of this method, there is no longer a need to retrain the optimal metric. Instead, the weights calculated above can be directly applied. For instance, when given an unknown viral sequence, we can utilize these weights to calculate its distance from other sequences in the database directly, thus determining its most likely family membership.

Our method has two major advantages compared to mainstream alignment algorithms. Firstly, being alignment-free, this method provides a substantial increase in speed. For a sequence set of length N, where each sequence has a length of $O (L)$ , the time complexity of performing multiple sequence alignment (MSA) is $O (L^{N})$ , while the time complexity of performing pairwise alignments for all sequences is $O (N^{2} L^{2})$ . However, with our method, computing k-mer natural vectors and generating the distance matrix has a time complexity of $O (N L + N^{2} 4^{k})$ . Even when $k = 9$ , this complexity is significantly lower than that of using alignment methods on viral datasets. (In our dataset, the average sequence length is 34,872.) Secondly, in sequence comparison, we do not depend on conserved segments. Instead, we compare statistical patterns from a higher perspective. This allows us to offer high-quality analysis for sequences like viral sequences that do not contain conserved regions.

In comparison to other alignment-free methods, our algorithm also exhibits three major advantages. Firstly, previous alignment-free methods extracted valuable features, but the systematic integration of these features has not been thoroughly studied. Our algorithm provides a method for integrating information from an optimization perspective, offering new insights for subsequent research on alignment-free methods. Secondly, our method can self-adjust weights based on different datasets, reducing the impact of sequence type differences to some extent. Thirdly, experimental results demonstrate that our algorithm's classification performance is significantly superior to previous algorithms.

Finally, from a geometric perspective, the optimal metric itself holds intrinsic significance. Past research on the geometry of genome space based on NV methods has extracted a series of geometric principles including the convex hull principle. However, the study of the metric itself has remained limited to empirical approaches. The metric serves as the foundation for the geometric structure. The optimal metric we have extracted sheds light on the manifold structure of the genome space to some extent.

Certainly, our method currently still has limitations that require further investigation in future studies. First, our objective function used for training is non-convex, which means that the uniqueness of the optimal solution and the global optimization cannot be guaranteed. While the practical significance of these locally optimal weights has been demonstrated, a more unique determination of the optimal solution in subsequent studies would further enhance the geometric interpretation of this optimal metric. Second, our method is only suitable for relatively complete sequences. If the data collected consists of only small fragments, such as in metagenomics, further research is needed to determine how to use our method to identify the types of these fragments.

Funding

This research was funded by National Natural Science Foundation of China (NSFC) grant (12171275) and Tsinghua University Education Foundation fund (042202008).

Declaration of generative AI and AI-assisted technologies in the writing process

During the preparation of this work the authors used ChatGPT in order to improve language. After using this tool, the authors reviewed and edited the content as needed and takes full responsibility for the content of the publication.

CRediT authorship contribution statement

Hongyu Yu: Conceptualization, Data curation, Formal analysis, Methodology, Software, Validation, Visualization, Writing – original draft. Stephen S.-T. Yau: Conceptualization, Funding acquisition, Methodology, Project administration, Resources, Supervision, Writing – review & editing.

Declaration of Competing Interest

None.

Footnotes

^{Appendix B}

Supplementary material related to this article can be found online at https://doi.org/10.1016/j.csbj.2024.05.005.

Appendix A.

Appendix B. Supplementary material

The following is the Supplementary material related to this article.

MMC 1

The GenBank IDs corresponding to the sequences utilized in this paper and their respective classifications.

mmc1.csv^{(306.2KB, csv)}

MMC 2

The code implementation of Algorithm 1. Other codes such as plotting can be found at https://github.com/BobYHY/OptimalMetric.

mmc2.zip^{(1.5KB, zip)}

Data availability

The data and code that support the findings of this study are available in https://github.com/BobYHY/OptimalMetric.

References

1.Needleman S.B., Wunsch C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48:443–453. doi: 10.1016/0022-2836(70)90057-4. [DOI] [PubMed] [Google Scholar]
2.Smith T.F., Waterman M.S. Identification of common molecular subsequences. J Mol Biol. 1981;147:195–197. doi: 10.1016/0022-2836(81)90087-5. [DOI] [PubMed] [Google Scholar]
3.Edgar R.C. Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. doi: 10.1093/nar/gkh340. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Higgins D.G., Sharp P.M. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene. 1988;73:237–244. doi: 10.1016/0378-1119(88)90330-7. [DOI] [PubMed] [Google Scholar]
5.Cover T.M., Hart P.E. Nearest neighbor pattern classification. IEEE Trans Inf Theory. 1967;13:21–27. [Google Scholar]
6.Hartigan J.A., Wong M.A. 1979. A k-means clustering algorithm. [Google Scholar]
7.DARPA Broad agency announcement (BAA 07-68) for Defense Sciences Office (DSO) 2008. http://www.math.utk.edu/~vasili/refs/darpa07.MathChallenges.html
8.Zielezinski A., Vinga S., Almeida J.S., Karłowski W.M. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 2017;18 doi: 10.1186/s13059-017-1319-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Bonham-Carter O., Steele J., Bastola D.R. Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis. Brief Bioinform. 2014;15:890–905. doi: 10.1093/bib/bbt052. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Lu Y.Y., Tang K., Ren J., Fuhrman J.A., Waterman M.S., Sun F. CAFE: aCcelerated Alignment-FrEe sequence analysis. Nucleic Acids Res. 2017;45:W554–W559. doi: 10.1093/nar/gkx351. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Qi J., Wang B., Hao B. Whole proteome prokaryote phylogeny without sequence alignment: a k-string composition approach. J Mol Evol. 2003;58:1–11. doi: 10.1007/s00239-003-2493-7. [DOI] [PubMed] [Google Scholar]
12.Jun S.R., Sims G.E., Wu G.A., Kim S.H. Whole-proteome phylogeny of prokaryotes by feature frequency profiles: an alignment-free method with optimal feature resolution. Proc Natl Acad Sci USA. 2010;107:133–138. doi: 10.1073/pnas.0913033107. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Levandowsky M., Winter D.K. Distance between sets. Nature. 1971;234:34–35. [Google Scholar]
14.Deng M., Yu C., Liang Q., He R.L., Yau S.S.T. A novel method of characterizing genetic sequences: genome space with biological distance and applications. PLoS ONE. 2011;6 doi: 10.1371/journal.pone.0017293. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Wen J., Chan R.H., Yau S.C., He R.L., Yau S.S.T. K-mer natural vector and its application to the phylogenetic analysis of genetic sequences. Gene. 2014;546:25–34. doi: 10.1016/j.gene.2014.05.043. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Zhao X., Tian K., He R.L., Yau S.S.T. Convex hull principle for classification and phylogeny of eukaryotic proteins. Genomics. 2018;111:1777–1784. doi: 10.1016/j.ygeno.2018.11.033. [DOI] [PubMed] [Google Scholar]
17.Sun N., Pei S., He L., Yin C., He R.L., Yau S.S.T. Geometric construction of viral genome space and its applications. Comput Struct Biotechnol J. 2021;19:4226–4234. doi: 10.1016/j.csbj.2021.07.028. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Tian K., Zhao X., Yau S.S.T. Convex hull analysis of evolutionary and phylogenetic relationships between biological groups. J Theor Biol. 2018;456:34–40. doi: 10.1016/j.jtbi.2018.07.035. [DOI] [PubMed] [Google Scholar]
19.Harris H.M.B., Hill C. A place for viruses on the tree of life. Front Microbiol. 2021;11 doi: 10.3389/fmicb.2020.604048. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Kingma D., Ba J. International conference on learning representations. 2014. Adam: a method for stochastic optimization. [Google Scholar]
21.Paszke A., Gross S., Massa F., Lerer A., Bradbury J., et al. PyTorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst. 2019;32 [Google Scholar]
22.Huang H.H., Yu C., Zheng H., Hernandez T., Yau S.S.T., He R., et al. Global comparison of multiple-segmented viruses in 12-dimensional genome space. Mol Phylogenet Evol. 2014;81 doi: 10.1016/j.ympev.2014.08.003. [DOI] [PubMed] [Google Scholar]
23.Gascuel O. BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol Biol Evol. 1997;14:685–695. doi: 10.1093/oxfordjournals.molbev.a025808. [DOI] [PubMed] [Google Scholar]
24.Saitou N., Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4:406–425. doi: 10.1093/oxfordjournals.molbev.a040454. [DOI] [PubMed] [Google Scholar]
25.Lefort V., Desper R., Gascuel O. FastME 2.0: a comprehensive, accurate, and fast distance-based phylogeny inference program. Mol Biol Evol. 2015;32:2798–2800. doi: 10.1093/molbev/msv150. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Letunić I., Bork P. Interactive tree of life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 2021;49:W293–W296. doi: 10.1093/nar/gkab301. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Baltimore D. Expression of animal virus genomes. Bacteriol Rev. 1971;35:235–241. doi: 10.1128/br.35.3.235-241.1971. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Baltimore D. Viral genetic systems. Trans N Y Acad Sci. 1971;33:327–332. doi: 10.1111/j.2164-0947.1971.tb02600.x. [DOI] [PubMed] [Google Scholar]
29.Baltimore D. The strategy of RNA viruses. Harvey Lect. 1974;70:57–74. [PubMed] [Google Scholar]
30.Koonin E., Kuhn J., Dolja V., Krupovic M. Megataxonomy and global ecology of the virosphere. ISME J. 2024;18 doi: 10.1093/ismejo/wrad042. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

MMC 1

The GenBank IDs corresponding to the sequences utilized in this paper and their respective classifications.

mmc1.csv^{(306.2KB, csv)}

MMC 2

The code implementation of Algorithm 1. Other codes such as plotting can be found at https://github.com/BobYHY/OptimalMetric.

mmc2.zip^{(1.5KB, zip)}

Data Availability Statement

The data and code that support the findings of this study are available in https://github.com/BobYHY/OptimalMetric.

[br0010] 1.Needleman S.B., Wunsch C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48:443–453. doi: 10.1016/0022-2836(70)90057-4. [DOI] [PubMed] [Google Scholar]

[br0020] 2.Smith T.F., Waterman M.S. Identification of common molecular subsequences. J Mol Biol. 1981;147:195–197. doi: 10.1016/0022-2836(81)90087-5. [DOI] [PubMed] [Google Scholar]

[br0030] 3.Edgar R.C. Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. doi: 10.1093/nar/gkh340. [DOI] [PMC free article] [PubMed] [Google Scholar]

[br0040] 4.Higgins D.G., Sharp P.M. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene. 1988;73:237–244. doi: 10.1016/0378-1119(88)90330-7. [DOI] [PubMed] [Google Scholar]

[br0050] 5.Cover T.M., Hart P.E. Nearest neighbor pattern classification. IEEE Trans Inf Theory. 1967;13:21–27. [Google Scholar]

[br0060] 6.Hartigan J.A., Wong M.A. 1979. A k-means clustering algorithm. [Google Scholar]

[br0070] 7.DARPA Broad agency announcement (BAA 07-68) for Defense Sciences Office (DSO) 2008. http://www.math.utk.edu/~vasili/refs/darpa07.MathChallenges.html

[br0080] 8.Zielezinski A., Vinga S., Almeida J.S., Karłowski W.M. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 2017;18 doi: 10.1186/s13059-017-1319-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[br0090] 9.Bonham-Carter O., Steele J., Bastola D.R. Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis. Brief Bioinform. 2014;15:890–905. doi: 10.1093/bib/bbt052. [DOI] [PMC free article] [PubMed] [Google Scholar]

[br0100] 10.Lu Y.Y., Tang K., Ren J., Fuhrman J.A., Waterman M.S., Sun F. CAFE: aCcelerated Alignment-FrEe sequence analysis. Nucleic Acids Res. 2017;45:W554–W559. doi: 10.1093/nar/gkx351. [DOI] [PMC free article] [PubMed] [Google Scholar]

[br0110] 11.Qi J., Wang B., Hao B. Whole proteome prokaryote phylogeny without sequence alignment: a k-string composition approach. J Mol Evol. 2003;58:1–11. doi: 10.1007/s00239-003-2493-7. [DOI] [PubMed] [Google Scholar]

[br0120] 12.Jun S.R., Sims G.E., Wu G.A., Kim S.H. Whole-proteome phylogeny of prokaryotes by feature frequency profiles: an alignment-free method with optimal feature resolution. Proc Natl Acad Sci USA. 2010;107:133–138. doi: 10.1073/pnas.0913033107. [DOI] [PMC free article] [PubMed] [Google Scholar]

[br0130] 13.Levandowsky M., Winter D.K. Distance between sets. Nature. 1971;234:34–35. [Google Scholar]

[br0140] 14.Deng M., Yu C., Liang Q., He R.L., Yau S.S.T. A novel method of characterizing genetic sequences: genome space with biological distance and applications. PLoS ONE. 2011;6 doi: 10.1371/journal.pone.0017293. [DOI] [PMC free article] [PubMed] [Google Scholar]

[br0150] 15.Wen J., Chan R.H., Yau S.C., He R.L., Yau S.S.T. K-mer natural vector and its application to the phylogenetic analysis of genetic sequences. Gene. 2014;546:25–34. doi: 10.1016/j.gene.2014.05.043. [DOI] [PMC free article] [PubMed] [Google Scholar]

[br0160] 16.Zhao X., Tian K., He R.L., Yau S.S.T. Convex hull principle for classification and phylogeny of eukaryotic proteins. Genomics. 2018;111:1777–1784. doi: 10.1016/j.ygeno.2018.11.033. [DOI] [PubMed] [Google Scholar]

[br0170] 17.Sun N., Pei S., He L., Yin C., He R.L., Yau S.S.T. Geometric construction of viral genome space and its applications. Comput Struct Biotechnol J. 2021;19:4226–4234. doi: 10.1016/j.csbj.2021.07.028. [DOI] [PMC free article] [PubMed] [Google Scholar]

[br0180] 18.Tian K., Zhao X., Yau S.S.T. Convex hull analysis of evolutionary and phylogenetic relationships between biological groups. J Theor Biol. 2018;456:34–40. doi: 10.1016/j.jtbi.2018.07.035. [DOI] [PubMed] [Google Scholar]

[br0190] 19.Harris H.M.B., Hill C. A place for viruses on the tree of life. Front Microbiol. 2021;11 doi: 10.3389/fmicb.2020.604048. [DOI] [PMC free article] [PubMed] [Google Scholar]

[br0200] 20.Kingma D., Ba J. International conference on learning representations. 2014. Adam: a method for stochastic optimization. [Google Scholar]

[br0210] 21.Paszke A., Gross S., Massa F., Lerer A., Bradbury J., et al. PyTorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst. 2019;32 [Google Scholar]

[br0220] 22.Huang H.H., Yu C., Zheng H., Hernandez T., Yau S.S.T., He R., et al. Global comparison of multiple-segmented viruses in 12-dimensional genome space. Mol Phylogenet Evol. 2014;81 doi: 10.1016/j.ympev.2014.08.003. [DOI] [PubMed] [Google Scholar]

[br0230] 23.Gascuel O. BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol Biol Evol. 1997;14:685–695. doi: 10.1093/oxfordjournals.molbev.a025808. [DOI] [PubMed] [Google Scholar]

[br0240] 24.Saitou N., Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987;4:406–425. doi: 10.1093/oxfordjournals.molbev.a040454. [DOI] [PubMed] [Google Scholar]

[br0250] 25.Lefort V., Desper R., Gascuel O. FastME 2.0: a comprehensive, accurate, and fast distance-based phylogeny inference program. Mol Biol Evol. 2015;32:2798–2800. doi: 10.1093/molbev/msv150. [DOI] [PMC free article] [PubMed] [Google Scholar]

[br0260] 26.Letunić I., Bork P. Interactive tree of life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 2021;49:W293–W296. doi: 10.1093/nar/gkab301. [DOI] [PMC free article] [PubMed] [Google Scholar]

[br0270] 27.Baltimore D. Expression of animal virus genomes. Bacteriol Rev. 1971;35:235–241. doi: 10.1128/br.35.3.235-241.1971. [DOI] [PMC free article] [PubMed] [Google Scholar]

[br0280] 28.Baltimore D. Viral genetic systems. Trans N Y Acad Sci. 1971;33:327–332. doi: 10.1111/j.2164-0947.1971.tb02600.x. [DOI] [PubMed] [Google Scholar]

[br0290] 29.Baltimore D. The strategy of RNA viruses. Harvey Lect. 1974;70:57–74. [PubMed] [Google Scholar]

[br0300] 30.Koonin E., Kuhn J., Dolja V., Krupovic M. Megataxonomy and global ecology of the virosphere. ISME J. 2024;18 doi: 10.1093/ismejo/wrad042. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

The optimal metric for viral genome space

Hongyu Yu

Stephen S-T Yau

Abstract

Graphical abstract

Highlights

1. Introduction

2. Materials and methods

2.1. Dataset

2.2. Natural vectors and k-mer natural vectors

2.3. The optimal weight and the algorithm for training

Algorithm 1.

2.4. The phylogenetic analysis

3. Results

3.1. The classification performance

Fig. 1.

Fig. 2.

3.2. Fine-tuning within the Baltimore classes

Table 1.

Table 2.

3.3. The optimal weight and the corresponding importance

Fig. 3.

Fig. 4.

Fig. A.7.

Fig. A.8.

Fig. A.9.

Fig. A.10.

Fig. A.11.

Fig. A.12.

Fig. A.13.

Fig. A.14.

3.4. The phylogenetic analysis for each Baltimore class

Fig. 5.

Fig. A.15.

Fig. A.16.

Fig. A.17.

Fig. A.18.

3.5. The impact of different taxonomic standards

Fig. 6.

4. Discussion

Funding

Declaration of generative AI and AI-assisted technologies in the writing process

CRediT authorship contribution statement

Declaration of Competing Interest

Footnotes

Appendix A.

Appendix B. Supplementary material

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases