Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2025 Nov 24;26(6):bbaf626. doi: 10.1093/bib/bbaf626

From genomic signals to prediction tools: a critical feature analysis and rigorous benchmark for phage–host prediction

Jiayu Shang 1, Cheng Peng 2, Jiaojiao Guan 3, Dehan Cai 4, Donglin Wang 5, Yanni Sun 6,
PMCID: PMC12641615  PMID: 41283812

Abstract

Accurate prediction of virus–host interactions is critical for understanding viral ecology and developing applications like phage therapy. However, the growing number of computational tools has created a complex landscape, making direct performance comparison challenging due to inconsistent benchmarks and varying usability. Here, we provide a systematic review and a rigorous benchmark of 27 virus–host prediction tools. We formulate the host prediction task into two primary frameworks—link prediction and multi-class classification—and construct two benchmark datasets to evaluate tool performance in distinct scenarios: a database-centric dataset (RefSeq-VHDB) and a metagenomic discovery dataset (MetaHiC-VHDB). Our results reveal that no single tool is universally optimal. Performance is highly context-dependent, with tools like CHERRY and iPHoP demonstrating robust, broad applicability, while others, such as RaFAH and PHIST, excel in specific contexts. We further identify a critical trade-off between predictive accuracy, prediction rate, and computational cost. This work serves as a practical guide for researchers and establishes a standardized benchmark to drive future innovation in deciphering complex virus–host interactions.

Keywords: bacteriophages, metagenomics, host prediction, prophages

Graphical Abstract

Graphical Abstract.

Graphical Abstract

Introduction

Viruses are obligate intracellular parasites that require a host cell to replicate. A significant subset of viruses, known as bacteriophages or phages, exclusively infect and replicate within prokaryotic hosts, including both bacteria and archaea [1]. Phages represent the vast majority of all known viruses and are the most abundant biological entities on Earth [2], with an estimated population exceeding Inline graphic particles—a number greater than all other organisms combined [3]. This extraordinary abundance is matched by a remarkable diversity in their virion morphologies and genomic structure [4]. While tailed phages with double-stranded DNA (dsDNA) genomes are the most frequently studied, phages can also possess double-stranded RNA (dsRNA), single-stranded DNA (ssDNA), or single-stranded RNA (ssRNA) genomes [5]. Their genome sizes are also remarkably varied, ranging from a few kilobase pairs (kbp) to those of megaphages, which can exceed 500 kbp [6]. Phages are found in all ecosystems colonized by prokaryotes, including aquatic environments [7], terrestrial soils [8], and the human microbiome [9]. Within these microbial communities, phages function as key ecological and evolutionary drivers by shaping community structure, promoting host co-evolution, and mediating horizontal gene transfer [1, 10].

The relationship between a phage and a bacterium is not a simple predator-prey dynamic but rather a spectrum of possible interactions with distinct ecological consequences. A schematic representation of the viral infection is shown in Fig. 1. The process initiates with adsorption, where a phage recognizes and binds to specific receptors on a bacterial cell [11, 12]. This step is a critical checkpoint, but interactions are not limited to viable hosts; recent work shows that phages can leverage the flagellum of non-host bacteria as a means of transport [13, 14], enhancing their dispersal. Following successful injection of its genome, the phage must overcome the host’s defense systems. Bacteria employ numerous anti-phage systems, including the well-studied CRISPR-Cas, restriction-modification, and methylation-associated defense systems, which provide an adaptive immune memory to fight off repeat infections [15]. If the phage survives, the outcome is dictated by its lifestyle and reproductive strategy. Virulent phages engage in a lytic cycle, which involves hijacking host resources for viral replication and concludes with the destruction of the host cell to release progeny [16]. In contrast, temperate phages may adopt a lysogenic lifestyle, integrating their DNA into the host genome as a prophage. In this state, the phage remains latent and is passively propagated with the host cell, a relationship that can persist until stress or other signals trigger its excision and entry into the lytic cycle [17].

Figure 1.

Schematic diagram illustrating the life cycle of a virus infecting a prokaryotic host, shown in five numbered stages. Stage 1 shows a virus particle either free-floating or attached to a non-host cell. Stage 2 depicts the virus attached to a host cell and injecting its DNA. Stage 3 illustrates the host’s defense, degrading the viral DNA. The process then branches: Stage 4, the lysogenic cycle, shows the viral DNA integrated into the host’s chromosome. Stage 5, the lytic cycle, shows the host cell filled with new virus particles, which then burst out as the cell lyses.

A schematic representation of the viral infection. (1) transport, where the virus attaches to a non-host prokaryote or diffuses freely. (2) infection, where the viral tail fiber and tail spike attach to the host and viral DNA is injected into the host cell. (3) The host can respond through its prokaryotic defense system, including CRISPR-Cas mechanisms that create novel spacers for immunity and degrade viral DNA. If the virus infect successfully, the virus may enter the (4) lysogenic cycle, integrating as prophages into the host chromosome, or (5) the lytic cycle, producing viral particles and lysing the host cell.

The destructive potential of the lytic cycle, a key outcome of phage-bacteria interactions, provides the foundation for several critical biotechnological applications. The most prominent of these is phage therapy [18], which is re-emerging as a compelling alternative to conventional antibiotics, especially in the face of rising antimicrobial resistance. Because phages are highly specific to their bacterial targets, some lytic phages can eliminate pathogens with minimal disruption to the patient’s beneficial microbiota, a significant advantage over broad-spectrum antibiotics. This precision has shown promise in treating challenging infections, including those caused by multidrug-resistant bacteria like Staphylococcus aureus [19] and Pseudomonas aeruginosa [20]. Beyond clinical medicine, phages are increasingly used as biocontrol agents in the food industry. Applied to food products or processing surfaces, they can selectively remove contaminants such as Listeria and Salmonella, thereby enhancing food safety and extending shelf life without affecting the food’s quality [21].

The success of these applications critically depends on accurately identifying which phages can infect and eliminate specific bacteria. However, experimental identification of these virus–bacteria interactions is challenging. Standard isolation methods are labor-intensive and, more critically, are limited because phage isolation and cultivation necessitate prior knowledge and culturing of the bacterial host [2]. The rapid advancement of high-throughput sequencing has led to an exponential growth in viral sequence data, with metagenomic studies uncovering millions of previously unknown viral genomes. Consequently, the field has increasingly turned to computational approaches to predict phage–host interactions directly from sequence data. Early methods usually relied on identifying virus–host interaction signals like integrated prophages and CRISPR spacers or signatures such as similarities in Inline graphic-mer frequencies or codon usage. While these methods can obtain some reliable results, no single approach is universally effective, as each feature has inherent limitations. For example, alignment-based methods require sufficiently close relatives in databases to make a match, and CRISPR-based predictions are only possible for hosts that have previously recorded a specific phage infection. These limitations have motivated the development of more sophisticated learning models that integrate genomic features to improve predictive accuracy. However, the performance of these models is constrained by three significant and interconnected challenges.

First, a substantial annotation gap exists within the vast amount of sequence data. The number of viral sequences in public databases has grown exponentially, with recent estimates showing a nearly five-fold increase in phage-like assemblies in the GenBank database over the last decade (16,232 in 2015 versus 78,002 in 2025). However, a large fraction of these sequences represent viral “dark matter” with no known host. Even within the more stringently curated RefSeq database, only around 87% (4,698/5,371 in 2025) viruses have precise host annotation. Furthermore, such annotations lack the required granularity, providing neither the specific host genome nor details about the nature of the virus–host relationship. As illustrated in Fig. 1, a phage–host interaction is a multi-stage process that includes initial binding (both host or non-host), DNA injection, and either a productive infection (lytic or lysogenic cycle) or failure due to prokaryotic defense systems. A simple taxonomy name of the “host” does not distinguish between a phage that can only attach to a cell’s surface and one that can successfully replicate within it. Consequently, a model trained on such data may learn to predict successful binding rather than a productive infection, leading to functionally misleading results. Second, the available host annotation data is heavily skewed by historical research biases. A disproportionate number of phages in our databases are linked to a small set of well-studied model organisms, such as Escherichia coli, Salmonella enterica, and Mycolicibacterium smegmatis. This creates a long-tail distribution, where a few bacterial species are associated with thousands of phages, while the vast majority of bacteria have few or no known viral predators in the databases. Such imbalance can introduce significant bias into models, leading to prediction tools that perform well for common hosts but fail to generalize to the broader, more sparsely populated bacterial domain. Finally, the inherent complexity of host range itself presents a major hurdle. Database annotations often imply a simple one-to-one relationship between a phage and a single host species. This administrative simplification cannot reflect the biological reality. Recent studies show that many phages have a relatively broad host range, capable of infecting multiple species or even crossing genus boundaries [22, 23]. For instance, some phages are known to infect multiple distinct species within the Enterobacteriaceae family. This broad-host-range behavior is difficult to capture and predict with models trained on simplified, single-host labels, complicating the design of truly comprehensive prediction tools. These combined challenges highlight the difficulty of host prediction and underscore the need for robust computational approaches to navigate this complex data.

The urgent need for virus–host prediction methods has spurred the development of a multitude of computational tools. This rapid growth has created a complex and often confusing landscape where most tools are evaluated on disparate datasets, making direct performance comparisons difficult. While previous review [24] provided a valuable initial assessment, its evaluation was limited to a small subset of accessible tools and relied on a benchmark using only three groups of bacteriophages. Here, we address these gaps by establishing a clear problem framework, providing a systematic analysis of the biological features used by 27 existing tools, and conducting a rigorous benchmark designed to serve as a new standard for the field.

By providing a rigorous, comparative analysis, this review serves two critical functions. First, it acts as a practical guide for researchers selecting the optimal tool for their work. Second, it establishes a foundational resource and a new performance benchmark intended to drive the next wave of innovation in deciphering the complex web of virus–host interactions.

Our key contributions are:

  • A standardized problem formulation: We establish a structured framework for the virus–host prediction problem, providing a consistent foundation for understanding and comparing diverse computational strategies.

  • A comprehensive survey of tools and biological features: We critically evaluate 27 existing tools, examining their methodologies, strengths, and weaknesses. We then survey the full spectrum of biological features they employed, from CRISPR spacer matching and prophage detection to alignment-free Inline graphic-mer frequency analysis.

  • Rigorous benchmarking using carefully designed datasets: Another key novelty is the development and application of two distinct evaluation benchmarks. RefSeq-VHDB provides a curated set of phage–host pairs for standardized assessment, while MetaHiC-VHDB consists of three independent metagenomic Hi-C test sets designed to assess tool performance in realistic ecological contexts. These benchmarks provide a practical guide for researchers and expose performance gaps that future methods must address.

Results

Problem formulation of the virus–host prediction

The structure of publicly available databases, which predominantly document one-to-one virus–host interactions, has directly affected the development of computational tools. Consequently, most methods are designed to predict a single, highest-ranking host for a given virus. This has led to two primary problem formulations for host prediction (Fig. 2). Consider a microbial community containing Inline graphic viruses (Inline graphic) and Inline graphic prokaryotes (Inline graphic). The first and more intuitive formulation frames the task as a link prediction problem on all potential Inline graphic interactions. The objective is to learn a scoring function Inline graphic (Equation 1).

Figure 2.

A comparative diagram of two problem formulations, F1 and F2. On the left, “F1: Link prediction” shows a pool of viruses and a pool of hosts. Many-to-many faint lines indicate all possible pairings, with one bold line connecting a specific virus to a specific host, representing a de novo prediction. On the right, “F2: multi-class classification” shows a pool of viruses being sorted by an algorithm into separate, labeled boxes for “Host A,” “Host B,” and “Host C,” representing assignment to a known class.

Two primary problem formulations for virus–host prediction task. Inline graphic: Link prediction. Inline graphic: multi-class classification. Based on the input. If a tool uses both viral and microbial sequences as inputs, it can formulate as Inline graphic. To make a prediction for a new virus, it evaluates all potential host candidates to identify the most likely interacting pair. The primary strength of this approach is its generality, making it suitable for discovering novel interactions (de novo discovery) as it is not limited to a known set of hosts. However, it is computationally intensive and must overcome the challenge of extreme class imbalance, where non-interacting pairs vastly outnumber true interactions. If a tool only relies on viruses for host prediction, it usually formulates Inline graphic, aims to separate viruses based on their known hosts, effectively generating a boundary for each class. This method is computationally efficient and precise for hosts represented in the training data. Its main weakness is the closed-set assumption, which prevents it from predicting hosts from taxa that were not included in the training set.

graphic file with name DmEquation1.gif (1)

where Inline graphic is the predicted probability that the pair Inline graphic represents a true biological infection.

An alternative formulation simplifies the problem by reframing it as multi-class classification. Instead of predicting individual links, this approach groups the Inline graphic prokaryotes into Inline graphic distinct taxonomic ranks (e.g. genus or family). The goal is then to learn a classifier Inline graphic that maps each virus to a host taxon (Equation 2).

graphic file with name DmEquation2.gif (2)

This formulation operates on the hypothesis that viruses with similar genomic features infect hosts of the same or similar taxa. Consequently, Inline graphic propagate taxonomic labels from a reference set of viruses with known hosts to novel viral sequences. While both formulations aim to solve the same fundamental problem, they differ significantly in their data requirements, underlying features, and the nature of their predictions, each presenting distinct advantages and disadvantages. A basic comparison is listed in Table 1 and detailed explanation can be found in Supplementary Note 1.

Table 1.

Comparison of link prediction and multi-class classification formulations for virus host prediction

Characteristic Link Prediction Formulation Multi-class Classification Formulation
Primary Goal Predicts whether a specific virus interacts with a specific host (binary decision per pair). Assigns a host taxonomic label to a given virus from a predefined set of taxa.
Data Challenge Extreme class imbalance; the number of negative (non-interacting) pairs vastly overwhelms positive pairs. Long-tail distribution of class labels; a few host taxa are heavily over-represented, while most are rare.
Feature Representation Primarily pairwise features representing direct evidence (e.g. sequence homology, CRISPR-spacer matches, prophages). Primarily virus-centric features creating a genomic fingerprint (e.g. marker genes, codon usages, protein organization).
Prediction Output An explicit, high-resolution pairing between a viral genome and a host genome. A single, lower-resolution taxonomic label (e.g. family or genus) for the host.
Key Strength Generality. Well-suited for de novo discovery of novel interactions, as it is not constrained by a known set of hosts. Precision & Efficiency. Computationally fast (one prediction per virus) and often highly accurate for hosts within databases.
Key Weakness High risk of false positives due to the massive number of potential interactions (number of viruses Inline graphic number of prokaryotes). Computationally intensive. Closed-set assumption; cannot predict hosts from taxa absent in the training data. Struggles with polyvalent viruses.

Current approaches for host prediction

We thoroughly reviewed 27 open-source computational methods for host prediction, which we group by their problem formulation: link prediction or multi-class classification. Our chronological analysis reveals a distinct technological evolution in the features and algorithms used. A brief summary of these tools is listed in Table 2. Early methods, such as WIsH and VirHostMatcher (VHM), primarily relied on alignment-free genomic signatures like Inline graphic-mer and oligonucleotide frequencies. The field then progressed toward integrating multiple biological signals, with tools like VirHostMatcher-Net (VHM-Net) and iPHoP combining CRISPR matches, sequence homology, and the outputs of other predictors into more robust frameworks. This trend was followed by a rapid adoption of deep learning, beginning with Convolutional Neural Networks (CNNs) that learned from sequence-based features (e.g. DeepHost, PHIAF). More recently, the research has advanced to sophisticated graph-based deep learning models (e.g. CHERRY, PHPGAT) that capture complex relationships within heterogeneous biological networks. The latest innovations leverage large protein language models to generate rich embeddings from viral proteins, particularly receptor-binding proteins, for highly targeted predictions (e.g. EvoMIL, PHIStruct). Regarding the problem formulation, tools based on link prediction require both viral and host sequences as input, whereas those based on multi-class classification rely only on viral sequences. Notably, although PHP, iPHoP, and CHERRY are designed using a link prediction framework, they incorporate internal, pre-compiled databases. This design enables them to generate predictions using only viral sequences as input. Despite this algorithmic progress, our practical assessment highlights a critical challenge: over half of the reviewed tools are not readily usable due to issues such as failed installations, hard-coded dependencies, or a lack of documentation, severely limiting their reproducibility and broader utility (see Supplementary Note 2).

Table 2.

A comparison of models/methods employed for host prediction of prokaryotic viruses. Inline graphic: link-prediction formulation. Inline graphic: multi-class classification formulation. The “Inputs” of a tool represent the kind of data required: “Virus” means the tools only required viral sequences as inputs; “Virus and MAGs” means both the viral sequences and a set of candidate host genomes or Metagenome-Assembled Genomes (MAGs) from the same sample are required. “Either” means the tool accepts inputs in either way. The “Prediction level” of a tool represents the lowest evaluated host taxonomy level reported in the origin paper. The “Accessibility” of a tool from an end-user standpoint was evaluated based on three key criteria: ease of installation, quality of documentation, and workflow automation. A designation of “Yes” indicates that the software met all three usability requirements. For tools marked as “No”, which failed to meet one or more of these criteria, detailed information can be found in the Supplementary Note 1

Name Formulation Method Feature Inputs Prediction level Accessibility Ref.
HostPhinder (2016) Inline graphic Homology analysis Inline graphic -mers Virus Species-level No [25]
WIsH (2017) Inline graphic Hidden markov model Inline graphic -mers Virus and MAGs Genus-level Yes [26]
VHM (2017) Inline graphic Homology analysis Inline graphic -mers Virus and MAGs Species-level Yes [27]
PHP (2021) Inline graphic Gaussian Mixture Models Inline graphic -mers Either Genus-level Yes [28]
HostG (2021) Inline graphic Graph convolutional network Integrated multiple features Virus Genus-level No [29]
RaFAH (2021) Inline graphic Random forest model Viral protein similarity Virus Genus-level Yes [30]
VPF-Class (2021) Inline graphic Homology analysis Viral protein similarity Virus Genus-level No [31]
PredPHI (2022) Inline graphic Convolutional neural network Viral protein similarity Virus and MAGs Genus-level No [32]
VHM-Net (2022) Inline graphic Hybrid model Integrated multiple features Either Species-level Yes [33]
PHIAF (2022) Inline graphic Convolutional neural network Inline graphic -mers Virus and MAGs Species-level No [34]
PHIST (2022) Inline graphic Homology analysis Inline graphic -mers Virus and MAGs Species-level Yes [35]
CHERRY (2022) Inline graphic Hybrid model Integrated multiple features Either Species-level Yes [36]
DeepHost (2022) Inline graphic Convolutional Neural Network Inline graphic -mers Virus Species-level Yes [37]
HoPhage (2022) Inline graphic Hybrid model Viral protein similarity Virus Genus-level No [38]
vHULK (2022) Inline graphic Deep neural network Viral protein similarity Virus Species-level Yes [39]
iPHoP (2023) Inline graphic Hybrid model Integrated multiple features Either Species-level Yes [40]
PhageTB (2023) Inline graphic Hybrid model Integrated multiple features Virus and MAGs Genus-level No [41]
PHPGCA (2023) Inline graphic Light Graph Convolutional Network Integrated multiple features Virus and MAGs Species-level No [42]
PHERI (2023) Inline graphic Decision tree Viral protein similarity Virus Genus-level Yes [43]
PHIEmbed (2023) Inline graphic Random forest model Viral marker protein Virus Genus-level No [44]
DeepPBI-KG (2024) Inline graphic Deep neural network Sequence similarity Virus and MAGs Species-level No [45]
VHIP (2024) Inline graphic Gradient boosting classifier Inline graphic -mers Virus and MAGs Species-level No [46]
PB-LKS (2024) Inline graphic Homology analysis Inline graphic -mers Virus and MAGs Genus-level Yes [47]
EvoMIL (2024) Inline graphic multiple instance learning model Viral protein similarity Virus Species-level No [48]
PHPGAT (2025) Inline graphic Graph Attention Network v2 Integrated multiple features Virus and MAGs Species-level No [49]
MI-RGC (2025) Inline graphic Regional graph convolution Inline graphic -mers Virus and MAGs Species-level No [50]
PHIStruct (2025) Inline graphic Protein language model Viral marker protein Virus Genus-level No [51]

Genomic features vary in their ability to predict hosts

The predictive power of computational models for host prediction hinges on the biological features extracted from viral genomic sequences. These features broadly fall into two categories: signals of potential virus–host interaction and signals of virus–virus similarity. Here, we systematically reviewed the utility of the most common genomic features, including CRISPR-Cas spacer matches, prophage detection, similarities in Inline graphic-mer frequencies, and viral DNA/protein sequence homology. To establish a labeled data for our analysis, we utilized a dataset of 4,698 viruses with species-level host annotations from the Viral RefSeq database. For the host genomic data, we downloaded all 110,988 available prokaryotic genomes GenBank. We then evaluated the effectiveness of each genomic feature in predicting viral hosts, thereby delineating their individual strengths and limitations.

CRISPR spacer match

CRISPR-Cas systems function as an adaptive immune system in prokaryotes. When a virus infects a prokaryote, the CRISPR system can capture a short fragment of the viral DNA (a “spacer”) and integrate it into a specific locus in the host’s genome. This integrated spacer then serves as a genetic memory, guiding CRISPR-associated (Cas) proteins to recognize and cleave the same viral DNA during subsequent infections. A match between a query viral sequence and a CRISPR spacer stored within a prokaryotic genome is therefore high-confidence evidence of a direct, historical interaction, making it a cornerstone of virus–host prediction [52, 53]. To evaluate the predictive utility of this feature, we aligned the genomes of 4,698 viruses against 2,005,489 CRISPR spacers identified from 110,988 prokaryotic genomes in GenBank (see Methods). Our analysis revealed that while nearly half of prokaryotic genomes contain at least one CRISPR array (48%, Supplementary Fig. S1), only a fraction yield viral matches. Overall, spacer alignments successfully linked 76% of the viruses to 12% of the host genomes in our dataset. We observed significant heterogeneity across prokaryotic phyla (Supplementary Fig. S1). For instance, while the phylum Thermodesulfobacteriota had the highest absolute count of CRISPR spacers, phyla such as Spirochaetota and Fusobacteriota showed a much higher proportion of genomes with a viral spacer match. This disconnect between spacer abundance and linkage rate likely reflects complex ecological dynamics, including varying levels of viral pressure or the differential reliance on CRISPR versus other defense mechanisms across lineages.

Then, we performed a grid search on percentage of identical matches (pid) and alignment coverage (defined as the alignment length divided by the spacer length), with both parameters starting at 90%. These thresholds are widely used in many research studies to find CRISPR spacer matches [52, 53]. The performance was quantified using four metrics (see Methods): Top-1 accuracy (accuracy on the best hit), Top-N accuracy (accuracy on all alignment), alignment proportion (proportion of aligned viruses), and correct match ratio (estimate the errors introduced by using Top-N accuracy). As shown in Fig. 3A, prediction accuracy is highly sensitive to the percentage of identical matches (pid) but less so to alignment coverage. This highlights a critical trade-off: while stringent thresholds (e.g. pid Inline graphic 98%) are required to achieve reliable predictions, this stringency significantly reduces the alignment proportion—the fraction of viruses for which any host prediction can be made. At an optimal threshold balancing accuracy and coverage, Top-1 accuracy reached a modest 57% at the species level but a much stronger 82% at the genus level.

Figure 3.

Two 3D surface plots, A and B, showing the performance of CRISPR and prophage search methods. Each plot shows Top-1 accuracy as a surface that varies with alignment parameters. A red dot on each surface indicates the parameters for the highest accuracy, and a blue dot indicates the lowest.

Surface plots to illustrate the utility of (A) CRISPR spacer match and (B) prophage homologous search. Top-1 accuracy: accuracy summarized on the best hit (the hit with the best alignment score). Performance is evaluated at the Species, Genus, and Family taxonomic levels, based on key alignment parameters. For CRISPRs, these parameters are percentage of identical matches (pid) and coverage, while for prophages, they are pid and alignment length. The lowest and highest values on each surface are highlighted with blue and red dots, respectively. The four metrics evaluated are: Top-N accuracy: accuracy summarized on the all the alignment hits. Alignment proportion: the fraction of viruses with at least one alignments. Correct match ratio: a proportion of the correct hits in the alignment results, estimating the errors introduced by using Top-N accuracy.

The higher genus-level accuracy is particularly insightful. A virus that has been successfully isolated and sequenced represents an instance where it evaded or overcame its immediate host’s CRISPR defenses. However, a different species within the same genus may have successfully defended against the same virus and integrated a spacer. This scenario would result in a correct genus-level prediction but an incorrect species-level one, explaining the observed discrepancy. Furthermore, our analysis shows that considering all possible alignments (Top-N accuracy) instead of just the best hit introduces significant ambiguity (a low correct match ratio) for only a marginal gain in overall accuracy. This confirms that CRISPR-based predictions, while highly reliable when found, are inherently sparse. Their strength lies in specificity, not broad coverage, necessitating the use of complementary features for comprehensive host prediction.

Prophage matches

The integration of a viral genome into a host chromosome as a prophage is clear evidence of a lysogenic relationship. This signal is typically identified by detecting prophage regions in microbial genomes or by direct DNA alignment between viral and microbial genomes. To illustrate the utility of this feature, we conducted the most commonly used BLASTN-based alignment search between viral and prokaryotic genomes (see Methods).

According to the alignment results, 59% of prokaryotic genomes and 51% of viruses are linked with at least one alignment (Supplementary Fig. S2). Meanwhile the order Top-10 phyla with the most BLASTN alignments is significantly different compared to the one in CRISPR spacer matches (Supplementary Fig. S2), demonstrating the natural bias introduced by these features. A plausible explanation is that the existence of prophages indicates infection, which the CRISPR system fails to prevent. Following a similar methodology to the CRISPR alignment experiment, we evaluated performance across a grid search of two widely used threshold combination: pid, starting from 90% and alignment length (starting from 100 bp). The BLASTN-based alignment reveals that prophage searches yield higher species-level accuracy than CRISPR analysis but suffer from a lower alignment proportion and more ambiguous hits at relaxed thresholds (Fig. 3B). A pid of 95% and an alignment length Inline graphic 500 bp provides a favorable balance for Top-1 accuracy and alignment proportion. However, when using all alignments (Top-N) to include more candidate predictions, a higher pid (e.g. Inline graphic 98%) should be considered to minimize false positives.

Consistency of CRISPR spacer match and prophage detection

As aforementioned, CRISPR spacer match and prophage detection introduce a significant bias on the phyla of prokaryotes (Supplementary Fig. S1 and Fig. S2). Here, we estimate the potential of using them as complementary for host prediction. First, we conduct a comparative analysis of the host predictions for a cohort of viruses that have annotations from both methodologies. Figure 4A illustrates that the consistency for viruses annotated by both methods was low at the species level (30.3%) but increased at higher taxonomic ranks (55.4% at genus, 70.2% at family). We further examined the overlap of correctly predicted virus–host pairs between the two methods. The UpSet plot (Fig. 4B) analysis confirms that the two methods identify largely distinct sets of virus–host interactions. At the species level, only 35 interactions were identified by both methods, whereas CRISPR and prophage searches uniquely identified 548 and 82, respectively. This minimal overlap persists across taxonomic ranks, suggesting that these features can be used in conjunction to maximize the discovery of virus–host linkages.

Figure 4.

A two-part figure comparing CRISPR and prophage predictions. Part A uses three bar charts to show that predictions are highly consistent between the two methods. Part B uses three UpSet plots to show the number of correct predictions that are unique to each method versus shared by both, at the species, genus, and family levels.

Complementarity of CRISPR-based and prophage-based virus–host predictions. (A) Consistency of host predictions for viruses identified by both methods. The charts show the percentage of predictions that are consistent (blue) versus inconsistent (yellow) at the species, genus, and family levels. (B) Overlap of correct virus–host predictions from CRISPR spacer analysis and prophage homology searching. The UpSet plots show the number of correct predictions unique to each method (single dots for prophage unique and CRISPRs unique) and the number shared by both (connected dots) across the three taxonomic ranks. The horizontal bars indicate the total number of correct predictions for each method.

Inline graphic -mer frequency similarity

Viruses frequently adapt their genomic composition, including aspects like codon usage, to resemble that of their hosts. This phenomenon, often termed a “genomic signature,” can be quantified using Inline graphic-mer frequency profiles. Consequently, host prediction can be performed by measuring the distance (e.g. cosine similarity, Euclidean distance) between the Inline graphic-mer frequency vectors of a virus and those of potential hosts.

To assess its discriminative power, we compared the Inline graphic-mer frequency distributions between known virus–host pairs and non-host pairs (see Methods). We chose Inline graphic because it is the most commonly used setting and offers a balance between informational content and computational cost. In the analysis, we defined two types of negative pairs to simulate different challenge levels. An example is illustrated in Supplementary Fig. S3. The “easy” negative case includes negative pairs where the prokaryotes are in different taxa from the given host, and the “hard” negative case has a constraint on the taxonomy rank of the chosen prokaryotes (e.g. prokaryotes do not belong to the same species but belong to the same genus as the host).

The results reveal that Inline graphic-mer profiles effectively distinguish viruses from distantly related non-hosts (the “easy” negative case), showing a clear bimodal distribution of distances (Fig. 5A). However, the distributions for hosts and other prokaryotes within the same host genus (the “hard” negative case) overlap considerably, limiting the feature’s resolving power at the species level (Fig. 5B). This highlights a critical challenge for building models: training datasets built on “easy” negatives may distinguish trivial cases but fail to generalize to high-resolution, intra-genus predictions.

Figure 5.

Two plots, A and B, showing distributions of 4-mer frequency distance for known virus–host pairs (positive) versus non-host pairs (negative). In plot A, representing an “easy” case, the positive and negative distributions are well-separated. In plot B, representing a “hard” case, the distributions are heavily overlapping.

4-mer frequency distance distribution of known virus–host pairs (positive pairs) from Genbank against those of non-host pairs (negative pairs). A: The “easy” negative case includes negative pairs where the prokaryotes are in different taxa from the given host (e.g. species-level taxonomy). B: The “hard” negative case has a constraint on the highest taxonomy rank of the prokaryotes (e.g. prokaryotes do not belong to the same species but belong to the same genus as the host).

Host-specific marker genes

Host-specific marker genes, such as Receptor-Binding Proteins (RBPs) in tailed phages, offer direct mechanistic links to host specificity. Homology between RBPs can strongly suggest a shared host range. However, the utility of this feature is limited. First, it is primarily applicable to specific viral clades like Caudoviricetes, as RBPs are not universally conserved. Second, high sequence diversity makes RBP identification computationally challenging, with even state-of-the-art tools achieving an F1-score of only around 0.8 [54], thereby hindering their broad application in host prediction.

Viral genomic similarity

A more general approach relies on viral genomic similarity, measured by shared protein content, synteny, or Inline graphic-mer frequency, between a query virus and reference viruses in the database. This method is founded on the observation that viruses belonging to the same taxonomic group (e.g. genus) often infect a similar host. To validate this assumption, we analyzed the Viral RefSeq database. Our results confirmed that this principle holds for a significant subset of viruses: 73% of genera (927 of 1,263) with species-level host annotations exhibit perfect host consistency, defined as all viruses within the genus infecting the same host species (Fig. 6). However, only 45% of the genera (1,263 of 2,760) have clear host annotations. Consequently, the practical utility of this method is constrained by the current state of database annotation. The prevalence of viruses with unknown hosts or incomplete taxonomic information reduces the size and resolution of the reference set, thereby limiting the predictive power of this approach.

Figure 6.

Two pie charts about host annotation in viral genera. The first chart shows that 54% of genera have no annotation. The second chart focuses on the annotated portion, showing that 73% of those have a consistent annotation.

Proportion of viral genera with consistent host annotations in the Viral RefSeq database. The pie chart on the left shows that a majority of genera (54%) lack any host annotation, while the remaining 46% (1,263 genera) are annotated. The chart on the right further examines this annotated portion, revealing that 73% (927 genera) have a consistent host annotation.

Benchmarking reveals a trade-off between performance and efficiency in host prediction

A comprehensive evaluation of existing virus–host prediction methods is essential for providing practical guidance to researchers. In practice, two primary application scenarios for these methods can be defined. The first scenario involves the large-scale analysis of viral sequences, often sourced from public databases, where corresponding host genomes are unavailable. In this context, predictions must be made using only the viral sequences as input. The second scenario involves the analysis of specific metagenomic datasets that contain both viral sequences and a set of candidate host genomes or Metagenome-Assembled Genomes (MAGs) from the same sample. Here, prediction tools can leverage information from both the virus and potential hosts. To evaluate tool performance in both contexts, we constructed two distinct benchmark datasets: RefSeq-VHDB (only with viral sequences) and MetaHiC-VHDB (with both viral sequences and MAGs), each designed to simulate one of these scenarios (see Methods). Then, we present a performance comparison of the accessible tools previously identified in Table 2, evaluating them on both benchmark datasets.

Benchmarking tools using viral sequence-only data

The RefSeq-VHDB dataset contains only viral genomic sequences and lacks corresponding host genomes for each entry. The data property is illustrated in Supplementary Fig. S5. This structural limitation precludes the use of tools that require a user-supplied set of candidate host genomes (i.e. those based on a link prediction framework). Consequently, our comparative analysis on this dataset was restricted to tools capable of predicting a host from the viral sequence alone, including PHP, VHM-Net, RaFAH, DeepHost, vHULK, CHERRY, iPHoP, and PHERI. Notably, while PHP, iPHoP, and CHERRY are fundamentally designed for link prediction, they circumvent this limitation by incorporating their own internal, pre-compiled host databases. This feature enables them to generate predictions using only a viral sequence as input.

The performance of the eight selected tools on the RefSeq-VHDB benchmark is summarized in Fig. 7. We evaluated each tool based on its prediction accuracy and prediction rate at the species, genus, and family taxonomic ranks (see Methods). For tools capable of returning multiple host predictions for a single virus (e.g. PHP, CHERRY, and iPHoP), we only considered the top-ranked prediction to ensure a fair and standardized comparison. It should be noted that RaFAH and PHERI do not provide species-level predictions; therefore, their performance is reported only at the genus and family levels. Conversely, while PHP was not originally evaluated at the species rank in its publication, the software does produce species-level outputs, which we have included in our analysis.

Figure 7.

Two panels of bar charts compare the performance of eight viral host prediction tools. The left panel shows prediction accuracy, and the right shows prediction rate. In both panels, tools are compared within the species, genus, and family taxonomic ranks. Bars for two tools, RaFAH and PHERI, are absent from the species-level comparison.

Performance of eight viral host prediction tools on the RefSeq-VHDB benchmark. The tools are evaluated based on prediction accuracy (left panel) and prediction rate (right panel) at the species, genus, and family taxonomic ranks. For tools capable of returning multiple predictions, only the top-ranked result was considered. RaFAH and PHERI do not provide species-level predictions and are therefore omitted from that category.

In terms of prediction accuracy (Fig. 7), a general trend was observed across all methods: performance improved as the taxonomic classification became higher, from species to family. At the species level, CHERRY demonstrated markedly superior performance, correctly identifying the host for approximately 77% of the predicted viruses. At the genus level, RaFAH achieved the highest accuracy at approximately 84%, with CHERRY (83%) and PHERI (80%) also showing strong performance. This trend continued at the family rank, where RaFAH (92%) and CHERRY (90%) were the top performers. These findings lead to a clear set of recommendations.

The prediction rate analysis reveals two distinct behaviors among the tools (Fig. 7). A subset of the tools—PHP, VHM-Net, and DeepHost—achieved a 100% prediction rate across all taxonomy ranks, providing a host prediction for every viral sequence in the dataset. In contrast, vHULK, CHERRY, iPHoP, and PHERI appear to employ internal confidence thresholds, as they did not return a prediction for every query. This indicates a potential trade-off, where these tools may sacrifice comprehensive prediction rate to maintain a higher certainty for the predictions they do make. For general-purpose applications requiring high accuracy, our analysis identifies CHERRY as the most effective and robust tool, achieving the highest species-level accuracy and top-tier performance at higher taxonomic ranks with a 100% prediction rate. While lacking species-level resolution, RaFAH is also an excellent choice for genus- and family-level predictions.

Interestingly, despite arranging the tools chronologically by their publication date on the x-axis, our findings reveal no strong correlation between a tool’s release date and its performance. The sustained high performance of certain tools can likely be attributed to their underlying algorithm and choice of biological features. In support of this, we observed that the top-performing tools, as illustrated in Table 2, all incorporate features based on viral genomic or protein similarity, demonstrating the critical importance of this approach in this application scenario.

Benchmarking tools using viruses and MAGs driven from metagenomics

To evaluate tool performance in a more practical metagenomic context, we next employed a benchmark derived from MetaHiC sequencing data. This dataset, termed MetaHiC-VHDB, comprises 251 virus–host interactions sourced from three distinct environments: 84 from human gut samples [55], 66 from bovine fecal samples [56], and 101 from wastewater samples [57] (see Methods). In contrast to the RefSeq-VHDB benchmark, this evaluation framework provided tools with both viral contigs and a corresponding set of candidate host MAGs from the same sample. This experimental design is intended to more faithfully replicate the real-world challenge of identifying the specific host of a newly discovered virus within a complex microbial community.

We evaluated the tools compatible with this input format: WIsH, PHP, PHIST, CHERRY, iPHoP, and PB-LKS. Performance was assessed at four taxonomic levels: exact host genome (exact match), species, genus, and family (Fig. 8). As expected, the tools generally exhibited improved accuracy at higher taxonomic ranks. Among them, PHP, PHIST, and CHERRY demonstrated the most robust and competitive accuracy across the three datasets. Specifically, PHIST was a top performer at the exact match level in both the cow fecal and human gut datasets. CHERRY exhibited consistent performance across all three environments, but at the cost of a lower prediction rate compared to other tools. While PHP generally achieved slightly lower accuracy than PHIST and CHERRY, it offered the advantage of a 100% prediction rate. In contrast, iPHoP’s performance was highly variable; it performed well in the human gut dataset but was less accurate in the wastewater and cow fecal environments, highlighting that environment-specific factors can significantly influence prediction accuracy.

Figure 8.

A 2×3 grid of bar charts compares eight viral host prediction tools. The top row shows accuracy, and the bottom shows prediction rate. The three columns represent Wastewater, Human gut, and Cow fecal datasets. Each chart compares the tools across four levels, from exact match to family.

Performance comparison of eight viral host prediction tools on three metagenomic datasets. The evaluation was conducted on virus–host pairs from Wastewater, Human gut, and Cow fecal matter samples. The top row (A) displays prediction accuracy, while the bottom row (B) shows the prediction rate. Performance was assessed at four levels: exact match (correct host’s genome), species, genus, and family.

For comparison, we also benchmarked two direct-evidence methods: prophage detection and standard CRISPR spacer matching. As expected, these methods achieved near-perfect accuracy (100% at the species level) when a prediction could be made, confirming that a direct match is reliable evidence of a virus–host link. This exceptional accuracy, however, was coupled with extremely low prediction rate. The prediction rate for both methods was less than 5% across the datasets, with CRISPR spacer matching failing to yield any predictions in the wastewater and cow fecal matter samples (Fig. 8). This further confirmed our previous finding that significant matches between the viral contigs and host MAGs were rare.

Ensemble of top-performing tools enhances host prediction accuracy

To investigate strategies for maximizing host prediction performance, we ensembled the outputs from the top-performing tools using three distinct strategies (see Methods). The first, “Union + Majority Vote”, aggregated all predictions and used a majority rule to assign the host. The second, “Joint + Majority Vote”, also used a majority rule but was restricted to the subset of viruses predicted by all tools. The final and most stringent method, “Joint + Consensus”, required a unanimous prediction from all three tools to make a final assignment.

As shown in Supplementary Fig. S6, the “joint Inline graphic consensus” approach leads to a notable improvement, reaching an accuracy of 99% on the RefSeq-VHDB dataset. The other combination strategies offered no significant benefit over single-tool performance. Consistent with our findings, The “joint + consensus” approach again yielded the best performance, improving average accuracy by up to 15% over the best individual tool across all three datasets in MetaHiC-VHDB (Supplementary Fig. S7). However, this came at the cost of a significantly lower prediction rate on all datasets. Thus, for specialized applications where accuracy is paramount and lower prediction rate is acceptable, employing a “joint Inline graphic consensus” of top-performers is the optimal strategy to boost the host prediction performance.

Computational efficiency and scalability of prediction tools

Beyond prediction performance, the computational efficiency of a tool is a critical factor for its practical application, particularly when analyzing large-scale metagenomic datasets. We, therefore, evaluated the computational performance of each tool by measuring the average elapsed time required to process 1,000 viral contigs on a consistent hardware platform: Intel Xeon Gold 6258R CPU with 40 cores and Tesla A100 if GPU is required.

The results, presented in Table 3, reveal a stark divergence in computational requirements among the tools. A group of methods, including PHIST, PHP, DeepHost, and CHERRY, demonstrated high efficiency, requiring only a few minutes to complete the task. In contrast, several tools were exceptionally resource-intensive. iPHoP and PB-LKS were the most computationally demanding tools. This significant variation in performance can largely be attributed to the underlying feature extraction process, which is often the primary computational bottleneck. For example, tools like iPHoP rely on searching against an extremely large internal database (over 300 GB in size) to generate features for each prediction. This exhaustive search strategy, while potentially powerful for finding homologous signals, results in a prohibitively long runtime for large-scale analyses. Conversely, the highly efficient tools likely employ more lightweight feature extraction methods, such as Inline graphic-mer frequency calculations or the use of pre-trained models that do not require extensive database searches for every prediction. This trade-off between computational cost and methodological approach may be a key consideration for researchers when selecting a tool appropriate for the scale of their data and accessible resources.

Table 3.

The average elapsed time to predict labels of 1,000 viral sequences for each method. All the methods are run on IntelInline graphic XeonInline graphic Gold 6258R CPU with 40 cores and Tesla A100 (if GPU is required)

Program WIsH CHERRY iPHoP PHP PHERI vHULK VHM-Net DeepHost PB-LKS PHIST
Min/1,000 virus 3 5 11,981 4 2,116 29 240 < 1 10,280 < 1

Discussion

The rapid expansion of viral sequence data has created a critical need for robust computational methods to link viruses to their microbial hosts. This review provides a comprehensive guide for navigating the complex field of computational virus–host prediction. We began by systematically defining the two primary problem formulations—link prediction and multi-class classification—that frame the task. We then analyzed the genomic features used for prediction, detailing their respective advantages and limitations. Our survey of published tools revealed that while numerous methods exist, their practical accessibility varies significantly. This observation motivated the construction of two distinct benchmark datasets tailored to different research scenarios. The RefSeq-VHDB benchmark was designed to test the classification scenario, where a tool must predict a host for a viral sequence without a specific set of candidate host genomes. In contrast, the MetaHiC-VHDB benchmark simulates a metagenomic analysis, providing both viral contigs and candidate host MAGs from the same environment to directly test a tool’s ability to perform de novo link prediction.

A key finding from our benchmarking is that the optimal strategy for host prediction is highly dependent on the research context and the user’s specific goals. When selecting a single tool, our results offer a clear guide. On the RefSeq-VHDB dataset, RaFAH demonstrates superior accuracy at higher taxonomic ranks. For discovering novel interactions within metagenomes (MetaHiC-VHDB), link-prediction tools like PHIST and PHP are highly effective. CHERRY stands out as a robust generalist, delivering consistent performance across both benchmark types. Beyond the performance of individual tools, our analysis reveals a powerful strategy for applications where precision is paramount. By employing a “joint Inline graphic consensus” ensemble of the top-performing tools, users can increase predictive accuracy to nearly 99% on reference data. However, this high confidence comes with a significant and predictable trade-off: a sharp reduction in the overall prediction rate. This highlights that users must choose not only the right tool, but also the right approach: For exploratory studies requiring high prediction rate, a high-performing single tool is optimal (Supplementary Table. 1). For validation or high-confidence discovery where false positives are costly, the joint consensus approach is superior.

Beyond predictive accuracy and prediction rate, our practical evaluation revealed significant barriers that impact a tool’s real-world utility. In particular, computational cost is a critical and often-overlooked factor. The substantial resource requirements of tools like iPHoP and PB-LKS can render them impractical for analyzing large-scale metagenomic datasets, whereas the efficiency of methods such as PHIST and PHP makes them highly scalable. Ultimately, such practical considerations are as important as predictive accuracy in determining a tool’s widespread adoption by the research community.

Due to the complexity and ongoing evolution of the host prediction problem, this study has limitations inherent to the current state of the field. Primarily, our evaluation framework is constrained by public databases that document one-to-one virus–host associations, and thus assesses the prediction of only a single host per virus. This approach does not fully consider the host range of polyvalent phages and may underestimate the performance of features (e.g. CRISPR spacer match) that correctly identify alternative hosts. However, we found this limitation may have a minimal impact on the validity of our comparative benchmark. The Top-N analysis revealed that considering multiple predictions introduced significant ambiguity for only marginal gains in accuracy. This finding, combined with the understanding that most phages have narrow species-level host ranges, suggests that false negatives from polyvalence are rare and do not skew the overall performance trends. Second, although our analysis covers the most common features in current methods, other signals could inform future approaches, including tRNA gene matching, methylation patterns, and co-abundance profiles. Our preliminary results indicate these features hold potential as complementary signals for host prediction (Supplementary Fig. S8 and S9). Third, while our benchmark provides reliable species-level evaluation, predicting host range at the strain level remains critical for applications like phage therapy. In this context, tools built on a link prediction framework—such as PHP, iPHoP, and CHERRY—offer greater utility due to their ability to output multiple candidates. This need has spurred the development of specialized models for bacteria like Escherichia and Klebsiella, which often incorporate domain-specific knowledge such as O-antigen serotypes to achieve high resolution [58, 59]. This highlights a fundamental trade-off between the broad applicability of general-purpose predictors and the precision of tools tailored to specific microbial systems.

Looking forward, a key challenge is the dynamic nature of virus–host interactions, such as host switching, where a virus’s genomic signature may not yet reflect its new tropism. This phenomenon can confound prediction models that depend on established signals of co-living. Fortunately, emerging technologies offer promising avenues to address these dynamics. In particular, single-cell sequencing has demonstrated significant potential to identify rare virus–host interactions and profile phage infections at high resolution [60]. In contrast to bulk methods like MetaHiC, single-cell approaches can provide more precise data on infection timing, prophage induction, and host gene expression, enabling the study of virus–host dynamics at the level of individual infection events rather than community-level averages.

Finally, while our review focused on viruses that infect prokaryotes, predicting hosts for RNA viruses represents another critical domain with a distinct set of challenges. The compact genomes of RNA viruses offer limited space for host-tropism signals, and the rarity of direct virus–host sequence matches largely precludes the use of link prediction methods. Consequently, existing approaches are typically formulated as multi-class classification tasks that rely on virus-virus similarity [61, 62]. The efficacy of these models, however, is constrained by the high mutation rates of RNA viruses, which reduce the genomic conservation needed for accurate generalization to novel viruses. This confluence of biological hurdles—compounded by the broad host ranges of many RNA viruses and sparse host databases—marks this area as a significant frontier for future research.

Methods

CRISPR spacer, prophage, and tRNA matches

To establish a labeled data for our analysis, we utilized a dataset of 4,698 viruses with species-level host annotations from the Viral RefSeq database (February 2025). For the host genomic data, we downloaded all 110,988 accessible prokaryotic genomes from GenBank (February 2025).

  • CRISPR spacer match: We obtained 2,005,489 CRISPR spacers derived from the prokaryotic genomes using CRT v1.2 [63] with default parameters. NCBI BLAST+ v2.16 was employed to align viral genomes to the CRISPR spacers database.

  • Prophage match: we employed NCBI BLAST+ v2.16 to align 4,698 viral genomes to prokaryotic genomes.

  • tRNA match: We obtained 7,709,234 tRNA genes derived from the prokaryotic genomes using ARAGORN [64]. Then, NCBI BLAST+ v2.16 was employed to align viral genomes to the tRNA genes database.

Evaluation metrics of CRISPR spacer and prophage matches

To estimate the usability and reliability of the features derived from CRISPR spacer match and prophage detection, we used four metrics to evaluate the results. We first defined the correct alignment/hit as the viral sequences is aligned to their host or prokaryote has the same taxonomy as its host. Then, the metrics can be calculated as below. To present the statistical process more intuitively, a simple example can be found via Supplementary Fig. S4.

Top-1 accuracy

Accuracy summarized on the best hit (the hit with the best alignment score). For each virus we evaluate Top-1 correct: the number of viruses which best hit is the correct hit. Then, the Top-1 correct will be divided by the total number viruses that have alignment results (Equation 3).

graphic file with name DmEquation3.gif (3)

Top-N accuracy

Accuracy summarized on the all the alignment hits. For each virus we evaluate Top-N correct: the number of viruses which correct hit is included in the alignment results. Then, the Top-N correct will be divided by the total number viruses that have alignment results (Equation 4).

graphic file with name DmEquation4.gif (4)

Alignment proportion

It is a proportion of aligned viruses, which reflect how many viruses can have at least one alignment (Equation 5).

graphic file with name DmEquation5.gif (5)

Correct match ratio

It is a proportion of the correct hits in the alignment results, estimating the errors introduced by using Top-N accuracy. To minimize the bias introduced by the data distribution (some taxonomy has much more prokaryotic genomes), we apply the number of unique taxonomy in the alignment to replace the number of hits. The larger the correct match ratio, the less error introduced by Top-N accuracy (Equation 6).

graphic file with name DmEquation6.gif (6)

The RefSeq-VHDB benchmark

The NCBI RefSeq dataset, initially utilized in VHM and later refined in CHERRY, is a widely adopted benchmark for evaluating virus–host prediction tools. For clarity in this review, we refer to it by the more formal name, the RefSeq Virus–Host Database (RefSeq-VHDB). This dataset is derived from the NCBI Viral RefSeq database and is filtered to retain only viruses with prokaryotic hosts (bacteria and archaea) that have a confident species-level annotation. A key advantage of RefSeq-VHDB is the high confidence of its virus–host linkages and its low data redundancy.

Our version of the dataset, downloaded in February 2025, contains 4,698 viruses linked to 498 distinct host species. The host distribution exhibits a long-tail pattern (Supplementary Fig. S5), a characteristic that poses a significant challenge for model robustness. The three most frequent host species—Escherichia coli, Salmonella enterica, and Mycolicibacterium smegmatis—account for 27.8% of the entries, while 245 species are represented by only a single virus/interaction.

Previous studies have established two primary methods for evaluating this dataset: a temporal split, which mimics the real-world challenge of classifying newly discovered viruses by training on older entries to predict hosts for newer ones, and a sequence similarity split, which assesses a model’s generalization capabilities [27, 28, 36]. In our benchmark, we chose to use the complete, unsplit dataset. Our rationale was to realistically reflect the performance users can expect from the latest release versions of all tools.

The MetaHiC-VHDB Benchmark

Evaluating prediction performance in a metagenomic sequencing data is challenging due to the difficulty of establishing ground-truth virus–host pairs. While databases like Prophage-DB exist, they are skewed towards temperate phages. Furthermore, their use for benchmarking can be unfair, as tools incorporating prophage detection steps (e.g. iPHoP, CHERRY, PhageTB) may achieve near-perfect accuracy.

To create a more representative benchmark, we constructed datasets based on proximity ligation sequencing (Hi-C). This method captures physical DNA-DNA interactions, including those between phages replicating within their host cells, providing a robust source of evidence for virus–host linkages. We processed three independent Hi-C sequencing datasets from diverse environments, including human gut [55], cow fecal matter [56], and wastewater [57]. First, we follow a standard pipeline to process the Hi–C reads as introduced in [65]: First, a standard cleaning procedure was applied to all raw WGS and Hi-C read libraries using bbduk from the BBTools suite (v37.62) We discarded short reads below 50 bp at each cleaning step. Adapter sequences were removed by bbduk with parameter “ktrim=r k=23 mink=11 hdist=1 minlen=50 tpe tbo” and reads were quality-trimmed using bbduk with parameters “trimq=10 qtrim=r ftm=5 minlen=50.” Then, the first 10 nucleotides of each read were trimmed by bbduk with parameter “ftl=10.” Second, all identical PCR optical and tile-edge duplicates for Hi-C paired-end reads were removed by the script “clumpify.sh” from BBTools suite. Third, the hicstuff (v3.2.4) is employed to generate the contacts matrix on the provided contigs and MAGs from the original research. Then, the resulting virus–host links were subjected to a stringent five-step filtering process to minimize false positives:

  • Viral contigs were identified using the intersection of predictions from PhaBOX/PhaMer, geNomad, and VirSorter2.

  • A minimum of two Hi-C read pairs were required to link a viral contig to a host MAG.

  • Host MAGs were required to have an intra-MAG connectivity of at least 10 links to ensure they were sufficiently well-assembled.

  • The virus–host connectivity ratio (Inline graphic) had to exceed 0.1.

  • For each virus, only the host with the highest number of supporting Hi-C links was retained as the definitive interaction.

The calculation of connectivity ratio (Inline graphic) is listed in Equation 7. It is calculated using the connectivity density (Inline graphic) of the virus-MAG pair Inline graphic and MAG Inline graphic to itself (Inline graphic), with connectivity density defined as the Hi-C link count per Inline graphic of sequence. This value is further normalized using a term derived from the viral copy number per MAG Inline graphic. The Inline graphic is estimated as shown in Equation 8, where Inline graphic is the abundance of virus Inline graphic, Inline graphic is the abundance of MAG Inline graphic, Inline graphic is the number of Hi-C links for the specific virus-MAG pair Inline graphic, and Inline graphic is the sum of links for the virus Inline graphic with all possible MAGs.

graphic file with name DmEquation7.gif (7)
graphic file with name DmEquation8.gif (8)

The taxonomy of all host MAGs was assigned using the GTDB-Toolkit (v2.4.0; Release:R220; classify-wf). The final, curated virus-host pairs from these three environments constitute our MetaHiC benchmark.

Evaluation metrics used in benchmarking

To ensure a fair and comprehensive comparison between existing tools, we selected two widely applicable metrics that can evaluate performance for tools based on both link-prediction and multi-class classification frameworks: Accuracy and Prediction Rate. Also, these two metrics can reflect the performance in real usage scenarios.

  • Accuracy: Assessing whether the predicted host taxon matches the ground-truth host taxon at a specific taxonomic rank (e.g. species, genus, and family). For the MetaHiC-VHDB benchmark, where specific host MAGs are provided, accuracy can be evaluated at the strain level by determining if the correct MAG was identified.

  • Prediction Rate: This metric is defined as the percentage of input viruses for which a tool successfully returns a host prediction. Some tools incorporate internal confidence thresholds and may not return a prediction for every query virus if the evidence is deemed insufficient to maintain a high accuracy. This metric, therefore, quantifies the practical applicability and coverage of a tool.

To assess whether combining outputs from top-performing tools could improve predictive performance, we evaluated three commonly used ensemble strategies:

  • Union + Majority Vote: This method aggregates all predictions from the three selected tools. For any virus with multiple predictions, the final host is determined by a majority vote. Ties are resolved by selecting the prediction from the tool with the highest individual performance on our benchmark.

  • Joint + Majority Vote: This method considers only the subset of viruses for which all three tools provided a prediction. Within this subset, a majority vote is applied to determine the final host, with ties resolved as described above.

  • Joint + Consensus: The most stringent approach, which yields a prediction only when all three tools are in unanimous agreement on the host assignment.

Key Points

  • A standardized problem formulation: We establish a structured framework for the virus–host prediction problem, providing a consistent foundation for understanding and comparing diverse computational strategies.

  • A comprehensive survey of tools and biological features: We critically evaluate 27 existing tools, examining their methodologies, strengths, and weaknesses. We then survey the full spectrum of biological features they employed, from CRISPR spacer matching and prophage detection to alignment-free Inline graphic-mer frequency analysis.

  • Rigorous benchmarking using carefully designed datasets: Another key novelty is the development and application of two distinct evaluation benchmarks. RefSeq-VHDB provides a curated set of phage–host pairs for standardized assessment, while MetaHiC-VHDB consists of three independent metagenomic Hi-C test sets designed to assess tool performance in realistic ecological contexts. These benchmarks provide a practical guide for researchers and expose performance gaps that future methods must address.

Supplementary Material

Supplementary_bbaf626

Acknowledgements

The computation was conducted at the HPCC of City University of Hong Kong.

Contributor Information

Jiayu Shang, Department of Information Engineering, Chinese University of Hong Kong, Hong Kong (SAR), HKG, China.

Cheng Peng, Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), HKG, China.

Jiaojiao Guan, Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), HKG, China.

Dehan Cai, Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), HKG, China.

Donglin Wang, School of Environmental Science and Engineering, Shandong University, Qingdao, Shandong, 266237, China.

Yanni Sun, Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), HKG, China.

Conflict of interest: The authors have no conflicts of interest to declare.

Funding

This work was supported by the Hong Kong Research Grants Council (RGC) General Research Fund (GRF) [11209823], City University of Hong Kong (the Institute of Digital Medicine, 9667256, 9678241), and Open Research Fund of Guangdong Provincial Key Laboratory of Wastewater Information Analysis and Early Warning [217310019].

Data availability

All the datasets used in this study are publicly available. The scripts and dataset information can be accessed via GitHub (https://github.com/KennthShang/HostPredictionReview) and Zenodo (https://doi.org/10.5281/zenodo.16975470). The MetaHiC data can be found via the NCBI Sequence Read Archive database (http://www.ncbi.nlm.nih.gov/sra). The human gut dataset is available under accession codes: shotgun library SRR6131123, Hi-C libraries SRR6131122, and SRR6131124. The cow fecal dataset used in this study is under accession codes: shotgun library ERX2333418 and Hi-C libraries ERX2548555 and ERX2548556. The wastewater dataset is available under accession codes: shotgun library SRR8239393 and Hi-C library SRR8239392.

References

  • 1. Naureen  Z, Dautaj  A, Anpilogov  K. et al.  Bacteriophages presence in nature and their role in the natural selection of bacterial populations. Acta Biomed  2020;91:e2020024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Batinovic  S, Wassef  F, Knowler  SA. et al.  Bacteriophages in natural and artificial environments. Pathogens  2019;8:100. 10.3390/pathogens8030100 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Cobián Güemes  AG, Youle  M, Cantú  VA. et al.  Viruses as winners in the game of life. Annu Rev Virol  2016;3:197–214. 10.1146/annurev-virology-100114-054952 [DOI] [PubMed] [Google Scholar]
  • 4. Dion  MB, Oechslin  F, Moineau  S. Phage diversity, genomics and phylogeny. Nat Rev Microbiol  2020;18:125–38. 10.1038/s41579-019-0311-5 [DOI] [PubMed] [Google Scholar]
  • 5. Zrelovs  N, Dislers  A, Kazaks  A. Motley crew: overview of the currently available phage diversity. Front Microbiol  2020;11:579452. 10.3389/fmicb.2020.579452 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Al-Shayeb  B, Sachdeva  R, Chen  L-X. et al.  Clades of huge phages from across Earth’s ecosystems. Nature  2020;578:425–31. 10.1038/s41586-020-2007-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Chen  L-X, Méheust  R, Crits-Christoph  A. et al.  Large freshwater phages with the potential to augment aerobic methane oxidation. Nat Microbiol  2020;5:1504–15. 10.1038/s41564-020-0779-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Jansson  JK, Wu  R. Soil viral diversity, ecology and climate change. Nat Rev Microbiol  2023;21:296–311. 10.1038/s41579-022-00811-z [DOI] [PubMed] [Google Scholar]
  • 9. Wei  Y, Zhou  C. Bacteriophages: a double-edged sword in the gastrointestinal tract. Front Microbiomes  2024;3:1450523. 10.3389/frmbi.2024.1450523 [DOI] [Google Scholar]
  • 10. Huang  D, Xia  R, Chen  C. et al.  Adaptive strategies and ecological roles of phages in habitats under physicochemical stress. Trends Microbiol  2024;32:902–16. 10.1016/j.tim.2024.02.002 [DOI] [PubMed] [Google Scholar]
  • 11. Letarov  A, Kulikov  E. Adsorption of bacteriophages on bacterial cells. Biochemistry  2017;82:1632–58. 10.1134/S0006297917130053 [DOI] [PubMed] [Google Scholar]
  • 12. Leprince  A, Mahillon  J. Phage adsorption to gram-positive bacteria. Viruses  2023;15:196. 10.3390/v15010196 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Yu  Z, Schwarz  C, Zhu  L. et al.  Hitchhiking behavior in bacteriophages facilitates phage infection and enhances carrier bacteria colonization. Environ Sci Technol  2020;55:2462–72. [DOI] [PubMed] [Google Scholar]
  • 14. You  X, Klose  N, Kallies  R. et al.  Mycelia-assisted isolation of non-host bacteria able to co-transport phages. Viruses  2022;14:195. 10.3390/v14020195 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Hobbs  SJ, Kranzusch  PJ. Nucleotide immune signaling in CBASS, Pycsar, Thoeris, and CRISPR antiphage defense. Annu Rev Microbiol  2024;78:255–76. 10.1146/annurev-micro-041222-024843 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Brady  A, Felipe-Ruiz  A, Gallego del Sol  F. et al.  Molecular basis of lysis–lysogeny decisions in gram-positive phages. Annu Rev Microbiol  2021;75:563–81. 10.1146/annurev-micro-033121-020757 [DOI] [PubMed] [Google Scholar]
  • 17. Howard-Varona  C, Hargreaves  KR, Abedon  ST. et al.  Lysogeny in nature: mechanisms, impact and ecology of temperate phages. ISME J  2017;11:1511–20. 10.1038/ismej.2017.16 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Lin  DM, Koskella  B, Lin  HC. Phage therapy: an alternative to antibiotics in the age of multi-drug resistance. World J Gastrointest Pharmacol Ther  2017;8:162. 10.4292/wjgpt.v8.i3.162 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Plumet  L, Magnan  C, Costechareyre  D. et al.  Phage therapy: a promising approach for Staphylococcus aureus diabetic foot infections. J Virol  2025;99:e00458–25. 10.1128/jvi.00458-25 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Kovacs  CJ, Rapp  EM, Rankin  WR. et al.  Combinations of bacteriophage are efficacious against multidrug-resistant Pseudomonas aeruginosa and enhance sensitivity to carbapenem antibiotics. Viruses  2024;16:1000. 10.3390/v16071000 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Bumunang  EW, Zaheer  R, Niu  D. et al.  Bacteriophages for the targeted control of foodborne pathogens. Foods  2023;12:2734. 10.3390/foods12142734 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Ross  A, Ward  S, Hyman  P. More is better: selecting for broad host range bacteriophages. Front Microbiol  2016;7:1352. 10.3389/fmicb.2016.01352 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Chung  KM, Liau  XL, Tang  SS. Bacteriophages and their host range in multidrug-resistant bacterial disease treatment. Pharmaceuticals  2023;16:1467. 10.3390/ph16101467 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Howell  AA, Versoza  CJ, Pfeifer  SP. Computational host range prediction—the good, the bad, and the ugly. Virus Evol  2024;10:1–11. 10.1093/ve/vead083 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Villarroel  J, Kleinheinz  KA, Jurtz  VI. et al.  HostPhinder: a phage host prediction tool. Viruses  2016;8:116. 10.3390/v8050116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Galiez  C, Siebert  M, Enault  F. et al.  WIsH: who is the host? Predicting prokaryotic hosts from metagenomic phage contigs. Bioinformatics  2017;33:3113–4. 10.1093/bioinformatics/btx383 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Ahlgren  NA, Ren  J, Lu  YY. et al.  Alignment-free oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences. Nucleic Acids Res  2017;45:39–53. 10.1093/nar/gkw1002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Lu  C, Zhang  Z, Cai  Z. et al.  Prokaryotic virus host predictor: a gaussian model for host prediction of prokaryotic viruses in metagenomics. BMC Biol  2021;19:5. 10.1186/s12915-020-00938-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Shang  J, Sun  Y. Predicting the hosts of prokaryotic viruses using GCN-based semi-supervised learning. BMC Biol  2021;19:250. 10.1186/s12915-021-01180-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Coutinho  FH, Zaragoza-Solas  A, López-Pérez  M. et al.  RaFAH: host prediction for viruses of bacteria and archaea based on protein content. Patterns  2021;2:100274. 10.1016/j.patter.2021.100274 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Pons  JC, Paez-Espino  D, Riera  G. et al.  VPF-class: taxonomic assignment and host prediction of uncultivated viruses based on viral protein families. Bioinformatics  2021;37:1805–13. 10.1093/bioinformatics/btab026 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Li  M, Wang  Y, Li  F. et al.  A deep learning-based method for identification of bacteriophage–host interaction. IEEE/ACM Trans Comput Biol Bioinform  2020;18:1801–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Wang  W, Ren  J, Tang  K. et al.  A network-based integrated framework for predicting virus–prokaryote interactions. NAR Genom Bioinform  2020;2:lqaa044. 10.1093/nargab/lqaa044 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Li  M, Zhang  W. PHIAF: prediction of phage–host interactions with GAN-based data augmentation and sequence-based feature fusion. Brief Bioinform  2022;23:bbab348. 10.1093/bib/bbab348 [DOI] [PubMed] [Google Scholar]
  • 35. Zielezinski  A, Deorowicz  S, Gudyś  A. PHIST: fast and accurate prediction of prokaryotic hosts from metagenomic viral sequences. Bioinformatics  2022;38:1447–9. 10.1093/bioinformatics/btab837 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Shang  J, Sun  Y. CHERRY: a computational metHod for accuratE pRediction of virus–pRokarYotic interactions using a graph encoder–decoder model. Brief Bioinform  2022;23:1–16. 10.1093/bib/bbac182 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Ruohan  W, Xianglilan  Z, Jianping  W. et al.  DeepHost: phage host prediction with convolutional neural network. Brief Bioinform  2022;23:1–10. 10.1093/bib/bbab385 [DOI] [PubMed] [Google Scholar]
  • 38. Tan  J, Fang  Z, Wu  S. et al.  HoPhage: an ab initio tool for identifying hosts of phage fragments from metaviromes. Bioinformatics  2022;38:543–5. 10.1093/bioinformatics/btab585 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Amgarten  D, Iha  BKV, Piroupo  CM. et al.  vHULK, a new tool for bacteriophage host prediction based on annotated genomic features and neural networks. Phage  2022;3:204–12. 10.1089/phage.2021.0016 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Roux  S, Camargo  AP, Coutinho  FH. et al.  iPHoP: an integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria. PLoS Biol  2023;21:e3002083. 10.1371/journal.pbio.3002083 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Aggarwal  S, Dhall  A, Patiyal  S. et al.  An ensemble method for prediction of phage-based therapy against bacterial infections. Front Microbiol  2023;14:1148579. 10.3389/fmicb.2023.1148579 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Du  Z-H, Zhong  J-P, Liu  Y. et al.  Prokaryotic virus host prediction with graph contrastive augmentation. PLoS Comput Biol  2023;19:e1011671. 10.1371/journal.pcbi.1011671 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Baláž  A, Kajsik  M, Budiš  J. et al.  PHERI—phage host exploration pipeline. Microorganisms  2023;11:1398. 10.3390/microorganisms11061398 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Gonzales  MEM, Ureta  JC, Shrestha  AM. Protein embeddings improve phage–host interaction prediction. PloS One  2023;18: e0289030. 10.1371/journal.pone.0289030 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Wei  T, Lu  C, Du  H. et al.  DeepPBI-KG: a deep learning method for the prediction of phage–bacteria interactions based on key genes. Brief Bioinform  2024;25:bbae484. 10.1093/bib/bbae484 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Bastien  GE, Cable  RN, Batterbee  C. et al.  Virus–host interactions predictor (VHIP): machine learning approach to resolve microbial virus–host interaction networks. PLoS Comput Biol  2024;20:e1011649. 10.1371/journal.pcbi.1011649 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Qiu  J, Nie  W, Ding  H. et al.  PB-LKS: a python package for predicting phage–bacteria interaction through local K-mer strategy. Brief Bioinform  2024;25:1–14. 10.1093/bib/bbae010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Liu  D, Young  F, Lamb  KD. et al.  Prediction of virus–host associations using protein language models and multiple instance learning. PLoS Comput Biol  2024;20:e1012597. 10.1371/journal.pcbi.1012597 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Liu  F, Zhao  Z, Liu  Y. PHPGAT: predicting phage hosts based on multimodal heterogeneous knowledge graph with graph attention network. Brief Bioinform  2025;26:bbaf017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Wei  A, Xiao  Z, Fu  L. et al.  Predicting phage–host interactions via feature augmentation and regional graph convolution. Brief Bioinform  2024;26:bbae672. 10.1093/bib/bbae672 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Gonzales  MEM, Ureta  JC, Shrestha  AM. PHIStruct: improving phage–host interaction prediction at low sequence similarity settings using structure-aware protein embeddings. Bioinformatics  2025;41:btaf016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Nayfach  S, Páez-Espino  D, Call  L. et al.  Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. Nat Microbiol  2021;6:960–70. 10.1038/s41564-021-00928-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Johansen  J, Atarashi  K, Arai  Y. et al.  Centenarians have a diverse gut virome with the potential to modulate metabolism and promote healthy lifespan. Nat Microbiol  2023;8:1064–78. 10.1038/s41564-023-01370-6 [DOI] [PubMed] [Google Scholar]
  • 54. Boeckaerts  D, Stock  M, De Baets  B. et al.  Identification of phage receptor-binding protein sequences with hidden Markov models and an extreme gradient boosting classifier. Viruses  2022;14:1329. 10.3390/v14061329 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Press  MO, Wiser  AH, Kronenberg  ZN. et al.  Hi-C deconvolution of a human gut microbiome yields high-quality draft genomes and reveals plasmid-genome interactions. Biorxiv  2017;198713. [Google Scholar]
  • 56. Stewart  RD, Auffret  MD, Warr  A. et al.  Assembly of 913 microbial genomes from metagenomic sequencing of the cow rumen. Nat Commun  2018;9:870. 10.1038/s41467-018-03317-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57. Stalder  T, Press  MO, Sullivan  S. et al.  Linking the resistome and plasmidome to the microbiome. ISME J  2019;13:2437–46. 10.1038/s41396-019-0446-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Boeckaerts  D, Stock  M, Ferriol-González  C. et al.  Prediction of klebsiella phage–host specificity at the strain level. Nat Commun  2024;15:4355. 10.1038/s41467-024-48675-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Gaborieau  B, Vaysset  H, Tesson  F. et al.  Prediction of strain level phage–host interactions across the Escherichia genus using only genomic information. Nat Microbiol  2024;9:2847–61. 10.1038/s41564-024-01832-5 [DOI] [PubMed] [Google Scholar]
  • 60. Wang  B, Lin  AE, Yuan  J. et al.  Single-cell massively-parallel multiplexed microbial sequencing (M3-seq) identifies rare bacterial populations and profiles phage infection. Nat Microbiol  2023;8:1846–62. 10.1038/s41564-023-01462-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Pandit  PS, Anthony  SJ, Goldstein  T. et al.  Predicting the potential for zoonotic transmission and host associations for novel viruses. Commun Biol  2022;5:844. 10.1038/s42003-022-03797-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Chen  G, Jiang  J, Sun  Y. RNAVirHost: a machine learning–based method for predicting hosts of RNA viruses through viral genomes. GigaScience  2024;13:giae059. 10.1093/gigascience/giae059 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63. Bland  C, Ramsey  TL, Sabree  F. et al.  CRISPR recognition tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats. BMC Bioinform  2007;8:209. 10.1186/1471-2105-8-209 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Laslett  D, Canback  B. ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res  2004;32:11–6. 10.1093/nar/gkh152 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Wu  R, Davison  MR, Nelson  WC. et al.  Hi-C metagenome sequencing reveals soil phage–host interactions. Nat Commun  2023;14:7666. 10.1038/s41467-023-42967-z [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary_bbaf626

Data Availability Statement

All the datasets used in this study are publicly available. The scripts and dataset information can be accessed via GitHub (https://github.com/KennthShang/HostPredictionReview) and Zenodo (https://doi.org/10.5281/zenodo.16975470). The MetaHiC data can be found via the NCBI Sequence Read Archive database (http://www.ncbi.nlm.nih.gov/sra). The human gut dataset is available under accession codes: shotgun library SRR6131123, Hi-C libraries SRR6131122, and SRR6131124. The cow fecal dataset used in this study is under accession codes: shotgun library ERX2333418 and Hi-C libraries ERX2548555 and ERX2548556. The wastewater dataset is available under accession codes: shotgun library SRR8239393 and Hi-C library SRR8239392.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES