Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2024 Sep 18;20(9):e1011649. doi: 10.1371/journal.pcbi.1011649

Virus-host interactions predictor (VHIP): Machine learning approach to resolve microbial virus-host interaction networks

G Eric Bastien 1, Rachel N Cable 1, Cecelia Batterbee 1, A J Wing 1, Luis Zaman 1,*, Melissa B Duhaime 1,*
Editor: Samuel V Scarpino2
PMCID: PMC11441702  PMID: 39292721

Abstract

Viruses of microbes are ubiquitous biological entities that reprogram their hosts’ metabolisms during infection in order to produce viral progeny, impacting the ecology and evolution of microbiomes with broad implications for human and environmental health. Advances in genome sequencing have led to the discovery of millions of novel viruses and an appreciation for the great diversity of viruses on Earth. Yet, with knowledge of only “who is there? we fall short in our ability to infer the impacts of viruses on microbes at population, community, and ecosystem-scales. To do this, we need a more explicit understanding “who do they infect? Here, we developed a novel machine learning model (ML), Virus-Host Interaction Predictor (VHIP), to predict virus-host interactions (infection/non-infection) from input virus and host genomes. This ML model was trained and tested on a high-value manually curated set of 8849 virus-host pairs and their corresponding sequence data. The resulting dataset, ‘Virus Host Range network’ (VHRnet), is core to VHIP functionality. Each data point that underlies the VHIP training and testing represents a lab-tested virus-host pair in VHRnet, from which meaningful signals of viral adaptation to host were computed from genomic sequences. VHIP departs from existing virus-host prediction models in its ability to predict multiple interactions rather than predicting a single most likely host or host clade. As a result, VHIP is able to infer the complexity of virus-host networks in natural systems. VHIP has an 87.8% accuracy rate at predicting interactions between virus-host pairs at the species level and can be applied to novel viral and host population genomes reconstructed from metagenomic datasets.

Author summary

The ecology and evolution of microbial communities are deeply influenced by viruses. Metagenomics analysis, the non-targeted sequencing of community genomes, has led to the discovery of millions of novel viruses. Yet, through the sequencing process, only DNA sequences are recovered, begging the question: which microbial hosts do those novel viruses infect? To address this question, we developed a computational tool to allow researchers to predict virus-host interactions from such sequence data. The power of this tool is its use of a high-value, manually curated set of 8849 lab-verified virus-host pairs and their corresponding sequence data. For each pair, we computed signals of coevolution to use as the predictive features in a machine learning model designed to predict interactions between viruses and hosts. The resulting model, Virus-Host Interaction Predictor (VHIP), has an accuracy of 87.8% and can be applied to novel viral and host genomes reconstructed from metagenomic datasets. Because the model considers all possible virus-host pairs, it can resolve complete virus-host interaction networks and supports a new avenue to apply network thinking to viral ecology.

Introduction

The development of metagenomic sequence analyses has led to unprecedented discoveries in microbial science [1,2], owing to the ability to study viruses and cellular microbes in their quintessential contexts. In particular, metagenomics has shed light on the genomic diversity and ubiquity of viruses [37]. Those viral populations are reconstructed from metagenomic data and identified as novel based on their sequence similarity to known viruses, expanding the total number of distinct uncultured viruses to millions the last decade alone. [3,811]. With these discoveries, there is mounting interest in characterizing how viruses impact microbial communities [12]. During infection, viruses influence ecological processes at multiple scales by shaping host population dynamics [13,14], modulating horizontal gene transfer [15,16], and reprogramming host metabolic pathways that can modulate the flux of environmental nutrients [1720]. Given the central role viruses play in the ecology and evolution of their microbial hosts, there is great interest in including them in biogeochemical modeling by leveraging global-scale metagenomic data. While the population genomes of viruses and their microbial hosts can be reconstructed from metagenomic data with increasing accuracy, the inclusion of viruses in community metabolic models that can one day be scaled to ecosystems is impeded by the absence of arguably the most important question about any novel virus: who does it infect [11,21,22]?

To bridge this knowledge gap, various approaches have been explored. Phylogeography-based approaches have been successful in predicting virus-host associations in eukaryotic systems [2325]. However, this approach is not applicable for viruses infecting microbes, as microbes show weak patterns of biogeography [26]. Another method leverages patterns of coevolution that can be extracted from the genomic sequences of the viruses and their host to predict the most likely taxa that can be infected by a given virus [2731]. This approach of leveraging genomic signals to predict virus-host association is possible because viruses rely on their host machinery to complete their life cycle and evolve to better utilize those resources by matching the codon biases of their host [20,32,33]. This results in a meaningful and capturable signal that is embedded in the sequences of the viruses and their known host [34,35] Those host prediction tools (HPTs) are, however, limited in scope. They typically only allow viral sequences as input, which restricts testing to host sequences that already exist in pre-defined reference databases, or they require sufficient expertise and resources from users to re-train the models to include new hosts sequences. These limitations restrict the applicability of such tools, as they are difficult to use to study viral host range in natural community contexts with newly discovered host populations. In addition, they typically focus on predicting the most likely taxa a virus infects, which does not reflect the breadth of natural virus-host range profiles [36,37] that can span different taxonomic levels [3740]. Further, as they do not predict non-infection interactions, that is, a virus’s inability to infect a host, existing HPTs cannot be used to resolve virus-host interaction networks.

To address these limitations, we collected lab-verified viral-host interactions (infection and non-infection data) from public databases and literature and compiled them into a single dataset named ‘Virus Host Range network’ (VHRnet). This data is essential for exploring the strengths of the virus-host coevolution signals, assessing existing HPTs, and for developing a novel model that can predict both infection and non-infection relationships for all pairwise sets of viruses and putative hosts the prediction model may someday encounter. In this study the VHRnet data were used to (1) quantify and evaluate genome-derived signals of coevolution captured in lab-validated virus-host pairs, (2) develop a machine learning model that leverages these virus-microbe coevolutionary signals, and (3) assess the accuracy of the model in predicting interactions (i.e., infection or non-infection) at the species level. The resulting model developed and described here is named VHIP for Virus-Host Interaction Predictor.

Results

VHRnet, a manually curated host range dataset unparalleled in size and scope

To train and test machine learning model approaches to predict virus-host relationships, we aimed to compile the most comprehensive host range dataset available, wherein all viral and host genome sequences are also publicly available (Fig 1A). For this, we relied on the fact that every virus in the NCBI RefSeq database has a host associated with it via the `/lab_host`tag in the GenBank file controlled syntax. In addition, for each virus in the RefSeq database, we searched published studies beyond the genome reports for documented host range trials (S1 Table). A total of 8849 lab-tested interactions were collected and compiled (Fig 2A). The majority of interactions in this dataset were non-infection (n = 6079), owing to a small number of large-scale host range studies that diligently tested and reported all virus-host pairs in their study (Fig 2B), rather than only reporting cases of infection, as is done in the vast majority of virus-host studies (S1 Table). This resulting dataset was named VHRnet (S2 Table), for Virus-Host Range network. It comprises 375 unique host species and 2292 unique viruses. The majority (94.7%) of the viruses belong to the Caudovirales order (Figs 2C and S1), which may be driven by culture techniques biases [36,41]. There are biases in the host taxa represented as well, with the majority of the tested hosts human pathogens (Figs 2D and S2), partially driven in the recent surge of phage therapy research [42].

Fig 1. Design of data collection, features computation, and evaluation of machine learning approach developed in this study A.

Fig 1

Metadata describing virus-host pairs was retrieved and compiled from NCBI and literature. Blue and yellow color indicates provenance of infection versus non-infection information, i.e., non-infection data came only from literature studies. B. Sequences of each virus and host were retrieved. Signals of coevolution were calculated for each virus-host pair, including C. sequence composition and D. sequence homology. E. The virus-host pairs and their computed signals of coevolution were used to train VHIP. To determine the best parameters for the machine learning model, a grid search was performed and bootstrapped 100 times on a training/testing set and evaluated on a hold-out set. The model was then retrained for a final time using the best hyperparameters.

Fig 2. VHRnet network visualization and content characteristics.

Fig 2

A. Network visualization of VHRnet, where an edge connects a viral node to a host node if that pair has been experimentally tested. The edge is colored by the interaction class (infection/non-infection) and the nodes by virus taxonomy or whether it is a host node (black squares are host species). B. Origin of known lab-tested infection and non-infection data across dataset sources. C. Distribution of family classifications for subset of viruses in VHRnet collected from NCBI. Lighter transparency represents the proportion of non-infection reports by viral family, relative to the solid portion, which represents known infection reports by viral family. D. Distribution of host genera for subset of hosts in VHRnet collected from NCBI. Lighter color transparency represents the proportion of non-infection relative to infection (solid color).

The majority of the viruses (n = 1962, 84.2%) were reportedly tested against a single host; these pairings came from the `/lab_host`tag of their NCBI GenBank files. Of the remaining viruses (n = 369, 18.8%), most (81.8%) were tested against different host species with no cross-genus tests. The percentages of viruses tested across clades gets increasingly smaller at higher taxonomic levels: 78.5% were tested at most across families, 77.5% were tested across order, 9.2% were tested across class, and 0% were tested across phyla or domain (S3 Fig). However it is important to note that the numbers of cross-clade host testing were heavily influenced by two large host-range scale studies (herein the ‘Staphylococcus study’ [40] and ‘Nahant study’ [36]), in which every potential virus and host pair was experimentally tested, resulting in a “complete” virus-host network (Fig 2A and 2B) [36,40]. 70.8% of the VHRnet pairs came from those two studies alone. When excluding those two large host range studies, only 82 viruses in VHRnet (4.01%) were tested against multiple species. Out of those viruses, 65.6% were tested against hosts belonging to different genera, 52.4% were tested against different families, 46.3% against different order, 41.2% against different class, and 0 viruses were tested against different hosts belonging to different phyla.

VHRnet provides opportunity for holistic comparison of Host Prediction Tools (HPTs)

The diligent lab-testing of all virus-host pairs from Staphylococcus and Nahant studies (Fig 2A and 2B) presented an opportunity to directly compare existing HPTs (S3 Table) on the same lab-validated infection/non-infection data. While HPT benchmarking is common, such a direct comparison of HPTs has been challenging given that each tool uses similar but not identical datasets for model testing and training. For the most objective assessment, testing datasets must exclude training data, which is not possible when part of the data used to compare outputs of one model was included in the training sets of other benchmarked models. In assessing the accuracy of HPTs against the Nahant Collection and Staphylococcus datasets in this study, there were three possible outcomes: (1) correct host was predicted, (2) incorrect host was predicted, (3) a host was predicted that was not experimentally tested (“untested”, Fig 3). Note that if there are multiple known hosts for a virus, these models only need to predict one correct host to obtain 100% accuracy in this evaluation.

Fig 3. Evaluation of the accuracy of existing virus-host predictions models using two complete the Staphylococcus (40) and Nahant (36) studies.

Fig 3

Accuracy is evaluated at different host taxa thresholds, from species to phylum. The relative proportion of correct, incorrect, and untested predictions is represented by bar color. Untested predictions are hosts predicted by an HPT but that are not experimentally validated as correct or incorrect. The prediction that received the highest score by each HPT was evaluated. Because iPHoP outputs multiple predictions, both the highest scoring prediction (“iPHoP bp”) and the full set of predictions (“iPHoP”) were reported. *RAFAH does not return predictions at the species level.

We found that the accuracy of each HPT decreased as we considered more resolved host taxonomic levels (i.e., from domain to species level). Furthermore, the number of predicted virus-host pairs that were not experimentally tested also increased with increasing host taxonomic resolution (Fig 3). Existing HPTs performed better in predicting hosts for viruses in the Staphylococcus study than for viruses in the Nahant study, with overall fewer wrong predictions. Regardless, existing HPTs do not have the resolution to reliably predict species level virus-host interactions. Moreover, due to the limitation that stems from the reference data used by HPTs (that is, only one known host per virus), these tools typically deliver only their highest scoring host prediction as output, which can lead to uncertainty in output data interpretation and downstream use. For instance, if there are five total predictions, all of low quality scores, the best of the poor predictions will be reported as the “most likely host”, rather than returning a prediction of “no infection”. Similarly, if five hosts are predicted with similarly high scores, only a single prediction is chosen, rather than five “infection” predictions. This forced one-to-one prediction output model design limits the use of current HPTs and does not reflect the dynamics of virus-host relationships. Motivated to move beyond the one-to-one output design, we next leveraged the VHRnet data to identify and quantify genome-derived signals of coevolution between known infection pairs (relative to non-infection pairs) at the virus and host species level. We hypothesized that these data could be used to develop a many-to-many HPT design.

Genomic signals of virus adaptation to their host(s) are discernible at the species level Signals of coevolution were computed for each virus-host pair in VHRnet (Fig 1A and 1B), which can be broadly divided into two categories: sequence homology (Fig 1C) and sequence composition (Fig 1D) [41]. For sequence homology, virus and host genomes were scanned for stretches of DNA sequence homology, which may indicate past virus-host interactions, such as CRISPR activity or horizontal gene transfer (HGT). We first identified bacteria and archaea in the VHRnet dataset with predicted CRISPR-Cas systems, then retrieved their associated CRISPR spacer motifs, which are short DNA sequences acquired from previous exposure to a foreign genetic element as part of the CRISPR-Cas adaptive immune response. Viral genomes were then scanned for the presence of these identical motifs. Stretches of non-CRISPR sequence homology, which we attributed to HGT events, between viruses and their putative host sequences were similarly identified between VHRnet viruses and hosts, with the difference being their search was not restricted to predicted CRISPR spacer motifs (minimum identity percentage 80 with minimum hit length 500 bp). Across the VHRnet data, instances of shared sequence homology between viral and host sequences were rare: sequence matches to CRISPR spacers were identified in 184 viruses out of 2292 (8.03%) and sequence matches attributed to HGT were identified in 340 viruses (14.83%). That the frequency of CRISPR matches was low was not surprising given that an estimated 50% of bacteria and 10% of archaea lack CRISPR-Cas viral defense systems [43].

In addition to sequence homology, we quantitatively evaluated signals of virus-host coevolution based on sequence composition. Because viruses rely on their host machinery to complete their life cycle [44], their genomes have a tendency towards matching the nucleotide usage biases of their hosts over coevolutionary time (the process of ‘genome amelioration’) [4547]. We computed k-mer profiles at k = 3, 6, and 9 for all viral and host sequences in VHRnet. We used both the Euclidean and the d2* distance metrics [35] to compute k-mer profile similarities between viruses and hosts and then to evaluate which measure encoded the strongest signals for virus-host predictions given our study design. As previously reported [35], the d2* metric outperformed the Euclidean distance metric in its ability to resolve sequence composition-based signals of virus-host coevolution (S4 Fig). In other applications, d2* has been shown to predict viral hosts at the genus level with twice the accuracy of the Euclidean metric [35]. The d2* metric differs from other distance metrics as it takes into consideration the background oligonucleotide patterns of the sequences being compared [48]. While the Euclidean metric remains the most popular for binning metagenomic contigs in the reconstruction of microbial populations from metagenomes, our results suggest that the d2* may be a better metric for binning as well, especially with continued optimization to reduce its computational burden. The d2* algorithm is rewritten here as a Python package to ease accessibility.

As longer k-mers (“words”) are considered, the number of possible words increases exponentially, e.g., 4 possible letters raised to the power of k means that for a 7 character sequence stretch [47], 16,384 words are possible. The length of the sequence must be sufficiently long, such that the frequency of each word is likely to be detected within the virus and host genomes being considered. If the k-mer is too long relative to the genome length, zeroes accrue in the k-mer frequencies table, which leads to spurious distance values, a behavior that has been previously posited by Edwards et al, 2016 [41]. Here we assessed the impact of this behavior on prediction performance, given that we could now explicitly compare k-mer profile distances of known infection and known non-infection cases. Although the signal for certain virus-host pairs may get stronger at higher k-lengths, this is not a universal trend [35] (Fig 4A). At k = 9, some known non-infection pairs have smaller distances to known infection pairs (Fig 4A, orange arrow), regardless of the distance metric used. This does not happen when using k-lengths of 3 or 6. Thus we do not recommend using k-mers larger than 6 for purposes of virus-host predictions.

Fig 4. Comparison of coevolution signals captured at the species taxonomic level.

Fig 4

A. Density plots of k-mer distances using k-length of 3, 6, and 9 between viruses and their hosts, colored by known interaction. B. Dot plot of %G+C difference bin against k-mer distance using d2*, colored by known interaction. Box and whisker plots represent the distribution values for the x-axis and y-axis, colored by known interaction (middle bar is median, quartile shows 25th and 75th percentile, and whiskers shows 1.5 times interquartile range). C. Stacked bar plot showing the ratio of lab-verified infection to non-infection for %G+C difference bins. Numbers inside the bars note the number of events observed for each %G+C difference bin. Note that this number decreases sharply towards the extremes. D. Density plot of %G+C difference and k-mer distance between viruses and lab-tested hosts, colored by known interaction. Left plot is with k set to 3 and the right plot is with k set to 6.

While it is commonly recognized that the %G+C contents of viruses are typically very similar to those of their hosts [49], the %G+C differences between viruses and hosts (%G+C of virus—%G+C of host) have also been previously recognized as a feature that could provide valuable information for virus-host predictions [41]. As %G+C difference is not currently considered by existing HPTs, we evaluated the potential of this signal to capture coevolutionary relationships. A decades-old study of 59 virus-host associations reported that viruses were on average 4% richer in AT [49]. This trend was attributed to the fact that G and C are energetically more expensive nucleotides to synthesize than A and T, that ATP is an abundant cellular molecule and more readily available for genome incorporation, and that there are more diverse pathways (and fewer metabolic bottlenecks) to synthesize A and T, as compared to, e.g., C [49]. In our set of 8849 virus-host interactions in VHRnet, we found a remarkably consistent result: viruses are on average 3.5% AT-richer relative to their host (Fig 4B, horizontal boxplot along top). We found that while the %G+C difference of a majority (68%) of the virus-host pairs fall within a narrow range of -4% to 4% (Fig 4C), the overall range of %G+C differences was quite broad, ranging from -40% to 32% (Fig 4C). These trends support that both the magnitude and direction of %G+C difference may be distinguishing features of virus-host infection pairs.

VHIP, a machine learning-based model, leverages the VHRnet dataset to predict virus-host interactions

Machine-learning model approaches are well-suited for capturing relationships between data features and have been applied to leverage genome-derived signals of virus-host coevolution using limited host range data in previous studies [50,51]. We sought to leverage VHRnet virus-host sequence pairs and the features we identified to assess, develop, and evaluate the performance of different machine learning models for virus-host predictions. Inclusion of infection and non-infection data points in VHRnet allows us to train a model predicting either infection or non-infection without any assumptions about virus host range. Before training machine learning-based models, we evaluated the Pearson pairwise correlations between the virus-host genomic signals (features used for machine-learning approach) (S5 Fig, and S4 Table) to determine whether any of the features are strongly correlated. Strongly correlated features typically do not bring additional information for prediction and would increase the complexity of the ML model, which is generally avoided when designing sound ML models [52,53]. Of all our evaluated features, only the k-3 and k-6 features were strongly correlated (Pearson = 0.94, S4 Table). However, when comparing the k-3 distance against %G+C difference and k-6 distance against %G+C difference, different patterns emerged (Fig 4D), suggesting that both the k-3 distance and k-6 distance encode different signals that can be leveraged by machine learning model approaches.

For a single feature to be strong enough to serve as a singular predictor of infection, there would need to be no overlap between infection and non-infection data points. This was not observed. For every feature, there was an overlap in the infection and non-infection density plots (Figs 4A and S3 diagonal plots). This strategy identifies regions of feature overlap and regions of distinction (i.e., valuable non-redundant information encoded in the features; green arrows Fig 4A) was also used to evaluate the power of combining features, such as the decision above to keep both k-3 and k-6 distances in the model (Fig 4D). Further, while virus-host genome amelioration was observed in the k-mer frequencies (resulting in low virus-host k-mer distances), we observed that viruses also have a strong tendency to remain AT-rich relative to their hosts (Fig 4B), consistent with prior reports [49].

We next evaluated the performance of different machine learning algorithms given our selected features (%G+C difference, k-3 distance, k-6 distance, homology hits). The Gradient Boosting Classifier and Random Forest performed best out of the classifiers we tested, with an average accuracy of 87.5% and 88.3% respectively (S6 Fig). However, the Gradient Boosting Classifier was used rather than the Random Forest, as it achieved comparable results with shallower trees, meaning fewer decision nodes were needed to reach comparable prediction performance. This is considered best practice for yielding the highest accuracy, while not overfitting the model [52]. The best performing model was then assessed on the hold-out dataset. We bootstrapped this approach 100 times and tracked which set of hyperparameters yielded the highest accuracy on the hold-out set for each iteration (Fig 1E). Across the 100 best models from each bootstrapping iteration, the average area under the receiver operator characteristic (AUROC) value was 0.93 ± 0.004 (S7 Fig). AUROC is a metric used to evaluate the accuracy of classification models, whereby a value of 1 represents a model with perfect predictions, a 0 represents only incorrect predictions, and 0.5 represents a model that makes random predictions.

We trained the machine learning model a final time using the entire dataset and the best hyperparameters determined from the performance analysis. The model can predict virus-host species interactions with 87.8% accuracy, defined as the number of correct predictions divided by total number of predictions on the test set. To assess VHIP performance, we considered all possible prediction outcomes: true positive (TP), true negative (TN), false positive (FP), and false negative (FN). False positives are commonly known as Type I error, and it represents cases when the model predicted infection but the virus-host pair is a case of non-infection. False negatives are commonly known as Type II errors, where the model predicted non-infection when it should have predicted infection. The AUROC score (the area under the receiver operating characteristics) for VHIP is 0.94 (Fig 5A); this is a measurement between the false positive rate (Type I error) against the true positive rate (model accurately predicting infection). The F1 score for VHIP is 0.93 (Fig 5B), which is the harmonic mean of the precision (number of true positive results divided by the number of all positive results, TP / (TP + FP)) and recall (number of true positive results divided by the number of all samples that should have been identified as positive (TP / (TP + FN)). The F1 score ignores the true negative and may be misleading for unbalanced classes [54]. Finally, we computed Matthew’s Coefficient Correlation (MCC), which considers all four possible outcomes (TP, TN, FP, and FN) (Fig 5C), and VHIP’s scored 0.75 (where 1 represents a model that is perfectly accurate and 0 a model that is making random predictions).

Fig 5. Model performance and feature evaluation.

Fig 5

A. ROC curve of final trained Gradient Boosting Classifier. Red dashed line represents a model that makes random predictions, whereas the gray line represents a model that has 100% accuracy. The blue line is the behavior for VHIP. B. F1 curve of the final trained Gradient Boosting Classifier. C. Confusion matrix showing all possible outcomes, counts, and percentages. D. Relative importance of each feature that VHIP used to predict virus-host interactions.

Interestingly, all the features used by the model contain information that can be leveraged for predictions, but based on feature importance determined by the model during the training phase, sequence composition features are the most useful for virus-host predictions (Fig 5D), with %G+C difference encoding the strongest signal the model leverages. Note that because the presence of sequence homology matches between viruses and hosts are rare, we combined instances of homology between viral genomes and CRISPR spacers and between viral genomes and putative host genomes into a single feature termed `homology`. The model is available through conda-forge and PyPI. The source code is available on Github.

To assess the effect of data provenance on model performance, we trained a machine learning model on a subsampled dataset containing 3159 data points but containing an equal amount from each source (i.e., NCBI, Nahant Collection, and the Staphylococcus study). This resulted model has worse score metrics than VHIP (accuracy: 0.829, F1 score: 0.77, ROC: 0.89, and Matthew’s correlation: 0.639), which is expected as less data was used during the training phase of the model. This inferior model was then applied on the data unused during training and it predicted interactions with a 91% accuracy rate. This may suggest that data provenance is not a significant driver of the accuracy of VHIP when trained on the entire dataset.

Comparing VHIP to existing host prediction tools is not straightforward. Existing HPTs aim to identify the host for a given virus (i.e., answering the question “what taxa does this virus infect?”). Whereas VHIP was designed to answer a fundamentally distinct question, where given a virus and a list of hosts, the response is “whether infection is predicted to occur or not occur for each host”. To enable comparison, we challenged HPTs against known virus-host pairs and recorded their top predictions. Since HPTs were trained on virus-host pairs extracted from the virus NCBI database, we used virus-host pair associations from the Nahant Collection and Staphylococcus study that were not used during training of VHIP, ensuring novel data points across all tools. For each virus-host pair evaluation, we queried VHIP’s infection prediction and checked if the correct host was included in each HPT’s output. Out of 214 data points, VHIP correctly predicted infections 63.5% of the time. The accuracy of iPHoP, VHMnet, vHULK, and CHERRY at predicting correct hosts was 0.93%, 4.2%, 4.2%, and 2.3% respectively.

Discussion

In this study, we make available the most comprehensive dataset of experimentally verified virus-microbe interactions from publicly available databases and literature. This holistic dataset allowed a reevaluation of assumptions about the prevalence of narrow versus broad viral host ranges. We assessed taxonomic biases in virus-microbe culturing and found that the lack of consistency in testing and reporting viral host range against a taxonomically diverse panel of hosts may perpetuate the notion that viruses are specialists. Further, recent technological advances and lab experiments are also challenging this conventional view of viral specialists, as well as revealing the diversity and complexity of virus-host interactions [17,18,55]. For instance, the Hi-C metagenomic pipeline, which links DNA based on physical proximity before sequencing, frequently results in highly nested networks where viruses’ host ranges can span broad taxa [56,57]. These metagenome-based findings are now routinely supported by host range experiments when a broader range of hosts is challenged [3639,58]. Additional lab experiments testing viruses against diverse host panels and consistent reporting are necessary to ascertain the relative specificity versus breadth of viral host ranges.

Importantly, the dataset of experimentally verified virus-host microbe interactions compiled in this study allowed for the development of a new resource, VHIP, that deviates from existing virus-microbe prediction tools and opens a new avenue to the study of virus-host interaction networks. VHIP is distinct by design in that it predicts infection/non-infection for any given virus-host pair. This approach has multiple benefits. First, VHIP takes both viral`and putative host sequences as input, allowing a user to consider all viral and cellular populations recovered from a sampled community (Fig 6A). Second, VHIP may predict a virus to infect multiple different hosts, more accurately reflecting the nature of viral host ranges. Finally, VHIP can resolve complete virus-host interaction networks, which is only possible if a model can explicitly predict both positive and negative relationships between viruses and their potential hosts. Owing to these central design differences, it is impossible to fully compare the accuracy of VHIP with that of existing HPTs. This is because HPTs are designed to return a single host or a list of predicted hosts, and their accuracy calculation considers only the highest score. This is not a problem when the tools are trained on a dataset where there is only one known host, but this limitation becomes an issue when predicting hosts for a novel virus since the predictions are limited to the pool of taxa on which those models were trained. For VHIP, every virus-microbe pair combination is considered, such that the accuracy is defined by the ability of VHIP to accurately infer both infection and non-infection interactions.

Fig 6. Model output and future directions.

Fig 6

A. VHIP predicts interactions between all possible virus-host combinations from the input sequences of viruses and potential hosts. Because all possible pairs are considered, the VHIP predictions can be visualized as a bipartite network of virus-host interactions. B. Additional information describing the nodes (biological entities) and edges (interactions) can be mapped onto the network. C. Virus-host population dynamics can be coded and visualized in the nodes and edges. D. Community structure can be studied by considering the entire set of predicted interactions. E. Edges in the network can be thought of as different possible routes for horizontal gene transfer to occur. F. By comparing networks across space, time, and/or environmental gradients can help understand how virus-host interactions are structured and their underlying processes manifest at ecosystem scales.

The underlying assumption of predicting virus-microbe associations by leveraging genomic signals is that those signals are similar regardless of virus or host taxonomic assignment and/or environmental conditions (e.g., viral adaptation at the genome scale is similar the human gut microbiome to viruses in the oceans). If this assumption is violated, then one must be careful with interpreting predictions from sequence-based tools. To assess this issue, we used a subset of VHRnet containing a smaller number of virus-microbe pairs that contained an equal amount of data points from each data source (NCBI, Nahant Collection, and Staphylococcus). This model was then applied on the data not used for the training/testing phase of the model and obtained an 89% accuracy rate at predicting infection and non-infection events. Furthermore, when using the full dataset, VHRnet was divided into two sets: a set for the training/testing phase of the machine learning model, and a hold-out set to assess the model performance. Because variation can arise from how the data is divided between the two sets, this pipeline was bootstrapped 100 times. For each iteration, there was very little difference in the performance of VHIP (S7 and S8 Figs).

The perspective shift from predicting the most likely taxa a virus can infect to considering all possible virus-host pairs is necessary to resolve virus-host interaction networks (Fig 6A). Interaction networks are mathematical objects that capture and quantify the multitude of potential interactions between species, which provide a common framework for investigations across scales (Fig 6). In such networks, the nodes represent viral and host populations, and an edge connects a virus to a host if it is predicted to infect it. Additional data, sequence or otherwise, can be depicted with networks. Edges can be colored by properties of the interaction that depend on the unique combination of a given virus and host (e.g., viral fitness on a given host, whether infection is lysogenic or lytic) or by the phenotypic properties of the infected cell (virocell) during infection (Fig 6B). Population sizes could be represented by scaling the sizes of nodes, and frequency of infection could be encoded in the width of the edges (Fig 6C). By considering all possible virus-host pairs, such networks can be used to better understand microbial predator-prey interaction patterns and tease apart the underlying processes occurring across scales [51,59,60](Fig 6D), inspiring hypothesis linking such structures and ecosystem properties. Further, virus-host networks explicitly model possible routes of infection-mediated horizontal gene transfer, facilitating the study of gene flow at both population and community scales (Fig 6E). At ecosystem scales, the predicted virus-host infection networks allow us to move beyond ecogenomics as the study of diversity across gradients and instead study how ecological interactions are structured across physicochemical, temporal, and spatial gradients. This shift allows for the integration of multilayer network theory into microbial ecology and opens new opportunities to study ecological complexity [59] (Fig 6F).

The application of network analyses permeates studies in ecology and evolution [61,62] and contributes to the understanding of community assembly [60,63,64], robustness and resilience [64,65], and species coexistence [66] across biological disciplines. Yet, the application of network thinking to the study of virus-microbe interaction is still in its infancy. The tool developed and presented here represents an important step with the power to leverage metagenomic data to answer the question “who infects whom?” from uncultured sequence data and supports new avenues to apply network thinking to viral ecology.

Methods

Collection of host range data and associated sequences

GenBank formatted viral genome files (n = 2621) were downloaded from the NCBI RefSeq database (Aug. 2018). Virus-host pairs were retrieved from two sources: (1) metadata fields in the viral genome GenBank file under the host or lab_host description, and (2) literature search reporting host range data (S1 Table). For the first source of host data, the ‘lab_host’ tag of each of the 2621 viral GenBank files was used to associate the virus with the host used in the sequencing project. If available, the host strain genome was downloaded from RefSeq (S5 Table). If a genome sequence for the host strain was not available, but a genome of the host species was sequenced, a representative genome of the host species was randomly chosen and downloaded from RefSeq (S5 Table). For the second source of host range data, the ‘Title’, ‘Journal’, and ‘Author’ tags of each viral GenBank file were used to identify primary journal articles (S1 Table). Lab-verified infection and non-infection data for the sequenced viruses were recorded from the identified reference articles. Additional studies that reported host range data for the sequenced viruses were identified via manual literature searches of the virus name. The data was compiled into a single file, named VHRnet for Virus-Host Range network (S2 Table).

Comparison of existing virus-host prediction tools on complete host range experiments

The viral sequences belonging to the Staphylococcus study and the Nahant collection study were given as input for the following predictions tools using their default settings: VirHostMatcher-Net (July 2021 version) [30], vHULK (v1.0.0) [31], CHERRY (v1.0.0) [27], iPHoP (v1.2.0) [29], and RaFAH (v0.3) [28]. To evaluate accuracy of those tools, the highest score prediction from each tool was considered. There are three possible outcomes: HPT correctly predicted a species that the virus can infect, HPT predicted a species that the virus cannot infect, and HPT predicted a host that was not tested experimentally. In cases where a virus could infect multiple different hosts, a tool only had to predict at least one host among the known hosts to be considered to have 100% accuracy. iPHoP differs from existing HPTs since it can return 0, 1, or multiple predicted hosts for a given virus. We considered separately the best prediction by iPHoP versus the set of hosts predicted by iPHoP when evaluating its performance on the complete host range studies. For RaFAH, it can only return predictions at the genus level so no assessment of its accuracy at the species level was not possible. To calculate accuracy of those tools at higher taxonomic levels, we considered the taxonomic level of the highest score of the predicted species. For example, if a tool predicted E. coli as the most likely host, to determine the accuracy at the phylum level, the phylum of the known hosts were compared to the phylum of E. coli.

Evaluation of commonly used features for virus-host predictions

The most commonly used features are sequence composition (i.e., how similar the pattern of k-mer frequencies between the viruses and their hosts are) and sequence homology (i.e., the presence of a DNA match between a virus and a host) (Fig 1C and 1D). The %G+C content was calculated for all the viral and host genomes using a custom Python script. The difference in %G+C content between viral and host genomes was defined as: viral%G+C—host%G+C. A custom Python script was used to generate k-mer profiles for the viruses and hosts using k-length of 3, 6, and 9, and to calculate similarities for each virus-host pair using the d2* distance metric and the Euclidean distance metric.

Sequence homology, a stretch of DNA that matches between a virus and a host, was used to identify evidence of prior infection (e.g., in the form of remnant integrated prophages, horizontal gene transfer events, or CRISPR spacers). A BLASTn was run between all viral genomes belonging to VHRnet to the NCBI RefSeq (Aug. 2018) sequences database for bacteria and archaea (minimum identity percentage 80 and minimum length 500bp). CRISPR spacers were identified using the CRISPRCasFinder tool (v4.2.20) on all sequences of the NCBI RefSeq sequence database with the following settings: -keepAll -log -cas -ccvRep -rcfowce -getSummaryCasfinder -prokka -meta. Spacer sequences were extracted from the CRISPRCasFinder output. Since spacers are typically 30 to 35 nucleotides long, a BLASTn with the short setting flag was performed between viruses against spacers. Only virus-spacers hits with no mismatches were kept for the CRISPR feature. A Pearson pairwise correlation was performed to assess correlations between features (S3 Table).

Comparison of machine learning classifiers using the VHRnet dataset

The signals of coevolution in combination with the knowledge of infection/non-infection for each pair constitute the input needed to explore machine learning model approaches. The input data was first randomly shuffled to ensure that any intrinsic non-random ordering of the data was removed and thus would not influence machine learning behavior. In addition, because the ratio of non-infection to infection in VHRnet is imbalanced (68.7 to 31.3), the host range data was first downsampled to reach a ratio of 60/40 of non-infection to infection. Different machine learning classifiers were tested using the scikit-learn package (v1.3) in Python (v3.10), namely AdaBoost, GradientBoostingClassifier, KNeighborsClassifier, RandomForest, StochasticDescentGradient, and SupportVectorClassifier (SVC). Each machine learning model has different values settings, herein referred as hyperparameters, that control the learning process during the training phase. The combination of hyperparameters that results in a robust model is both dependent on the training dataset and the features being considered. To determine the best performing machine learning model given the study design for virus-host predictions, we performed a broad grid search (using the GridSearchCV module from scikit-learn) to explore different combinations of hyperparameters.

During the grid search, 70% of the input data was used as the training/testing set and the remaining 30% was kept as a hold-out set. A shuffle split (n = 10) was used to divide the training/testing set into 10 splits, where 9 splits are used to train the model and the remaining one is used to assess performance of the model. This is repeated until each split has been used as the test set. Once the best performing set of hyperparameters was determined using the training/testing set, it was then evaluated on the hold-out set (Fig 1E). This entire process, including the downsampling of non-infection data to obtain a 60/40 ratio of infection to non-infection events, was bootstrapped 50 times for each type of machine learning classifier, except for SVC for which a single grid search was performed due to the runtime required. Code and analysis of the grid search is available here: https://github.com/DuhaimeLab/VHIP_analyses_Bastien_et_al_2023

Training, testing, and evaluation of VHIP

From the previous analysis, we determined that the Gradient Boosting Classifier performed best and therefore is the most appropriate model for virus-host predictions given our study design. We ran a more exhaustive grid search. During the grid search, the data was again downsample to reach a 60/40 ratio of non-infection to infection. For each iteration (n = 100), 70% of the host range data was used for training/testing of the mode, and the remaining 30% kept as hold-out to evaluate the best set of hyperparameters for that iteration (Fig 1E). In addition, when assessing the best combination of hyperparameters, the AUROC (S7 Fig) and F1 score (S8 Fig) were also computed to assess the model performance and consistency across each iteration. From this pipeline, we determined that the best combination of hyperparameters are: max_depth = 15, learning_rate = 0.75, and loss = exponential. Finally, the model was trained one more time using a shuffle split (n = 10), where 70% of the data was used for training and the remaining 30% for testing the model. The ROC, F1, and MCC scores were calculated using functions provided by the scikit-learn metrics module. Finally, the accuracy of the model was calculated as (TP + TN) / (TP + TN + FP + FN). VHIP is available through conda-forge and PyPI. The source code is available at: https://github.com/DuhaimeLab/VirusHostInteractionPredictor.

To assess the effect of data provenance on our machine learning model, the data was subsampled such that there were equal amounts of data points from each source (n = 3159). Because the subsampling can introduce randomness, this pipeline was bootstrapped 100 times. The accuracy, AUROC, F1 score, and Matthew’s correlation were computed for each iteration of the model.

To compare VHIP’s prediction ability to existing tools, we evaluated each tool in their ability to recover known infection virus-host pairs. For this assessment, only pairs from the Nahant Collection and Staphylococcus study in the test set were considered to ensure novel data points across all tools. For each virus-host pair evaluation, we queried VHIP’s and checked if the correct host was included in each HPT’s output.

Supporting information

S1 Fig. Distribution of all family classifications of viruses in VHRnet.

Lighter transparency represents the proportion of non-infection reports by viral family, relative to the solid portion, which represents known infection reports by viral family.

(TIF)

pcbi.1011649.s001.tif (263KB, tif)
S2 Fig. Distribution of all host genera represented in VHRnet.

Lighter color transparency represents the proportion of non-infection relative to infection (solid color).

(TIF)

pcbi.1011649.s002.tif (921.2KB, tif)
S3 Fig. Number of viruses tested against different host taxa.

X-axis represent the number of viruses that have been tested.

(TIF)

pcbi.1011649.s003.tif (120.3KB, tif)
S4 Fig. Kernel density of distance measurements for each virus-host pair, colored by interaction (yellow for infection and blue for non-infection).

The top row used the Euclidean distance to compute the similarity between the k-mer profiles of the virus and its host, while the second row uses the d2* distance metric. Each column represents a different length of k-mer used to create the k-mer profiles (k-length of 3 versus 6 versus 9). The d2* distance metric is a more appropriate metric than the Euclidean distance metric for the purpose of virus-host prediction since it encodes some evolutionary signals (the peaks for the no-infection and infection are separated).

(TIF)

pcbi.1011649.s004.tif (845.3KB, tif)
S5 Fig. Feature distribution (diagonal plots) and co-correlations (all the other plots).

(TIF)

pcbi.1011649.s005.tif (1.8MB, tif)
S6 Fig. Comparison of different machine learning models on VHRnet.

For each type of machine learning model, a grid search was performed to determine the best combinations of parameters. This plot shows the accuracy of the best performing model. This was bootstrapped 50 times (except for SVM since the fit algorithm is O(n^2)).

(TIF)

pcbi.1011649.s006.tif (115.3KB, tif)
S7 Fig. ROC curves from 100 bootstrapping iterations of the best model trained during the grid search using best hyperparameters.

(TIF)

pcbi.1011649.s007.tif (80.7KB, tif)
S8 Fig. F1 curve of 100 best hyperparameter combinations during the grid search.

(TIF)

pcbi.1011649.s008.tif (681.2KB, tif)
S1 Table. Compilation of NCBI accession numbers of lab-tested viral host range and their respective DOI.

Submitted as an excel spreadsheet.

(XLSX)

pcbi.1011649.s009.xlsx (14.2KB, xlsx)
S2 Table. Machine learning model input.

Each row contains an experimentally tested virus-host pair, their known interaction, and the signal of coevolutions computed from their genomic sequences. Submitted as an excel spreadsheet.

(CSV)

pcbi.1011649.s010.csv (1.3MB, csv)
S3 Table. Comparison between input and output of existing host-prediction tools.

(XLSX)

pcbi.1011649.s011.xlsx (10.4KB, xlsx)
S4 Table. Pearson pairwise correlations of features that went into VHIP.

Higher value means the features are more strongly correlated.

(XLSX)

pcbi.1011649.s012.xlsx (9.1KB, xlsx)
S5 Table. List of NCBI accession numbers for viral and host sequences used in this study.

Submitted as an excel spreadsheet.

(CSV)

pcbi.1011649.s013.csv (48.2KB, csv)

Acknowledgments

We thank K. Shedden, B. Hegarty, and M. Moreno for helpful discussions and suggestions and Geoffrey Hannigan for early discussions and collaboration that inspired the hunt for more data.

Data Availability

All relevant data are within the manuscript and its Supporting Information files. Code written for analyses and figures generated as part of this manuscript is made available on Github (https://github.com/DuhaimeLab/VHIP_analyses_Bastien_et_al_2023) The tool VHIP, described in the manuscript, is made available as a Python package through conda-forge and PyPI. The source code is made public on Github (https://github.com/DuhaimeLab/VirusHostInteractionPredictor).

Funding Statement

This study is based upon work supported by the National Science Foundation under Grant No. 2055455 awarded to MBD and LZ and 1813069 awarded to LZ and by funding to MBD through the National Oceanic and Atmospheric Administration Great Lakes Omics program distributed through the University of Michigan Cooperative Institute for Great Lakes Research NA17OAR4320152. This is CIGLR contribution number 1250. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Gilbert JA, Dupont CL. Microbial Metagenomics: Beyond the Genome. Annual Review of Marine Science. 2011;3(1):347–71. doi: 10.1146/annurev-marine-120709-142811 [DOI] [PubMed] [Google Scholar]
  • 2.Garza DR, Dutilh BE. From cultured to uncultured genome sequences: metagenomics and modeling microbial ecosystems. Cell Mol Life Sci. 2015. Nov 1;72(22):4287–308. doi: 10.1007/s00018-015-2004-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Gregory AC, Zayed AA, Conceição-Neto N, Temperton B, Bolduc B, Alberti A, et al. Marine DNA Viral Macro- and Microdiversity from Pole to Pole. Cell. 2019. May 16;177(5):1109–1123.e14. doi: 10.1016/j.cell.2019.03.040 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Coutinho FH, Silveira CB, Gregoracci GB, Thompson CC, Edwards RA, Brussaard CPD, et al. Marine viruses discovered via metagenomics shed light on viral strategies throughout the oceans. Nat Commun. 2017. Jul 5;8(1):1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Benler S, Yutin N, Antipov D, Rayko M, Shmakov S, Gussow AB, et al. Thousands of previously unknown phages discovered in whole-community human gut metagenomes. Microbiome. 2021. Mar 29;9(1):78. doi: 10.1186/s40168-021-01017-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Mokili JL, Rohwer F, Dutilh BE. Metagenomics and future perspectives in virus discovery. Current Opinion in Virology. 2012. Feb 1;2(1):63–77. doi: 10.1016/j.coviro.2011.12.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Rosario K, Breitbart M. Exploring the viral world through metagenomics. Current Opinion in Virology. 2011. Oct 1;1(4):289–97. doi: 10.1016/j.coviro.2011.06.004 [DOI] [PubMed] [Google Scholar]
  • 8.Brum JR, Ignacio-Espinoza JC, Roux S, Doulcier G, Acinas SG, Alberti A, et al. Patterns and ecological drivers of ocean viral communities. Science [Internet]. 2015. May 22 [cited 2021 Jul 6];348(6237). Available from: https://science.sciencemag.org/content/348/6237/1261498 [DOI] [PubMed] [Google Scholar]
  • 9.Roux S, Emerson JB, Eloe-Fadrosh EA, Sullivan MB. Benchmarking viromics: an in silico evaluation of metagenome-enabled estimates of viral community composition and diversity. PeerJ. 2017. Sep 21;5:e3817. doi: 10.7717/peerj.3817 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Camargo AP, Nayfach S, Chen IMA, Palaniappan K, Ratner A, Chu K, et al. IMG/VR v4: an expanded database of uncultivated virus genomes within a framework of extensive functional, taxonomic, and ecological metadata. Nucleic Acids Res. 2023. Jan 6;51(D1):D733–43. doi: 10.1093/nar/gkac1037 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Roux S, Adriaenssens EM, Dutilh BE, Koonin EV, Kropinski AM, Krupovic M, et al. Minimum Information about an Uncultivated Virus Genome (MIUViG). Nat Biotechnol. 2019. Jan;37(1):29–37. doi: 10.1038/nbt.4306 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Breitbart M. Marine Viruses: Truth or Dare. Annu Rev Mar Sci. 2011. Dec 12;4(1):425–48. [DOI] [PubMed] [Google Scholar]
  • 13.Gayder S, Parcey M, Nesbitt D, Castle AJ, Svircev AM. Population Dynamics between Erwinia amylovora, Pantoea agglomerans and Bacteriophages: Exploiting Synergy and Competition to Improve Phage Cocktail Efficacy. Microorganisms. 2020. Sep;8(9):1449. doi: 10.3390/microorganisms8091449 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Maslov S, Sneppen K. Population cycles and species diversity in dynamic Kill-the-Winner model of microbial ecosystems. Scientific Reports. 2017. Jan 4;7(1):39642. doi: 10.1038/srep39642 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.McDaniel LD, Young E, Delaney J, Ruhnau F, Ritchie KB, Paul JH. High Frequency of Horizontal Gene Transfer in the Oceans. Science. 2010. Oct 1;330(6000):50–50. doi: 10.1126/science.1192243 [DOI] [PubMed] [Google Scholar]
  • 16.Soucy SM, Huang J, Gogarten JP. Horizontal gene transfer: building the web of life. Nat Rev Genet. 2015. Aug;16(8):472–82. doi: 10.1038/nrg3962 [DOI] [PubMed] [Google Scholar]
  • 17.Forterre P. The virocell concept and environmental microbiology. ISME J. 2013. Feb;7(2):233–6. doi: 10.1038/ismej.2012.110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Howard-Varona C, Lindback MM, Bastien GE, Solonenko N, Zayed AA, Jang H, et al. Phage-specific metabolic reprogramming of virocells. ISME J. 2020. Apr;14(4):881–95. doi: 10.1038/s41396-019-0580-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Zimmerman AE, Howard-Varona C, Needham DM, John SG, Worden AZ, Sullivan MB, et al. Metabolic and biogeochemical consequences of viral infection in aquatic ecosystems. Nat Rev Microbiol. 2020. Jan;18(1):21–34. doi: 10.1038/s41579-019-0270-x [DOI] [PubMed] [Google Scholar]
  • 20.Enav H, Mandel-Gutfreund Y, Béjà O. Comparative metagenomic analyses reveal viral-induced shifts of host metabolism towards nucleotide biosynthesis. Microbiome. 2014. Mar 26;2(1):9. doi: 10.1186/2049-2618-2-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Coclet C, Roux S. Global overview and major challenges of host prediction methods for uncultivated phages. Current Opinion in Virology. 2021. Aug 1;49:117–26. doi: 10.1016/j.coviro.2021.05.003 [DOI] [PubMed] [Google Scholar]
  • 22.Duhaime MB, Kottmann R, Field D, Glöckner FO. Enriching public descriptions of marine phages using the Genomic Standards Consortium MIGS standard. Stand in Genomic Sci. 2011. Mar 1;4(2):271–85. doi: 10.4056/sigs.621069 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Albery GF, Eskew EA, Ross N, Olival KJ. Predicting the global mammalian viral sharing network using phylogeography. Nat Commun. 2020. May 8;11(1):2260. doi: 10.1038/s41467-020-16153-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Albery GF, Becker DJ, Brierley L, Brook CE, Christofferson RC, Cohen LE, et al. The science of the host–virus network. Nat Microbiol. 2021. Dec;6(12):1483–92. doi: 10.1038/s41564-021-00999-5 [DOI] [PubMed] [Google Scholar]
  • 25.Poisot T, Ouellet MA, Mollentze N, Farrell MJ, Becker DJ, Brierley L, et al. Network embedding unveils the hidden interactions in the mammalian virome. Patterns. 2023. Jun;4(6):100738. doi: 10.1016/j.patter.2023.100738 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Meyer KM, Memiaghe H, Korte L, Kenfack D, Alonso A, Bohannan BJM. Why do microbes exhibit weak biogeographic patterns? ISME J. 2018. Jun;12(6):1404–13. doi: 10.1038/s41396-018-0103-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Shang J, Sun Y. CHERRY: a Computational metHod for accuratE pRediction of virus-pRokarYotic interactions using a graph encoder-decoder model. Briefings in Bioinformatics. 2022. May 21;bbac182. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Coutinho FH, Zaragoza-Solas A, López-Pérez M, Barylski J, Zielezinski A, Dutilh BE, et al. RaFAH: Host prediction for viruses of Bacteria and Archaea based on protein content. Patterns. 2021. Jul 9;2(7):100274. doi: 10.1016/j.patter.2021.100274 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Roux S, Camargo AP, Coutinho FH, Dabdoub SM, Dutilh BE, Nayfach S, et al. iPHoP: An integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria. PLOS Biology. 2023. Apr 21;21(4):e3002083. doi: 10.1371/journal.pbio.3002083 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Wang W, Ren J, Tang K, Dart E, Ignacio-Espinoza JC, Fuhrman JA, et al. A network-based integrated framework for predicting virus–prokaryote interactions. NAR Genom Bioinform [Internet]. 2020. Jun 1 [cited 2020 Jun 25];2(2). Available from: https://academic.oup.com/nargab/article/2/2/lqaa044/5861484 doi: 10.1093/nargab/lqaa044 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Amgarten D, Iha BKV, Piroupo CM, da Silva AM, Setubal JC. vHULK, a New Tool for Bacteriophage Host Prediction Based on Annotated Genomic Features and Neural Networks. PHAGE [Internet]. 2022. Aug 25 [cited 2022 Oct 4]; Available from: https://www.liebertpub.com/doi/full/10.1089/phage.2021.0016 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Esposito LA, Gupta S, Streiter F, Prasad A, Dennehy JJ. Evolutionary interpretations of mycobacteriophage biodiversity and host-range through the analysis of codon usage bias. Microbial Genomics. 2016;2(10):e000079. doi: 10.1099/mgen.0.000079 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Lucks JB, Nelson DR, Kudla GR, Plotkin JB. Genome Landscapes and Bacteriophage Codon Usage. PLOS Computational Biology. 2008. Feb 29;4(2):e1000001. doi: 10.1371/journal.pcbi.1000001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Edwards RA, McNair K, Faust K, Raes J, Dutilh BE. Computational approaches to predict bacteriophage–host relationships. FEMS Microbiol Rev. 2016. Mar 1;40(2):258–72. doi: 10.1093/femsre/fuv048 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Ahlgren NA, Ren J, Lu YY, Fuhrman JA, Sun F. Alignment-free $d_2^*$ oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences. Nucleic Acids Res. 2017. Jan 9;45(1):39–53. doi: 10.1093/nar/gkw1002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Kauffman KM, Hussain FA, Yang J, Arevalo P, Brown JM, Chang WK, et al. A major lineage of non-tailed dsDNA viruses as unrecognized killers of marine bacteria. Nature. 2018. Feb;554(7690):118–22. doi: 10.1038/nature25474 [DOI] [PubMed] [Google Scholar]
  • 37.Malki K, Kula A, Bruder K, Sible E, Hatzopoulos T, Steidel S, et al. Bacteriophages isolated from Lake Michigan demonstrate broad host-range across several bacterial phyla. Virology Journal. 2015. Oct 9;12(1):164. doi: 10.1186/s12985-015-0395-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Beumer A, Robinson JB. A Broad-Host-Range, Generalized Transducing Phage (SN-T) Acquires 16S rRNA Genes from Different Genera of Bacteria. Applied and Environmental Microbiology [Internet]. 2005. Dec [cited 2022 Jan 11]; Available from: https://journals.asm.org/doi/abs/10.1128/AEM.71.12.8301-8304.2005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Feng X, Yan W, Wang A, Ma R, Chen X, Lin TH, et al. A Novel Broad Host Range Phage Infecting Alteromonas. Viruses. 2021. Jun;13(6):987. doi: 10.3390/v13060987 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Göller PC, Elsener T, Lorgé D, Radulovic N, Bernardi V, Naumann A, et al. Multi-species host range of staphylococcal phages isolated from wastewater. Nat Commun. 2021. Nov 29;12(1):6965. doi: 10.1038/s41467-021-27037-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Krishnamurthy SR, Wang D. Origins and challenges of viral dark matter. Virus Research. 2017. Jul 15;239:136–42. doi: 10.1016/j.virusres.2017.02.002 [DOI] [PubMed] [Google Scholar]
  • 42.Pires DP, Costa AR, Pinto G, Meneses L, Azeredo J. Current challenges and future opportunities of phage therapy. FEMS Microbiology Reviews. 2020. Nov 24;44(6):684–700. doi: 10.1093/femsre/fuaa017 [DOI] [PubMed] [Google Scholar]
  • 43.Burstein D, Sun CL, Brown CT, Sharon I, Anantharaman K, Probst AJ, et al. Major bacterial lineages are essentially devoid of CRISPR-Cas viral defence systems. Nat Commun. 2016. Feb 3;7(1):10613. doi: 10.1038/ncomms10613 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Breitbart M, Bonnain C, Malki K, Sawaya NA. Phage puppet masters of the marine microbial realm. Nature Microbiology. 2018. Jul;3(7):754–66. doi: 10.1038/s41564-018-0166-y [DOI] [PubMed] [Google Scholar]
  • 45.Carbone A. Codon Bias is a Major Factor Explaining Phage Evolution in Translationally Biased Hosts. J Mol Evol. 2008. Mar 1;66(3):210–23. doi: 10.1007/s00239-008-9068-6 [DOI] [PubMed] [Google Scholar]
  • 46.Pride DT, Wassenaar TM, Ghose C, Blaser MJ. Evidence of host-virus co-evolution in tetranucleotide usage patterns of bacteriophages and eukaryotic viruses. BMC Genomics. 2006. Jan 18;7(1):8. doi: 10.1186/1471-2164-7-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Lawrence JG, Ochman H. Amelioration of Bacterial Genomes: Rates of Change and Exchange. J Mol Evol. 1997. Apr 1;44(4):383–97. doi: 10.1007/pl00006158 [DOI] [PubMed] [Google Scholar]
  • 48.Song K, Ren J, Zhai Z, Liu X, Deng M, Sun F. Alignment-Free Sequence Comparison Based on Next-Generation Sequencing Reads. Journal of Computational Biology. 2013. Feb;20(2):64–79. doi: 10.1089/cmb.2012.0228 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Rocha EPC, Danchin A. Base composition bias might result from competition for metabolic resources. Trends in Genetics. 2002. Jun 1;18(6):291–4. doi: 10.1016/S0168-9525(02)02690-2 [DOI] [PubMed] [Google Scholar]
  • 50.Hannigan GD, Duhaime MB, Ruffin MT, Koumpouras CC, Schloss PD. Diagnostic Potential and Interactive Dynamics of the Colorectal Cancer Virome. mBio [Internet]. 2018. Dec 21 [cited 2020 Nov 4];9(6). Available from: https://mbio.asm.org/content/9/6/e02248-18 doi: 10.1128/mBio.02248-18 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Hannigan GD, Duhaime MB, Koutra D, Schloss PD. Biogeography and environmental conditions shape bacteriophage-bacteria networks across the human microbiome. PLOS Computational Biology. 2018. Apr 18;14(4):e1006099. doi: 10.1371/journal.pcbi.1006099 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Topçuoğlu BD, Lesniak NA, Ruffin MT, Wiens J, Schloss PD. A Framework for Effective Application of Machine Learning to Microbiome-Based Classification Problems. mBio [Internet]. 2020. Jun 30 [cited 2020 Jun 10];11(3). Available from: https://mbio.asm.org/content/11/3/e00434-20 doi: 10.1128/mBio.00434-20 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Cai J, Luo J, Wang S, Yang S. Feature selection in machine learning: A new perspective. Neurocomputing. 2018. Jul 26;300:70–9. [Google Scholar]
  • 54.Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020. Jan 2;21(1):6. doi: 10.1186/s12864-019-6413-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Hobbs Z, Abedon ST. Diversity of phage infection types and associated terminology: the problem with ‘Lytic or lysogenic.’ FEMS Microbiology Letters. 2016. Apr 1;363(7):fnw047. [DOI] [PubMed] [Google Scholar]
  • 56.Hwang Y, Roux S, Coclet C, Krause SJE, Girguis PR. Viruses interact with hosts that span distantly related microbial domains in dense hydrothermal mats. Nat Microbiol. 2023. May;8(5):946–57. doi: 10.1038/s41564-023-01347-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Cheng Z, Li X, Palomo A, Yang Q, Han L, Wu Z, et al. Virus impacted community adaptation in oligotrophic groundwater environment revealed by Hi-C coupled metagenomic and viromic study. Journal of Hazardous Materials. 2023. Sep 15;458:131944. doi: 10.1016/j.jhazmat.2023.131944 [DOI] [PubMed] [Google Scholar]
  • 58.Sakowski EG, Arora-Williams K, Tian F, Zayed AA, Zablocki O, Sullivan MB, et al. Interaction dynamics and virus–host range for estuarine actinophages captured by epicPCR. Nat Microbiol. 2021. May;6(5):630–42. doi: 10.1038/s41564-021-00873-4 [DOI] [PubMed] [Google Scholar]
  • 59.Pilosof S, Porter MA, Pascual M, Kéfi S. The multilayer nature of ecological networks. Nature Ecology & Evolution. 2017. Mar 23;1(4):1–9. doi: 10.1038/s41559-017-0101 [DOI] [PubMed] [Google Scholar]
  • 60.Barberán A, Bates ST, Casamayor EO, Fierer N. Using network analysis to explore co-occurrence patterns in soil microbial communities. ISME J. 2012. Feb;6(2):343–51. doi: 10.1038/ismej.2011.119 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Proulx SR, Promislow DEL, Phillips PC. Network thinking in ecology and evolution. Trends in Ecology & Evolution. 2005. Jun 1;20(6):345–53. doi: 10.1016/j.tree.2005.04.004 [DOI] [PubMed] [Google Scholar]
  • 62.Segar ST, Fayle TM, Srivastava DS, Lewinsohn TM, Lewis OT, Novotny V, et al. The Role of Evolution in Shaping Ecological Networks. Trends in Ecology & Evolution. 2020. May 1;35(5):454–66. doi: 10.1016/j.tree.2020.01.004 [DOI] [PubMed] [Google Scholar]
  • 63.Montoya JM, Solé RV. Topological properties of food webs: from real data to community assembly models. Oikos. 2003;102(3):614–22. [Google Scholar]
  • 64.Weitz JS, Poisot T, Meyer JR, Flores CO, Valverde S, Sullivan MB, et al. Phage–bacteria infection networks. Trends in Microbiology. 2013;21(2):82–91. doi: 10.1016/j.tim.2012.11.003 [DOI] [PubMed] [Google Scholar]
  • 65.Kéfi S, Miele V, Wieters EA, Navarrete SA, Berlow EL. How Structured Is the Entangled Bank? The Surprisingly Simple Organization of Multiplex Ecological Networks Leads to Increased Persistence and Resilience. PLOS Biology. 2016. Aug 3;14(8):e1002527. doi: 10.1371/journal.pbio.1002527 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Allesina S, Levine JM. A competitive network theory of species diversity. PNAS. 2011. Apr 5;108(14):5638–42. doi: 10.1073/pnas.1014428108 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011649.r001

Decision Letter 0

James O'Dwyer, Samuel V Scarpino

19 Feb 2024

Dear Dr. Duhaime,

Thank you very much for submitting your manuscript "Virus-Host Interactions Predictor (VHIP): machine learning approach to resolve microbial virus-host interaction networks" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

In this work, the authors construct a dataset of viral host ranges by combining information from several publicly available sources. Using these data, they evaluate the performance of several existing tools for predicting viral/host interactions based sequence data and develop a novel method that leverages both sequence and other metadata. The reviewers all highlighted the potential utility of the data set, which I agree with, but also identified several areas where substantive additional work is needed.

First, the reviewers all agreed that the biased nature of the host/viral data wrt to host limited the impact of both the new dataset and our interpretation of the novel machine learning method. My suggestion is that the authors demonstrate more rigorously how model performance varies when considering the smaller number of non-human pathogenic viruses included in the study. For example, the authors could split training/testing data between human pathogenic and all other viruses or could do a kind of repeated bootstrap or downsampling such that human pathogenic vs. non-human pathogenic sample sizes are equal. Additionally, the authors should further discuss what might be different wrt to model performance, etc. if more diverse data sets were available.

Second, the reviewers point out that while a comparison is done between existing classification methods, no comparison seems to have been done between the new developed method and established approaches. Given how much emphasis in the paper is placed on both the specific new method and, more interestingly, the general approach of including more metadata in the prediction models, the authors must provide a meaningful demonstration that their approach is in fact better. Importantly, the authors must also engage data set bias when comparing across models. I can imagine a number of pernicious kinds of bias that could show up here related to metadata coverage in human vs. non-human pathogenic viruses, etc.

Third, a number of claims in the paper, e.g., the network results in F6, seem to go well beyond what is demonstrated in the paper. My advice here is to carefully weigh what is and isn't a well supported conclusion in the current study. For example, given the heavy reliance on two data sets, it does seem unlikely that much can be said by generalizability across environments.

I hope that the authors pay careful attention to the detailed reviewer comments and understand that addressing them will almost certainly require running new analysis and a meaningful re-write of aspects of the manuscript.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Samuel V. Scarpino

Academic Editor

PLOS Computational Biology

James O'Dwyer

Section Editor

PLOS Computational Biology

***********************

In this work, the authors construct a dataset of viral host ranges by combining information from several publicly available sources. Using these data, they evaluate the performance of several existing tools for predicting viral/host interactions based sequence data and develop a novel method that leverages both sequence and other metadata. The reviewers all highlighted the potential utility of the data set, which I agree with, but also identified several areas where substantive additional work is needed.

First, the reviewers all agreed that the biased nature of the host/viral data wrt to host limited the impact of both the new dataset and our interpretation of the novel machine learning method. My suggestion is that the authors demonstrate more rigorously how model performance varies when considering the smaller number of non-human pathogenic viruses included in the study. For example, the authors could split training/testing data between human pathogenic and all other viruses or could do a kind of repeated bootstrap or downsampling such that human pathogenic vs. non-human pathogenic sample sizes are equal. Additionally, the authors should further discuss what might be different wrt to model performance, etc. if more diverse data sets were available.

Second, the reviewers point out that while a comparison is done between existing classification methods, no comparison seems to have been done between the new developed method and established approaches. Given how much emphasis in the paper is placed on both the specific new method and, more interestingly, the general approach of including more metadata in the prediction models, the authors must provide a meaningful demonstration that their approach is in fact better. Importantly, the authors must also engage data set bias when comparing across models. I can imagine a number of pernicious kinds of bias that could show up here related to metadata coverage in human vs. non-human pathogenic viruses, etc.

Third, a number of claims in the paper, e.g., the network results in F6, seem to go well beyond what is demonstrated in the paper. My advice here is to carefully weigh what is and isn't a well supported conclusion in the current study. For example, given the heavy reliance on two data sets, it does seem unlikely that much can be said by generalizability across environments.

I hope that the authors pay careful attention to the detailed reviewer comments and understand that addressing them will almost certainly require running new analysis and a meaningful re-write of aspects of the manuscript.

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: I do not have any major issues with the manuscript.

As a suggestion, please consider testing the tool on genomes derived from metagenomes. In particular ones with organisms that are taxonomically distanced from the ones dominating your existing host range datasets. There are some metagenomic datasets that include both long read data and Hi-C to confirm. I believe https://doi.org/10.1016/j.jhazmat.2023.131944 - which you cite - has some. My main concern is the relatively narrow host range your tool was developed on. Will the model generalize?

Also, for future work, consider using graph neural networks once you get to ecosystems.

Reviewer #2: In this study, Bastien et al. manually curated a database of virus-host relationship based on publicly available experimental data. Both infection and non-infection information, as well as many-to-many infectious relationships were included, which is distinguishable from many known virus-host databases. Based on this database, the accuracies of commonly used host prediction tolls were validated. The authors then evaluated genomic features between viruses and hosts resulted from coevolution, and based on selected features, they developed a machine-learning based classifier to predict virus-host infectious features based on their genome sequences.

The concept of including non-infection information to host prediction, the attempt in resolving all the infection/non-infection relationships in a community of viruses and hosts as emphasized by the authors are indeed very important. However, they have not clearly demonstrated that the prediction tool they developed has benefitted from these additional data in performance. Moreover, the authors need to provide a user-friendly implementation of this tool online.

Major concerns.

(1) As the authors have admitted, the VHRnet database is heavily biased to human pathogens, phage therapy studies, and the data from the two experimental studies they utilized. There are so many potential relationships between the hosts and viruses in the database are not yet tested by experiments. Therefore, its current version can provide little valuable information to assess the topic of host specificity of viruses. Please move lines 164 to 175 to the Discussion, and instead of indicating that your database indicates that viruses are specialists, you should point out that the VHRnet database currently may provide little valuable information to host specificity of viruses.

(2) The authors explained why they did not compare the performance of their tool to other published host prediction tools with a very vague description. They seem to say that VHIP is designed for predict virus-host pair combination while all others predict hosts based on highest scores? I am not convinced. I think you can certainly compare the accuracy of VHIP with others for any single host-virus pairs as you have done for the existing tools in Fig. 3. The authors reported an overall accuracy value of 87.8%, but it is not explained how this value is obtained? I can not found relevant figures or tables for the results of this calculation.

(3) I checked the github website of VHIP, but it looks like that VHIP is not yet fully implemented in a user-friendly way. The pretreatments for data needs to be done by the users with additional tools and procedures, which lacks of necessary details in introduction. This exclude most of the users without rich experience in bioinformatics. Please fully implement VHIP as a conda package or at least easy to use. Also please provide a detailed instruction for installation and use of this software.

(4) The discussion of virus-host networking modeling is way beyond the currently results can infer. Please delete Fig. 6 and most of the relevant discussion, unless you can provide the results showing that you have applied VHIP on a specific environmental data to build and explore a network. Instead, please discuss more about the accuracy and reliability of your tool.

Minor comments.

Line 132. Legends of C and D should be swapped.

Line 235. Methods to identify HGT is not found, please specify the sequence identity threshold here.

Line 364. a stronger signal by compared to what?

Fig 5D. Please order by the one with highest importance on the top and the ones with lower importance following up.

Reviewer #3: Overall thoughts

This paper trains a Gradient Boosted Machine to predict the presence of a successful interaction between viruses and bacterial hosts using genome features as predcitors. While the creation of these genome features is outside of my immediate expertise, I found they employed a nice use of CRISPR motifs in host and virus genomes as a signal of past co-evolution, even though they found that these motifs were rare in their data. The authors conclude that features including GC differences and k-mer sequence motifs were the most useful features.

While I think this paper is an excellent forway into the use of both virus and host genome features to predict virus-host interactions, I found that the paper largely over-sold it’s novelty in terms of methodology, and the production of a new and valuable dataset of experimentally verified infections. In particular, the authors claim novelty of this paradigm of binary network prediction for species interactions, which indicates they did not conduct a thorough literature review. Further, their dataset is largely an amalgamation of two existing datasets, which are narrow either taxnomically or sample a single environment. They do add new virus-host interactions to these published records, but they are mined from GenBank metadata without any discussion of validation of these records, or why they assume them to be controlled experimental infections as opposed to naturally infected hosts. From a methods perspective, despite acknowledging their limited and unbalanced training data, the authors do not explore any methods for balanced sampling / re-sampling / data augmentation to help these imbalance issues.

Finally, while the authors provide a nice general discussion, they do not provide any ecological or evolutionary context for the specific predictions made by their model. This would greatly improve the general interest of the work, and also help to open the “black box” of their model.

I’m sorry I cannot be more positive. I think this has the potential to be a very good paper if more attention is paid to the data and modelling, and the claims of novelty and applicability are reduced, and the model and results are placed in a better context with respect to the authors study system (bacteriophages), and the broader landscape of species interaction models.

Major comments

Insufficient review of literature and overall context for the current work:

Lines 40-43 & 82-93: There are multiple virus-host prediction models that predict muliple interactions at once (e.g. the entire network or parts of the sub-network). Further, many existing models successfully reconstuct multiple features of these networks, and in many cases the accuracy of these models excees that reported in your abstract (87.8%) From these statements and the cited references, it is clear that the authors have not sufficiently reviewed existing literature / methods in this field. For example, relevant literature includes:

- Albery et al. (2021) The Science of the host-virus network. Nature Microbiology

- Wardeh et al. (2021) Divide-and-conquer: machine-learning integrates mammalian and viral traits with network features to predict virus-mammal associations. Nature Communications

- Farrell et al. (2022) Predicting missing links in global host–parasite networks. Journal of Animal Ecology

- Elmasri et al. (2020) A hierarchical Bayesian model for predicting ecological interactions using scaled evolutionary relationships. Annals of Applied Statistics

- Poisot et al. (2023) Network embedding unveils the hidden interactions in the mammalian virome. Patterns

- Strydom et al. (2023) Graph embedding and transfer learning can help predict potential species interaction networks despite data limitations. Methods in Ecology & Evolution

I strongly suggest the authors conduct a mor thorough review of alternative approaches for host-pathogen prediction, and adjust their claims accordingly. In particular, in Figure 6 you say your predictions can be visualized as a bipartite network of all virus-host interactions. There is a wealth of information and models on bipartite network models which could help to inform your approach.

Concerns about the newly assembled dataset:

Collecting lab-verified interactions is an excellent way to train and validate such a model, especially if you have “true negatives”, which are often absent from many databases of host-virus interactions. When presenting your dataset, it would be great to indicate what evidence you considered as proof of a successful infection. For example, if you found the same virus-host interaction tested in two studies, one successful, another not, how would you code this? Would you discard the unsuccessful experiment, or does your model allow for both of these observations as input?

In terms of the provenance of the input data for “VHRnet”, for the NCBI data it is unclear how “clean” or reliable the host tage in GenBank is. Having worked with these data, I have found multiple entires where this information is incorrect. What steps did you take to verify these data? Even manual curation of a small random subset would be a good attempt. Further, if you are using RefSeq, how do you determine if the sequence comes from a lab controll experimental infection rather than an observational study (e.g. virus may be present as a contaminant), and in this context what does a “non-infection” study look like in terms of the GenBank meta-data? For example, in lines 151-152 you say the majority of viruses were reportedly tested again a single host, and these pairings come from the “host” tag, however the host tag could be the source from which the virus was sampled and a genome was assembled from metagenomic data, but this does not tell you if this was an experimental infection or not.

It seems like the majority of your data come from the Nahant study. This database appears to be sampled from one ecosystem (a littoral marine zone). Further, there is no discussion of whether the three sources you used are comparable in terms of the definition of a successful infection, or how this was assessed. Considering over 70% of your virus-host pairs come from this study, and another on Staphylococcus, and you focus extensively on these two datasets in your results, I wonder if it is appropriate to say you have created a new database of lab-tested host-virus interactions. Further, this raises the question of how confident can you be with your predictions as they extend to hosts in other systems (e.g. human microbiome, phyllosphere, soil microbiome, animal microbiomes, etc...)?

Also a small semantic point as mentioned above, this database is actually more of an edge list (list of host-virus interactions, along with their successful experimental infection).

Issues with the modelling appraoch

Line 196-197: You say that existing HPTs performed better in predicting hosts for viruses in the Staph study than viruses in Nahant. Could this be due to an imbalance in your training data? E.g. You have more data for Staph viruses, hence are less able to generalize to other host groups? Considering the potential data quailty and imbalance issues I would have expected some exploration of data re-sampling and/or attempting to fit your model on one dataset and then predict the other.

Lines 329-336: You should be able to assess whether models are overfit by testing on a hold-out / validation test set. In my experience, models are picked based on performance (assessed via multiple metrics) on validation set, rather than that number of trees. This also limits your ability to compare models which do not have this underlying architecture.

Line 336: What is the “untouched dataset”? This should be defined earlier.

Lines 329-361: Much of this reads like Methods rather than results. It would be good to see more results in the context of predictive accuracy across different taxonomic groups, as a function of the input data (which of the data subsets the models were trained on), and discussing some examples of when the model is successful, and then it fails. By examining which interactions your model is predicting as false positives and false negatives you may be able to get some useful insights into why it may fail, and how to build a better next iteration of your model.

Figure 6: It is unclear how your model extrapolates predictions from the species through to the community and ecosystem levels.

Minor comments

Lines 30-31 & 69-70: Please provide more context for “Advances in genome sequencing have led to the discovery of millions of novel viruses” – it is unclear which species concepts you are applying here, as there are fewer than 7,000 recognized virus species, and while I am aware of studies using different forms of extrapolation and power laws for estimating millions of viruses, these are quite variable and somewhat contested...

Lines 32 & 34 & 79: Stylistic comment, but I think “what” is more appropriate than “who” when talking about bacteria and viruses.

Line 38: “host range” is often used when discussing the diversity of hosts a pathogen can infect (e.g. how many hosts / species richness, or perhaps phylogenetic diversity, etc.), as a reference to specialism and generalism in terms of host tropism, rather than the particular host-pathogen interaction. Unless you are explicitly predicting host range, using this phrase is incorrect.

Line 40: Unclear what “features of co-evolution” are.

Lines 45 & 76: Unlear what you mean by “popuation genomes”. Please define this and differentiate from genomes typically associated with individual species.

Line 82: Is “host prediction tools (HPT)” a commonly used term in this field? I am more familiar with host-virus link prediction models / network models, etc...

Line 91-93: I don’t understand the idea that these models “do not predict non-infection”. Aren’t most of these models assuming a binary outcome, and thus predicting an interaction for some host-virus paris also by necessity predicting the absence of a similar interaction in a different host-virus pair? If HPTs are very different than existing network prediction / species interaction / link prediction models, it is important to offer a more detailed summary of these approaches and how they differ.

Lines 101-103: Given the format of results then methods, it would greatly help the reader if you have a breif overview of the types of co-evolutionary signals you measured (are these identified sites of selection, or auto-generated genome composition features?), and the general type of ML model employed (GBM) somewhere here in the introduction.

Lines 110-112: Please clarify that these are bacteriophages and bacteria, correct? If so, these are not all viruses and hosts as you exclude large taxonomic groups of viral hosts here.

Line 308-312: What cutoff for correlation did you use? Depending on the ML model used, some relatively “highly correlated” predictors could provide very useful information. For example, in GBMs, which allow for highly non-linear interactions between predictors, predictors which appear fairly correllated could actually provide important discriminatory information for particular classes.

Line 352: You first use AUROC, then here switch to ROC when referring to AUROC.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Lu Fan

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011649.r003

Decision Letter 1

James O'Dwyer, Samuel V Scarpino

13 Jul 2024

Dear Dr. Duhaime,

Thank you very much for submitting your manuscript "Virus-Host Interactions Predictor (VHIP): machine learning approach to resolve microbial virus-host interaction networks" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

I agree with both reviewers that the manuscript has improved substantially during revision. I would ask the authors to provide the comparisons requested by reviewer 2 and to ensure that the points raised by reviewer 3 in their original assessment are discussed in the manuscript (in addition to the response letter). I leave the decision around whether to keep figure 6, move it to the supplement, or save it for a future paper to the authors. However, I agree with the sentiment of reviewer 2 that the figure doesn't add much and is pretty high-level.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Samuel V. Scarpino

Academic Editor

PLOS Computational Biology

James O'Dwyer

Section Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

I agree with both reviewers that the manuscript has improved substantially during revision. I would ask the authors to provide the comparisons requested by reviewer 2 and to ensure that the points raised by reviewer 3 in their original assessment are discussed in the manuscript (in addition to the response letter). I leave the decision around whether to keep figure 6, move it to the supplement, or save it for a future paper to the authors. However, I agree with the sentiment of reviewer 2 that the figure doesn't add much and is pretty high-level.

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #2: Thank you for your response to my comments and the changes you have made in the revision. I do have two further comments regarding to your response.

To your response to my 2nd major comment:

I think the author can still compare the accuracy of VHIP with existing tools 1) based on the conventional metric (e.g. either prediction of species A or species B is considered 100% accurate), and 2) based a new metric (e.g. only prediction of both species A and B is considered 100% accurate). In the latter case, some existing tools may still be applicable since they can make multi-host prediction based on a threshold cutoff of confidence score. I think both metrics have applicable scenarios in virus-host studies.

To your response to my 4th major comment:

I still think Fig. 6 is immature and it breaks the integrity of this manuscript. Please save it for your next paper.

Reviewer #3: I reviewed a previous version of this manuscript. The authors have addressed some of my concerns, including exploring potentially biased data through sub-sampling, and stating why some classes of link prediction models developed for vertebrate host-virusses (e.g. phylogeographic models) may not be applicable to predict microbe-virus interactions.

For my other concerns, I find that the authors address these in the response letter but it is unclear that they have made appropriate changes in the manuscript (e.g. viral species concepts / vOTUS are not mentioned, “features of co-evolution” are still mentioned and genomic features are not explicitly linked to co-evolutionary mechanisms, and it is unclear why HPTs are different than host-virus link prediction models).

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011649.r005

Decision Letter 2

James O'Dwyer, Samuel V Scarpino

2 Sep 2024

Dear Dr. Duhaime,

We are pleased to inform you that your manuscript 'Virus-Host Interactions Predictor (VHIP): machine learning approach to resolve microbial virus-host interaction networks' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Samuel V. Scarpino

Academic Editor

PLOS Computational Biology

James O'Dwyer

Section Editor

PLOS Computational Biology

***********************************************************

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1011649.r006

Acceptance letter

James O'Dwyer, Samuel V Scarpino

12 Sep 2024

PCOMPBIOL-D-23-01780R2

Virus-Host Interactions Predictor (VHIP): machine learning approach to resolve microbial virus-host interaction networks

Dear Dr Duhaime,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Jazmin Toth

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Distribution of all family classifications of viruses in VHRnet.

    Lighter transparency represents the proportion of non-infection reports by viral family, relative to the solid portion, which represents known infection reports by viral family.

    (TIF)

    pcbi.1011649.s001.tif (263KB, tif)
    S2 Fig. Distribution of all host genera represented in VHRnet.

    Lighter color transparency represents the proportion of non-infection relative to infection (solid color).

    (TIF)

    pcbi.1011649.s002.tif (921.2KB, tif)
    S3 Fig. Number of viruses tested against different host taxa.

    X-axis represent the number of viruses that have been tested.

    (TIF)

    pcbi.1011649.s003.tif (120.3KB, tif)
    S4 Fig. Kernel density of distance measurements for each virus-host pair, colored by interaction (yellow for infection and blue for non-infection).

    The top row used the Euclidean distance to compute the similarity between the k-mer profiles of the virus and its host, while the second row uses the d2* distance metric. Each column represents a different length of k-mer used to create the k-mer profiles (k-length of 3 versus 6 versus 9). The d2* distance metric is a more appropriate metric than the Euclidean distance metric for the purpose of virus-host prediction since it encodes some evolutionary signals (the peaks for the no-infection and infection are separated).

    (TIF)

    pcbi.1011649.s004.tif (845.3KB, tif)
    S5 Fig. Feature distribution (diagonal plots) and co-correlations (all the other plots).

    (TIF)

    pcbi.1011649.s005.tif (1.8MB, tif)
    S6 Fig. Comparison of different machine learning models on VHRnet.

    For each type of machine learning model, a grid search was performed to determine the best combinations of parameters. This plot shows the accuracy of the best performing model. This was bootstrapped 50 times (except for SVM since the fit algorithm is O(n^2)).

    (TIF)

    pcbi.1011649.s006.tif (115.3KB, tif)
    S7 Fig. ROC curves from 100 bootstrapping iterations of the best model trained during the grid search using best hyperparameters.

    (TIF)

    pcbi.1011649.s007.tif (80.7KB, tif)
    S8 Fig. F1 curve of 100 best hyperparameter combinations during the grid search.

    (TIF)

    pcbi.1011649.s008.tif (681.2KB, tif)
    S1 Table. Compilation of NCBI accession numbers of lab-tested viral host range and their respective DOI.

    Submitted as an excel spreadsheet.

    (XLSX)

    pcbi.1011649.s009.xlsx (14.2KB, xlsx)
    S2 Table. Machine learning model input.

    Each row contains an experimentally tested virus-host pair, their known interaction, and the signal of coevolutions computed from their genomic sequences. Submitted as an excel spreadsheet.

    (CSV)

    pcbi.1011649.s010.csv (1.3MB, csv)
    S3 Table. Comparison between input and output of existing host-prediction tools.

    (XLSX)

    pcbi.1011649.s011.xlsx (10.4KB, xlsx)
    S4 Table. Pearson pairwise correlations of features that went into VHIP.

    Higher value means the features are more strongly correlated.

    (XLSX)

    pcbi.1011649.s012.xlsx (9.1KB, xlsx)
    S5 Table. List of NCBI accession numbers for viral and host sequences used in this study.

    Submitted as an excel spreadsheet.

    (CSV)

    pcbi.1011649.s013.csv (48.2KB, csv)
    Attachment

    Submitted filename: response_to_reviewers.pdf

    pcbi.1011649.s014.pdf (181KB, pdf)
    Attachment

    Submitted filename: response_to_reviewers.pdf

    pcbi.1011649.s015.pdf (75.6KB, pdf)

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting Information files. Code written for analyses and figures generated as part of this manuscript is made available on Github (https://github.com/DuhaimeLab/VHIP_analyses_Bastien_et_al_2023) The tool VHIP, described in the manuscript, is made available as a Python package through conda-forge and PyPI. The source code is made public on Github (https://github.com/DuhaimeLab/VirusHostInteractionPredictor).


    Articles from PLOS Computational Biology are provided here courtesy of PLOS

    RESOURCES