Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2023 Jul 18;24(5):bbad239. doi: 10.1093/bib/bbad239

Zero-shot-capable identification of phage–host relationships with whole-genome sequence representation by contrastive learning

Yao-zhong Zhang 1,, Yunjie Liu 2, Zeheng Bai 3, Kosuke Fujimoto 4,5, Satoshi Uematsu 6,7, Seiya Imoto 8,
PMCID: PMC10516345  PMID: 37466138

Abstract

Accurately identifying phage–host relationships from their genome sequences is still challenging, especially for those phages and hosts with less homologous sequences. In this work, focusing on identifying the phage–host relationships at the species and genus level, we propose a contrastive learning based approach to learn whole-genome sequence embeddings that can take account of phage–host interactions (PHIs). Contrastive learning is used to make phages infecting the same hosts close to each other in the new representation space. Specifically, we rephrase whole-genome sequences with frequency chaos game representation (FCGR) and learn latent embeddings that ‘encapsulate’ phages and host relationships through contrastive learning. The contrastive learning method works well on the imbalanced dataset. Based on the learned embeddings, a proposed pipeline named CL4PHI can predict known hosts and unseen hosts in training. We compare our method with two recently proposed state-of-the-art learning-based methods on their benchmark datasets. The experiment results demonstrate that the proposed method using contrastive learning improves the prediction accuracy on known hosts and demonstrates a zero-shot prediction capability on unseen hosts.

In terms of potential applications, the rapid pace of genome sequencing across different species has resulted in a vast amount of whole-genome sequencing data that require efficient computational methods for identifying phage–host interactions. The proposed approach is expected to address this need by efficiently processing whole-genome sequences of phages and prokaryotic hosts and capturing features related to phage–host relationships for genome sequence representation. This approach can be used to accelerate the discovery of phage–host interactions and aid in the development of phage-based therapies for infectious diseases.

Keywords: phage–host identification, contrastive learning, whole-genome sequence representation

INTRODUCTION

With the development of next-generation sequencing technologies, bacterial virus (phage) genome sequences are being discovered at a fast speed when compared with the traditional culture-based way. The rapid and accurate identification of infectable hosts for these newly identified bacterial viruses has become a fundamental task that accompanies the need to analyze genome sequences.

For identifying a phage’s infectable hosts, classical methods compare the phage and candidate hosts based on the information of homology matches [1], CRISPR array [2], specific regions of integration [3] and oligonucleotide abundance [4]. However, the recall rates of the classical methods are usually low due to the coverage of the identifiable information and heterogeneity of the viral genome. To solve this problem, statistical and machine learning based methods [5–9] have been proposed. Features such as k-mer profiles [9] and phage protein content profiles [6, 8] are used for the host prediction. Usually, these methods formalize phage–host identification as a multi-class prediction task, which predicts or infers host labels based on input features. Although significant performance improvements have been achieved, one major limitation of the multi-class prediction approach is that the predicting host labels have to be pre-defined, which limits its generalization ability to unseen hosts. Meanwhile, imbalanced data are a critical challenging issue for the learning-based approach, as the number of negative phage–host pairs is much larger than the positives.

As new phages and bacteria are kept on being discovered, there is a growing demand for identifying phage–host relationships on newly discovered (previously unseen) hosts. For this purpose, comparing genome or protein sequences of both phages and hosts using alignment [10] or alignment-free methods [11, 12] can be used without the need of retraining the model in the multi-class prediction formalism. The performance of these pair-matching methods are highly dependent on the representation of phages and hosts. Currently, the prediction performance at the species and genus level still need to be further improved. Shang et al. [13] proposed a method named CHERRY to construct a multimodal graph that incorporates relationships of phage–phage and phage–host. They trained an encoder for node representation and a decoder for edge prediction. The convolutional graph encoder integrates the topological structure of the multimodel graph, in which features from both training and testing samples are used. The identification of PHIs is formalized as an edge prediction task with the decoder.

In this work, we propose a new method to incorporate PHIs into the genome sequence representation of phages and hosts, and use the learned representation for identifying PHIs. To process whole genome sequences effectively and efficiently, we rephrase k-mer information with frequency chaos game representation [14] for both phages and hosts. Based on the known PHIs and the FCGR representation, we then train a convolutional neural network (CNN) to learn latent embeddings of both phages and hosts. The PHIs are incorporated into the new representation via contrastive learning that makes phages infecting the same host as close as possible in the learned embedding space. Contrastive learning can effectively deal with imbalanced datasets, where positive phage–host pairs are significantly outnumbered by negative ones. The latent embeddings of phages and hosts are trained to take account of known PHIs in the representation. Using these learned representations of phages and hosts, PHIs can be identified by calculating the distance of their corresponding embeddings. In addition to achieving high prediction accuracy similar to the multi-class prediction approach, the proposed method can be readily extended to predict unseen hosts and multiple hosts.

METHODS

Rephrasing whole-genome sequences with frequency chaos game representation

As the genomes of phages and bacteria are at different scales (average size of phages: around 52k, bacteria: around 4M), choosing an appropriate genome representation is the first and fundamental step in analyzing PHIs within a learning-based framework. Meanwhile, genomic variants present an additional challenge for species-level representation without loss of generality. In this work, we used FCGR as the initial genome sequence representation. FCGR has been widely used in many bioinformatics applications [15], such as sequence alignment. It compresses k-mer frequency information into a two-dimensional (2D) matrix with the element coordinates arranged according to chaos game representation (CGR) [16]. FCGR preserves the basic character of CGR as a generalized Markov chain. The reason for using FCGR is that we can process k-mer information with 2D convolution more efficiently, especially when k is large. For example, when using 6-mer for genome sequence representation, traditionally 6-mer features are processed in the form of 4096 vectors, while FCGR preserves the same amount of information in the shape of a 256 x 256 matrix. A 256 x 256 matrix can be very efficiently processed with a convolutional module. As illustrated in Figure 1, phage and host whole-genome sequences/contigs are first transformed into an FCGR matrix and then processed with a convolutional encoder.

Figure 1.

Figure 1

CL4PHI pipeline for phage–host identification. CL4PHI consists of two major components (A) phage and host genome sequence representation taking account of PHIs and (B) PHI identification based on the distance of the learned representation. (A) Learning phage and host embeddings taking account of PHIs. Phage and host whole-genome sequences are first represented with frequency chaos game representation as 2D k-mer matrices. An encoder is then trained to optimize the contrastive loss on all positive and negative sample pairs from a training set. The margin in the contrastive loss is used to separate infectable and non-infectable hosts. (B) Prediction of infectable hosts for a given phage. After acquiring the phage and candidate host embeddings, Euclidean distances between the phage and candidate hosts are calculated here. The infectable hosts can be predicted based on criteria of minimum distance or within given margin threshold (e.g. <1).

Learning latent space embeddings through contrastive learning

FCGR provides a general representation of genome sequences for phages and candidate hosts. To further take account of phage–host relationships into the representation, we proposed to use contrastive learning [17] based on known PHIs. We prepared contrastive training samples Inline graphic based on given phage–host training dataset. We paired each phage Inline graphic and all known candidate hosts Inline graphic, where Inline graphic are the number of phages and hosts at the species level. The interacted phage–host pairs are labeled as 1, while all other pairs are labeled as 0. The latent embeddings are learned through minimizing the contrastive loss as follows:

graphic file with name DmEquation1.gif

where Inline graphic is the margin used to control or define the closeness of the infectable hosts of a phage in the space, and Inline graphic is the distance between two embeddings. Here, we used a simple Euclidean distance as the distance evaluation metric. In the contrastive loss, the margin is used to bound a phage’s infectable hosts while separating all the non-infectable hosts. A large number of non-infectable hosts will not disturb the training much. Therefore, no specific strategy for dealing with the data imbalance issue is applied here.

We used a basic two-layer CNN with batch normalization as the backbone model to learn the mapping from FCGR input to the latent embedding. Details regarding the model architecture and hyper-parameters can be found in Supplementary document S2. In summary, the total number of model parameters of CNN with the 6-mer FCGR representation amounts to 2.7M, and the dimension of the output (learned embedding) is 512. The model was trained and saved according to the best validation accuracy.

Prediction of infectable host or infecting virus

After learning the embeddings of genome sequences that take account of PHIs, we can predict a phage’s infectable hosts, or conversely, the infecting phages of a host, by measuring the distance between embedding pairs. In this work, we mainly focused on predicting hosts for target phages. On the one hand, like many multi-class prediction methods, the candidate host with the minimum distance in the list can be selected as the prediction. On the other hand, we have the option to select all the hosts within the diameter Inline graphic as the prediction, providing a straightforward extension to accommodate situations where a phage may infect multiple hosts. We refer to this entire pipeline as CL4PHI in the following sections.

Evaluation benchmark datasets and metrics

We evaluated the proposed method in comparison with two recently proposed learning-based methods: DeepHost [9] and CHERRY [13]. In conjunction with these methods, we used their benchmark datasets for training and testing. DeepHost prepared dataset Inline graphic based on phage resources from NCBI, EMBL and PhageDB, which contains 8756 phages and 118 bacterial hosts at the species level (In the 118 species, there are 2696 phages annotated with ‘Mycobacterium smegmatis’ and 7 phages annotated with ‘Mycobacteruim smegmatis.’ By querying the NCBI database and checking the genus-level annotation, we hypothesized that the ‘Mycobacteruim smegmatis’ could be the label of ‘Mycobacterium smegmatis’. Therefore, we changed the species label of ‘Mycobacteruim smegmatis’ to ‘Mycobacterium smegmatis’ and got 117 host species instead. The detailed information is provided in Supplementary material S1.1.). CHERRY used the dataset [12] that contains 1940 phages and 223 bacterial hosts. For a concise description, we use Inline graphic as the abbreviation in the following sections. We followed their methods of splitting the datasets. The detailed data-split information is reported in the Supplementary document S1. DeepHost formalizes phage–host prediction as a multi-class prediction task for known host candidates, while the prediction for unseen host is not applicable. Therefore, we evaluated the model performance of host prediction under the known host setting on Inline graphic. CHERRY explores phage and host sequence features and builds a graph network based on phage–phage (phage protein organizations) and phage–host (CRISPR database, BLASTN and train data annotation). PHIs are identified by predicting the likelihood of an edge between the phage and the host node. CHERRY can be used to predict unseen hosts, provided their genome sequences are available. Querying databases is required in the stage of building graphs for both training and testing, and the original CHERRY used features from both training and testing samples for constructing the multimodal graph [13]. For the purpose of investigating how learning-based models perform on unseen phages and hosts, we retrained the CHERRY model with all hosts that appear in the training data (187 hosts) and predicted for the 223 hosts which include 36 hosts only encountered in the testing. Regarding the hosts in the benchmark datasets, we downloaded their genome sequences based on accession IDs in the ‘prokaryote.csv’ file provided by CHERRY on GitHub. For hosts lacking an accession ID, we manually searched and downloaded the corresponding sequences from NCBI databases, using the host’s species name as a guide. The Supplementary document S1 provides further information on these manually downloaded hosts.

For known hosts, we calculated host prediction Inline graphic as the percentage of phages for which hosts are accurately predicted according to gold standards. In cases where a model gives no prediction for a sample, we considered these instances as the wrong predictions. For unseen hosts encountered in training, we evaluated top-k accuracy (k = 1,2,3,5,10,20,50,100). All models were trained at the species level, with prediction accuracy assessed at both the species and the genus levels. The accuracy at the genus level was calculated based on the information derived from the species level prediction.

RESULTS

Whole-genome-sequence embeddings learned by contrastive learning can take account of phage–host relationships

Figure 2 presents t-distributed stochastic neighbor embedding (tSNE) plots of embeddings before and after applying contrastive learning to the two benchmark datasets. In the tSNE plots of the flattened FCGR representation, most phages are scattered, and bacterial hosts are located within an extended rectangular region centered in the middle. After applying contrastive learning, the new embeddings start to cluster according to their infectable hosts. For instance, phages infecting Mycobacterium smegmatis in Figure 2A and Mycolicibacterium smegmatis in Figure 2B demonstrate noticeable clustering after the application of contrastive learning. (Note that Mycolicibacterium smegmatis previously belonged to the genus Mycobacterium, the two datasets use different host species names.) Besides the phage clustering according to the host at the species level, phages infecting the same hosts at the genus level are also observed to be clustered for the new embeddings. For instance, in the top of the right figure in Figure 2B, phages (NC_031012.1 and NC_030920.1) infecting Bacillus thuringiensis and Bacillus megaterium locate closely after contrastive learning. A similar clustering pattern can be clearly observed on the phages infecting Streptomyces (Streptomyces viridochromogenes, Streptomyces lividans, Streptomyces griseus, Streptomyces griseofuscus, Streptomyces xanthochromogenes, Streptomyces scabiei, Streptomyces venezuelae and Streptomycessanglieri) and Mycobacterium (Mycobacterium avium, Mycobacterium phlei and Mycobacterium smegmatis) at the bottom of the right part in Figure 2A. As the taxonomic hierarchy is not used in the contrastive learning, this indicates that the learned embeddings preserve the genome sequence information after contrastive learning. Note that some host-infecting phages are already clustered for the original embedding, while others are scattered. For example, phages infecting Propionibacterium acnes in Figure 2A and Cutibacterium acnes infectable phages in Figure 2B are clustered before and after contrastive learning. This indicates that the FCGR representation of whole-genome sequences can encode phage–host relationships for some specific species. Through contrastive learning, more phage–host infection properties are integrated into the new embeddings.

Figure 2.

Figure 2

tSNE plots of phage and host representations before and after contrastive learning was applied to the two benchmark test sets. (A) The phage and host embeddings on the DeepHost’s benchmark dataset. (B) The tSNE plots of phage and host embeddings on the CHERRY’s benchmark dataset. In both (A) and (B), the left figure depicts phage and host embeddings based on the FCGR representation, and the right figure shows the learned phage and host embeddings through contrastive learning. Phages are represented by point shapes, while the hosts are by cross shapes. The tSNE visualization shows that PHIs are accounted for in the learned embeddings through contrastive learning, which is evident from the right figures in both (A) and (B) that phages infecting the same host tend to be clustered together.

Contrastive learning improves host prediction accuracy

We compared CL4PHI with two state-of-the-art learning-based (hybrid included) methods of DeepHost and CHERRY on their benchmark datasets. In the contrastive learning of CL4PHI, we only used the hosts appearing in the training and validation set to learn embeddings. Inline graphic contains 117 hosts in the training set, which overlap with all 91 hosts in the test set. Inline graphic contains 187 hosts in the training and 95 hosts (including 36 unseen hosts) in the test set. CL4PHI uses the minimum distance criterion in the prediction. To evaluate the performance on unseen hosts and ensure a fair comparison, CHERRY was applied in a modified setting that differs from the one described in their original paper in the following two aspects: (1) Phages and hosts in the testing set were not used to construct the multimodal graph during training. (2) For computational efficiency, we only used the bacterial hosts relevant to the benchmark datasets instead of the whole 60 105 prokaryotes. More specifically, on Inline graphic, we used 187 hosts and all training phage samples to train CHERRY. In the testing, we used a total of 223 hosts and test phage samples for the evaluation. In the context that follows, the term 'CHERRY' refers to a modified version of the CHERRY model used in this setting.The model parameters of DeepHost and CHERRY were set following their papers. For training CHERRY on Inline graphic, the training was not converged with the default hyper-parameter setting (lr = 0.01 and epoch = 250). We tested several additional sets of hyper-parameters and used the one (lr = 0.001 and epoch = 2000) with a higher training accuracy and observed training convergence. All model parameters can be found in Supplementary file S2.

The prediction accuracy at the species and genus level are shown in Table 1. At both taxonomy levels of species and genus, CL4PHI achieves a higher accuracy than DeepHost and CHERRY on the two datasets. Specifically, on Inline graphic, CL4PHI achieves an accuracy of 0.9279, which is 1.72% absolutely higher than DeepHost. On Inline graphic, as the amount of training data is fewer than Inline graphic (Inline graphic is Inline graphic5x larger), all the three models exhibit reduced accuracy when compared to their performance on Inline graphic. Among the total 634 test phages in Inline graphic, two test phages of ’NC_029548.1’ and ’NC_029098.1’ do not have any prediction by CHERRY. CL4PHI achieves a higher prediction accuracy at the species and the genus level.

Table 1.

Prediction accuracy on the two datasets. The genus-level accuracy is calculated based on the species-level prediction results

Dataset Species Genus
DeepHost CHERRY CL4PHI DeepHost CHERRY CL4PHI
Inline graphic 0.9105 0.7597 0.9279 0.9479 0.8264 0.9613
Inline graphic 0.4921 0.4811 0.5726 0.5757 0.5394 0.7145

The numbers highlighted in bold font indicate the highest accuracy among the compared methods for each specific taxonomy level within the dataset.

We conducted a more detailed analysis of the model prediction results on Inline graphic, leveraging the extensive annotations available. We compared the overlap of corrected predictions made by all three methods, as shown in Figure 3A and B. Overall, out of the total 634 phages, 219 and 265 phages are correctly predicted by all three methods at the species and genus level, respectively. CHERRY, DeepHost and CL4PHI have an additional 60, 13 and 48 correctly predicted phages that are not predicted by the other two methods at the species level. This number changes to 36, 15 and 70 at the genus level. Similar changes can also be observed in the phage types of Siphoviridae, Herelleviridae, Drexlerviridae and Demerecviridae from the species level to the genus level. Among the four most frequent phage types at the species and genus level, CL4PHI achieves a higher accuracy. All methods failed to correctly predict only one sample of Tristromaviridae in the test set. This sample, identified as 'NC_029548.1', is one of the unpredicted phages in the CHERRY's result.

Figure 3.

Figure 3

Comparisons of the one-best prediction results of the three models on Inline graphic. (A) and (B) show comparisons at the species level and genus level, respectively. At each level, a Venn diagram of the correct predictions made by the three models, as well as the prediction accuracy according to different phage types and genome length intervals are demonstrated. CL4PHI shows competitive results for most phage types and genome length intervals at both the species and genus levels.

We further assessed the models' prediction performance in relation to phage genome size. The evaluation was carried out over five length intervals: ‘Inline graphic30 kbp’, ‘30–65 kbp’, ‘65–100 kbp’, ‘100–150 kbp ’ and ‘150 kbpInline graphic’, in accordance with the phage genome size distribution [18]. As shown in Figure 3, CL4PHI delivers equal or higher performance across all range intervals at the genus level. At the species level, CL4PHI achieves a higher accuracy than the other two methods in most of the range intervals except the ‘100–150 kbp’ length range.

CL4PHI can be generalized to unseen host species

We evaluated the models on the 80 phage test samples that infecting 36 unseen hosts (in the training) on the CHERRY’s benchmark dataset. Given that DeepHost can only predict pre-defined labels, we then used all 223 labels, including the 36 unseen labels, as their model predicting labels. No phage sample related to these 36 unseen labels is used in training the DeepHost model. Different from DeepHost, both CHERRY and CL4PHI use host genome sequence information for predicting PHIs. In this evaluation, these 36 unseen host sequences were also not used in training for all the methods.

The top-k accuracy of the phages infecting unseen hosts in the CHERRY benchmark dataset is shown in Table 2. For the top-1 accuracy at the species level, CHERRY achieves the highest accuracy of 0.3625, while at the genus level, CL4PHI achieves the highest accuracy of 0.45. The top-2 accuracy of CL4PHI at both species and genus levels increases to 0.375 and 0.5875, respectively. These are the highest among the three models. As k increases to 3 and 5, the accuracy of CL4PHI predictions increases to 0.475 and 0.6, while CHERRY accuracy remains 0.375. Note that host predictions made by CHERRY are based on both empirical rules and model prediction scores. For those nodes marked with confident labels, the predictions are assigned a prioritized score of 1.0 in CHERRY. The confident label is generated based on the phage–host CRISPR query, high confident surrounding label, phage–phage BLASTN identity percentage (larger than 0.9), and appearance in the training set. Since the phages infecting unseen hosts are evaluated here, the rules of confident label assignment relate to the first three criteria. In the top-1 prediction made by CHERRY for phages infecting unseen hosts, all 29 correctly predicted labels are related to the confident nodes. As for CL4PHI, all predictions are made according to the minimum distance between the learned embeddings of phage and hosts. Given that the embeddings are learned from supervised data, the top-1 result suggests that CL4PHI could benefit from incorporating additional information sources, similar to the approach used by CHERRY. Interestingly, while the DeepHost model made no correct prediction at the species level, it did make correct predictions at the genus level. For the top-10 genus-level accuracy, it achieves 0.5. As only the phage sequence information is used in the DeepHost model, this indicates that their encoding of whole-phage-genome sequences can also be generalized for predicting unseen host interactions at a higher taxonomy level.

Table 2.

Top-k accuracy of the three learning-based models on the 80 phage test samples infecting the 36 unseen hosts

Top-k Species Genus
DeepHost CHERRY CL4PHI DeepHost CHERRY CL4PHI
top-1 0.0 0.3625 0.1625 0.225 0.4 0.45
top-2 0.0 0.3625 0.375 0.35 0.4125 0.5875
top-3 0.0 0.375 0.475 0.3625 0.45 0.6625
top-5 0.0 0.375 0.6 0.4375 0.5375 0.7375
top-10 0.0 0.6125 0.65 0.5 0.7625 0.7625
top-20 0.0 0.7375 0.7625 0.5625 0.8125 0.8375
top-50 0.0 0.8875 0.9125 0.625 0.95 0.925
top-100 0.0 0.95 0.95 0.675 0.975 0.95

The numbers highlighted in bold font represent the highest top-k accuracy among the compared models at a specific taxonomy level.

We also utilized the Phage & Host Daily (PHDaily) [19] for evaluating models with additional external databases. PHDaily is a curated catalog of known phage–host associations collected from several databases of NCBI Virus, Virus-Host DB, MVP, RefSeq, GenBank, UniProt and IntAct. We observed similar performance for both seen and unseen host evaluation on the external dataset, indicating the robustness and generalization capability of our proposed approach. Detailed results are reported in Supplementary file S4.

Visualizing PHIs based on distance can provide an intuitive perspective. CL4PHI identifies PHIs by measuring the closeness of the distance between a phage and a host. During the training of CL4PHI, a margin distance is utilized to separate infectable and non-infectable hosts. Thus, in addition to the one-best prediction, the criterion of selecting candidate hosts within the margin can be used to predict infectable hosts for a phage. Figure 4C shows the margin-based CL4PHI predictions on Inline graphic. In this figure, the length of an edge between a phage and a host corresponds to the distance of their learned embeddings. Edges are plotted only if their length is less than a margin of 1. Hosts that are not seen in training are colored gray to distinguish them from the hosts seen in training. For example, Clostridium tetani is an unseen host colored in gray, as shown in Figure 4A. Red-colored lines are used for highlighting PHIs that overlap the gold standard. In Figure 4A, ’NC_029010.1’ has two host predictions with the phage–host distance less than the margin. Although the closest host to ’NC_029010.1’ is Staphylococus haemolyticus instead of the gold standard Staphylococus aureus at the species level, CL4PHI makes the correct prediction at the genus level. Moreover, such a margin-based prediction approach can be utilized to handle multi-host predictions where a phage infects several different hosts.

Figure 4.

Figure 4

Visualization of the margin-based prediction results of CL4PHI on Inline graphic. (A) and (B) are zoomed-in views of the overall phage–host network depicted in (C). The edges represented in the network indicate the distance between the phage and host nodes based on the embeddings learned by CL4PHI. In the plot, edges are shown if their distance is less than the margin of 1. The gray-colored nodes represent the unseen hosts in the training set, while the red lines highlight the gold standard overlapped phage–host predictions. This visualization provides an intuitive overview of the prediction results, offering an alternative perspective to the one-best accuracy measure. Furthermore, the margin-based prediction approach enables handling multi-host predictions, where a phage can infect several different hosts.

A case study on metagenomic data

We evaluated the proposed method using metagenomic data from a real cohort study [20]. This study involves sequencing of intestinal phage and bacterial metagenomic data from fecal samples of 101 healthy Japanese adults. Circular viral contigs longer than 1.5 kb and bacterial contigs longer than 5 kb were selected after assembling phage and bacteria contigs. Bacteria-like viral contigs were removed based on the predictions of VirSorter [21] and VirFinder [22]. This results in 1453 distinct circular viral contigs and 265 234 bacterial contigs. The bacterial contigs were assigned to 240 bacterial taxonomy using PhyloPythiaS+ [23]. In the original study, phage–host associations were identified based on prophages and CRISPR spacer-based method. A total of 85 prophage-identified and 663 CRISPR-spacer-identified PHIs at the species level (with 60 PHIs identified by both methods) were used to investigate the overlaps between the reported PHIs and the predictions made by different deep learning based methods. It should be noted that these reported PHIs only cover a portion of the PHIs present in the intestinal microbial environment, as acquiring comprehensive species-level PHIs from metagenomic data is difficult.

We trained models on the CHERRY benchmark dataset, which includes all 223 host species and 1940 PHIs. As the host composition in the evaluation of intestinal microbial environment differs from the training data, the number of overlapping host species between the 239 (Since the reference genome sequence of ’uncultured bacterium A11’ could not be found in the NCBI database, we used the remaining 239 reference genome sequences among the 240 hosts at the species level for the prediction.) hosts and the 223 hosts used in the training are only 15 (Host names are reported in S5.1 of Supplementary file).

For the 1453 phage contigs, we predicted their potential hosts from the list of 239 candidate hosts using DeepHost, CHERRY and CL4PHI models. However, DeepHost could not predict hosts that are not encountered during training, and only 15 of the 239 hosts are present in the training set. Therefore, we mainly evaluated the prediction results of CHERRY and CL4PHI based on their overlap with the prophage-identified and CRISPR-spacers-identified PHIs. For evaluating phages with multi-host interactions, a prediction is considered overlapping if it matches any of the hosts in the multi-host list. Among the 85 prophage-identified PHIs, CHERRY predicted 13 overlapping hosts, whereas CL4PHI predicted two overlapping hosts for the one-best prediction at the species level. Notably, the two overlapping predictions made by CL4PHI are not included in the 13 hosts predicted by CHERRY. For the top-k prediction, CL4PHI predicted more overlapping PHIs than CHERRY when k equals or exceeds 10 (S5.2 Figure A of Supplementary file). For the top-50 predictions, CHERRY and CL4PHI have 38 and 55 predictions overlapping with the prophage-identified PHIs, respectively. Regarding the 663 CRISPR-spacer-identified PHIs, CHERRY predicted 223 overlapping PHIs, while CL4PHI predicted 27 overlapping hosts at the species level for the one-best prediction. There are 16 of the 27 overlapping predictions made by CL4PHI that are not covered in the 223 predictions made by CHERRY. As the value of k increases in top-k predictions, the disparity in the number of overlapping predictions between CHERRY and CL4PHI decreases (S5.3 Figure A of Supplementary file). For the top-50 prediction, CHERRY and CL4PHI have 446 and 380 predictions overlapping with the CRISPR-spacer-identified PHIs at the species level, respectively. As the CRISPR database is used for the prioritized confident prediction in the CHERRY prediction, it explains why more overlapping predictions were made by CHERRY than CL4PHI for the CRISPR-identified PHIs. To validate this assumption, besides comparing the pipeline-level results, we also evaluated deep learning models’ direct prediction results without using any external rules or knowledge. CL4PHI gives more overlapping predictions with prophage-identified and CRISPR-identified PHIs in the model’s direct predictions, as shown in the figures of S5.2 (B) and S5.3 (B) of Supplementary file. For DeepHost, fewer predictions overlap with prophage-identified PHIs (0) and the CRISPR-spacer-identified PHIs (2). For the rest phage contigs (765 out of 1453) without PHIs identified by known relevant DNA fragment matching, CL4PHI can provide candidate infectable hosts for further investigation.

Training time of contrastive learning

In contrastive learning, binary paired samples are generated based on the interacted phage–host data. The number of training samples increases in proportion to the number of hosts. We report the total training time of running 150 epochs after model convergence on a working station equipped with one AMD 3995WX CPU and one NVIDIA A6000 GPU. As shown in Table 3, contrastive learning can be finished in 2 hours on both datasets.

Table 3.

Time usage of training the contrastive learning model on the two benchmark datasets

Dataset Phage number Batch size (by phage) Host number Epoch Total time
Inline graphic 6734 32 117 150 5206 s
Inline graphic 1306 32 187 150 1884 s

The effect of using FCGR for whole-genome sequence representation

To investigate the role of FCGR in the proposed method, we randomly shuffled the topological order for generating FCGR while retaining the k-mer frequency information. We retrained CNN with contrastive learning on the Inline graphic using the two FCGRs of Inline graphic and Inline graphic. All model settings and parameters were kept the same. The prediction accuracy on the Inline graphic test set was investigated. The accuracy of the model using the shuffled FCGR slightly decreases from 0.5726 to 0.5567, but remains higher than the other two learning-based methods. These results suggest that the specific topological of FCGR helps to improve the model performance of CL4PHI, while a certain proportion of performance improvement comes from the 2D representation of the k-mer vectors used in the CNN model architecture.

DISCUSSION

In this work, we proposed a simple yet effective contrastive learning method for predicting PHIs. We represented whole-genome sequences using frequency chaos game representation and learned embeddings to take account of PHIs with contrastive learning. Our experiment results show that the learned embeddings can take account of phage–host relationships. The proposed method can achieve a higher prediction accuracy on known hosts and be generalized to unseen hosts with a zero-shot prediction capability.

The contrastive learning applied in this study utilized samples derived from supervised labels. Hosts are used as the pivots in the contrastive learning to enable learned embeddings encoding features to distinguish infectable and non-infectable hosts based on phages’ genome sequence representation. After learning the embeddings taking account of PHIs, the PHI prediction problem can be transformed into a minimum distance problem. In current work, only supervised data is used for training the contrastive learning model. Although the proposed method shows a promising result, further improvement can be made by considering more training data, especially more unlabeled data. Additionally, while we have explored a simple Euclidean distance in the contrastive learning framework, other distance metrics such as Inline graphic, which has proven successfully in the alignment-free method [11], could be applied and further explored in the contrastive learning framework.

For learning-based methods, the imbalance of training data presents a challenging for training a less biased model. This is especially true in PHI data, which are typically highly imbalanced. Previous methods employ various down-sampling techniques to balance the amount of positive and negative samples used in training. Our method, CL4PHI, addresses the issue of data imbalance through the use of a contrastive loss with a margin. The margin serves to bound a phage’s infectable hosts while separating all the non-infectable hosts. For those non-infectable hosts with a phage distance larger than the margin, the information on the phylogenetic tree of prokaryotes is not considered. The contrastive loss function can be further refined by considering phylogenetic information.

The taxonomy name of the host could be changed due to updates, which can result in issues of synonymy across different datasets. For instance, Mycolicibacterium smegmatis previously belonged to the genus Mycobacterium. In Inline graphic, Mycolicibacterium is used, while in Inline graphic  Mycobacterium is used. Such inconsistencies can pose problems and require extra post-processing when applying phage–host prediction models trained on different naming standards, if they rely solely on host labels. In contrast, methods like CHERRY and CL4PHI, which directly utilize host genome sequences, are less affected by these synonym issues.

LIMITATION OF THIS STUDY

There are two limitations that need to be addressed to enhance the proposed method further. First, the current method solely learns PHIs based on whole-genome sequences of phages and their infectable hosts in the training data. Other information, such as the CRIPSR database and protein-level PHI features (e.g. receptor-binding proteins), has yet to be explored in the proposed framework. Although CL4PHI has already demonstrated competitive results for both known hosts and unseen hosts, incorporating other information could lead to more accurate and comprehensive predictions. Additionally, ensembling the results with other complementary methods could also help address this limitation. Second, in this work, we trained and tested the proposed methods mainly based on two benchmark datasets updated to 2021. The number of phages and hosts included in this study may be limited. As new phages and PHIs are continually being identified, it would be beneficial to train the proposed model using additional high-quality phage and host whole-genome sequencing data.

CONCLUSION

In this work, we proposed a new modeling formalism to solve the challenging task of identifying PHIs. Different from previous work, we reformulated phage–host identification as a whole-genome sequence representation learning task, in which phage and host interactions are incorporated into the learned embeddings. For the whole-genome sequences of phages and hosts, we rephrased the k-mer information using FCGR and used contrastive learning to encode PHIs into the learned representation. Contrastive learning is used to bring together phages infecting the same hosts and push apart non-infectable hosts. Based on the learned embeddings, PHIs can be identified through distance calculation. We compared our method with two recently proposed state-of-the-art learning-based methods on their benchmark datasets. The experiment results demonstrate that the proposed method using contrastive learning improves the prediction accuracy on known hosts and demonstrates a zero-shot prediction capability on unseen hosts.

Key Points

  • We propose a novel modeling formalism for identifying PHIs as a task of whole-genome sequence representation. In the approach, we incorporate known phage–host relationships into the representation of genome sequences of phages and hosts. The PHIs are then identified through the distance calculation of the learned embeddings.

  • We utilize contrastive learning to incorporate phage–host relationships into the embeddings of whole-genome sequences of phages and hosts. Through this approach, phages that infect the same hosts become closer, while phages and non-infectable hosts are separated in the new representation space.

  • The proposed pipeline, named CL4PHI, demonstrates higher prediction accuracy on known hosts (seen in the training) and showcases a zero-shot prediction capability on unseen hosts, as evaluated on different datasets.

CODE AND DATA AVAILABILITY

Related code, dataset splits and trained models can be found at: https://github.com/yaozhong/CL4PHI.

FUNDING

This research was supported by Grant-in-Aid for Scientific Research (C) (JSPS KAKENHI Grant Number JP21K12104 to Y.Z.) from Japan Society for the Promotion of Science,the Japan Agency for Medical Research and Development (AMED) (to S.U.; 22fk0108619h0002 and 22ae0121040h0002, to K.F.; 22ae0121048h0002), the Japanese Society for the Promotion of Science (JSPS) (to S.U.; 22H00477, to S.I.; 21H03538) and the Uehara Memorial Foundation (to S.I.).

ACKNOWLEDGEMENTS

The computing resources were provided by Human Genome Center, the Institute of Medical Science, the University of Tokyo. The authors would like to thank Dr. Yasumasa Kimura for his help on the metagenomic dataset and the anonymous reviewers for their valuable comments.

Supplementary Material

PS_BIB202305_Supplement_materials_bbad239

Yao-zhong Zhang is currently a project associate professor in the Division of Health Medical Intelligence, Human Genome Center, the Institute of Medical Science, the University of Tokyo.

Yuejie Liu is a PhD student in the Division of Health Medical Intelligence, Human Genome Center, the Institute of Medical Science, the University of Tokyo.

Zeheng Bai is a PhD student in the Division of Health Medical Intelligence, Human Genome Center, the Institute of Medical Science, the University of Tokyo.

Kosuke Fujimoto is an associate professor in the Department of Immunology and Genomics, Graduate School of Medicine, Osaka Metropolitan University, and a project associate professor in the Division of Metagenome Medicine, Human Genome Center, The Institute of Medical Science, The University of Tokyo.

Satoshi Uematsu is a professor in the Department of Immunology and Genomics, Graduate School of Medicine, Osaka Metropolitan University, and a project professor in the Division of Metagenome Medicine, Human Genome Center, The Institute of Medical Science, The University of Tokyo.

Seiya Imoto is a professor in the Division of Health Medical Intelligence and the director of Human Genome Center, the Institute of Medical Science, the University of Tokyo.

Contributor Information

Yao-zhong Zhang, Division of Health Medical Intelligence, Human Genome Center, The Institute of Medical Science, The University of Tokyo, Shirokanedai 4-6-1, Minato-ku, 108-8639 Tokyo, Japan.

Yunjie Liu, Division of Health Medical Intelligence, Human Genome Center, The Institute of Medical Science, The University of Tokyo, Shirokanedai 4-6-1, Minato-ku, 108-8639 Tokyo, Japan.

Zeheng Bai, Division of Health Medical Intelligence, Human Genome Center, The Institute of Medical Science, The University of Tokyo, Shirokanedai 4-6-1, Minato-ku, 108-8639 Tokyo, Japan.

Kosuke Fujimoto, Department of Immunology and Genomics, Graduate School of Medicine, Osaka Metropolitan University, Asahi-machi 1-4-3, Abeno-ku, 545-8585 Osaka, Japan; Division of Metagenome Medicine, Human Genome Center, The Institute of Medical Science, The University of Tokyo, Shirokanedai 4-6-1, Minato-ku, 108-8639 Tokyo, Japan.

Satoshi Uematsu, Department of Immunology and Genomics, Graduate School of Medicine, Osaka Metropolitan University, Asahi-machi 1-4-3, Abeno-ku, 545-8585 Osaka, Japan; Division of Metagenome Medicine, Human Genome Center, The Institute of Medical Science, The University of Tokyo, Shirokanedai 4-6-1, Minato-ku, 108-8639 Tokyo, Japan.

Seiya Imoto, Division of Health Medical Intelligence, Human Genome Center, The Institute of Medical Science, The University of Tokyo, Shirokanedai 4-6-1, Minato-ku, 108-8639 Tokyo, Japan.

REFERENCES

  • 1. Dutilh  BE, Cassman  N, McNair  K, et al.  A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes. Nat Commun  2014;5(1):4498. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Stern  A, Mick  E, Tirosh  I, et al.  Crispr targeting reveals a reservoir of common phages associated with the human gut microbiome. Genome Res  2012;22(10): 1985–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Fouts  DE. Phage_finder: automated identification and classification of prophage regions in complete bacterial genome sequences. Nucleic Acids Res  2006;34(20): 5839–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Pride  DT, Wassenaar  TM, Ghose  C, Blaser  MJ. Evidence of host-virus co-evolution in tetranucleotide usage patterns of bacteriophages and eukaryotic viruses. BMC Genomics  2006;7:1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Galiez  C, Siebert  M, Enault  F, et al.  Wish: who is the host? Predicting prokaryotic hosts from metagenomic phage contigs. Bioinformatics  2017;33(19): 3113–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Amgarten  D, Iha  BKV, Piroupo  CM, et al.  vhulk, a new tool for bacteriophage host prediction based on annotated genomic features and deep neural networks. PHAGE 2022;3(4):204–212. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Tan  J, Fang  Z, Shufang  W, et al.  Hophage: an ab initio tool for identifying hosts of phage fragments from metaviromes. Bioinformatics  2022;38(2): 543–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Coutinho  FH, Zaragoza-Solas  A, López-Pérez  M, et al.  Rafah: host prediction for viruses of bacteria and archaea based on protein content. Patterns  2021;2(7):100274. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Ruohan  W, Xianglilan  Z, Jianping  W, Cheng  LIS. Deephost: phage host prediction with convolutional neural network. Brief Bioinform  2022;23(1):bbab385. [DOI] [PubMed] [Google Scholar]
  • 10. Camacho  C, Coulouris  G, Avagyan  V, et al.  Blast+: architecture and applications. BMC Bioinformatics  2009;10:1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Ahlgren  NA, Ren  J, Lu  YY, et al.  Alignment-free Inline graphic oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences. Nucleic Acids Res  2017;45(1): 39–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Congyu  L, Zhang  Z, Cai  Z, et al.  Prokaryotic virus host predictor: a Gaussian model for host prediction of prokaryotic viruses in metagenomics. BMC Biol  2021;19:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Shang  J, Sun  Y. Cherry: a computational method for accurate prediction of virus-prokaryotic interactions using a graph encoder-decoder model. Brief Bioinform  2022;23(5):bbac182–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Deschavanne  PJ, Giron  A, Vilain  J, et al.  Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol Biol Evol  1999;16(10): 1391–9. [DOI] [PubMed] [Google Scholar]
  • 15. Löchel  HF, Heider  D. Chaos game representation and its applications in bioinformatics. Comput Struct Biotechnol J  2021;19:6263–71. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Joel  H, Jeffrey.  Chaos game representation of gene structure. Nucleic Acids Res  1990;18(8): 2163–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Sumit  Chopra, Raia  Hadsell, and Yann  LeCun. Learning a similarity metric discriminatively, with application to face verification. In: M. Hebert and D. Kriegman (eds.). 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). San Diego, CA, USA: IEEE, 2005; Vol. 1, 539–46. [Google Scholar]
  • 18. Zrelovs  N, Dislers  A, Kazaks  A. Motley crew: overview of the currently available phage diversity. Front Microbiol  2020;11:579452. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Albrycht  K, Rynkiewicz  AA, Harasymczuk  M, et al.  Daily reports on phage-host interactions. Front Microbiol  2022;13:946070. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Fujimoto  K, Kimura  Y, Shimohigoshi  M, et al.  Metagenome data on intestinal phage-bacteria associations aids the development of phage therapy against pathobionts. Cell Host Microbe  2020;28(3): 380–9. [DOI] [PubMed] [Google Scholar]
  • 21. Roux  S, Enault  F, Hurwitz  BL, Sullivan  MB. Virsorter: mining viral signal from microbial genomic data. PeerJ  2015;3:e985. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Ren  J, Ahlgren  NA, Yang Young  L, et al.  Virfinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome  2017;5:1–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Gregor  I, Dröge  J, Schirmer  M, et al.  Phylopythias+: a self-training method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes. PeerJ  2016;4:e1603. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

PS_BIB202305_Supplement_materials_bbad239

Data Availability Statement

Related code, dataset splits and trained models can be found at: https://github.com/yaozhong/CL4PHI.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES