Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2025 Dec 10;26(6):bbaf656. doi: 10.1093/bib/bbaf656

PBIP: a deep learning framework for predicting phage–bacterium interactions at the strain level

Lijia Ma 1, Peng Gao 2, Gufeng Liu 3, Yuan Bai 4, Qiuzhen Lin 5, Jianqiang Li 6, Minfeng Xiao 7,
PMCID: PMC12694490  PMID: 41370631

Abstract

Phage therapy has received great attention as a promising antimicrobial treatment, and its core technique, namely predicting phage–bacterium interactions (PBIs), is crucial for understanding infection mechanisms and optimizing therapeutic strategies. However, existing computational methods mainly focus on the species or higher taxonomic levels, and usually neglect the potential of deep embedding representations, limiting their ability to capture complex biological patterns inherent in sequences. This hinders the discovery of rich sequence features, and restricts the clinical application of phage therapy. To address these limitations, we propose a novel deep learning framework (called PBIP) for strain-level PBI prediction. In PBIP, we first identify strain-level interactions through biological infection experiments and sequencing of Klebsiella pneumoniae isolated from the clinical environment of Xiangya Hospital. Then, we utilize a pretrained unified representation model to convert protein sequences of phages and bacteria into deep embeddings. Next, we apply the synthetic minority oversampling technique to generate positive interactions in the embedding space to address the data imbalance issue. Subsequently, we design a deep neural network that uses a convolutional neural network to extract local features, a bi-directional gated recurrent unit to capture global features, and an attention module to highlight significant features. Finally, a fully connected layer integrates this information for PBI prediction. Experimental results show the superiority of PBIP over the state-of-the-art methods in predicting PBIs. The code and datasets are available at https://github.com/a1678019300/PBIP.

Keywords: deep learning, protein representation learning, phage–bacterium interactions, attention mechanism

Introduction

Phages are viruses that infect bacteria and replicate within them [1, 2], playing vital roles in regulating bacterial populations and maintaining ecological balance in various environments [3]. Moreover, the potential of phages as an alternative to antibiotics has gained increasing attention, forming the basis of phage therapy [4, 5]. The key step in implementing phage therapy is to identify effective phages that can interact with specific bacterial hosts and lyse them. Therefore, predicting phage–bacterium interactions (PBIs), especially at the strain level, is crucial for targeted therapeutic applications [6]. However, traditional experimental approaches for verifying PBIs are time-consuming and costly, which has prompted the development of computational methods to improve efficiency.

The existing computational methods for PBI prediction are mainly categorized into alignment-based and learning-based methods. The alignment-based methods predict PBIs by assessing the similarity between phages and bacteria. For example, HostPhinder [7] predicts PBIs based on Inline graphic-mer similarity between the query phages and known host phages, while VPF-Class [8] estimates similarity by using viral protein families. Moreover, VirHostMatcher [9] predicts interaction likelihood by analyzing oligonucleotide frequency similarity between phages and bacteria, while PHIST [10] employs Inline graphic-mer comparison to evaluate genomic similarity between phages and potential bacterial hosts. Additionally, certain methods [11–13] leverage the CRISPR system for interaction prediction, but these methods are only suitable for a subset of bacteria.

The learning-based methods utilize machine learning and deep learning to model complex PBIs. For instance, WIsH [14] trains Markov models for the candidate hosts to estimate the likelihood of their interactions with phages, while Leite Inline graphic [15, 16] extract hand-crafted protein features and apply machine learning models such as K-Nearest Neighbors (KNN) [17], Random Forest (RF) [18], Logistic Regression (LR) [19], Support Vector Machine (SVM) [20], Naive Bayesian (NB) [21], and Artificial Neural Networks (ANN) [22]. Recently, deep learning has shown promising performance in related tasks such as drug–target interaction prediction [23, 24], miRNA-disease association prediction [25, 26], and drug–drug interaction prediction [27, 28]. Among various deep learning architectures, convolutional neural networks (CNNs) have been used for potential biological interaction prediction due to their ability to capture rich local features through convolution operations. For example, PredPHI [29] and PHIAF [30] obtain hand-crafted genome and protein features from phages and bacteria, and use CNNs to build prediction models, while CL4PHI [31] employs frequency chaotic game representations as features. Moreover, graph neural networks have been applied to PBIs by constructing knowledge graphs from phage and bacterium sequences, enabling efficient propagation of biological features between nodes. For example, HostG [32], CHERRY [33], and PGCN [34] focus on constructing knowledge graphs, while PHPGCA [35], GERMAN-PHI [36], and PHISGAE [37] aim to address the sparsity of knowledge graphs.

However, the existing computational methods still have limitations. More specifically, certain methods rely on extracting hand-crafted features from phage and bacterial sequences, but these features often fail to reveal the complex biological patterns underlying the sequences. Moreover, CNN-based methods cannot effectively capture global information about PBIs. Furthermore, since most phages infect specific strains within a bacterial species [38], species-level prediction methods tend to assign multiple candidate phages to a given bacterial strain, and cannot find concrete infection.

In deep learning, the unified representation (UniRep) model [39] has been shown to effectively learn deep representations of protein sequences while preserving their physico-chemical and structural properties. Moreover, CNNs excel at extracting local feature maps from sequences through convolution operations [40], while the bi-directional gated recurrent unit (Bi-GRU) effectively captures long-term dependencies through a gating mechanism [41]. Furthermore, to address data imbalance, the synthetic minority oversampling technique (SMOTE) [42] is widely employed to generate minority class samples. Additionally, the attention mechanism enables deep neural networks to focus on task-relevant features, thereby improving performance.

In this article, inspired by the capabilities of UniRep, SMOTE, CNN, Bi-GRU, and the attention mechanism, we propose phage–bacterium interactions predictor (PBIP). Specifically, PBIP employs a protein representation model to transform phage and bacterial protein sequences into deep embeddings. The CNN and Bi-GRU modules are applied to capture local and global sequence features, while the attention mechanism emphasizes informative features. In addition, SMOTE is adopted to address data imbalance. The main contributions are summarized as follows:

  • We propose PBIP, a strain-level PBIs predictor that integrates deep protein sequence embeddings, enriched local and global sequence features, and the attention mechanism to enhance the prediction accuracy.

  • In PBIP, we first identify the strain-level interactions through biological infection experiments and sequencing on Klebsiella pneumoniae isolated from the clinical environment of Xiangya Hospital (see our work [43]). Then, we use the pretrained UniRep model to extract protein sequence features with biological significance. Subsequently, we apply SMOTE to augment positive interactions and data imbalance in the embedding space. After that, we design a deep learning model that integrates a CNN module for local feature extraction, a Bi-GRU module for capturing long-term dependencies in both forward and backward directions, and an attention mechanism to emphasize important features. Finally, we employ a fully connected layer with a Sigmoid activation function for PBI prediction.

  • Experiments on the strain-level and species-level datasets show that PBIP outperforms the state-of-the-art methods. Moreover, the case studies on test set imbalance ratios and sequence similarity validate the robustness and generalizability of PBIP, while the ablation studies show the effectiveness of its various components for PBI prediction.

The remainder of this article is organized as follows: Section “Materials and methods” presents the problem formulation, datasets, technical details of PBIP, and performance metrics. Section “Results and discussion” describes the baseline settings, experimental results, and discussions. Finally, Section “Conclusion” concludes the study and outlines potential future research directions.

Materials and methods

In this section, we first introduce the representation of protein sequences and the PBI prediction problem, and then describe the datasets used in this study. Then, we present a novel deep learning framework, namely PBIP, for strain-level PBI prediction. Finally, we provide the performance evaluation metrics.

Protein sequence representation and problem formulation

Protein sequence representation

In phage–bacterium systems, interactions are mainly mediated by receptor-binding proteins (RBPs) located on the phage surface, which specifically recognize and bind to designated receptors on the bacterial cell envelope [29, 44, 45]. Consequently, the protein sequences encoded by both phages and their bacterial hosts contain rich interaction-related signals, making them a valuable foundation for computational modeling.

In this study, a protein sequence is defined as Inline graphic, where each element Inline graphic (Inline graphic) corresponds to 1 of the 20 standard amino acids {A, R, N, D, C, E, Q, G, H, I, L, K, M, F, P, S, T, W, Y, V}, and Inline graphic denotes the length of the sequence.

Phage–bacterium interaction prediction

PBIs refer to the biological processes of phages infecting bacterial hosts, including stages such as adsorption, injection, replication, and release [46].

We formulate the PBI prediction as a binary classification task. More specifically, given a phage–bacterium pair Inline graphic, where Inline graphic, and Inline graphic represent the sets of protein sequences for the phage and bacterium, respectively, and Inline graphic and Inline graphic denote the number of proteins in the corresponding organisms, the PBI task is to predict the interaction label Inline graphic. Here, Inline graphic indicates a positive interaction (i.e. the phage infects the bacterium), while Inline graphic indicates a negative interaction (i.e. no infection occurs).

Dataset

Strain-level interaction dataset

We construct a strain-level interaction dataset based on experimental results obtained from high-throughput and automated phage–bacterium characterization systems, with details of the datasets and experiments available in our work [43]. Specifically, we isolate 125 Klebsiella pneumoniae strains from clinical samples collected at Xiangya Hospital, and enrich 104 phages from hospital sewage through centrifugation and membrane filtration. We confirm the infectivity of each phage by measuring plaque-forming units, i.e. the number of clear zones formed on agar plates due to bacterial lysis. Each interaction is tested using a double-layer agar assay, and images of the resulting plaques are analyzed with an automated plaque identification algorithm, which calculates plaque characteristics such as radius, area, and transmissivity. The algorithm generates a composite score, and interactions with a score >1.5 are classified as positive. Each phage–bacterium pair is tested in three independent experiments, and interactions with at least two positive replicates are defined as positive, whereas all others are considered negative.

After integrating the results of independent replicates, the final interaction matrix contains 125 bacterial strains and 104 phages, yielding 13 000 phage–bacterium pairs. Among them, 938 (accounting for 7.22% of all entries) are positive interactions, reflecting a high degree of resistance within the bacterial population. On average, each bacterial strain is susceptible to 9.02 phages, and each phage can infect Inline graphic7.50 bacterial strains.

We perform whole-genome sequencing and assembly for all bacterial strains and phages, followed by gene annotation using GeneMarkS [47] to obtain coding protein sequences. To ensure reliable interaction analysis, we remove five bacterial strains that exhibit complete resistance to all tested phages. The resulting dataset contains 104 phages and 120 bacterial strains, corresponding to 12 480 valid phage–bacterium records. Moreover, to minimize the risk of data leakage, we discard negative interactions involving phages with substantial overlap with the phages in the test set, and use the remaining interactions for model training. An overview of the dataset is provided in Table 1, and Fig. 1A illustrates the numbers of phages and bacteria in the training and test sets, distinguished by different colors.

Table 1.

Detailed information of the strain-level training and test sets

Dataset Number of positive interactions Number of negative interactions Total number of interactions
Training set 747 9213 9960
Test set 191 191 382
Figure 1.

A two-panel figure describing the strain-level PBI dataset. (A) Bar charts comparing the number of phages and bacterial strains allocated to the training and test sets. (B) A statistical plot displaying the distribution of similarity scores between phages, with an average of 0.60 and a median of 0.44.

The strain-level interaction dataset, showing (A) the phages and bacteria in the training and test sets, respectively, and (B) the similarity between phages in the training and test sets.

We evaluate the overall similarity between the training and test sets following prior studies [33, 35]. Specifically, we use Dashing [48] to compute sequence similarities, and for each phage in the test set, we record its highest similarity score with any phage in the train set. The distribution of these scores is shown in Fig. 1B. Only a small proportion of test phages display substantial similarity to training phages, with an average similarity of 0.60 and a median of 0.44. Generally, the dataset split under these settings on the strain-level interaction dataset is appropriate for training and evaluation.

Species-level interaction dataset

To evaluate the performance of PBIP and baseline methods in predicting PBIs at the species level, we adopt the benchmark dataset used in PredPHI [29] with 3449 phages and 301 bacterial species collected from the NCBI RefSeq database, and split data into training and test sets according to their submission date. More specifically, following PredPHI, we use the interactions between phages and bacteria submitted before 2016 as the independent test set, and adopt those submitted in or after 2016 as the train set. This split yields 2851 positive interactions for training and 618 for testing. In both sets, negative interactions are randomly selected from all unverified negative interactions, and the number of negative samples is matched to that of positive samples to address data imbalance. An overview of the dataset is presented in Table 2, while Fig. 2A shows the distribution of phages and bacteria in the train and test sets.

Table 2.

Detailed information of the species-level training and test sets

Dataset Number of positive interactions Number of negative interactions Total number of samples
Training set 2851 2851 5702
Test set 618 618 1236
Figure 2.

A two-panel figure characterizing the species-level PBI dataset. (A) Bar charts showing the number of phages and bacteria in the training and test sets, respectively. (B) A distribution plot of the similarity between phages, indicating an average of 0.12 and a median of 0.06.

The species-level interaction dataset, showing (A) the phages and bacteria in the training and test sets, respectively, and (B) the similarity between phages in the training and test sets.

Similar to the strain-level dataset, sequence similarity between phages in the training and test sets is assessed using Dashing [48]. The similarity distribution is shown in Fig. 2B. Only a small fraction of test phages show substantial similarity to those in the training set, with an average similarity of 0.12 and a median of 0.06, confirming that the dataset split is suitable for model training and evaluation.

Phage–bacterium interactions framework

The model architecture of PBIP, as shown in Fig. 3, consists of four key components: (i) data preparation (see Fig. 3A), (ii) protein sequence embedding (see Fig. 3B), (iii) data augmentation (see Fig. 3C), and (iv) deep learning model (see Fig. 3D). More specifically, PBIP first obtains strain-level interactions as described in Section “Dataset”. Then, PBIP utilizes the pretrained UniRep model to generate deep embeddings of proteins. Next, to address the data imbalance issue, PBIP applies SMOTE to augment the positive interactions in the embedding space. Subsequently, the protein sequence embeddings are fed into a CNN module to capture local feature maps and a Bi-GRU module to extract long-term dependencies in both forward and backward directions. The resulting feature vectors are then processed by an attention module to emphasize informative features. Finally, the output vectors are fed into a fully connected layer with a sigmoid activation function to integrate phage and bacterial information for interaction prediction.

Figure 3.

A workflow diagram illustrating the PBIP framework for strain-level PBI prediction. The process begins with (A) collecting strain-level data, followed by (B) generating protein sequence embeddings via deep representation learning. It then (C) applies the SMOTE technique to alleviate data imbalance in the embedding space, and finally (D) builds a deep learning model for interaction prediction.

Overview of PBIP framework for predicting strain-level PBIs. (A) Collecting used strain-level data. (B) Generating protein sequence embeddings using deep representation learning. (C) Utilizing SMOTE to alleviate data imbalance in the embedding space. (D) Developing a deep learning model for predicting PBIs.

Protein sequence embedding

Many methods for predicting PBIs rely on hand-crafted features, which often fail to capture complex biological patterns in protein sequences. To overcome this limitation, we employ UniRep [39], a pretrained protein representation model that encodes sequences into 1900-dimensional embeddings while preserving physico-chemical and structural properties. UniRep leverages a multiplicative LSTM (mLSTM) [49] trained to predict the next amino acid, thereby learning contextualized representations from raw sequences. The embedding process is illustrated in Fig. 4.

Figure 4.

A diagram illustrating the UniRep protein embedding process. An input protein sequence is processed by an mLSTM network to generate a contextualized hidden state for each amino acid residue. The final sequence embedding is obtained by averaging all these hidden states.

Overview of generating a protein sequence embedding using UniRep.

Formally, given a protein sequence of a phage or bacterium Inline graphic with length Inline graphic, where each element Inline graphic (Inline graphic) represents 1 of the 20 amino acids, we compute the embedding representation as follows. First, the protein sequence Inline graphic is one-hot encoded into a matrix Inline graphic as follows:

graphic file with name DmEquation1.gif (1)

where Inline graphic is a binary vector of length 20 representing the amino acid Inline graphic in the sequence Inline graphic. The element Inline graphic, Inline graphic, in the Inline graphicth position of Inline graphic is set to 1 if it corresponds to Inline graphic, and 0 otherwise.

Next, UniRep employs an embedding layer to convert the one-hot encoding matrix Inline graphic into a continuous representation matrix Inline graphic. This embedding layer is parameterized by a weight matrix Inline graphic, where each row corresponds to the continuous representation of an amino acid. In UniRep, Inline graphic is initialized with randomly generated values and is updated during training. Specifically, each one-hot vector Inline graphic in Inline graphic is replaced with its corresponding continuous vector from Inline graphic, as shown below:

graphic file with name DmEquation2.gif (2)

where Inline graphic represents the continuous embedding of the amino acid Inline graphic.

Subsequently, UniRep employs an mLSTM with 1900 hidden units to process the sequence embeddings and generate contextualized states Inline graphic, and the final protein embedding Inline graphic is obtained by averaging them:

graphic file with name DmEquation3.gif (3)

Details of the processes of mLSTM in UniRep can be found in the Supplementary data.

We employ the officially released pretrained UniRep weights (https://github.com/churchlab/UniRep), which include both Inline graphic and the mLSTM parameters. Using these pretrained weights ensures the preservation of biologically meaningful features.

Since each phage or bacterium contains multiple proteins, we compute organism-level embeddings by averaging the corresponding protein embeddings. Specifically, we let Inline graphic and Inline graphic denote the UniRep embeddings of the Inline graphicth protein in a phage and bacterium, respectively. The organism-level protein embeddings are calculated as follows:

graphic file with name DmEquation4.gif (4)

where Inline graphic and Inline graphic denote the number of protein sequences in the phage and bacterium, respectively. The combined embedding for a phage–bacterium pair Inline graphic is represented as Inline graphic.

Data augmentation

The strain-level dataset exhibits a significant imbalance between positive and negative phage–bacterium pairs. To mitigate this issue, many techniques for tackling imbalance can be used. Here, SMOTE [42] is adopted due to its simplicity and effectiveness for tackling related bioinformatics prediction tasks, such as phage virion proteins prediction [50] and cell wall lytic enzymes prediction [51]. More specifically, we apply SMOTE [42] to generate synthetic positive interactions and balance the training set in the embedding space, while retaining biologically validated negative interactions. Importantly, the test set remains unchanged and contains only experimentally validated positive interactions. The SMOTE workflow is illustrated in Fig. 5.

Figure 5.

A diagram illustrating the SMOTE technique for synthetic data generation. It shows how a new sample is created in the embedding space by performing a linear interpolation between an original positive sample and one of its nearest neighbors.

Overview of generating a synthetic sample using SMOTE in the embedding space.

Given an original positive sample Inline graphic and one of its nearest neighbors Inline graphic, SMOTE creates a synthetic sample in the embedding space by linear interpolation:

graphic file with name DmEquation5.gif (5)

where Inline graphic is a random coefficient. The number of synthetic samples is selected to ensure that the number of positive and negative samples is equal in the training set.

Deep learning model

In this section, we present the proposed deep learning model for predicting PBIs. The model takes as input phage–bacterium protein embedding pairs Inline graphic derived from the pretrained UniRep. These embeddings are first processed through a CNN module to extract local features and then fed into a Bi-GRU module to capture dependencies in both forward and backward directions. Next, the resulting feature vectors are passed through an attention module to emphasize critical features. Finally, the output is fed into a fully connected layer with a sigmoid activation function to predict interaction probabilities. The overall architecture is illustrated in Fig. 6.

Figure 6.

A workflow diagram of the proposed deep learning model for predicting PBIs. The architecture processes protein embedding pairs through sequential modules: a CNN for local feature extraction, a Bi-GRU for capturing bidirectional dependencies, an attention mechanism to weight critical features, and a final fully-connected layer with sigmoid activation to output an interaction probability.

Overview of the proposed deep learning model architecture.

Convolutional neural network module

The CNN module consists of four 1D convolutional layers with max-pooling to extract local feature patterns. Each convolution layer computes feature maps by applying convolution filters to the input embeddings. The operation is defined as:

graphic file with name DmEquation6.gif (6)

where Inline graphic denotes either the phage protein embedding Inline graphic or the bacterium protein embedding Inline graphic, Inline graphic and Inline graphic are the convolutional kernel weights and bias parameters, respectively, and Inline graphic denotes the convolution operation. The output Inline graphic represents the Inline graphicth local feature, and Inline graphic denotes the resulting feature map composed of Inline graphic feature patterns.

A 1D max-pooling layer follows each convolutional layer to reduce the dimensionality of the feature map while preserving important features by selecting the maximum value over a sliding window. Moreover, a dropout layer is introduced after the CNN module to prevent overfitting by setting certain network weights to zero [52].

These local features learned by the CNN may capture biologically relevant short motifs within protein sequences [53], such as receptor-binding motifs in phage tail proteins. These patterns play key roles in phage adsorption and host recognition, and are informative for predicting infectivity.

Bi-directional gated recurrent unit module

Following the CNN module, a Bi-GRU is employed to capture long-term dependencies from both forward and backward directions. The internal structure of the GRU cell is presented in Fig. 7.

Figure 7.

A diagram detailing the internal architecture of a Gated Recurrent Unit cell. It illustrates how the input and the previous hidden state flow through the reset gate, update gate, and memory content to compute the new comprehensive hidden state representation.

The architecture of the GRU cell.

Formally, given the local feature map extracted by the CNN, denoted as Inline graphic, the Bi-GRU processes the sequence bidirectionally. At each time step Inline graphic, the GRU computes:

graphic file with name DmEquation7.gif (7)

where Inline graphic, Inline graphic, Inline graphic, and Inline graphic represent the reset gate, update gate, memory content, and hidden state at time step Inline graphic, respectively. Inline graphic, Inline graphic, and Inline graphic are the weights of Inline graphic, Inline graphic, and Inline graphic are the weights of the Inline graphic, while Inline graphic is the weight of the Inline graphic. Inline graphic, Inline graphic, and Inline graphic are the corresponding biases for the reset gate, update gate, and memory content, respectively. Inline graphic denotes element-wise multiplication, Inline graphic and Inline graphic represent the sigmoid and hyperbolic tangent functions, respectively. Details of the GRU computation process can be found in the Supplementary data.

Bi-GRU aggregates information from both directions by concatenating the forward and backward hidden states at each step:

graphic file with name DmEquation8.gif (8)

where Inline graphic denotes the concatenation operation.

The Bi-GRU captures long-term dependencies across the protein sequence [54], enabling the extraction of global features that complement local motifs learned by CNN. Such features help identify PBIs formed by distant but functionally coordinated residues within RBPs.

Attention module

To emphasize informative features essential for PBI prediction, an attention mechanism is applied after the Bi-GRU module. It computes a weighted sum of all hidden states, where higher weights are assigned to more relevant features:

graphic file with name DmEquation9.gif (9)

where Inline graphic denotes the attention weight with Inline graphic. The output vector Inline graphic reflects the relative importance of each feature in the sequence.

Phage–bacterium interaction prediction

The representations of phage and bacterium, Inline graphic and Inline graphic, obtained through CNN, Bi-GRU, and attention modules, capture local features, long-term dependencies, and key interaction-relevant features. These are concatenated and fed into a fully connected layer to compute the interaction probability:

graphic file with name DmEquation10.gif (10)

where Inline graphic and Inline graphic are the weight and bias of the fully connected layer, respectively.

Model training

The binary cross-entropy loss function is employed to compute the error between the interaction scores predicted by PBIP and the true labels, defined as:

graphic file with name DmEquation11.gif (11)

where Inline graphic represents the true label of the Inline graphicth phage–bacterium pair (0 for negative and 1 for positive interaction). Inline graphic is the predicted probability of a positive interaction, and Inline graphic is the total number of samples.

The AMSGrad optimizer [55] is utilized to minimize the loss function. Model hyperparameters are summarized in Table 3. PBIP is implemented in Python 3.8.10 with Keras 2.8.0 and trained on an NVIDIA RTX 3090 GPU.

Table 3.

Network module parameters and training hyperparameters

Network module Parameters Values
1D-CNN Layer number 4
Kernel size 3
Pooling size 2
Filter number [32, 64, 128, 256]
Dropout rate 0.5
Bi-GRU Hidden size 64
Dropout rate 0.5
Attention Neurons 32
Training hyperparameters Learning rate 3e-4
Batch size 16
Epoch 200

Performance evaluation metrics

To evaluate the performance of PBIP and baseline methods, eight evaluation metrics are used, namely Accuracy, Precision, Sensitivity (Recall), Specificity, F1-score, Matthews correlation coefficient (MCC), Area under the receiver operating characteristic curve (AUC), and Area under the precision-recall curve (AUPR). These metrics are calculated as follows:

graphic file with name DmEquation12.gif (12)
graphic file with name DmEquation13.gif (13)
graphic file with name DmEquation14.gif (14)
graphic file with name DmEquation15.gif (15)
graphic file with name DmEquation16.gif (16)
graphic file with name DmEquation17.gif (17)

where TP, TN, FP, and FN denote true positives, true negatives, false positives, and false negatives, respectively. In addition to the above metrics, the ROC curve plots true positive rate against false positive rate, with AUC representing the area under this curve. The precision-recall curve plots precision versus recall, and AUPR is its corresponding area.

Results and discussion

In this section, we validate the effectiveness of the proposed PBIP by making a comparison with the state-of-the-art PBI prediction methods. Moreover, certain comparative experiments are conducted to show the impact of test set imbalance as well as training and test similarity on prediction performance. Finally, an ablation study is performed to assess the necessity of each component of PBIP in improving the prediction performance. In the following, the baseline methods are first introduced, and then the results and discussion of all methods are given.

Baseline settings

To validate the effectiveness of PBIP, we compare it with four state-of-the-art PBI prediction methods: LeiteANN [15], LeiteBagging [16], PredPHI [29], and PHIAF [30]. Given that all negative interactions in the strain-level dataset are experimentally validated, we further construct SMOTE-based variants of PredPHI and PHIAF, named PredPHIS and PHIAFS, which generate synthetic positive interactions in the embedding space to achieve data balance. Then, to further validate the performance of PBIP, we compare it with five widely used machine learning classifiers, namely LR [19], SVM [56], KNN [17], RF [18], and eXtreme Gradient Boosting (XGB) [57].

LeiteANN [15]: This method extracts three hand-crafted features including amino acid composition (AAC), chemical composition (AC), and molecular weight (MW). The training set is balanced by randomly selecting negative interactions, and an ANN model is trained for PBI prediction.

LeiteBagging [16]: It adopts the same feature set and negative sampling strategy as LeiteANN. Instead of a single classifier, it employs a Bagging ensemble with hard voting that combines RF, ANN, and KNN.

PredPHI [29]: This method uses AAC, AC, and MW features, balances the training set through negative sampling, and trains a CNN for prediction.

PHIAF [30]: In addition to the features used in PredPHI, it also extracts seven DNA-based features including Inline graphic-mer, reverse complement Inline graphic-mer (RCKmer), nucleic acid composition, di-nucleotide composition, tri-nucleotide composition, composition of Inline graphic-spaced nucleic acid pairs, and electron–ion interaction pseudopotentials of trinucleotides (PseEIIP). The training set is balanced using negative sampling, and prediction is performed with a CNN incorporating an attention mechanism.

PredPHIS: This variant of PredPHI generates positive interactions in the embedding space using SMOTE to balance the training set. The feature extraction and deep learning model are identical to those employed in PredPHI.

PHIAFS: This variant of PHIAF applies SMOTE to generate positive interactions, with feature extraction and model components identical to those used in PHIAF.

LR [19], SVM [56], KNN [17], RF [18], and XGB [57]: These methods use UniRep-based protein embeddings, balance the training set through negative sampling, and are trained as individual models for PBI prediction.

To ensure fair comparison, all baselines are retrained on the same strain-level dataset, with consistent train/test partitioning as in our method. For the species-level dataset, since the numbers of positive and negative interactions are balanced, SMOTE-based data augmentation is removed.

Performance comparison on cross-validation

To ensure stable training and assess the performance of PBIP, we perform 10-fold cross-validation on both the strain-level and species-level datasets. The data are randomly partitioned into 10 subsets. In each round, the model uses nine subsets for training and the remaining one for validation. The mean results over all folds are reported for each dataset. As shown in Table 4, PBIP achieves the best results on most evaluation metrics across both datasets. Specifically, on the strain-level dataset, PBIP attains the highest Accuracy (0.96), Sensitivity (0.90), F1-score (0.86), MCC (0.72), and AUC (0.96). On the species-level dataset, it obtains the best Accuracy (0.92), Precision (0.92), Sensitivity (0.92), F1-score (0.92), MCC (0.85), AUC (0.96), and AUPR (0.98).

Table 4.

Performance comparison between PBIP and baseline methods on strain-level and species-level training sets using 10-fold cross-validation

Level Model Accuracy Precision Sensitivity Specificity F1-score MCC AUC AUPR
strain LeiteANN 0.70Inline graphic0.05 0.70Inline graphic0.05 0.70Inline graphic0.05 0.67Inline graphic0.07 0.69Inline graphic0.05 0.39Inline graphic0.10 0.75Inline graphic0.05 0.76Inline graphic0.06
LeiteBagging 0.83Inline graphic0.02 0.83Inline graphic0.02 0.83Inline graphic0.02 0.82Inline graphic0.04 0.83Inline graphic0.02 0.66Inline graphic0.05
PredPHI 0.83Inline graphic0.02 0.83Inline graphic0.02 0.83Inline graphic0.02 0.83Inline graphic0.03 0.83Inline graphic0.02 0.66Inline graphic0.04 0.90Inline graphic0.03 0.91Inline graphic0.03
PHIAF 0.80Inline graphic0.03 0.80Inline graphic0.06 0.79Inline graphic0.06 0.80Inline graphic0.07 0.80Inline graphic0.03 0.60Inline graphic0.07 0.88Inline graphic0.04 0.88Inline graphic0.03
PredPHIS 0.92Inline graphic0.01 0.73Inline graphic0.03 0.87Inline graphic0.02 0.93Inline graphic0.01 0.78Inline graphic0.03 0.59Inline graphic0.05 0.94Inline graphic0.01 0.69Inline graphic0.07
PHIAFS 0.95Inline graphic0.01 0.63Inline graphic0.07 0.69Inline graphic0.07 0.97Inline graphic0.01 0.66Inline graphic0.03 0.63Inline graphic0.04 0.92Inline graphic0.02 0.72Inline graphic0.07
LR 0.69Inline graphic0.02 0.69Inline graphic0.03 0.69Inline graphic0.03 0.65Inline graphic0.03 0.68Inline graphic0.03 0.37Inline graphic0.05 0.74Inline graphic0.04 0.72Inline graphic0.05
SVM 0.77Inline graphic0.04 0.78Inline graphic0.04 0.77Inline graphic0.04 0.82Inline graphic0.04 0.77Inline graphic0.04 0.55Inline graphic0.09 0.84Inline graphic0.03 0.85Inline graphic0.03
KNN 0.76Inline graphic0.02 0.77Inline graphic0.02 0.76Inline graphic0.02 0.67Inline graphic0.05 0.76Inline graphic0.02 0.53Inline graphic0.04 0.87Inline graphic0.01 0.86Inline graphic0.03
RF 0.79Inline graphic0.03 0.79Inline graphic0.03 0.79Inline graphic0.03 0.78Inline graphic0.05 0.79Inline graphic0.03 0.59Inline graphic0.06 0.86Inline graphic0.02 0.83Inline graphic0.03
XGB 0.79Inline graphic0.03 0.79Inline graphic0.03 0.79Inline graphic0.04 0.79Inline graphic0.04 0.79Inline graphic0.03 0.59Inline graphic0.07 0.87Inline graphic0.03 0.87Inline graphic0.02
PBIP 0.96Inline graphic0.01 0.83Inline graphic0.03 0.90Inline graphic0.02 0.96Inline graphic0.01 0.86Inline graphic0.02 0.72Inline graphic0.04 0.96Inline graphic0.01 0.83Inline graphic0.04
species LeiteANN 0.84Inline graphic0.01 0.85Inline graphic0.01 0.84Inline graphic0.02 0.81Inline graphic0.02 0.84Inline graphic0.01 0.69Inline graphic0.03 0.93Inline graphic0.01 0.92Inline graphic0.02
LeiteBagging 0.89Inline graphic0.01 0.90Inline graphic0.01 0.89Inline graphic0.02 0.82Inline graphic0.03 0.89Inline graphic0.02 0.79Inline graphic0.03
PredPHI 0.91Inline graphic0.02 0.91Inline graphic0.02 0.91Inline graphic0.02 0.90Inline graphic0.04 0.91Inline graphic0.02 0.82Inline graphic0.04 0.97Inline graphic0.01 0.97Inline graphic0.01
PHIAF 0.91Inline graphic0.01 0.92Inline graphic0.02 0.90Inline graphic0.04 0.92Inline graphic0.02 0.91Inline graphic0.02 0.82Inline graphic0.02 0.98Inline graphic0.01 0.98Inline graphic0.01
LR 0.85Inline graphic0.02 0.85Inline graphic0.02 0.85Inline graphic0.02 0.85Inline graphic0.02 0.85Inline graphic0.02 0.70Inline graphic0.03 0.92Inline graphic0.01 0.89Inline graphic0.03
SVM 0.86Inline graphic0.01 0.86Inline graphic0.01 0.86Inline graphic0.01 0.88Inline graphic0.02 0.86Inline graphic0.01 0.73Inline graphic0.03 0.95Inline graphic0.01 0.95Inline graphic0.01
KNN 0.86Inline graphic0.01 0.86Inline graphic0.01 0.86Inline graphic0.01 0.79Inline graphic0.01 0.86Inline graphic0.01 0.72Inline graphic0.02 0.93Inline graphic0.01 0.90Inline graphic0.02
RF 0.88Inline graphic0.01 0.89Inline graphic0.01 0.88Inline graphic0.01 0.95Inline graphic0.01 0.88Inline graphic0.01 0.77Inline graphic0.03 0.93Inline graphic0.01 0.90Inline graphic0.02
XGB 0.89Inline graphic0.01 0.89Inline graphic0.01 0.89Inline graphic0.01 0.92Inline graphic0.02 0.89Inline graphic0.01 0.79Inline graphic0.03 0.97Inline graphic0.01 0.96Inline graphic0.01
PBIP 0.92Inline graphic0.01 0.92Inline graphic0.01 0.92Inline graphic0.01 0.92Inline graphic0.01 0.92Inline graphic0.01 0.85Inline graphic0.03 0.98Inline graphic0.01 0.98Inline graphic0.01

The best performance is highlighted in boldface. “–” represents that the metric is not available.

The results in Table 4 further demonstrate that deep learning-based approaches (PBIP, PredPHI, PHIAF, PredPHIS, and PHIAFS) generally outperform machine learning-based methods (LeiteANN, LeiteBagging, LR, SVM, KNN, RF, and XGB). Among the machine learning baselines, LeiteBagging and XGB perform particularly well. This is likely due to the advantages of ensemble learning in enhancing generalization (LeiteBagging) and the integration of UniRep protein embeddings with L1/L2 regularization in XGB. Moreover, among the methods that employ SMOTE to balance the datasets, PBIP outperforms both PredPHIS and PHIAFS. These findings highlight that the combination of UniRep protein embeddings, CNN, Bi-GRU, and the attention mechanism is more effective for predicting PBIs.

Performance comparison on the test set

To further evaluate the performance of each method, we test them on two independent test sets, which are carefully partitioned to ensure a robust assessment. As shown in Table 5, PBIP achieves the best performance across most evaluation metrics on both the strain-level and species-level datasets. Specifically, it achieves the highest Accuracy (0.80), Sensitivity (0.80), F1-score (0.80), MCC (0.61), and AUPR (0.89) on the strain-level dataset, as well as the highest Accuracy (0.69), Precision (0.70), F1-score (0.69), MCC (0.39), AUC (0.78), and AUPR (0.75) on the species-level dataset. Notably, on the strain-level dataset, the SMOTE-based variants PredPHIS and PHIAFS outperform their original counterparts PredPHI and PHIAF across most metrics, underscoring the positive role of data balancing strategies in improving prediction performance.

Table 5.

Performance comparison between PBIP and baseline methods on strain-level and species-level test sets

Level Model Accuracy Precision Sensitivity Specificity F1-score MCC AUC AUPR
strain LeiteANN 0.64 0.64 0.64 0.63 0.64 0.29 0.68 0.73
LeiteBagging 0.72 0.73 0.72 0.85 0.71 0.45
PredPHI 0.76 0.77 0.76 0.85 0.76 0.53 0.81 0.84
PHIAF 0.73 0.82 0.59 0.87 0.69 0.48 0.80 0.81
PredPHIS 0.77 0.80 0.77 0.93 0.76 0.57 0.89 0.89
PHIAFS 0.75 0.91 0.55 0.95 0.69 0.54 0.85 0.86
LR 0.63 0.63 0.63 0.60 0.63 0.26 0.66 0.60
SVM 0.72 0.73 0.72 0.80 0.72 0.44 0.80 0.83
KNN 0.72 0.72 0.72 0.74 0.72 0.44 0.78 0.76
RF 0.73 0.75 0.73 0.88 0.73 0.49 0.82 0.83
XGB 0.75 0.76 0.75 0.83 0.75 0.50 0.83 0.84
PBIP 0.80 0.81 0.80 0.89 0.80 0.61 0.86 0.89
species LeiteANN 0.65 0.67 0.65 0.79 0.65 0.32 0.74 0.73
LeiteBagging 0.66 0.67 0.66 0.78 0.66 0.34
PredPHI 0.66 0.67 0.66 0.77 0.66 0.33 0.76 0.71
PHIAF 0.68 0.67 0.71 0.64 0.69 0.36 0.75 0.72
LR 0.52 0.52 0.52 0.53 0.52 0.03 0.52 0.48
SVM 0.60 0.65 0.60 0.88 0.57 0.25 0.70 0.67
KNN 0.59 0.66 0.59 0.92 0.54 0.24 0.71 0.66
RF 0.58 0.70 0.58 0.97 0.51 0.25 0.74 0.72
XGB 0.62 0.66 0.62 0.88 0.60 0.28 0.69 0.69
PBIP 0.69 0.70 0.69 0.77 0.69 0.39 0.78 0.75

The best performance is highlighted in boldface. “–” represents that the metric is not available.

To visually compare various methods, we present the ROC and PR curves on the strain-level and species-level test sets in Fig. 8. As shown in Fig. 8, PBIP achieves the highest AUPR (0.89) on the strain-level set and the highest AUC (0.78) and AUPR (0.75) on the species-level set. These results further validate the effectiveness of the proposed deep learning framework in predicting PBIs.

Figure 8.

A four-panel figure comparing the performance of PBIP against baseline methods using ROC and PR curves. Panels (A) and (B) show results on the strain-level test set, where PBIP achieves the highest AUPR. Panels (C) and (D) show results on the species-level test set, where PBIP achieves the highest AUC and AUPR.

Performance comparison between PBIP and baseline methods based on ROC and PR curves on strain-level and species-level test sets: (A) strain-level ROC, (B) strain-level PR, (C) species-level ROC, and (D) species-level PR.

Impact of test set imbalance on prediction performance

In real-world scenarios, the number of positive interactions is substantially lower than that of negative interactions. To evaluate the performance of PBIP and baseline methods under imbalanced test conditions, we conduct a case study using the strain-level interaction dataset. Specifically, we incorporate negative interactions discarded from the training set to create nine imbalanced test sets, with positive-to-negative interaction ratios ranging from 1:2 to 1:10. Moreover, we use MCC to evaluate model performance on imbalanced datasets.

As shown in Fig. 9, the MCC decreases consistently as the degree of imbalance increases. Notably, PBIP surpasses all state-of-the-art methods across all imbalanced test sets, demonstrating robust performance under varying imbalance ratios and highlighting its potential applicability in real-world scenarios where positive interactions are inherently scarce.

Figure 9.

A line graph comparing the MCC of PBIP and baseline methods across test sets with increasing imbalance ratios. The results show that while all methods decline in performance, PBIP consistently surpasses others, demonstrating its robustness.

Performance comparison between PBIP and baseline methods at different test set imbalance ratios on strain-level dataset.

Impact of training and test similarity on prediction performance

We utilize Dashing [48] to quantify the genomic similarity between training and test sets. For each phage in the test set, we calculate its similarity score with every phage in the training set and retain the highest value as its representative similarity. Based on these scores, the test phages are partitioned into five similarity intervals, and the predictive performance is evaluated separately at both strain and species levels.

As shown in Fig. 10, higher similarity between training and testing data generally corresponds to improved predictive accuracy across all methods. PBIP consistently achieves the highest accuracy within most similarity ranges, further confirming its advantage over competing approaches. However, an anomaly is observed in that performance does not strictly follow the expected upward trend with increasing similarity. This irregularity may arise from the relatively small number of test interactions within certain similarity intervals (as indicated by the gray bars), which can magnify the influence of sample-specific characteristics.

Figure 10.

A two-panel figure comparing the prediction accuracy of PBIP and baseline methods across different similarity intervals between training and test data for (A) strain-level and (B) species-level datasets. While higher similarity generally improves accuracy for all methods, and PBIP leads in most intervals, the performance shows a non-monotonic trend, potentially due to limited test samples in certain ranges.

Performance comparison between PBIP and baseline methods at different similarity intervals on (A) strain-level and (B) species-level datasets.

Ablation study

The aforementioned experimental results demonstrate the effectiveness of PBIP in predicting PBIs. To further evaluate the contribution of each component, we conduct an ablation study. Notably, the CNN module is excluded from this study due to the considerable slowdown in training when feeding high-dimensional embedding representations directly into the Bi-GRU. The PBIP variants are as follows:

  • PBIP1: This variant utilizes the same hand-crafted features as PredPHI.

  • PBIP2: This variant employs the same hand-crafted features as PHIAF.

  • PBIP3: This variant is without the Bi-GRU module.

  • PBIP4: This variant is without the attention module.

  • PBIP5: This variant is without the data augmentation module.

Table 6 presents the performance comparison between PBIP and its variant models on the test sets. Overall, PBIP consistently surpasses its variants across the majority of evaluation metrics. Specifically, the results of PBIP1 and PBIP2 highlight the superiority of UniRep-derived embeddings, which provide substantial performance gains over traditional hand-crafted features. Furthermore, the results of PBIP3, PBIP4, and PBIP5 demonstrate the individual contributions of the Bi-GRU module, the attention mechanism, and the data augmentation strategy. Notably, in the strain-level dataset, PBIP5 is trained without SMOTE, scaling down to the original validated positive interactions. Comparing PBIP5 to PBIP, the performance drops by Inline graphic7.5%, demonstrating that SMOTE with data augmentation substantially enhances predictive performance.

Table 6.

Performance comparison between PBIP and the variant models on strain-level and species-level test sets

Level Model Accuracy Precision Sensitivity Specificity F1-score MCC AUC AUPR
strain PBIP1 0.77 0.78 0.77 0.83 0.77 0.55 0.82 0.85
PBIP2 0.71 0.77 0.71 0.94 0.70 0.48 0.75 0.79
PBIP3 0.71 0.77 0.71 0.95 0.69 0.48 0.83 0.85
PBIP4 0.73 0.76 0.73 0.90 0.73 0.49 0.84 0.86
PBIP5 0.74 0.76 0.74 0.87 0.74 0.50 0.81 0.80
PBIP 0.80 0.81 0.80 0.89 0.80 0.61 0.86 0.89
species PBIP1 0.60 0.60 0.60 0.68 0.60 0.21 0.65 0.62
PBIP2 0.63 0.64 0.63 0.72 0.63 0.27 0.70 0.67
PBIP3 0.64 0.64 0.64 0.66 0.63 0.27 0.71 0.66
PBIP4 0.66 0.66 0.66 0.75 0.65 0.32 0.71 0.66
PBIP 0.69 0.70 0.69 0.77 0.69 0.39 0.78 0.75

The best performance is highlighted in boldface

Conclusion

The PBI prediction task is of critical significance for advancing phage therapy. However, the existing computational methods mainly focus on species or higher-level classification resolutions, and often neglect the utilization of deep protein embedding representations. In this article, we have proposed PBIP, a novel deep learning framework designed for strain-level PBI prediction. PBIP first constructs a strain-level interaction dataset through biological infection experiments and sequencing of Klebsiella pneumoniae isolated from the clinical environment of Xiangya Hospital. Then, PBIP leverages the pretrained UniRep to generate deep embeddings of protein sequences, enabling the efficient capture of rich sequence-level biological patterns. To address data imbalance, SMOTE is employed to generate the additional positive interactions in the embedding space. Subsequently, these embedding feature vectors are fed into a CNN module to extract local features, a Bi-GRU module to capture long-term dependencies in both forward and backward directions, and an attention module to emphasize the contribution of key features. Finally, a fully connected layer with a sigmoid activation function tackles these vectors for PBI prediction. Extensive experimental results have demonstrated the superiority of PBIP over the state-of-the-art methods for PBI prediction. The case studies on test set imbalance ratios and sequence similarity have validated the robustness and generalizability of PBIP, while the ablation studies have demonstrated the effectiveness of its various components for PBI prediction.

Although PBIP achieves a competitive performance for PBI prediction, its interpretability for the presence of specific protein signatures for resisting certain phage types remains unresolved. In the future, we will systematically evaluate the predictive value of the genomic and domain-level features of proteins, aiming to enhance both prediction accuracy and biological interpretability.

Key Points

  • We propose phage–bacterium interactions predictor (PBIP), a deep learning framework that leverages deep protein embeddings combined with the convolutional neural network, bi-directional gated recurrent unit, and attention modules to capture complex local and global features for accurate strain-level PBI prediction.

  • To address data imbalance, PBIP applies the synthetic minority oversampling technique to augmenting positive interactions in the embedding space, enhancing model robustness and predictive performance.

  • Extensive experiments on strain-level and species-level datasets show that PBIP outperforms the existing methods, with ablation studies validating the effectiveness of its deep embedding, deep neural network architecture, and data augmentation strategies.

Supplementary Material

Supplementary_Document_bbaf656

Contributor Information

Lijia Ma, College of Computer Science and Software Engineering, Shenzhen University, No. 3688 Nanhai Avenue, Nanshan District, Shenzhen 518060, Guangdong, China.

Peng Gao, College of Computer Science and Software Engineering, Shenzhen University, No. 3688 Nanhai Avenue, Nanshan District, Shenzhen 518060, Guangdong, China.

Gufeng Liu, College of Computer Science and Software Engineering, Shenzhen University, No. 3688 Nanhai Avenue, Nanshan District, Shenzhen 518060, Guangdong, China.

Yuan Bai, School of Public Health, The University of Hong Kong, No. 7 Sassoon Road, Pokfulam, Hong Kong SAR, China.

Qiuzhen Lin, College of Computer Science and Software Engineering, Shenzhen University, No. 3688 Nanhai Avenue, Nanshan District, Shenzhen 518060, Guangdong, China.

Jianqiang Li, National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, No. 3688 Nanhai Avenue, Nanshan District, Shenzhen 518060, Guangdong, China.

Minfeng Xiao, BGI-Shenzhen, Beishan Industrial Zone, Yantian District, Shenzhen 518083, Guangdong, China.

Conflict of interest: None declared.

Funding

This work was supported by the National Key R&D Program of China under Grant 2020YFA0908700, by the National Natural Science Foundation of China under Grant 62173236, by the Technology Research Project of Shenzhen City under Grants JCYJ20240813141416022 and JCYJ20190808174801673, and by the Tencent “Rhinoceros Bird” Scientific Research Foundation for Young Teachers of Shenzhen University under Grant.

Data availability

The code, validated positive and negative interactions at the strain level, experimental phenotypes, and species-level interaction data are available at https://github.com/a1678019300/PBIP. Moreover, all strain-level sequence data are publicly available from the China National GeneBank DataBase Sequence Archive (CNSA) [58] under accession number CNP0006217: https://db.cngb.org/search/project/CNP0006217/.

References

  • 1. Kortright  KE, Chan  BK, Koff  JL. et al.  Phage therapy: a renewed approach to combat antibiotic-resistant bacteria. Cell Host Microbe  2019;25:219–32. 10.1016/j.chom.2019.01.014 [DOI] [PubMed] [Google Scholar]
  • 2. Pan  J, You  Z, You  W. et al.  PTBGRP: predicting phage–bacteria interactions with graph representation learning on microbial heterogeneous information network. Brief Bioinform  2023;24:bbad328. [DOI] [PubMed] [Google Scholar]
  • 3. Mallawaarachchi  V, Roach  MJ, Decewicz  P. et al.  Phables: from fragmented assemblies to high-quality bacteriophage genomes. Bioinformatics  2023;39:btad586. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Ma  L, Deng  W, Bai  Y. et al.  Identifying phage sequences from metagenomic data using deep neural network with word embedding and attention mechanism. IEEE/ACM Trans Comput Biol Bioinform  2023;20:3772–85. 10.1109/TCBB.2023.3322870 [DOI] [PubMed] [Google Scholar]
  • 5. Wang  C, Zhang  J, Cheng  L. et al.  DPProm: a two-layer predictor for identifying promoters and their types on phage genome using deep learning. IEEE J Biomed Health Inform  2022;26:5258–66. 10.1109/JBHI.2022.3193224 [DOI] [PubMed] [Google Scholar]
  • 6. Gaborieau  B, Vaysset  H, Tesson  F. et al.  Prediction of strain level phage–host interactions across the Escherichia genus using only genomic information. Nat Microbiol  2024;9:2847–61. 10.1038/s41564-024-01832-5 [DOI] [PubMed] [Google Scholar]
  • 7. Villarroel  J, Kleinheinz  KA, Jurtz  VI. et al.  Hostphinder: a phage host prediction tool. Viruses  2016;8:116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Pons  JC, Paez-Espino  D, Riera  G. et al.  VPF-class: taxonomic assignment and host prediction of uncultivated viruses based on viral protein families. Bioinformatics  2021;37:1805–13. 10.1093/bioinformatics/btab026 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Ahlgren  NA, Ren  J, Lu  YY. et al.  Alignment-free Inline graphic oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences. Nucleic Acids Res  2017;45:39–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Zielezinski  A, Deorowicz  S, Gudyś  A. Phist: fast and accurate prediction of prokaryotic hosts from metagenomic viral sequences. Bioinformatics  2022;38:1447–9. 10.1093/bioinformatics/btab837 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Paez-Espino  D, Eloe-Fadrosh  EA, Pavlopoulos  GA. et al.  Uncovering earth’s virome. Nature  2016;536:425–30. 10.1038/nature19094 [DOI] [PubMed] [Google Scholar]
  • 12. Shmakov  SA, Sitnik  V, Makarova  KS. et al.  The CRISPR spacer space is dominated by sequences from species-specific mobilomes. MBio  2017;8:10–1128. 10.1128/mBio.01397-17 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Zhang  R, Mirdita  M, Karin  EL. et al.  SpacePHARER: sensitive identification of phages from CRISPR spacers in prokaryotic hosts. Bioinformatics  2021;37:3364–6. 10.1093/bioinformatics/btab222 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Galiez  C, Siebert  M, Enault  F. et al.  WIsH: Who is the host? Predicting prokaryotic hosts from metagenomic phage contigs. Bioinformatics  2017;33:3113–4. 10.1093/bioinformatics/btx383 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Leite  DMC, Brochet  X, Resch  G. et al.  Computational prediction of inter-species relationships through omics data analysis and machine learning. BMC Bioinform  2018;19:151–9. 10.1186/s12859-018-2388-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Leite  DMC, Lopez  JF, Brochet  X. et al.  Exploration of multiclass and one-class learning methods for prediction of phage-bacteria interaction at strain level. In: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1818–25. Madrid, Spain: IEEE, 2018. [Google Scholar]
  • 17. Cover  T, Hart  P. Nearest neighbor pattern classification. IEEE Trans Inform Theory  1967;13:21–7. 10.1109/TIT.1967.1053964 [DOI] [Google Scholar]
  • 18. Breiman  L. Random forests. Mach Learn  2001;45:5–32. 10.1023/A:1010933404324 [DOI] [Google Scholar]
  • 19. Hsiang-Fu  Y, Huang  F-L, Lin  C-J. Dual coordinate descent methods for logistic regression and maximum entropy models. Mach Learn  2011;85:41–75. 10.1007/s10994-010-5221-8 [DOI] [Google Scholar]
  • 20. Hearst  MA, Dumais  ST, Osuna  E. et al.  Support vector machines. IEEE Intell Syst Appl  1998;13:18–28. 10.1109/5254.708428 [DOI] [Google Scholar]
  • 21. Rish  I. et al.  An empirical study of the naive Bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence, Vol. 3, pp. 41–6. Seattle, Washington, USA: Morgan Kaufmann, 2001. [Google Scholar]
  • 22. Witten  IH, Frank  E. Data mining: practical machine learning tools and techniques with java implementations. ACM Sigmod Record  2002;31:76–7. [Google Scholar]
  • 23. Yansen  S, Zhiyang  H, Wang  F. et al.  AMGDTI: drug–target interaction prediction based on adaptive meta-graph learning in heterogeneous network. Brief Bioinform  2024;25:bbad474. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Zhang  L, Wang  C-C, Zhang  Y. et al.  GPCNDTA: prediction of drug-target binding affinity through cross-attention networks augmented with graph features and pharmacophores. Comput Biol Med  2023;166:107512. 10.1016/j.compbiomed.2023.107512 [DOI] [PubMed] [Google Scholar]
  • 25. Zhang  G, Li  M, Deng  H. et al.  SGNNMD: signed graph neural network for predicting deregulation types of miRNA-disease associations. Brief Bioinform  2022;23:bbab464. [DOI] [PubMed] [Google Scholar]
  • 26. Huang  F, Yue  X, Xiong  Z. et al.  Tensor decomposition with relational constraints for predicting multiple types of microRNA-disease associations. Brief Bioinform  2021;22:bbaa140. 10.1093/bib/bbaa140 [DOI] [PubMed] [Google Scholar]
  • 27. Zhao  Y, Yin  J, Zhang  L. et al.  Drug–drug interaction prediction: databases, web servers and computational models. Brief Bioinform  2024;25:bbad445. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Han  C-D, Wang  C-C, Huang  L. et al.  MCFF-MTDDI: multi-channel feature fusion for multi-typed drug–drug interaction prediction. Brief Bioinform  2023;24:bbad215. [DOI] [PubMed] [Google Scholar]
  • 29. Li  M, Wang  Y, Li  F. et al.  A deep learning-based method for identification of bacteriophage-host interaction. IEEE/ACM Trans Comput Biol Bioinform  2021;18:1801–10. 10.1109/TCBB.2020.3017386 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Li  M, Zhang  W. PHIAF: Prediction of phage-host interactions with GAN-based data augmentation and sequence-based feature fusion. Brief Bioinform  2022;23:bbab348. [DOI] [PubMed] [Google Scholar]
  • 31. Zhang  Y-z, Liu  Y, Bai  Z. et al.  Zero-shot-capable identification of phage–host relationships with whole-genome sequence representation by contrastive learning. Brief Bioinform  2023;24:bbad239. 10.1093/bib/bbad239 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Shang  J, Sun  Y. Predicting the hosts of prokaryotic viruses using GCN-based semi-supervised learning. BMC Biol  2021;19:250. 10.1186/s12915-021-01180-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Shang  J, Sun  Y. CHERRY: a computational method for accurate prediction of virus–prokaryotic interactions using a graph encoder–decoder model. Brief Bioinform  2022;23:bbac182. 10.1093/bib/bbac182 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Ma  L, Gao  P, Zhou  W. et al.  Multi-view attention graph convolutional networks for the host prediction of phages. Knowledge-Based Syst  2025;308:112755. 10.1016/j.knosys.2024.112755 [DOI] [Google Scholar]
  • 35. Zhi-Hua  D, Zhong  J-P, Liu  Y. et al.  Prokaryotic virus host prediction with graph contrastive augmentaion. PLoS Comput Biol  2023;19:e1011671. 10.1371/journal.pcbi.1011671 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Wang  Y, Sun  H, Wang  H. et al.  An effective model for predicting phage-host interactions via graph embedding representation learning with multi-head attention mechanism. IEEE J Biomed Health Inform  2023;27:3061–71. 10.1109/JBHI.2023.3261319 [DOI] [PubMed] [Google Scholar]
  • 37. Xiao  Z, Sun  H, Wei  A. et al.  A novel framework for predicting phage-host interactions via host specificity-aware graph autoencoder. IEEE J Biomed Health Inform  2025;29:3069–78. 10.1109/JBHI.2024.3500137 [DOI] [PubMed] [Google Scholar]
  • 38. Kauffman  KM, Chang  WK, Brown  JM. et al.  Resolving the structure of phage–bacteria interactions in the context of natural diversity. Nat Commun  2022;13:372. 10.1038/s41467-021-27583-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Alley  EC, Khimulya  G, Biswas  S. et al.  Unified rational protein engineering with sequence-based deep representation learning. Nat Methods  2019;16:1315–22. 10.1038/s41592-019-0598-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. LeCun  Y, Bottou  L, Bengio  Y. et al.  Gradient-based learning applied to document recognition. Proc IEEE  1998;86:2278–324. 10.1109/5.726791 [DOI] [Google Scholar]
  • 41. Chung  J, Gulcehre  C, Cho  KH. et al.  Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555. 2014. https://arxiv.org/abs/1412.3555
  • 42. Chawla  NV, Bowyer  KW, Hall  LO. et al.  SMOTE: synthetic minority over-sampling technique. J Artif Intell Res  2002;16:321–57. [Google Scholar]
  • 43. Li  M, Liu  G, Song  W. et al.  Enhancing strain-level phage-host prediction through experimentally validated negatives and feature optimization strategies. bioRxiv:2025.05.31.656987. 2025. 10.1101/2025.05.31.656987 [DOI]
  • 44. Coelho  ED, Arrais  JP, Matos  S. et al.  Computational prediction of the human-microbial oral interactome. BMC Syst Biol  2014;8:24. 10.1186/1752-0509-8-24 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Boeckaerts  D, Stock  M, Criel  B. et al.  Predicting bacteriophage hosts based on sequences of annotated receptor-binding proteins. Sci Rep  2021;11:1467. 10.1038/s41598-021-81063-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Nami  Y, Imeni  N, Panahi  B. Application of machine learning in bacteriophage research. BMC Microbiol  2021;21:1–8. 10.1186/s12866-021-02256-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Besemer  J, Lomsadze  A, Borodovsky  M. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res  2001;29:2607–18. 10.1093/nar/29.12.2607 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Baker  DN, Langmead  B. Dashing: Fast and accurate genomic distances with hyperloglog. Genome Biol  2019;20:265. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Krause  B, Murray I, Renals S. et al.  Multiplicative LSTM for sequence modelling. In: 5th International Conference on Learning Representations Workshop. Toulon, France: OpenReview, 2016.
  • 50. Arif  M, Ali  F, Ahmad  S. et al.  Pred-BVP-Unb: fast prediction of bacteriophage virion proteins using un-biased multi-perspective properties with recursive feature elimination. Genomics  2020;112:1565–74. 10.1016/j.ygeno.2019.09.006 [DOI] [PubMed] [Google Scholar]
  • 51. Jing  X-Y, Li  F-M. Predicting cell wall lytic enzymes using combined features. Front Bioeng Biotechnol  2021;8:627335. 10.3389/fbioe.2020.627335 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Srivastava  N, Hinton  G, Krizhevsky  A. et al.  Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res  2014;15:1929–58. [Google Scholar]
  • 53. Alipanahi  B, Delong  A, Weirauch  MT. et al.  Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nat Biotechnol  2015;33:831–8. 10.1038/nbt.3300 [DOI] [PubMed] [Google Scholar]
  • 54. Xie  J, Jin  X, Wei  H. et al.  IDP-EDL: enhancing intrinsically disordered protein prediction by combining protein language model and ensemble deep learning. Brief Bioinform  2025;26:bbaf182. 10.1093/bib/bbaf182 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Reddi  SJ, Kale  S, Kumar  S. On the convergence of Adam and beyond. In: 6th International Conference on Learning Representations. Vancouver, BC, Canada: OpenReview, 2018.
  • 56. Chang  C-C, Lin  C-J. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol  2011;2:1–27. 10.1145/1961189.1961199 [DOI] [Google Scholar]
  • 57. Chen  T, Guestrin  C. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–94. San Francisco, CA, USA: ACM, 2016.
  • 58. Guo  X, Chen  F, Gao  F. et al.  CNSA: a data repository for archiving omics data. Database  2020;2020:baaa055. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary_Document_bbaf656

Data Availability Statement

The code, validated positive and negative interactions at the strain level, experimental phenotypes, and species-level interaction data are available at https://github.com/a1678019300/PBIP. Moreover, all strain-level sequence data are publicly available from the China National GeneBank DataBase Sequence Archive (CNSA) [58] under accession number CNP0006217: https://db.cngb.org/search/project/CNP0006217/.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES