Abstract
The coronavirus disease-2019 (COVID-19) pandemic has elucidated major limitations in the capacity of medical and research institutions to appropriately manage emerging infectious diseases. We can improve our understanding of infectious diseases by unveiling virus–host interactions through host range prediction and protein–protein interaction prediction. Although many algorithms have been developed to predict virus–host interactions, numerous issues remain to be solved, and the entire network remains veiled. In this review, we comprehensively surveyed algorithms used to predict virus–host interactions. We also discuss the current challenges, such as dataset biases toward highly pathogenic viruses, and the potential solutions. The complete prediction of virus–host interactions remains difficult; however, bioinformatics can contribute to progress in research on infectious diseases and human health.
Keywords: Virus–host interaction, Host range prediction, Protein–protein interaction prediction
1. Introduction
Humanity has long been battling viral infections [1], [2]. The coronavirus disease-2019 (COVID-19) pandemic has resulted in 6578,245 reported deaths as of November 8, 2022 (https://covid19.who.int), and the pandemic has once again reminded us of our lack of knowledge and understanding about viruses. However, it is possible to improve our understanding of virus–host interactions and prepare for the next pandemic. We use virus–host interaction prediction as a generic term for host-range and protein–protein interaction (PPI) prediction.
Because many emerging infectious diseases in humans are caused by viruses originating from other animals, clarifying macroscopic virus–host interactions or host ranges is a high priority (Fig. 1A). For example, the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is believed to originate from bats [3] and has a relatively extensive host range, including rhesus macaques [4] cats [5], ferrets [5], and Malayan pangolins [6]. These varied pathways can facilitate viral transmission to humans, and if the host range can be predicted, measures such as quarantine can be undertaken. Thus, host-range prediction is essential for determining whether a novel virus can cause zoonosis and thus pose a risk to human health [7].
Fig. 1.
Conceptual diagram of (A) virus-host prediction and (B) PPI prediction. Solid lines indicate known interaction pairs (typically by experiment), and dashed lines indicate predicted interaction pairs.
Viruses attempt to hijack the transcription and translation systems of the host for replication; conversely, the host attempts to eliminate the virus through innate and acquired immune responses (Fig. 1B). These virus–host microscopic interactions are organized by complex PPI networks. Therefore, knowledge of the PPI network between a virus and its host is essential for understanding the viral replication system and the cellular response to infection. One of the molecular mechanisms responsible for the formation of virus–host PPI networks is the mimicry of motif sequences in host proteins by viruses. For example, short linear motifs (SLiMs) are typically peptide sequences with three to ten amino acids that mediate various PPIs in host cells. Previous studies have shown that diverse viral proteins contain SLiMs, which are used for virus–human PPI formation [8], [9], [10]. Another mechanism is viral mimicry of the host protein structure, which determines the physical interactions among proteins [9], [10], [11]. Such similarities in motif sequences or protein structures between viruses and hosts can provide clues for predicting virus–human PPIs. Physical protein interactions between virus and host proteins have been identified using modern techniques such as next-generation yeast-two-hybrid analysis [12], cross-linking-mass spectrometry [13], and a combination of proximity labeling and affinity purification-mass spectrometry [14]. For example, the comprehensive SARS-CoV-2-human PPI networks were revealed through these experiments, and the network has been used to elucidate the disease mechanisms or to identify candidate therapeutic targets [15], [16], [17]. In addition, for various viral species, the interaction data identified in these experiments were collected in public databases, such as IntAct [18], VirusMetha [19], VirusMINT [20], and HPIDB [21]. The following papers have discussed the viral-host PPIs in more detail[22], [23]. Although the SARS-CoV-2 PPI network was determined very quickly because of the critical nature of the pandemic and the importance of the disease [16], the scalability of these techniques is generally limited by the time and financial costs required for virus culture and antibody production. Accordingly, a complete picture of the PPI network is still unclear. Thus, bioinformatics has attracted attention as a potential solution because the scalability of experimental approaches is limited.
Here, we can regard host-range prediction and PPI prediction as similar problems in the informatics context despite their distinctiveness as biological topics. This is because host-range prediction and PPI prediction are both binary classification problems, in which we determine whether an edge (infectivity or PPI) exists between two nodes (e.g., DNA/RNA/amino acid sequence). One problem with virus–host interaction prediction is that the sheer number of combinations prevents rapid and accurate prediction. This is because to solve the host prediction problem, interactions with various host organisms, including non-model organisms (which generally cannot be handled in the laboratory), must be considered, and to solve the PPI prediction problem, interactions with> 10, 000 host proteins must be considered. Because it is difficult to experimentally verify such a large number of virus–host interactions, the informatics approach is expected to be reasonable. In addition, computational analysis is valuable because of its safety and because it does not pose any ethical issues in infectious disease research. Indeed, a series of studies have focused on the host range and virus–host PPI in infectious diseases, and there is a growing demand for algorithms to unravel virus–host interactions.
In this paper, we provide a comprehensive review of existing host range/PPI prediction algorithms to understand how informatics approaches contribute to virus–host interaction prediction and discuss the current issues and future prospects (Tables 1 and 2). Considering the breadth of related literature, only peer-reviewed articles are included in this review. Algorithms targeting phages were within the scope of this review because of their potential application to animal viruses. Papers that have made a particular contribution to the field are also discussed in the text. These papers were published relatively recently, have many citations, and are original studies.
Table 1.
Host range prediction algorithms.
| Algorithm name | Year | Virus | Host | Classifier | Features or input | Software availability | Ref. |
|---|---|---|---|---|---|---|---|
| Ahmed et al. | 2009 | Lamda phage, Bacillus phage, Vibrio phage, others | E.coli, B.subtilis, V. cholerae, others | – | oligostickness similarity | no | [72] |
| Eng et al. | 2014 | influenza virus | human, avian | RF, K-NN, NB, SVM, NN | AA k-mer, physicochemical properties of AA | no | [73] |
| HostPhinder | 2016 | phages in NCBI[74], EMBL[75], others | bacterias in NCBI[74] | – | nucleotide k-mer | yes | [76] |
| VirHostMatcher | 2017 | virus in RefSeq[74] | bacterias and archaeas in RefSeq[74] | - | nucleotide k-mer | yes | [29] |
| WIsH | 2017 | phages in RefSeq[74] | bacterias in RefSeq[74] | Markov model | few kbp contigs | yes | [31] |
| Xu et al. | 2017 | influenza virus | human, avian, swine | SVM | word2vec-based futures[33] | yes | [32] |
| Eng et al. | 2017 | influenza virus | human, avian | RF | AA k-mer, physicochemical properties of AA | no | [77] |
| Leite et al. | 2018 | phages in PhagesDB[78], GenBank[79] | phages in PhagesDB[78], GenBank[79] | K-NN, SVM, RF, NN | domain-domain interaction | no | [37] |
| Li et al. | 2018 | rabies virus, coronavirus, influenza virus | bat, human, animals | SVM, K-NN | AA k-mer, sequence similarity | no | [80] |
| Babayan et al. | 2018 | ssRNA viruses in ICTVdB[81] | human, mammals, arthropods, others in ICTVdB[81] | GBM | nucleotide/AA k-mer, viral phylogenetic relationships | yes | [28] |
| Host Taxon predictor | 2019 | viruses and phages in NCBI[82] | eukaryotes and prokaryotes in NCBI[82] | LR, K-NN, QDA, SVM | nucleotide k-mer | yes | [83] |
| ILMF-VH | 2019 | viruses in Virus-ŘHost DB[84] | hosts in Virus-Host DB[84] | matrix factorization | Doc2Vec-based features[85], nucleotide k-mer | no | [86] |
| Zhang et al. | 2019 | viruses in Virus-ŘHost DB[84] | hosts in Virus-Host DB[84] | K-NN, RF, GNBC, SVM, LR | nucleotide k-mer | yes | [87] |
| VHost-Classifier | 2019 | viruses in the Virus-Host DB[84] | hosts in the Virus-Host DB[84] | BLAST | BLAST results | yes | [88] |
| Qiang et al. | 2019 | influenza virus | human, avian | SVM, K-NN, NB | nucleotide k-mer, physicochemical properties of AA | no | [89] |
| VirHostMatcher-Net | 2020 | viruses and phages in NCBI[82] | bacteria, archaea in NCBI[82] | Markov random field model | CRISPR, nucleotide k-mer | yes | [90] |
| Kuzmin et al. | 2020 | coronavirus | human, swin, avian, others | SVM, LR, DT, RF | one-hot encoding of S protein seqs | yes | [91] |
| Young et al. | 2020 | viruses in Virus-ŘHost DB[84] | hosts in Virus-Host DB[84] | SVM | nucleotide/AA k-mer, physicochemical properties of AA, protein domain | yes | [92] |
| PredPHI | 2021 | phages in PhagesDB[78], GenBank[79] | bacterias in GenBank[79] | K-Means clustering, NN | AA k-mer, MW, physicochemical properties of AA | yes | [93] |
| Boeckaerts et al. | 2021 | phages in UniProtKB[94] | bacterias in UniProtKB[94] | Linear Discriminant Analysis, LR, RF, GBM | nucleotide/AA k-mer, protein secondary structure | yes | [38] |
| RaFAH | 2021 | viruses in NCBI[82], GLUVAB[95] | bacterias and archaeas in NCBI[82] GLUVAB[95] | RF, MMSeq2[60], hmmsearch[96] | sequence similarity | yes | [95] |
| SpacePHARER | 2021 | phages in Genbank[79] | bacteria in Genbank[79] | MMSeq2[60] | CRISPR | yes | [97] |
| CrisprOpenDB | 2021 | phages in NCBI[82] | bacterias in NCBI[82] | CRISPRDetect[42], BLAST | CRISPR | yes | [41] |
| Phirbo | 2021 | phages reported in[31], [90], [98] | bacteria reported in[31], [90], [98] | BLAST | BLAST results | yes | [99] |
| Prokaryotic virus Host Predictor | 2021 | viruses in VirHostMatcher[29], ICTV[100], NCBI[82] | bacterias in VirHostMatcher[29], ICTV[100], NCBI[82] | GMM | nucleotide k-mer | yes | [101] |
| Divide-and-conquer | 2021 | viruses in EID2[102] | terrestrial mammals in EID2[102] | avNNet, GBM, RF, XGBoost, SVM, NB | genome traits (e.g. GC content), ecological traits (e.g. body mass), phylogenetic relationships, others | yes | [103] |
| DeePaC | 2021 | viruses in Virus-Host DB[84] | hosts in Virus-Host DB[84] | NN | sequencer reads | yes | [104] |
| Brierley et al. | 2021 | coronavirus | hosts in GenBank[79] | RF | nucleotide k-mer | yes | [105] |
| Yang et al. | 2021 | viruses in NCBI[74] | hosts in NCBI[74] | NN | nucleotide/AA k-mer | no | [106] |
| VIDHOP | 2021 | influenza virus, rabies virus, Rotavirus A | hosts in ViPR database[107], influenza Research Database[108] | NN | one-hot encoding of AA | yes | [35] |
| 2021 | viruses reported in[81], [109], [110] | hosts reported in[81], [109], [110] | GBM | nucleotide/AA k-mer | yes | [111] | |
| VPF-Class | 2021 | uncultivated virus | - | hmmsearch[96] | sequence similarity | yes | [112] |
| ML-AdVInfect | 2021 | adenovirus | human, plant, bacteria, others | SVM, RF, NN | phylogenetic relationships, host receptor, adenovirus fiber proteins | no | [39] |
Abbreviations, AA: Amino Acids, NN: Neural Network, DT: Decision Tree, GBM: Gradient Boosting Machines, GMM: Gaussian Mixture Model, GNCB: Generalized Naive Bayes Classifiers, HMM: Hidden Markov Model, K-NN: K-Nearrest Neighbors, LR: Logistic Regression, MW: Molecular Weight, NB: Naive Bayse, QDA: Quadratic Discriminant Analysis, RF: Random Forest, SVM: Support Vector Machine
Table 2.
PPI prediction algorithms.
| Algorithm name | Year | Virus | Host | Classifer | Features or input | Software availability | Ref. |
|---|---|---|---|---|---|---|---|
| Dyer et al. | 2011 | HIV | human | SVM | protein domains, AA k-mer, intrahost PPI | no | [113] |
| Cui et al. | 2012 | HCV,HPV | human | SVM | AA k-mer | no | [114] |
| Emamjomeh et al. | 2014 | HCV | human | RF, NB, SVM, MLP | PSSM, intrahostPPI, PAC, tissue information, post-translational modification | no | [115] |
| Barman et al. | 2014 | HIV-1, SV40, HBV, HEV | human | SVM, NB, RF | AA k-mer, disorder region, domain-domain association | no | [116] |
| Mei et al. | 2015 | HTLV | human | SVM | GO | no | [117] |
| DeNovo | 2016 | retrovirus, HHV, adenovirus, others | human | SVM | physicochemical properties of AA, AA k-mers | yes | [118] |
| Ray et al. | 2016 | HIV | human | non negative matrix factorization based clustering | gene expression, GO | no | [119] |
| Kim et al. | 2016 | HPV, HCV | human | SVM | AA k-mers | no | [120] |
| Nourani et al. | 2016 | HIV, SV40, HBV, others | human | Kernel embedding | AA k-mer, intrahost PPI, domain-domain association, GO | no | [121] |
| HOPITOR | 2018 | retrovirus, HHV, adenovirus, others | human | SVM, RF, XGBoost | AA k-mer | yes | [122] |
| Zhou et al. | 2018 | influenza virus, ebola, others | human, plant, bacteria, others | SVM | physicochemical properties of AA, AA k-mer | yes | [123] |
| Alguwaizani et al. | 2018 | RNA/DNA virus, retrovirus | human, mouse, E.coli, others | SVM | repeat patterns of AA, AA k-mer | yes | [124] |
| P-HIPSTer | 2019 | coronavirus, retrovirus, HHV, others | human | Bayesian network | structure | yes | [11] |
| Dey et al. | 2020 | SARS-CoV-2 | human | SVM, K-NN, NB, RF, XGBoost, AdaBoost, DMLP | physicochemical properties of AA, PAC | no | [125] |
| Zhang et al. | 2020 | adenovirus, retrovirus, coronavirus, others | human | RF | N-glycosylation, gene expression, AA k-mer | no | [126] |
| Yang et al. | 2020 | viruses in HPIDB[21], SwissProt[94] | human | RF | Doc2vec-based features | yes | [47] |
| HMI-PRED | 2020 | EBV | human | – | structural alignment protein docking | yes | [127] |
| DeepViral | 2021 | virus in HPIDB[21], SARS-CoV-2 | human | NN | DL2Vec-based features[128], GO, phenotypes, phylogenetic relationships | yes | [129] |
| MTT | 2021 | influenza virus, Ebola, SARS-CoV-2 | human | NN | UniRep-based features[50] | yes | [49] |
| Yang et al. | 2021 | HIV, Herpes, Papilloma, others | human | NN | PSSM | yes | [130] |
| Koca et al. | 2022 | HBV, influenza virus, zika virus, others | human | NN | Doc2Vec-based features[85] | yes | [131] |
Abbreviations, AA: Amino acid, GO: Gene Ontology, HBV: Hepatitis B Virus, HCV: Hepatitis C, HHV: human herpes virus, HIV: Human Immunodeficiency Virus, HPV: Human PapillomaVirus, HTLV: Human T-cell Leukemia Virus, K-NN: K-Nearrest Neighbors, NB: Naive Bayes, NN: Neural Network, PAC: Pseudo Amino acid Composition, PSSM: position-specific scoring matrix, RF: Random Forest, SV40: Simian Virus 40, SVM: Support Vector Machine
2. Algorithms for host-range prediction
The host-range prediction studies are presented in Table 1. Here, we introduce each feature used for host-range predictions. It should be noted that these features are not exclusive, and some studies have attempted to improve the accuracy of prediction by combining multiple features.
2.1. Sequence composition features based on viral adaptation mechanisms to their hosts
Most studies have performed host-range prediction using features extracted from the nucleotide or amino acid sequences of viruses or hosts. In this section, we introduce sequence features extracted according to the knowledge of viral adaptation mechanisms to their hosts. A representative feature is the k-mer frequency of nucleotide or amino acid sequences. This is because the sequence compositions of the virus have coevolved with those of the host to hijack the host’s cellular mechanisms for viral replication [24] or to escape from the host immune system [25]. Thus, the compositions of the viral nucleotide and amino acid sequences are similar to those of the host, and the compositions of viruses infecting the same host are also similar. Moreover, some studies have used the physicochemical properties of viral amino acid sequences as predictive features, which have been reported to be important for binding viral surface proteins to host receptors [26], [27]. In particular, we present two studies that used the k-mer frequency of nucleotide or amino acid sequences as features for host-range prediction.
Babayan et al. demonstrated that machine learning using viral genome sequences can directly predict reservoir animals and arthropod vectors [28]. This model could predict the hosts of most human-infective single-stranded RNA virus families, including 69 viruses with previously unknown reservoirs or vectors. The authors thought that combining phylogenetic neighborhood and genomic traits with machine learning could improve the accuracy of host prediction because closely related viruses often have closely related hosts, and genomic traits of viruses have been reported to mimic those of their hosts. First, the authors compared the accuracies of eight machine learning algorithms trained using 4229 viral traits, such as codon pair bias, dinucleotide bias, and amino acid bias. The resulting gradient boosting machine (GBM) with the best performance was selected, and the most informative genomic traits for virus–host prediction were identified. GBM combined with the selected genomic traits and phylogenetic neighborhood analysis predicted reservoir hosts with an accuracy of up to 83.5%. Second, by learning and bagging two sets of models focusing on arthropod-borne infections, GBM could identify which arthropod vector transmitted the virus with near-perfect accuracy (bagged accuracy = 97.0%). Furthermore, the authors attempted to identify the reservoirs or vectors of viruses previously uninvestigated. For example, the Bas-Congo virus, which caused an outbreak of hemorrhagic fever, was detected only in humans. However, the trained GBM predicted even-toed ungulates as reservoir animals and midges as arthropod vectors. These predictions can help in the identification of potential organisms to be prioritized in epidemiological studies.
VirHostMatcher is a host-prediction tool based on the sequence composition called oligonucleotide frequency (ONF) [29]. The key idea of the sequence composition approach is that codon usage or k-mer frequency of the virus is correlated with that of the host. This correlation is caused by viruses mimicking the sequence composition of the host to hijack its transcription/translation systems. Because ONF is a vector of k-mer frequencies, the ONF-based method computes the distances of ONF vectors between viruses and hosts. The virus–host pair with the minimum distance was predicted to be the most likely parasite–host relationship. VirHostMatcher can compute 11 distinct distance measures, such as Euclidean, Manhattan, and d*2 distances [30]. The authors evaluated these measures using genomes of 1427 viruses whose hosts are known and revealed that the d*2 distance was the most accurate for host prediction. The d*2 distance is a measure of the ONF corrected by subtracting the background ONF from the observed ONF. The background ONF was predicted using a Markov model representing convergent evolution. The authors interpreted that the d*2 distance could accurately predict hosts because it excludes the effect of convergent evolution among distantly related genomes. Next, the authors compared VirHostMatcher with previous methods using 820 bacteriophages and 2699 candidate bacteria, and VirHostMatcher with d*2 distance outperformed the previous methods. However, the ONF-based method is only available for known hosts because it must calculate the host ONFs of the hosts. In contrast, ONF can be calculated from fragmented or partial sequences that are often contained in metagenomic data. Therefore, the ONF-based method has an advantage over similarity-search-based methods for metagenomic analyses.
2.2. Embedded features by deep learning from viral sequences
In the previous section, we showed that viral host ranges can be predicted using nucleotide and amino acid sequence compositions, which is one mechanism of viral adaptation to the host. However, it has also been reported that prediction using sequence compositions as features is not sufficiently accurate when only short viral sequences are available [31]. Furthermore, viral infectivity cannot be determined by sequence composition alone, and unknown mechanisms may be involved in determining the viral host range. Although such features related to unknown mechanisms are difficult to extract, they can lead to better predictive accuracy for host range prediction. Therefore, some studies have attempted to improve the prediction accuracy by extracting features independent of virological knowledge. Here, we introduce studies that extract more complex features from viral sequences using deep neural networks.
Xu et al. predicted host ranges of the influenza virus using features vectorized from viral nucleotide or protein sequences by word2vec [32]. word2vec [33] is a deep learning model developed in the field of natural language processing that transforms words into continuous vector representations by associating them with each other. In this study, every two to four letters were split from nucleotide and amino acid sequences, and the split sequences were defined as words. Each word was then transformed into a vector using word2vec, and the average of these vectors was used as a feature. The SVM classifier based on vectorized viral sequences can classify influenza viral hosts with high accuracy (area under the curve of approximately 0.9). In this study, 12 viral proteins were used as inputs, and high prediction accuracy was achieved when using surface proteins (HA and NA). These proteins are expected to be better discriminators because they directly interact with the host cells or are targeted by host immune responses. Interestingly, some internal proteins (PB2, PB1, PA, and NP) also showed high predictive accuracy. These proteins constitute RNA polymerase complexes and have been reported to influence viral replication efficiency in host cells [34]. Therefore, the authors suggested that the characteristics of viral proteins influence prediction accuracy. In summary, this study indicates that feature extraction from viral sequences using word2vec can be useful for host prediction.
VIDHOP is a deep learning method for predicting viral hosts based on viral nucleotide sequences only [35]. This study proposed a long short-term memory (LSTM) architecture and a convolutional neural network (CNN)+LSTM architecture. The LSTM architecture extracts context-specific patterns, and the CNN+LSTM architecture uses a CNN to identify meaningful features that can be used by the LSTM layers. Both architectures provided accurate predictions of viral hosts on datasets of the influenza A virus, rabies lyssavirus, and rotavirus A. The LSTM model appeared to be suitable for more complex datasets, and the model with the CNN+LSTM architecture could be trained faster than that with the LSTM architecture. Furthermore, the authors addressed multiple issues when using viral genome sequences as inputs for deep learning: (i) unbalanced datasets and (ii) inefficient training of recurrent neural networks when using long sequences. The traditional approach to overcome the first issue is to match the amount of data in each class with the class with the least amount of data. This approach is not appropriate for heavily biased datasets, such as datasets of viruses, as it would result in most sequences not being used. To avoid bias in the training set while ensuring all data are used, the authors first fixed the validation and test data and then extracted the training data at each epoch from the rest for each class. The second problem is the use of short subsequences divided from viral genomes as inputs. It was shown that subsequences shorter than 400 bp could be used to predict potential viral hosts. In conclusion, this study built a pipeline for host prediction by addressing multiple problems in deep learning using viral genomic sequences as inputs.
2.3. Other features
In addition to the features introduced in previous sections, metagenomic studies have attempted to use other types of features for host range prediction. Dutilh et al. showed that the sequence read amounts of viruses and hosts can be used for host range prediction, based on the assumption that these sequence amounts would correlate with each other in metagenomic data [36]. Furthermore, WiSH established a host-prediction algorithm that allows short viral sequences to be used as inputs via a probabilistic modeling approach. Furthermore, some studies have reported that specific sequence regions can also be valuable features, for example, CRISPR spacer sequences, the remnant of viral infections in the past (details in Dion et al.), or interactions between host membrane proteins and viral surface proteins [37], [38], [39]. In this section, we introduce three algorithms developed in the metagenomic field.
Dutilh et al. identified a novel phage genome and predicted its host from human gut metagenomes using an abundance profiling approach [36]. This approach assumes that the abundance of each contig from the same virus is correlated, and the contig abundances of the viruses are also correlated with those of the host. First, the authors created abundance profiles by combining metagenomic reads derived from all individual fecal samples and conducting sequence assembly. Then, a novel bacteriophage called “crAssphage” was identified by collecting and assembling the correlated contigs in the profiles. Next, to predict the hosts of crAssphage, the authors compared the profiles of crAssphage and 404 candidate bacteria. As a result, Bacteroides species were predicted to be hosts. Finally, the authors revealed that some CRISPR spacers in the Bacteroides genomes were similar to those in the crAssphage. As similar CRISPR spacers could indicate recent phage-host interactions [40], this biological feature supports the validity of virus-host prediction using abundance profiles. However, the abundance profiling approach has the following limitation: the abundance correlations do not always represent the parasite-host relationships because the virus increases after the host. Despite this limitation, this approach demonstrates a significant advantage for identifying novel viruses and predicting hosts using only metagenomic data, without the requirement for any database.
WIsH is an alignment-free method proposed to predict viral hosts from metagenomic data [31]. The authors specified the low accuracy in short contigs as a weakness of the proposed alignment-free methods, such as VirHostMatcher. Therefore, the authors attempted to predict hosts even with short contig sequences. First, WIsH trains a Markov model for all given host genomes. Subsequently, the likelihood of a phage sequence was calculated under each host’s model and the host was selected using a maximum-likelihood approach. For experiments on real data, WIsH uses 3780 full host prokaryotic genomes from the KEGG database and 1420 host annotated phages genomes from the RefSeq Virus database. WIsH outperformed VirHostMatcher at every taxonomic level. At the genus level, both methods showed an accuracy of approximately 42% for the entire viral genome. However, for very short contigs (length, 1 kb), VirHostMatcher’s accuracy dropped to 7.5%, whereas WIsH achieved an accuracy of 28.0%. By taking a probabilistic approach, WIsH maintains high accuracy even with insufficient data. WIsH also uses OpenMP for parallel programming to improve computing speed. The computing time was reduced from approximately 16 h using VirHostMatcher to only a few minutes using WIsH. In conclusion, WIsH was able to predict the viral hosts with higher accuracy and faster computing speed than the existing methods, especially for contigs shorter than 3 kb.
Dion et al. developed an accurate virus–host prediction tool using CRISPR spacer datasets [41]. The CRISPR-Cas system is an adaptive immune system in prokaryotes that protects against viral infections. Prokaryotes with this system can acquire resistance to specific viruses by integrating DNA sequences, called spacers, from viruses that have previously invaded into their genomes. Accordingly, when a prokaryotic genome contains the DNA of a particular virus as the spacer, the spacer provides direct evidence that the virus can infect the prokaryote. However, CRISPR-based host prediction methods based on simple sequence similarity searches have failed to perform adequately because they ignore spacer mutations and the acquisition of spacers through horizontal gene transfer. Another problem is the lack of large spacer data available for host range prediction. In line with this background, Dion et al. first constructed a large-scale CRISPR spacer database that included 11,767,782 spacers predicted by CRISPRdetect from 367,446 bacterial genomes [42]. Next, they proposed four filtering rules after a sequence similarity search to improve the prediction performance: the number of alignment mismatches, the number of detections in the genome, the position of the spacer in the CRISPR array, and the last common ancestor. Benchmark dataset analysis demonstrated that the developed CRISPR-based method shows performances comparable with that of WIsH.
3. PPI prediction approaches
Similar to host range prediction, many methods have been proposed for PPI prediction depending on the combination of features and classifiers (Table 2). The composition of nucleotides and amino acids is frequently used for PPI prediction. In addition, gene ontology and the PPI network in the host are also essential features because viruses target functionally important proteins and hub proteins in the host PPI. Furthermore, structural information is also important for predicting PPIs because viruses are known to form PPIs by mimicking the host protein structures, as mentioned in the Introduction [9], [10], [11]. Currently, the number of viral proteins whose structures are known is low; this prevents genome-wide PPI prediction using structural information. However, databases have been enriched, algorithms for predicting structural information from only protein sequences have been developed (see Summary and Outlook), and PPI prediction using structural information is being developed. This section introduces PPI prediction approaches considering a wide variety of features, such as sequence compositions, gene ontology, and structures.
P-HIPSTer performed a large-scale prediction of virus–host PPIs using protein structural information obtained from the Protein Data Bank and homology modeling [11]. The advantages of using structural information to predict PPIs are that: i) the information of direct physical interactions between proteins can be incorporated and ii) because protein structures are more conserved than their sequences [43], functional relationships between proteins that are undetectable on the basis of sequence information alone can be detected. P-HIPSTer combined the following three scores by means of a Bayesian network to calculate the probability that a candidate pair of proteins forms a complex: i) domain—domain interactions, ii) domain—peptide interactions, and iii) numbers of “structural neighbors” that are known to interact with the same target. P-HIPSTer predicted 282,000 virus-host PPIs, and the accuracy was reported to be 76.9% using a co-immunoprecipitation assay targeting 65 PPIs. Furthermore, P-HIPSTer successfully identified PPI modules involved in Zika virus replication and human papillomavirus pathogenicity. The virus–host PPIs predicted by P-HIPSTer are publicly available in a database (http://phipster.org) and will be updated according to the accumulation of information regarding protein structures. Such a platform can provide a deeper understanding regarding virus–host relationships and the biological mechanisms underlying these relationships. In the future, it is expected that the breakthrough in protein structure prediction by the amino acid sequence alone [44], [45], [46] will enable the prediction of PPIs with various viruses, including uncultivated ones (see Summary and Outlook).
Yang et al. proposed a virus–human PPI prediction tool using the representation learning method, which was initially proposed in natural language processing [47]. The representation learning method learns how to convert words/sentences to real-value feature vectors from large-scale document data while preserving the semantic similarity among words and sentences. Training is performed in an unsupervised manner using neural networks such as word2vec, Doc2Vec, or BERT. The feature vectors obtained by representation learning were further used as input data for other machine-learning models, including support vector machines and random forests. Recently, representation learning has also attracted attention in the field of biological sequence analysis and has been used for various prediction tasks, such as the prediction of protein functions, localization, and modification sites [48]. In these tasks, learning is performed by considering sequences and overlapping/non-overlapping k-mers in the sequences as sentences and words, respectively. Many previous studies have demonstrated that feature vectors based on representation learning perform better than those based on the physical properties of amino acids. Based on this trend, Yang et al. first applied the Doc2Vec model to 291,726 proteins for representation learning and then trained a random forest classifier to predict virus–human PPIs using the obtained feature vectors; these authors showed that the proposed feature vectors outperformed conventional feature vectors.
Dong et al. constructed a multi-transfer learning framework that attempts to predict new virus–human PPI prediction from a small amount of training data by combining transfer learning and multi-task learning [49]. It is difficult to create a generalized prediction model due to overfitting when training with only a small amount of data. Therefore, the authors used the pre-trained UniRep model that extracts the latent representation of 24 million protein sequences [50] and fine-tuned this model via multi-task learning for both virus–human and human–human PPIs. Additionally, the task for human PPIs is based on the assumption that viral proteins with biological properties similar to those of human proteins would show similar interaction patterns. Furthermore, to avoid overfitting due to a complex model, both virus–human and human–human PPI tasks were learned using a simple neural network, a multilayer perceptron. The multi-transfer learning model showed an accuracy comparable to that of previously reported machine-learning models on benchmark datasets of virus–human PPIs. Although this model only uses protein sequences, it scored better than a deep-learning model that handles the knowledge of viral taxonomy and phenotypes. These results suggest that such an approach incorporating transfer and multitask learning may solve the problem associated with the availability of only limited training data regarding virus–host interactions.
4. Summary and outlook
This chapter presents problems that prevent unveiling virus–host interactions. Here, we discuss how existing methods have challenged these issues and the possible approaches that may solve them.
Public datasets are biased toward viruses that have human hosts and cause severe diseases, such as SARS-CoV-2. For example, human viruses occupy approximately 84% of over 10 million nucleotide sequences registered in the NCBI Virus Database (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/). Furthermore, previous studies on public databases of virus–host PPIs have shown that the available data are biased towards viruses that cause infectious diseases in humans, such as coronaviruses or influenza viruses [23], [22]. It is unclear whether models trained on only limited virus species found in limited host species can be used for non-human hosts or novel viruses [51], [52]. Training bias is also a critical issue in machine-learning fields; machine-learning experts tackle this problem using multi-task or transfer learning approaches, which complement domains having little data with data-rich domains[53], [54]. In particular, the strategy in which human–human PPI data cover insufficient virus–human PPI data has achieved competitive results with existing methods (see PPI prediction approaches). Thus, there are ongoing challenges associated with efficiently learning from limited data and polishing machine-learning models.
Another bias in public datasets is that most data are positive and few are negative. Hence, many studies have considered non-positive data in the datasets as negative data, but this assumption does not always hold true. A limited number of studies have released negative datasets; Negatome Database 2.0 is a negative interaction-focused database [55]. The International Molecular Exchange consortium also collects negative interactions [56]. However, the PPIs in the two datasets are in the order of thousands and negative interactions involving viral proteins are less than 100. Saha et al. pointed out that researchers must systematically provide negative PPI data and expand the database [23].
One machine learning-based solution to this problem is to use positive and unlabeled (PU) learning. PU learning is a method that learns only from positive and unlabeled data that may or may not be positive or negative. Theoretical studies on PU learning have recently been conducted; for example, loss functions for PU learning that prevent learning bias and overfitting have been proposed [57]. These studies can be combined with advanced machine-learning methods, such as deep learning. Because large amounts of unlabeled data constitute a common problem for many bioinformatics prediction problems, PU learning has recently been used for various bioinformatics prediction problems, such as disease-related gene identification and gene function prediction [58]. Thus far, no research has applied PU learning for virus–host interaction prediction, but this should be a promising approach.
Even for the human genome, the most well-studied genome, the complete sequence, including repeat regions, has only recently become available [59]. Similar to the host genome, determining complete viral genomes is also a high-priority issue, reflecting the increasing number of novel viruses being discovered by metagenomic analysis [60]. The development of sequencers will gradually solve the lack of complete genome sequences, and alignment-free approaches will also contribute towards solving this problem from the algorithm side. An alignment-free approach does not always require complete genome sequences and can predict the host using short-read sequences. The development of sequencers and the refinement of alignment-free approaches will circumvent the insufficiency of complete genome sequences.
Despite the importance of structural information in molecular interactions, the structures of only a few proteins are known; this is a limitation that arises due to the cost and throughput of structure determination. Only approximately 100, 000 of the 1 billion known proteins [60], [61] have a registered structure in the public database [62], and the lack of structural information prevents structure-based PPI prediction between viruses and hosts. Recently, AlphaFold [44], RoseTTAFold [45] and the alignment-free approach [46] achieved highly accurate protein structure prediction. Because these neural network-based approaches can predict protein structures only by amino acid sequences, the estimated structures can be applied as inputs for machine learning without experimental results. Furthermore, AlphaFold has already achieved acceptable quality in PPI prediction [63]. Thus, informatics-based protein structure prediction by only amino acid sequencing without any experiments has the potential to accelerate the development of structure-based PPI-prediction methods.
Although they are beyond the scope of this review, RNA–protein interactions are a critical consideration in uncovering virus–host interactions. The SARS-CoV-2 genome is a single-stranded RNA and is known to express subgenomic RNAs to translate proteins on the 3’ side [64]. Recent reports have confirmed that the expression forms other than subgenomic RNAs, and these RNAs may interact with host proteins [65]. Furthermore, it has been suggested that ORF7a-derived small RNA may bind to Argonaute protein, which is a principal component of the RNA interference pathway, and selectively repress host genes as non-coding RNAs [66]. The importance of RNA–protein interactions between viruses and hosts has become apparent, and some studies have attempted to create a comprehensive catalog of SARS-CoV-2 RNA–human protein interactions [67], [68], [69]. These MS-based approaches could be powerful tools to reveal the complete virus-host network, although they are limited by the protein abundance and RNA cross-linking [68], cell lines, and physiological conditions [54]. Additionally, predicting RNA–protein interactions is associated with high costs and other issues, and bioinformatics is a powerful tool to support the elucidation of the network [70], [71].
In this review, we examined many algorithms used to predict virus–host interactions with limited data. Although these approaches have not yet revealed the full extent of the virus–host network, they are constantly being updated and refined, especially with the continued development of machine learning. In addition, accumulating data from high-performance sequencers and mass spectrometers will strongly support improved prediction accuracy. We believe that bioinformatics approaches are critical for unveiling virus–host networks.
5. Conclusions
This review discussed various tools to reveal virus–host interaction networks. Although there is still room for improvement in terms of prediction accuracy, rapid advances in machine-learning and cutting-edge techniques will enable more accurate predictions. Bioinformatics approaches are expected to allow researchers to elucidate the design architecture of virus–host interactions and thus, contribute towards protecting human health.
CRediT authorship contribution statement
Hitoshi Iuchi: Conceptualization, Writing − original draft, Writing − review & editing. Junna Kawasaki: Writing − original draft, Writing − review & editing. Kento Kubo: Writing − original draft. Tsukasa Fukunaga: Writing − original draft. Koki Hokao: Writing − original draft. Gentaro Yokoyama: Writing − original draft. Akiko Ichinose: Writing − original draft. Kanta Suga: Writing − original draft. Michiaki Hamada: Writing − review & editing.
Acknowledgement
We would like to express our appreciation to Atsushi Takeda, Waseda University for his valuable and constructive suggestions. This study was supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI [grant Nos.: JP21K15078 to HI, JP22J00010 to JK]. Waseda Research Institute for Science and Engineering, Grant-in-Aid for Young Scientists (Early Bird) [to JK], and AMED [Grant Nos.: JP21fk0108104 and JP22ama121055 to MH].
Contributor Information
Hitoshi Iuchi, Email: hitosh.iuchi@gmail.com.
Michiaki Hamada, Email: mhamada@waseda.jp.
References
- 1.Meganck R.M., Baric R.S. Developing therapeutic approaches for twenty-first-century emerging infectious viral diseases. Nat Med. 2021;27(3):401–410. doi: 10.1038/s41591-021-01282-0. [DOI] [PubMed] [Google Scholar]
- 2.Nakamura T., Isoda N., Sakoda Y., Harashima H. Strategies for fighting pandemic virus infections: Integration of virology and drug delivery. J Control Release. 2022;343:361–378. doi: 10.1016/j.jconrel.2022.01.046. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Zhou P., Yang X.-L., Wang X.-G., Hu B., Zhang L., Zhang W., Si H.-R., Zhu Y., Li B., Huang C.-L., Chen H.-D., Chen J., Luo Y., Guo H., Jiang R.-D., Liu M.-Q., Chen Y., Shen X.-R., Wang X., Zheng X.-S., Zhao K., Chen Q.-J., Deng F., Liu L.-L., Yan B., Zhan F.-X., Wang Y.-Y., Xiao G.-F., Shi Z.-L. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature. 2020;579(7798):270–273. doi: 10.1038/s41586-020-2012-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Shan C., Yao Y.-F., Yang X.-L., Zhou Y.-W., Gao G., Peng Y., Yang L., Hu X., Xiong J., Jiang R.-D., Zhang H.-J., Gao X.-X., Peng C., Min J., Chen Y., Si H.-R., Wu J., Zhou P., Wang Y.-Y., Wei H.-P., Pang W., Hu Z.-F., Lv L.-B., Zheng Y.-T., Shi Z.-L., Yuan Z.-M. Infection with novel coronavirus (SARS-CoV-2) causes pneumonia in Rhesus macaques. Cell Res. 2020;30(8):670–677. doi: 10.1038/s41422-020-0364-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Shi J., Wen Z., Zhong G., Yang H., Wang C., Huang B., Liu R., He X., Shuai L., Sun Z., Zhao Y., Liu P., Liang L., Cui P., Wang J., Zhang X., Guan Y., Tan W., Wu G., Chen H., Bu Z. Susceptibility of ferrets, cats, dogs, and other domesticated animals to SARS-coronavirus 2. Sci (N Y, N Y ) 2020;368(6494):1016–1020. doi: 10.1126/science.abb7015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Lam T.T.-Y., Jia N., Zhang Y.-W., Shum M.H.-H., Jiang J.-F., Zhu H.-C., Tong Y.-G., Shi Y.-X., Ni X.-B., Liao Y.-S., Li W.-J., Jiang B.-G., Wei W., Yuan T.-T., Zheng K., Cui X.-M., Li J., Pei G.-Q., Qiang X., Cheung W.Y.-M., Li L.-F., Sun F.-F., Qin S., Huang J.-C., Leung G.M., Holmes E.C., Hu Y.-L., Guan Y., Cao W.-C. Identifying SARS-CoV-2-related coronaviruses in Malayan pangolins. Nature. 2020;583(7815):282–285. doi: 10.1038/s41586-020-2169-0. [DOI] [PubMed] [Google Scholar]
- 7.Albery G.F., Becker D.J., Brierley L., Brook C.E., Christofferson R.C., Cohen L.E., Dallas T.A., Eskew E.A., Fagre A., Farrell M.J., Glennon E., Guth S., Joseph M.B., Mollentze N., Neely B.A., Poisot T., Rasmussen A.L., Ryan S.J., Seifert S., Sjodin A.R., Sorrell E.M., Carlson C.J. The science of the host-virus network. Nat Microbiol. 2021;6(12):1483–1492. doi: 10.1038/s41564-021-00999-5. [DOI] [PubMed] [Google Scholar]
- 8.Davey N.E., Travé G., Gibson T.J. How viruses hijack cell regulation. Trends Biochem Sci. 2011;36(3):159–169. doi: 10.1016/j.tibs.2010.10.002. [DOI] [PubMed] [Google Scholar]
- 9.Franzosa E.A., Xia Y. Structural principles within the human-virus protein-protein interaction network. Proc Natl Acad Sci USA. 2011;108(26):10538–10543. doi: 10.1073/pnas.1101440108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Garamszegi S., Franzosa E.A., Xia Y. Signatures of pleiotropy, economy and convergent evolution in a domain-resolved map of human-virus protein-protein interaction networks. PLoS Pathog. 2013;9(12) doi: 10.1371/journal.ppat.1003778. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Lasso G., Mayer S.V., Winkelmann E.R., Chu T., Elliot O., Patino-Galindo J.A., Park K., Rabadan R., Honig B., Shapira S.D. A structure-informed atlas of human-virus interactions. Cell. 2019;178(6):1526–1541.e16. doi: 10.1016/j.cell.2019.08.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Velásquez-Zapata V., Elmore J.M., Banerjee S., Dorman K.S., Wise R.P. Next-generation yeast-two-hybrid analysis with Y2H-SCORES identifies novel interactors of the MLA immune receptor. PLOS Comput Biol. 2021;17(4) doi: 10.1371/journal.pcbi.1008890. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Wheat A., Yu C., Wang X., Burke A.M., Chemmama I.E., Kaake R.M., Baker P., Rychnovsky S.D., Yang J., Huang L. Protein interaction landscapes revealed by advanced in vivo cross-linking-mass spectrometry. Proc Natl Acad Sci USA. 2021;118(32) doi: 10.1073/pnas.2023360118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Liu X., Salokas K., Weldatsadik R.G., Gawriyski L., Varjosalo M. Combined proximity labeling and affinity purification-mass spectrometry workflow for mapping and visualizing protein interaction networks. Nat Protoc. 2020;15(10):3182–3211. doi: 10.1038/s41596-020-0365-x. [DOI] [PubMed] [Google Scholar]
- 15.Zhou Y., Liu Y., Gupta S., Paramo M.I., Hou Y., Mao C., Luo Y., Judd J., Wierbowski S., Bertolotti M., Nerkar M., Jehi L., Drayman N., Nicolaescu V., Gula H., Tay S., Randall G., Wang P., Lis J.T., Feschotte C., Erzurum S.C., Cheng F., Yu H. A comprehensive SARS-CoV-2-human protein-protein interactome reveals COVID-19 pathobiology and potential host therapeutic targets. Nat Biotechnol. 2022 doi: 10.1038/s41587-022-01474-0. (Oct.) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Gordon D.E., Hiatt J., Bouhaddou M., Rezelj V.V., Ulferts S., Braberg H., Jureka A.S., Obernier K., Guo J.Z., Batra J., Kaake R.M., Weckstein A.R., Owens T.W., Gupta M., Pourmal S., Titus E.W., Cakir M., Soucheray M., McGregor M., Cakir Z., Jang G., O’Meara M.J., Tummino T.A., Zhang Z., Foussard H., Rojc A., Zhou Y., Kuchenov D., Hüttenhain R., Xu J., Eckhardt M., Swaney D.L., Fabius J.M., Ummadi M., Tutuncuoglu B., Rathore U., Modak M., Haas P., Haas K.M., Naing Z.Z.C., Pulido E.H., Shi Y., Barrio-Hernandez I., Memon D., Petsalaki E., Dunham A., Marrero M.C., Burke D., Koh C., Vallet T., Silvas J.A., Azumaya C.M., Billesbølle C., Brilot A.F., Campbell M.G., Diallo A., Dickinson M.S., Diwanji D., Herrera N., Hoppe N., Kratochvil H.T., Liu Y., Merz G.E., Moritz M., Nguyen H.C., Nowotny C., Puchades C., Rizo A.N., Schulze-Gahmen U., Smith A.M., Sun M., Young I.D., Zhao J., Asarnow D., Biel J., Bowen A., Braxton J.R., Chen J., Chio C.M., Chio U.S., Deshpande I., Doan L., Faust B., Flores S., Jin M., Kim K., Lam V.L., Li F., Li J., Li Y.-L., Li Y., Liu X., Lo M., Lopez K.E., Melo A.A., Moss F.R., Nguyen P., Paulino J., Pawar K.I., Peters J.K., Pospiech T.H., Safari M., Sangwan S., Schaefer K., Thomas P.V., Thwin A.C., Trenker R., Tse E., Tsui T.K.M., Wang F., Whitis N., Yu Z., Zhang K., Zhang Y., Zhou F., Saltzberg D., QCRG Structural Biology Consortium. Hodder A.J., Shun-Shion A.S., Williams D.M., White K.M., Rosales R., Kehrer T., Miorin L., Moreno E., Patel A.H., Rihn S., Khalid M.M., Vallejo-Gracia A., Fozouni P., Simoneau C.R., Roth T.L., Wu D., Karim M.A., Ghoussaini M., Dunham I., Berardi F., Weigang S., Chazal M., Park J., Logue J., McGrath M., Weston S., Haupt R., Hastie C.J., Elliott M., Brown F., Burness K.A., Reid E., Dorward M., Johnson C., Wilkinson S.G., Geyer A., Giesel D.M., Baillie C., Raggett S., Leech H., Toth R., Goodman N., Keough K.C., Lind A.L., Zoonomia Consortium. Klesh R.J., Hemphill K.R., Carlson-Stevermer J., Oki J., Holden K., Maures T., Pollard K.S., Sali A., Agard D.A., Cheng Y., Fraser J.S., Frost A., Jura N., Kortemme T., Manglik A., Southworth D.R., Stroud R.M., Alessi D.R., Davies P., Frieman M.B., Ideker T., Abate C., Jouvenet N., Kochs G., Shoichet B., Ott M., Palmarini M., Shokat K.M., García-Sastre A., Rassen J.A., Grosse R., Rosenberg O.S., Verba K.A., Basler C.F., Vignuzzi M., Peden A.A., Beltrao P., Krogan N.J. Vol. 370. 2020. Comparative host-coronavirus protein interaction networks reveal pan-viral disease mechanisms. (Science (New York, N.Y.)). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Kim D.-K., Weller B., Lin C.-W., Sheykhkarimli D., Knapp J.J., Dugied G., Zanzoni A., Pons C., Tofaute M.J., Maseko S.B., Spirohn K., Laval F., Lambourne L., Kishore N., Rayhan A., Sauer M., Young V., Halder H., la Rosa N.M.-d., Pogoutse O., Strobel A., Schwehn P., Li R., Rothballer S.T., Altmann M., Cassonnet P., Coté A.G., Vergara L.E., Hazelwood I., Liu B.B., Nguyen M., Pandiarajan R., Dohai B., Coloma P.A.R., Poirson J., Giuliana P., Willems L., Taipale M., Jacob Y., Hao T., Hill D.E., Brun C., Twizere J.-C., Krappmann D., Heinig M., Falter C., Aloy P., Demeret C., Vidal M., Calderwood M.A., Roth F.P., Falter-Braun P. A proteome-scale map of the SARS-CoV-2-human contactome. Nat Biotechnol. 2022 doi: 10.1038/s41587-022-01475-z. (Oct.) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.DelToro N., Shrivastava A., Ragueneau E., Meldal B., Combe C., Barrera E., Perfetto L., How K., Ratan P., Shirodkar G., Lu O., Mészáros B., Watkins X., Pundir S., Licata L., Iannuccelli M., Pellegrini M., Martin M.J., Panni S., Duesbury M., Vallet S.D., Rappsilber J., Ricard-Blum S., Cesareni G., Salwinski L., Orchard S., Porras P., Panneerselvam K., Hermjakob H. The IntAct database: efficient access to fine-grained molecular interaction data. Nucleic Acids Res. 2022;50(D1):D648–D653. doi: 10.1093/nar/gkab1006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Calderone A., Licata L., Cesareni G. VirusMentha: a new resource for virus-host protein interactions. Nucleic Acids Res. 2015;43(Database issue):D588–D592. doi: 10.1093/nar/gku830. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Chatr-aryamontri A., Ceol A., Peluso D., Nardozza A., Panni S., Sacco F., Tinti M., Smolyar A., Castagnoli L., Vidal M., Cusick M.E., Cesareni G. VirusMINT: a viral protein interaction database. Nucleic Acids Res. 2009;37(Database issue):D669–673. doi: 10.1093/nar/gkn739. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Ammari M.G., Gresham C.R., McCarthy F.M., Nanduri B. HPIDB 2.0: a curated database for host-pathogen interactions. atabase: J Biol Databases Curation. 2016:baw103. doi: 10.1093/database/baw103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Goodacre N., Devkota P., Bae E., Wuchty S., Uetz P. Protein-Protein Interact Hum Virus. 2020;99:31–39. doi: 10.1016/j.semcdb.2018.07.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Saha D., Iannuccelli M., Brun C., Zanzoni A., Licata L. The intricacy of the viral-human protein interaction networks: Resources, data, and analyses. Front Microbiol. 2022;13 doi: 10.3389/fmicb.2022.849781. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Jenkins G.M., Holmes E.C. The extent of codon usage bias in human RNA viruses and its evolutionary origin. Virus Res. 2003;92(1):1–7. doi: 10.1016/s0168-1702(02)00309-x. [DOI] [PubMed] [Google Scholar]
- 25.Takata M.A., Gonçalves-Carneiro D., Zang T.M., Soll S.J., York A., Blanco-Melo D., Bieniasz P.D. CG dinucleotide suppression enables antiviral defence targeting non-self RNA. Nature. 2017;550(7674):124–127. doi: 10.1038/nature24039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Arinaminpathy N., Grenfell B. Dynamics of glycoprotein charge in the evolutionary history of human influenza. PloS One. 2010;5(12) doi: 10.1371/journal.pone.0015674. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Sorin M.N., Kuhn J., Stasiak A.C., Stehle T. Structural insight into non-enveloped virus binding to glycosaminoglycan receptors: a review. Viruses. 2021;13(5):800. doi: 10.3390/v13050800. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Babayan S.A., Orton R.J., Streicker D.G. Predicting reservoir hosts and arthropod vectors from evolutionary signatures in RNA virus genomes. Sci (N Y, N Y ) 2018;362(6414):577–580. doi: 10.1126/science.aap9072. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Ahlgren N.A., Ren J., Lu Y.Y., Fuhrman J.A., Sun F. Alignment-free d*2 oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences. Nucleic Acids Res. 2017;45(1):39–53. doi: 10.1093/nar/gkw1002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Reinert G., Chew D., Sun F., Waterman M.S. Alignment-free sequence comparison (i): statistics and power. J Comput Biol: A J Comput Mol Cell Biol. 2009;16(12):1615–1634. doi: 10.1089/cmb.2009.0198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Galiez C., Siebert M., Enault F., Vincent J., Söding J. WIsH: who is the host? Predicting prokaryotic hosts from metagenomic phage contigs. Bioinforma (Oxf, Engl) 2017;33(19):3113–3114. doi: 10.1093/bioinformatics/btx383. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Xu B., Tan Z., Li K., Jiang T., Peng Y. Predicting the host of influenza viruses based on the word vector. PeerJ. 2017;5 doi: 10.7717/peerj.3579. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient Estimation of Word Representations in Vector Space, arXiv:1301.3781 [cs] (Sep. 2013). 10.48550/arXiv.1301.3781 〈http://arxiv.org/abs/1301.3781〉.
- 34.Mehle A., Doudna J.A. Adaptive strategies of the influenza virus polymerase for replication in humans. Proc Natl Acad Sci USA. 2009;106(50):21312–21316. doi: 10.1073/pnas.0911915106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Mock F., Viehweger A., Barth E., Marz M. VIDHOP, viral host prediction with deep learning. Bioinforma (Oxf, Engl) 2021;37(3):318–325. doi: 10.1093/bioinformatics/btaa705. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Dutilh B.E., Cassman N., McNair K., Sanchez S.E., Silva G.G.Z., Boling L., Barr J.J., Speth D.R., Seguritan V., Aziz R.K., Felts B., Dinsdale E.A., Mokili J.L., Edwards R.A. A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes. Nat Commun. 2014;5:4498. doi: 10.1038/ncomms5498. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Leite D.M.C., Brochet X., Resch G., Que Y.-A., Neves A., Peña-Reyes C. Computational prediction of inter-species relationships through omics data analysis and machine learning. BMC Bioinforma. 2018;19(Suppl 14):420. doi: 10.1186/s12859-018-2388-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Boeckaerts D., Stock M., Criel B., Gerstmans H., De Baets B., Briers Y. Predicting bacteriophage hosts based on sequences of annotated receptor-binding proteins. Sci Rep. 2021;11(1):1467. doi: 10.1038/s41598-021-81063-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Karabulut O.C., Karpuzcu B.A., Türk E., Ibrahim A.H., Süzek B.E. ML-AdVInfect: A Machine-Learning Based Adenoviral Infection Predictor. Front Mol Biosci. 2021;8 doi: 10.3389/fmolb.2021.647424. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Fineran P.C., Gerritzen M.J.H., Suárez-Diez M., Künne T., Boekhorst J., van Hijum S.A.F.T., Staals R.H.J., Brouns S.J.J. Degenerate target sites mediate rapid primed CRISPR adaptation. Proc Natl Acad Sci. 2014;111(16) doi: 10.1073/pnas.1400071111. (Apr.) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Dion M.B., Plante P.-L., Zufferey E., Shah S.A., Corbeil J., Moineau S. Streamlining CRISPR spacer-based bacterial host predictions to decipher the viral dark matter. Nucleic Acids Res. 2021;49(6):3127–3138. doi: 10.1093/nar/gkab133. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Biswas A., Staals R.H.J., Morales S.E., Fineran P.C., Brown C.M. CRISPRDetect: A flexible algorithm to define CRISPR arrays. BMC Genom. 2016;17:356. doi: 10.1186/s12864-016-2627-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.IllergÅrd K., Ardell D.H., Elofsson A. Structure is three to ten times more conserved than sequence–a study of structural response in protein cores. Proteins. 2009;77(3):499–508. doi: 10.1002/prot.22458. [DOI] [PubMed] [Google Scholar]
- 44.Jumper J., Evans R., Pritzel A., Green T., Figurnov M., Ronneberger O., Tunyasuvunakool K., Bates R., Žídek A., Potapenko A., Bridgland A., Meyer C., Kohl S.A.A., Ballard A.J., Cowie A., Romera-Paredes B., Nikolov S., Jain R., Adler J., Back T., Petersen S., Reiman D., Clancy E., Zielinski M., Steinegger M., Pacholska M., Berghammer T., Bodenstein S., Silver D., Vinyals O., Senior A.W., Kavukcuoglu K., Kohli P., Hassabis D. Highly accurate protein structure prediction with Alpha Fold. Nature. 2021;596(7873):583–589. doi: 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Baek M., DiMaio F., Anishchenko I., Dauparas J., Ovchinnikov S., Lee G.R., Wang J., Cong Q., Kinch L.N., Schaeffer R.D., Millán C., Park H., Adams C., Glassman C.R., DeGiovanni A., Pereira J.H., Rodrigues A.V., van Dijk A.A., Ebrecht A.C., Opperman D.J., Sagmeister T., Buhlheller C., Pavkov-Keller T., Rathinaswamy M.K., Dalwadi U., Yip C.K., Burke J.E., Garcia K.C., Grishin N.V., Adams P.D., Read R.J., Baker D. Accurate prediction of protein structures and interactions using a three-track neural network. Sci (N Y, N Y ) 2021;373(6557):871–876. doi: 10.1126/science.abj8754. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Chowdhury R., Bouatta N., Biswas S., Floristean C., Kharkar A., Roy K., Rochereau C., Ahdritz G., Zhang J., Church G.M., Sorger P.K., AlQuraishi M. Single-sequence protein structure prediction using a language model and deep learning. Nat Biotechnol. 2022;40(11):1617–1623. doi: 10.1038/s41587-022-01432-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Yang X., Yang S., Li Q., Wuchty S., Zhang Z. Prediction of human-virus protein-protein interactions through a sequence embedding-based machine learning method. Comput Struct Biotechnol J. 2020;18:153–161. doi: 10.1016/j.csbj.2019.12.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Iuchi H., Matsutani T., Yamada K., Iwano N., Sumi S., Hosoda S., Zhao S., Fukunaga T., Hamada M. Representation learning applications in biological sequence analysis. Comput Struct Biotechnol J. 2021;19:3198–3208. doi: 10.1016/j.csbj.2021.05.039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Dong T.N., Brogden G., Gerold G., Khosla M. A multitask transfer learning framework for the prediction of virus-human protein-protein interactions. BMC Bioinforma. 2021;22(1):572. doi: 10.1186/s12859-021-04484-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Alley E.C., Khimulya G., Biswas S., AlQuraishi M., Church G.M. Unified rational protein engineering with sequence-based deep representation learning. Nat Methods. 2019;16(12):1315–1322. doi: 10.1038/s41592-019-0598-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Carlson C.J., Farrell M.J., Grange Z., Han B.A., Mollentze N., Phelan A.L., Rasmussen A.L., Albery G.F., Bett B., Brett-Major D.M., Cohen L.E., Dallas T., Eskew E.A., Fagre A.C., Forbes K.M., Gibb R., Halabi S., Hammer C.C., Katz R., Kindrachuk J., Muylaert R.L., Nutter F.B., Ogola J., Olival K.J., Rourke M., Ryan S.J., Ross N., Seifert S.N., Sironen T., Standley C.J., Taylor K., Venter M., Webala P.W. The future of zoonotic risk prediction. Philos Trans R Soc Lond Ser B, Biol Sci. 2021;376(1837) doi: 10.1098/rstb.2020.0358. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Wille M., Geoghegan J.L., Holmes E.C. How accurately can we assess zoonotic risk? PLOS Biol. 2021;19(4) doi: 10.1371/journal.pbio.3001135. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Sapoval N., Aghazadeh A., Nute M.G., Antunes D.A., Balaji A., Baraniuk R., Barberan C.J., Dannenfelser R., Dun C., Edrisi M., Elworth R.A.L., Kille B., Kyrillidis A., Nakhleh L., Wolfe C.R., Yan Z., Yao V., Treangen T.J. Current progress and open challenges for applying deep learning across the biosciences. Nat Commun. 2022;13(1):1728. doi: 10.1038/s41467-022-29268-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Hu X., Feng C., Ling T., Chen M. Deep learning frameworks for protein-protein interaction prediction. Comput Struct Biotechnol J. 2022;20:3223–3233. doi: 10.1016/j.csbj.2022.06.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Blohm P., Frishman G., Smialowski P., Goebels F., Wachinger B., Ruepp A., Frishman D. Negatome 2.0: a database of non-interacting proteins derived by literature mining, manual annotation and protein structure analysis. Nucleic Acids Res. 2014;42:D396–400. doi: 10.1093/nar/gkt1079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Porras P., Barrera E., Bridge A., Del-Toro N., Cesareni G., Duesbury M., Hermjakob H., Iannuccelli M., Jurisica I., Kotlyar M., Licata L., Lovering R.C., Lynn D.J., Meldal B., Nanduri B., Paneerselvam K., Panni S., Pastrello C., Pellegrini M., Perfetto L., Rahimzadeh N., Ratan P., Ricard-Blum S., Salwinski L., Shirodkar G., Shrivastava A., Orchard S. Towards a unified open access dataset of molecular interactions. Nat Commun. 2020;11(1):6144. doi: 10.1038/s41467-020-19942-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Kiryo R., Niu G., duPlessis M.C., Sugiyama M. Advances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc.; 2017. Positive-Unlabeled Learning with Non-Negative Risk Estimator.〈https://papers.nips.cc/paper/2017/hash/7cce53cf90577442771720a370c3c723-Abstract.html〉 [Google Scholar]
- 58.Li F., Dong S., Leier A., Han M., Guo X., Xu J., Wang X., Pan S., Jia C., Zhang Y., Webb G.I., Coin L.J.M., Li C., Song J. Positive-unlabeled learning in bioinformatics and computational biology: a brief review. Brief Bioinforma. 2022;23(1) doi: 10.1093/bib/bbab461. [DOI] [PubMed] [Google Scholar]
- 59.Nurk S., Koren S., Rhie A., Rautiainen M., Bzikadze A.V., Mikheenko A., Vollger M.R., Altemose N., Uralsky L., Gershman A., Aganezov S., Hoyt S.J., Diekhans M., Logsdon G.A., Alonge M., Antonarakis S.E., Borchers M., Bouffard G.G., Brooks S.Y., Caldas G.V., Chen N.-C., Cheng H., Chin C.-S., Chow W., de Lima L.G., Dishuck P.C., Durbin R., Dvorkina T., Fiddes I.T., Formenti G., Fulton R.S., Fungtammasan A., Garrison E., Grady P.G.S., Graves-Lindsay T.A., Hall I.M., Hansen N.F., Hartley G.A., Haukness M., Howe K., Hunkapiller M.W., Jain C., Jain M., Jarvis E.D., Kerpedjiev P., Kirsche M., Kolmogorov M., Korlach J., Kremitzki M., Li H., Maduro V.V., Marschall T., McCartney A.M., McDaniel J., Miller D.E., Mullikin J.C., Myers E.W., Olson N.D., Paten B., Peluso P., Pevzner P.A., Porubsky D., Potapova T., Rogaev E.I., Rosenfeld J.A., Salzberg S.L., Schneider V.A., Sedlazeck F.J., Shafin K., Shew C.J., Shumate A., Sims Y., Smit A.F.A., Soto D.C., Sović I., Storer J.M., Streets A., Sullivan B.A., Thibaud-Nissen F., Torrance J., Wagner J., Walenz B.P., Wenger A., Wood J.M.D., Xiao C., Yan S.M., Young A.C., Zarate S., Surti U., McCoy R.C., Dennis M.Y., Alexandrov I.A., Gerton J.L., O’Neill R.J., Timp W., Zook J.M., Schatz M.C., Eichler E.E., Miga K.H., Phillippy A.M. The complete sequence of a human genome. Science. 2022;376(6588):44–53. doi: 10.1126/science.abj6987. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Steinegger M., Mirdita M., Söding J. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nat Methods. 2019;16(7):603–606. doi: 10.1038/s41592-019-0437-4. [DOI] [PubMed] [Google Scholar]
- 61.Mitchell A.L., Almeida A., Beracochea M., Boland M., Burgin J., Cochrane G., Crusoe M.R., Kale V., Potter S.C., Richardson L.J., Sakharova E., Scheremetjew M., Korobeynikov A., Shlemov A., Kunyavskaya O., Lapidus A., Finn R.D. MGnify: the microbiome analysis resource in 2020. Nucleic Acids Res. 2020;48(D1):D570–D578. doi: 10.1093/nar/gkz1035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.wwPDB consortium Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. 2019;47(D1):D520–D528. doi: 10.1093/nar/gky949. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Bryant P., Pozzati G., Elofsson A. Improved prediction of protein-protein interactions using AlphaFold2. Nat Commun. 2022;13(1):1265. doi: 10.1038/s41467-022-28865-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Sola I., Almazán F., Zúñiga S., Enjuanes L. Continuous and Discontinuous RNA Synthesis in Coronaviruses. Annu Rev Virol. 2015;2(1):265–288. doi: 10.1146/annurev-virology-100114-055218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Kim D., Lee J.-Y., Yang J.-S., Kim J.W., Kim V.N., Chang H. The Architecture of SARS-CoV-2 Transcriptome. Cell. 2020;181(4):914–921.e10. doi: 10.1016/j.cell.2020.04.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Pawlica P., Yario T.A., White S., Wang J., Moss W.N., Hui P., Vinetz J.M., Steitz J.A. SARS-CoV-2 expresses a microRNA-like small RNA able to selectively repress host genes. Proc Natl Acad Sci USA. 2021;118(52) doi: 10.1073/pnas.2116668118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Schmidt N., Lareau C.A., Keshishian H., Ganskih S., Schneider C., Hennig T., Melanson R., Werner S., Wei Y., Zimmer M., Ade J., Kirschner L., Zielinski S., Dölken L., Lander E.S., Caliskan N., Fischer U., Vogel J., Carr S.A., Bodem J., Munschauer M. The SARS-CoV-2 RNA-protein interactome in infected human cells. Nat Microbiol. 2021;6(3):339–353. doi: 10.1038/s41564-020-00846-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Kamel W., Noerenberg M., Cerikan B., Chen H., Järvelin A.I., Kammoun M., Lee J.Y., Shuai N., Garcia-Moreno M., Andrejeva A., Deery M.J., Johnson N., Neufeldt C.J., Cortese M., Knight M.L., Lilley K.S., Martinez J., Davis I., Bartenschlager R., Mohammed S., Castello A. Global analysis of protein-RNA interactions in SARS-CoV-2-infected cells reveals key regulators of infection. Mol Cell. 2021;81(13):2851–2867.e7. doi: 10.1016/j.molcel.2021.05.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Flynn R.A., Belk J.A., Qi Y., Yasumoto Y., Wei J., Alfajaro M.M., Shi Q., Mumbach M.R., Limaye A., DeWeirdt P.C., Schmitz C.O., Parker K.R., Woo E., Chang H.Y., Horvath T.L., Carette J.E., Bertozzi C.R., Wilen C.B., Satpathy A.T. Discovery and functional interrogation of SARS-CoV-2 RNA-host protein interactions. Cell. 2021;184(9):2394–2411.e16. doi: 10.1016/j.cell.2021.03.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Vandelli A., Monti M., Milanetti E., Armaos A., Rupert J., Zacco E., Bechara E., DelliPonti R., Tartaglia G.G. Structural analysis of SARS-CoV-2 genome and predictions of the human interactome. Nucleic Acids Res. 2020;48(20):11270–11283. doi: 10.1093/nar/gkaa864. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Yamada K., Hamada M. Predict RNA-Protein Interact Using a Nucleotide Lang Model. 2022;2(1):vbac023. doi: 10.1093/bioadv/vbac023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Ahmed S., Saito A., Suzuki M., Nemoto N., Nishigaki K. Host-parasite relations of bacteria and phages can be unveiled by oligostickiness, a measure of relaxed sequence similarity. Bioinforma (Oxf, Engl) 2009;25(5):563–570. doi: 10.1093/bioinformatics/btp003. [DOI] [PubMed] [Google Scholar]
- 73.Eng C.L.P., Tong J.C., Tan T.W. Predicting host tropism of influenza A virus proteins using random forest. BMC Med Genom. 2014;7 Suppl 3:S1. doi: 10.1186/1755-8794-7-S3-S1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Brister J.R., Ako-Adjei D., Bao Y., Blinkova O. NCBI viral genomes resource. Nucleic Acids Res. 2015;43(Database issue):D571–577. doi: 10.1093/nar/gku1207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Harrison P.W., Ahamed A., Aslam R., Alako B.T.F., Burgin J., Buso N., Courtot M., Fan J., Gupta D., Haseeb M., Holt S., Ibrahim T., Ivanov E., Jayathilaka S., Balavenkataraman Kadhirvelu V., Kumar M., Lopez R., Kay S., Leinonen R., Liu X., O’Cathail C., Pakseresht A., Park Y., Pesant S., Rahman N., Rajan J., Sokolov A., Vijayaraja S., Waheed Z., Zyoud A., Burdett T., Cochrane G. The European Nucleotide Archive in 2020. Nucleic Acids Res. 2021;49(D1):D82–D85. doi: 10.1093/nar/gkaa1028. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Villarroel J., Kleinheinz K.A., Jurtz V.I., Zschach H., Lund O., Nielsen M., Larsen M.V. HostPhinder: A Phage Host Prediction Tool. Viruses. 2016;8(5) doi: 10.3390/v8050116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Eng C.L.P., Tong J.C., Tan T.W. Predicting Zoonotic Risk of Influenza A Viruses from Host Tropism Protein Signature Using Random Forest. Int J Mol Sci. 2017;18(6) doi: 10.3390/ijms18061135. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Russell D.A., Hatfull G.F. PhagesDB: the actinobacteriophage database. Bioinforma (Oxf, Engl) 2017;33(5):784–786. doi: 10.1093/bioinformatics/btw711. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Benson D.A., Cavanaugh M., Clark K., Karsch-Mizrachi I., Ostell J., Pruitt K.D., Sayers E.W. GenBank. Nucleic Acids Res. 2018;46(D1):D41–D47. doi: 10.1093/nar/gkx1094. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Li H., Sun F. Comparative studies of alignment, alignment-free and SVM based approaches for predicting the hosts of viruses based on viral sequences. Sci Rep. 2018;8(1):10032. doi: 10.1038/s41598-018-28308-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Olival K.J., Hosseini P.R., Zambrana-Torrelio C., Ross N., Bogich T.L., Daszak P. Host and viral traits predict zoonotic spillover from mammals. Nature. 2017;546(7660):646–650. doi: 10.1038/nature22975. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.NCBI Resource Coordinators Database resources of the national center for biotechnology information. Nucleic Acids Res. 2017;45(D1):D12–D17. doi: 10.1093/nar/gkw1071. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Gałan W., Ba̧k M., Jakubowska M. Host taxon predictor - a tool for predicting taxon of the host of a newly discovered virus. Sci Rep. 2019;9(1):3436. doi: 10.1038/s41598-019-39847-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Mihara T., Nishimura Y., Shimizu Y., Nishiyama H., Yoshikawa G., Uehara H., Hingamp P., Goto S., Ogata H. Linking Virus Genomes with Host Taxonomy. Viruses. 2016;8(3):66. doi: 10.3390/v8030066. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Q. Le, T. Mikolov, Distributed Representations of Sentences and Documents, in: Proceedings of the 31st International Conference on Machine Learning, PMLR, 2014, pp. 1188–1196.〈https://proceedings.mlr.press/v32/le14.html〉.
- 86.Liu D., Ma Y., Jiang X., He T. Predicting virus-host association by Kernelized logistic matrix factorization and similarity network fusion. BMC Bioinforma. 2019;20(Suppl 16):594. doi: 10.1186/s12859-019-3082-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Zhang Z., Cai Z., Tan Z., Lu C., Jiang T., Zhang G., Peng Y. Rapid identification of human-infecting viruses. Transbound Emerg Dis. 2019;66(6):2517–2522. doi: 10.1111/tbed.13314. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Kitson E., Suttle C.A. VHost-Classifier: virus-host classification using natural language processing. Bioinforma (Oxf, Engl) 2019;35(19):3867–3869. doi: 10.1093/bioinformatics/btz151. [DOI] [PubMed] [Google Scholar]
- 89.Qiang X., Kou Z. Predicting interspecies transmission of avian influenza virus based on wavelet packet decomposition. Comput Biol Chem. 2019;78:455–459. doi: 10.1016/j.compbiolchem.2018.11.029. [DOI] [PubMed] [Google Scholar]
- 90.Wang W., Ren J., Tang K., Dart E., Ignacio-Espinoza J.C., Fuhrman J.A., Braun J., Sun F., Ahlgren N.A. A network-based integrated framework for predicting virus-prokaryote interactions. NAR Genom Bioinforma. 2020;2(2) doi: 10.1093/nargab/lqaa044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Kuzmin K., Adeniyi A.E., DaSouza A.K., Lim D., Nguyen H., Molina N.R., Xiong L., Weber I.T., Harrison R.W. Machine learning methods accurately predict host specificity of coronaviruses based on spike sequences alone. Biochem Biophys Res Commun. 2020;533(3):553–558. doi: 10.1016/j.bbrc.2020.09.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Young F., Rogers S., Robertson D.L. Predicting host taxonomic information from viral genomes: A comparison of feature representations. PLoS Comput Biol. 2020;16(5) doi: 10.1371/journal.pcbi.1007894. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.Li M., Wang Y., Li F., Zhao Y., Liu M., Zhang S., Bin Y., Smith A.I., Webb G.I., Li J., Song J., Xia J. A Deep Learning-Based Method for Identification of Bacteriophage-Host Interaction. IEEE/ACM Trans Comput Biol Bioinforma. 2021;18(5):1801–1810. doi: 10.1109/TCBB.2020.3017386. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.UniProt Consortium UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 2021;49(D1):D480–D489. doi: 10.1093/nar/gkaa1100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95.Coutinho F.H., Zaragoza-Solas A., López-Pérez M., Barylski J., Zielezinski A., Dutilh B.E., Edwards R., Rodriguez-Valera F. RaFAH: Host prediction for viruses of Bacteria and Archaea based on protein content. Patterns (N Y, N Y ) 2021;2(7) doi: 10.1016/j.patter.2021.100274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Potter S.C., Luciani A., Eddy S.R., Park Y., Lopez R., Finn R.D. HMMER web server: 2018 update. Nucleic Acids Res. 2018;46(W1):W200–W204. doi: 10.1093/nar/gky448. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 97.Zhang R., Mirdita M., Levy Karin E., Norroy C., Galiez C., Söding J. Sensitive identification of phages from CRISPR spacers in prokaryotic hosts. Bioinforma (Oxf, Engl) 2021 doi: 10.1093/bioinformatics/btab222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98.Edwards R.A., McNair K., Faust K., Raes J., Dutilh B.E. Computational approaches to predict bacteriophage-host relationships. FEMS Microbiol Rev. 2016;40(2):258–272. doi: 10.1093/femsre/fuv048. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 99.Zielezinski A., Barylski J., Karlowski W.M. Taxonomy-aware, sequence similarity ranking reliably predicts phage-host relationships. BMC Biol. 2021;19(1):223. doi: 10.1186/s12915-021-01146-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100.Lefkowitz E.J., Dempsey D.M., Hendrickson R.C., Orton R.J., Siddell S.G., Smith D.B. Virus taxonomy: the database of the International Committee on Taxonomy of Viruses (ICTV) Nucleic Acids Res 46(Database Issue) 2018:D708–D717. doi: 10.1093/nar/gkx932. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 101.Lu C., Zhang Z., Cai Z., Zhu Z., Qiu Y., Wu A., Jiang T., Zheng H., Peng Y. Prokaryotic virus host predictor: a Gaussian model for host prediction of prokaryotic viruses in metagenomics. BMC Biol. 2021;19(1):5. doi: 10.1186/s12915-020-00938-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102.Wardeh M., Risley C., McIntyre M.K., Setzkorn C., Baylis M. Database of host-pathogen and related species interactions, and their global distribution. Sci Data. 2015;2 doi: 10.1038/sdata.2015.49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 103.Wardeh M., Blagrove M.S.C., Sharkey K.J., Baylis M. Divide-and-conquer: machine-learning integrates mammalian and viral traits with network features to predict virus-mammal associations. Nat Commun. 2021;12(1):3954. doi: 10.1038/s41467-021-24085-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 104.Bartoszewicz J.M., Seidel A., Renard B.Y. Interpretable detection of novel human viruses from genome sequencing data. NAR Genom Bioinforma. 2021;3(1) doi: 10.1093/nargab/lqab004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 105.Brierley L., Fowler A. Predicting the animal hosts of coronaviruses from compositional biases of spike protein and whole genome sequences through machine learning. PLoS Pathog. 2021;17(4) doi: 10.1371/journal.ppat.1009149. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 106.Yang Y., Guo J., Wang P., Wang Y., Yu M., Wang X., Yang P., Sun L. Reservoir hosts prediction for COVID-19 by hybrid transfer learning model. J Biomed Inform. 2021;117 doi: 10.1016/j.jbi.2021.103736. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 107.Pickett B.E., Sadat E.L., Zhang Y., Noronha J.M., Squires R.B., Hunt V., Liu M., Kumar S., Zaremba S., Gu Z., Zhou L., Larson C.N., Dietrich J., Klem E.B., Scheuermann R.H. ViPR: an open bioinformatics database and analysis resource for virology research. Nucleic Acids Res. 2012;40(Database issue):D593–598. doi: 10.1093/nar/gkr859. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 108.Squires R.B., Noronha J., Hunt V., García-Sastre A., Macken C., Baumgarth N., Suarez D., Pickett B.E., Zhang Y., Larsen C.N., Ramsey A., Zhou L., Zaremba S., Kumar S., Deitrich J., Klem E., Scheuermann R.H. Influenza research database: an integrated bioinformatics resource for influenza research and surveillance. Influenza Other Respir Virus. 2012;6(6):404–416. doi: 10.1111/j.1750-2659.2011.00331.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 109.Woolhouse M.E.J., Brierley L. Epidemiological characteristics of human-infective RNA viruses. Sci Data. 2018;5 doi: 10.1038/sdata.2018.17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 110.Mollentze N., Streicker D.G. Viral zoonotic risk is homogenous among taxonomic orders of mammalian and avian reservoir hosts. Proc Natl Acad Sci USA. 2020;117(17):9423–9430. doi: 10.1073/pnas.1919176117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 111.Mollentze N., Babayan S.A., Streicker D.G. Identifying and prioritizing potential human-infecting viruses from their genome sequences. PLoS Biol. 2021;19(9) doi: 10.1371/journal.pbio.3001390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 112.Pons J.C., Paez-Espino D., Riera G., Ivanova N., Kyrpides N.C., Llabrés M. VPF-Class: Taxonomic assignment and host prediction of uncultivated viruses based on viral protein families. Bioinforma (Oxf, Engl) 2021 doi: 10.1093/bioinformatics/btab026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 113.Dyer M.D., Murali T.M., Sobral B.W. Supervised learning and prediction of physical interactions between human and HIV proteins. Infect, Genet Evol: J Mol Epidemiol Evolut Genet Infect Dis. 2011;11(5):917–923. doi: 10.1016/j.meegid.2011.02.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 114.Cui G., Fang C., Han K. Prediction of protein-protein interactions between viruses and human by an SVM model. BMC Bioinforma. 2012;13(Suppl 7):S5. doi: 10.1186/1471-2105-13-S7-S5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 115.Emamjomeh A., Goliaei B., Zahiri J., Ebrahimpour R. Predicting protein-protein interactions between human and hepatitis C virus via an ensemble learning method. Mol Biosyst. 2014;10(12):3147–3154. doi: 10.1039/c4mb00410h. [DOI] [PubMed] [Google Scholar]
- 116.Barman R.K., Saha S., Das S. Prediction of interactions between viral and host proteins using supervised machine learning methods. PloS One. 2014;9(11) doi: 10.1371/journal.pone.0112034. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 117.Mei S., Zhu H. A novel one-class SVM based negative data sampling method for reconstructing proteome-wide HTLV-human protein interaction networks. Sci Rep. 2015;5:8034. doi: 10.1038/srep08034. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 118.Eid F.-E., ElHefnawi M., Heath L.S. DeNovo: virus-host sequence-based protein-protein interaction prediction. Bioinforma (Oxf, Engl) 2016;32(8):1144–1150. doi: 10.1093/bioinformatics/btv737. [DOI] [PubMed] [Google Scholar]
- 119.Ray S., Bandyopadhyay S., NMF A. based approach for integrating multiple data sources to predict HIV-1-human PPIs. BMC Bioinforma. 2016;17:121. doi: 10.1186/s12859-016-0952-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 120.Kim B., Alguwaizani S., Zhou X., Huang D.-S., Park B., Han K. An improved method for predicting interactions between virus and human proteins. J Bioinforma Comput Biol. 2017;15(1) doi: 10.1142/S0219720016500244. [DOI] [PubMed] [Google Scholar]
- 121.Nourani E., Khunjush F., Durmuş S. Computational prediction of virus-human protein-protein interactions using embedding kernelized heterogeneous data. Mol Biosyst. 2016;12(6):1976–1986. doi: 10.1039/c6mb00065g. [DOI] [PubMed] [Google Scholar]
- 122.Basit A.H., Abbasi W.A., Asif A., Gull S., Minhas F.U.A.A. Training host-pathogen protein-protein interaction predictors. J Bioinforma Comput Biol. 2018;16(4) doi: 10.1142/S0219720018500142. [DOI] [PubMed] [Google Scholar]
- 123.Zhou X., Park B., Choi D., Han K. A generalized approach to predicting protein-protein interactions between virus and host. BMC Genom. 2018;19(Suppl 6):568. doi: 10.1186/s12864-018-4924-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 124.Alguwaizani S., Park B., Zhou X., Huang D.-S., Han K. Predicting interactions between virus and host proteins using repeat patterns and composition of amino acids. J Healthc Eng 2018. 2018 doi: 10.1155/2018/1391265. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 125.Dey L., Chakraborty S., Mukhopadhyay A. Machine learning techniques for sequence-based prediction of viral-host interactions between SARS-CoV-2 and human proteins. Biomed J. 2020;43(5):438–450. doi: 10.1016/j.bj.2020.08.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 126.Zhang Z., Ye S., Wu A., Jiang T., Peng Y. Prediction of the receptorome for the human-infecting virome. Virol Sin. 2021;36(1):133–140. doi: 10.1007/s12250-020-00259-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 127.Guven-Maiorov E., Hakouz A., Valjevac S., Keskin O., Tsai C.-J., Gursoy A., Nussinov R. HMI-PRED: a web server for structural prediction of host-microbe interactions based on interface mimicry. J Mol Biol. 2020;432(11):3395–3403. doi: 10.1016/j.jmb.2020.01.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 128.Chen J., Althagafi A., Hoehndorf R. Predicting candidate genes from phenotypes, functions and anatomical site of expression. Bioinforma (Oxf, Engl) 2021;37(6):853–860. doi: 10.1093/bioinformatics/btaa879. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 129.Liu-Wei W., Kafkas c, Chen J., Dimonaco N.J., Tegnér J., Hoehndorf R. DeepViral: prediction of novel virus-host interactions from protein sequences and infectious disease phenotypes. Bioinforma (Oxf, Engl) 2021 doi: 10.1093/bioinformatics/btab147. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 130.Yang X., Yang S., Lian X., Wuchty S., Zhang Z. Transfer learning via multi-scale convolutional neural layers for human-virus protein-protein interaction prediction. Bioinforma (Oxf, Engl) 2021 doi: 10.1093/bioinformatics/btab533. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 131.Koca M.B., Nourani E., Abbasoğlu F., Karadeniz I., Sevilgen F.E. Graph convolutional network based virus-human protein-protein interaction prediction for novel viruses. Comput Biol Chem. 2022;101 doi: 10.1016/j.compbiolchem.2022.107755. [DOI] [PubMed] [Google Scholar]

