PhageScope: a well-annotated bacteriophage database with automatic analyses and visualizations

Ruo Han Wang; Shuo Yang; Zhixuan Liu; Yuanzheng Zhang; Xueying Wang; Zixin Xu; Jianping Wang; Shuai Cheng Li

doi:10.1093/nar/gkad979

. 2023 Oct 30;52(D1):D756–D761. doi: 10.1093/nar/gkad979

PhageScope: a well-annotated bacteriophage database with automatic analyses and visualizations

Ruo Han Wang ^1,³, Shuo Yang ^2,³, Zhixuan Liu ³, Yuanzheng Zhang ⁴, Xueying Wang ^5,⁶, Zixin Xu ⁷, Jianping Wang ^8,^✉, Shuai Cheng Li ^9,^✉

PMCID: PMC10767790 PMID: 37904614

Abstract

Bacteriophages are viruses that infect bacteria or archaea. Understanding the diverse and intricate genomic architectures of phages is essential to study microbial ecosystems and develop phage therapy strategies. However, the existing phage databases are short of meticulous annotations. To this end, we propose PhageScope (https://phagescope.deepomics.org), an online phage database with comprehensive annotations. PhageScope harbors a collection of 873 718 phage sequences from various sources. Applying fifteen state-of-the-art tools to perform systematic annotations and analyses, PhageScope provides annotations on genome completeness, host range, lifestyle information, taxonomy classification, nine types of structural and functional genetic elements, and three types of comparative genomic studies for curated phages. Additionally, PhageScope incorporates automatic analyses and visualizations for curated and customized phages, serving as an efficient platform for phage study.

Graphical Abstract

Introduction

Viruses infecting bacteria or archaea, i.e. bacteriophages or phages, are the most abundant and diverse biological entities on Earth (1). Phages play essential roles in maintaining species diversity and driving bacterial co-evolution. Given the threat of multi-drug resistance, phage therapy, which uses phages to treat bacterial infections, is considered an alternative to traditional antibiotic therapy. In-depth investigation of microbial systems and effective exploration of phages as therapeutic agents rely on meticulous studies on phage genomes.

Exhaustive investigations of phage genomes rely on extensive collections and diligent annotations of phage sequences. In recent years, the accumulation of next-generation sequencing (NGS) data and the development of phage detection methods have facilitated the computational extraction of numerous phage sequences from bacterial and metagenomic NGS data (2–8). However, these phage sequences derived from NGS data often lack accurate and detailed annotations, since phage annotation workflows, which require manually curated references and well-designed pipelines, are tedious and laborious.

A systematic workflow for phage annotation should encompass completeness assessment, phenotype and taxonomy determination, structural and functional annotations, as well as comparative genomic studies (9). First, the completeness assessment reports the quality of the assembled phage genomes. Then, the phenotype and taxonomy determination characterize the observable traits, such as morphology, lifestyles and host ranges. Furthermore, structural and functional annotations identify the genetic features and functional elements of phage genomes. Last, comparative genomic studies provide insight into evolutionary relationships, genetic diversity and functional variations for multiple phage genomes.

The existing bacteriophage databases and webservers do not offer a comprehensive provision of the aforementioned information and functionality. PhagesDB (10) stores actinobacteriophage sequences with phenotype annotated, but lack genome annotations. MVP (11) provides phage-host interaction information, with other information unavailable. PHROG (12) stores annotated phage protein families, but lacks other functional elements and phage information. PhANNs (13) and PhaGAA (14) are phage annotation webservers. However, they simply establish partial workflows and lack curated data.

To fill the gap, we propose PhageScope (https://phagescope.deepomics.org), a bacteriophage database with comprehensive annotations. PhageScope stores 873 718 phage sequences from multiple public repositories and published datasets. According to the workflows described above, we have applied fifteen state-of-the-art tools to provide annotations and analyses for curated phages, encompassing genome completeness, host range, lifestyle information, taxonomy classification and genetic element annotations, such as open reading frames (ORFs) and proteins, transcriptional terminators, tRNA and tmRNA genes, Anti-CRISPR proteins, CRISPR arrays, virulent factors, antimicrobial resistance genes and transmembrane proteins. Comparative genomic studies among multiple sequences, including genome clustering, sequence alignment and comparative tree construction, are also available. In addition, to streamline the workflow for users to analyze their phage genomes, PhageScope also supports automatic analyses and interactive visualizations for curated and customized phages.

Materials and methods

Phage sequence collection

We first searched for phage sequences across multiple public repositories, including RefSeq (15), Genbank (16), EMBL (17) and DDBJ (18), with ‘phage’, ‘bacteriophage’ and the bacterial names from NCBI taxonomy database (19) along with ‘virus’ as keywords. Furthermore, we incorporated phages from various published datasets, including PhagesDB (10), GOV2 (2), GVD (3) GPD (4), MGV (5), CHVD (6), STV (20), IGVD (21), IMG/VR (8), as well as 66 823 phage sequences from TemPhD, mined with our temperate phage detection method (7). As a result, we collected a dataset comprising 873 718 phage sequences (Supplementary Table S1). Subsequently, we applied multiple analysis tools to provide exhaustive genome annotations and sequence comparison information for the phages (Supplementary Table S2).

Genome annotation

Completeness assessment

We applied CheckV v0.8.1 (22) to assess the completeness of the phage genomes. CheckV categorized each genome into four distinct quality tiers, which are complete, high-quality, medium-quality and low-quality.

Phenotype annotation

We determined the phenotypic characteristics of the phages with respect to the host range and lifestyle. Regarding phage host assignment, the host information of 530 085 phages was available from the data source through a systematic search, which serves as the reference. To annotate the remaining phages, we incorporated homology search and DeepHost (23) to infer host taxonomies (Supplementary Methods S1.1). Regarding lifestyle prediction, the phages from TemPhD dataset are temperate phages according to their phage mining method. For the remaining phages, we utilized Graphage (24) to discern between virulent and temperate phages.

Structural annotation

We identified structural components, including coding regions and transcriptional terminators within phage genomes. First, we adopted Prodigal v2.6.3 (25) with a meta option to identify the ORFs on the phage genomes. Then we employed Eggnog-mapper v2.1.10 (26) to conduct orthology assignments and transfer annotations from the assigned ortholog groups for the resulting coding sequences. To refine the annotation, for proteins lacked hits, we iteratively applied mmseqs (27) to detect homology from the PHROG database (12), adopting a threshold of e-value <1e⁻⁵ with a sensitive mode and then annotated the proteins (Supplementary Methods S1.2 and Supplementary Figure S1). We further categorized the annotated proteins into ten functional classes (Supplementary Methods S1.3 and Supplementary Table S3). Additionally, we adopted TransTermHP v2.09 (28) to identify the terminators on the phage genomes.

Taxonomic annotation

To determine the taxonomy of each phage, we adopted Nayfach et al.’s approach by searching the phage proteins against a taxonomically representative HMMs database (22). Specifically, 30 553 taxonomy-specific VOGs from eight taxonomical groups were selected as marker genes (Supplementary Table S4). For each phage, we applied HMMsearch to align its encoded proteins with the VOGs and assigned them to the taxonomical group with the most HMM hits.

Functional annotation

To assign functional annotations to genomic elements, we employed a series of analysis tools. First, we utilized tRNAscan-SE v2.0 with a bacterial option (29) and ARAGORN v1.2.41 with the bacteria genetic code (30) to detect tRNA and tmRNA genes within phage genomes. Additionally, we incorporated AcRanker (31) and homology-based search, using Anti-CRISPRdb (32) as the reference, to identify anti-CRISPR proteins (Supplementary Methods S1.4). We utilized CRISPRCasFinder v4.2.20 (33) to detect CRISPR arrays on phage genomes. Furthermore, we employed mmseqs (27) to conduct a homology search for phage proteins against VFDB (34) and CARD (35), which allowed us to identify virulence factor and antimicrobial resistance genes on phage genomes if the match met the thresholds of Inline graphic identity and coverage (36). Finally, we utilized TMHMM v2.09 (37) to predict the topology of transmembrane proteins.

Genome comparison

Sequence clustering

Clustering homologous phages enables comparative analyses of genomes. We performed a two-step procedure to assign the phages into subclusters and clusters. First, following the suggested criteria (38), we applied mmseqs (27), with a threshold of identity >0.9 and coverage >0.9, to generate subclusters along with their representative sequences. Subsequently, we took the representative sequences as the inputs to another round of mmseqs, with a threshold of identity >0.6 and coverage >0.75, thereby generating clusters.

Sequence alignment

To compare CDS sequences in multiple phage genomes, we adopted BLASTP (39) to perform a pairwise alignment between encoded proteins derived from the annotation process. The alignment coverage and identity values from the BLAST outputs were showcased in the alignment visualizations. The order in which the phages are presented in the visualizations was automatically determined to ensure an optimal arrangement of the alignments (Supplementary Methods S1.5).

Comparative tree construction

We demonstrated the hierarchies among multiple phages with a tree structure. To construct a comparative tree for multiple phages, we first applied Alfpy, which is an alignment-free sequence comparison method, to calculate the genomic distance between the phage sequences. Alfpy has demonstrated its superiority for various sequence comparison tasks and exhibited scalability with large datasets (40). Then we utilized neighbor-joining algorithm (41) to construct a comparative tree (Supplementary Methods S1.6).

Platform development

PhageScope is hosted on an Ubuntu 20.04.6 LTS server, which is outfitted with 1 TB of memory and 90 TB of storage. The platform’s backend functionality is supported by an in-house framework (42,43) consisting of Apache, Django, PostgreSQL and Typescript+Vue3. All online data visualizations are implemented with Oviz (44). We provide detailed tutorials on the platform to facilitate usage.

Results

PhageScope database

PhageScope database holds a vast collection of 873 718 phages sourced from diverse public repositories and published databases, consisting of 767 797 nonredundant sequences (Figure 1). The sequence length and GC content distributions are depicted in Supplementary Figure S2. The phage sequences, accompanied by their respective source links, are provided on PhageScope. Annotations and metadata of the phages are frequently lacking in the original data sources. To augment the database, we applied multiple state-of-the-art tools to endow the curated phages with systematic and comprehensive annotations.

Figure 1. — Overview of PhageScope database. The PhageScope database stores 873 718 phage sequences from diverse sources, along with their annotated information and comparative results.

The completeness levels for the curated phages are available in PhageScope, allowing users to assess the quality of the sequences. Among the phages, 72 668 sequences are complete, 300 137 with high quality, 212 175 with medium quality, 267 050 with low quality and the remaining 21 688 sequences not-determined (Supplementary Figure S3).

PhageScope provides the phenotype information for the phages, including the host taxonomy and lifestyle. Host information from 530 085 phages is available from the data source, and the remaining are identified through the aforementioned pipeline, with 124 446 from the homology search and 219 187 from DeepHost (23). Consequently, the curated phages are assigned to bacteria of 4723 species, 1649 genera, 435 families, 196 orders, 94 classes and 57 phyla. The complete host taxonomies, accompanied by the information sources, are accessible within PhageScope. Regarding lifestyle, the phages in PhageScope are categorized into 553 688 virulent phages and 320 030 temperate phages, based on information from data submitters or predictions generated by Graphage (24). The host taxonomy and lifestyle distributions are shown in Supplementary Figure S4.

The phage genomes within PhageScope are meticulously annotated with their genomic structures, providing detailed information about the locations of ORFs and transcription terminators. There are 43 088 582 proteins and 6 462 417 terminators detected from stored phages. For proteins, the encoding products, functional classifications, physicochemical properties and annotation sources are provided. The proteins are categorized into 10 functional classes, including lysis, integration, replication, tRNA-related, regulation, packaging, assembly, infection, immune and hypothetical protein (Supplementary Table S5). For the terminators, region types and confidence scores are available.

Additionally, PhageScope yields taxonomical annotations for the phages via homology search, resulting in phages of Caudoviricetes, Microviridae, Inoviridae, Riboviria, Cressdnaviricota or Parvoviridae, Ampullaviridae, Bicaudaviridae or Turriviridae, Ligamenvirales, and Autolykiviridae, Fuselloviridae or Guttaviridae. Taxonomical classification enables users to explore genetic and functional traits within specific taxonomies.

PhageScope also equips the phages with exhaustive annotations of the functional elements associated with RNA molecules, CRISPR systems, host interaction and protein topology. By screening the phage genomes with the pipelines mentioned above, we have identified 691 091 tRNA genes, 11 516 tmRNA genes, 307 329 anti-CRISPR proteins, 56 652 CRISPR arrays, 41 609 virulent factors, 2602 antimicrobial resistance genes and 4 020 770 transmembrane proteins (Supplementary Table S6). All of these functional elements, along with their corresponding supplementary information, are curated within PhageScope.

Furthermore, we have clustered the phage sequences in PhageScope according to the sequence similarities to facilitate comparative analysis. For the resultant 555 901 clusters and 669 183 subclusters, PhageScope provides multiple sequence alignment and hierarchical comparative results with visualizations. The sequence alignment results present homologous CDSs among the sequences, and the hierarchical comparative results exhibit sequence similarity with a hierarchical structure, allowing researchers to comprehend the evolutionary connections and diversity among the phages.

All of the phage sequences, completeness reports, phenotype information, taxonomy categories, protein sequences and functional annotations, along with their sources, are ready for download in PhageScope.

Automatic analyses and visualizations

PhageScope supplies automatic analysis workflows for users to study their customized phages efficiently (Figure 2). The workflow encompasses the aforementioned pipelines for genome annotation and genome comparison analyses (Supplementary Methods S1.7). Users can select curated phages or upload their customized phages to perform partial or complete analyses in the workflows tailored to their needs. Additionally, users have the option to compare their phage genomes with the PhageScope database in genome comparison pipelines. Once the submitted tasks are completed, users can easily download the resultant documents and visualizations (Supplementary Figures S5–S12). The visualizations, which support informative tooltips to deliver detailed introductions and information, are prepared in an optional format with high quality, making them seamlessly incorporated into academic publications.

Comparison to the existing databases and webservers

We compare PhageScope with existing online bacteriophage databases, including PHROG (12), MVP (11) and PhagesDB (10) and webservers, including PhaGAA (14) and PhANNs (13). PhageScope demonstrates a multitude of distinct advantages, attributed to the extensive collection of phage sequences, comprehensive annotations, integrated automatic analyses and informative visualizations (Supplementary Table S7).

Discussion

PhageScope serves as a well-annotated bacteriophage database, enriched with advanced features such as automated analyses and visualizations. The extensive repertoire of phages, coupled with the comprehensive genome annotations and sequence comparison results from systematic analyses assisted with fifteen state-of-the-art tools, provide details about genetic features and comparative genomics for phage study.

Accurate genome annotations for phages offer valuable insights into their survival mechanisms, host interactions, roles in horizontal gene transfer, as well as their potential and threat to be utilized as therapeutic interventions. PhageScope provides comprehensive annotations for curated phages, encompassing genome completeness, phenotype information, taxonomy classifications, genetic features and functional elements. This wealth of information is crucial to comprehending the landscape of phage diversity and discerning unique characteristics among different phages. For instance, in PhageScope database, temperate phages exhibit a higher abundance of virulent factors and phages rarely encode antimicrobial resistance genes (Supplementary Table S6), which agrees with previous findings (36,45).

Additionally, PhageScope provides comparative genomics to establish informative and intricate evolutionary dynamics within curated phages. The results of sequence clustering, multiple sequence alignment and comparative tree construction assist researchers in elucidating shared genomic features, the diverse genetic contents and the potential interrelationships between phages.

The incorporation of automatic analyses and interactive visualizations within PhageScope reinforces its purpose of serving the scientific community as a user-friendly and effective tool. Recently, a vast number of phages have been mined from NGS data using computational methods without reliable annotations, restricting researchers from gleaning meaningful deductions from the data. PhageScope enables users to conveniently explore the details of their custom phages and gain insight and understanding of the unique genomic characteristics.

Supplementary Material

gkad979_supplemental_file

Click here for additional data file.^{(6.1MB, pdf)}

Acknowledgements

The authors thank Dr Yiqi Jiang for her constructive suggestions for the PhageScope platform.

Contributor Information

Ruo Han Wang, Department of Computer Science, City University of Hong Kong, Hong Kong.

Shuo Yang, Department of Computer Science, City University of Hong Kong, Hong Kong.

Zhixuan Liu, Department of Computer Science, City University of Hong Kong, Hong Kong.

Yuanzheng Zhang, Department of Computer Science, City University of Hong Kong, Hong Kong.

Xueying Wang, Department of Computer Science, City University of Hong Kong, Hong Kong; City University of Hong Kong (Dongguan), Dongguan, China.

Zixin Xu, Department of Computer Science, City University of Hong Kong, Hong Kong.

Jianping Wang, Department of Computer Science, City University of Hong Kong, Hong Kong.

Shuai Cheng Li, Department of Computer Science, City University of Hong Kong, Hong Kong.

Data availability

All the data are freely available at https://phagescope.deepomics.org.

Supplementary data

Supplementary Data are available at NAR Online.

Funding

Hong Kong Innovation and Technology Fund [GHX/002/19SZ (CityU: 9440262 to S.C.L.)]. Funding for open access charge: Hong Kong Innovation and Technology Fund [GHX/002/19SZ (CityU: 9440262 to S.C.L.)].

Conflict of interest statement. None declared.

References

1. Rohwer F. Global phage diversity. Cell. 2003; 113:141. [DOI] [PubMed] [Google Scholar]
2. Gregory A.C., Zayed A.A., Conceição-Neto N., Temperton B., Bolduc B., Alberti A., Ardyna M., Arkhipova K., Carmichael M., Cruaud C.et al.. Marine DNA viral macro-and microdiversity from pole to pole. Cell. 2019; 177:1109–1123. [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Gregory A.C., Zablocki O., Zayed A.A., Howell A., Bolduc B., Sullivan M.B.. The gut virome database reveals age-dependent patterns of virome diversity in the human gut. Cell Host Microbe. 2020; 28:724–740. [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Camarillo-Guerrero L.F., Almeida A., Rangel-Pineros G., Finn R.D., Lawley T.D.. Massive expansion of human gut bacteriophage diversity. Cell. 2021; 184:1098–1109. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Nayfach S., Páez-Espino D., Call L., Low S.J., Sberro H., Ivanova N.N., Proal A.D., Fischbach M.A., Bhatt A.S., Hugenholtz P.et al.. Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. Nat. Microbiol. 2021; 6:960–970. [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Tisza M.J., Buck C.B.. A catalog of tens of thousands of viruses from human metagenomes reveals hidden associations with chronic diseases. Proc. Natl. Acad. Sci. U.S.A. 2021; 118:e2023202118. [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Zhang X., Wang R., Xie X., Hu Y., Wang J., Sun Q., Feng X., Lin W., Tong S., Yan W.et al.. Mining bacterial NGS data vastly expands the complete genomes of temperate phages. NAR Genom. Bioinform. 2022; 4:lqac057. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Camargo A.P., Nayfach S., Chen I.-M.A., Palaniappan K., Ratner A., Chu K., Ritter S.J., Reddy T., Mukherjee S., Schulz F.et al.. IMG/VR v4: an expanded database of uncultivated virus genomes within a framework of extensive functional, taxonomic, and ecological metadata. Nucleic Acids Res. 2023; 51:D733–D743. [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Turner D., Adriaenssens E.M., Tolstoy I., Kropinski A.M.. Phage annotation guide: Guidelines for assembly and high-quality annotation. Phage. 2021; 2:170–182. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Russell D.A., Hatfull G.F.. PhagesDB: the actinobacteriophage database. Bioinformatics. 2017; 33:784–786. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Gao N.L., Zhang C., Zhang Z., Hu S., Lercher M.J., Zhao X.-M., Bork P., Liu Z., Chen W.-H.. MVP: a microbe–phage interaction database. Nucleic Acids Res. 2018; 46:D700–D707. [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Terzian P., Olo Ndela E., Galiez C., Lossouarn J., Pérez Bucio R.E., Mom R., Toussaint A., Petit M.-A., Enault F.. PHROG: families of prokaryotic virus proteins clustered using remote homology. NAR Genom. Bioinform. 2021; 3:lqab067. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Cantu V.A., Salamon P., Seguritan V., Redfield J., Salamon D., Edwards R.A., Segall A.M.. PhANNs, a fast and accurate tool and web server to classify phage structural proteins. PLoS Comput. Biol. 2020; 16:e1007845. [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Wu J., Liu Q., Li M., Xu J., Wang C., Zhang J., Xiao M., Bin Y., Xia J.. PhaGAA: an integrated web server platform for phage genome annotation and analysis. Bioinformatics. 2023; 39:btad120. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. O’Leary N.A., Wright M.W., Brister J.R., Ciufo S., Haddad D., McVeigh R., Rajput B., Robbertse B., Smith-White B., Ako-Adjei D.et al.. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016; 44:D733–D745. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Benson D.A., Cavanaugh M., Clark K., Karsch-Mizrachi I., Ostell J., Pruitt K.D., Sayers E.W.. GenBank. Nucleic Acids Res. 2018; 46:D41. [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Kanz C., Aldebert P., Althorpe N., Baker W., Baldwin A., Bates K., Browne P., van den Broek A., Castro M., Cochrane G.et al.. The EMBL nucleotide sequence database. Nucleic Acids Res. 2005; 33:D29–D33. [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Ogasawara O., Kodama Y., Mashima J., Kosuge T., Fujisawa T.. DDBJ Database updates and computational infrastructure enhancement. Nucleic Acids Res. 2020; 48:D45–D50. [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Schoch C.L., Ciufo S., Domrachev M., Hotton C.L., Kannan S., Khovanskaya R., Leipe D., Mcveigh R., O’Neill K., Robbertse B.et al.. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database. 2020; 2020:baaa062. [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Santos-Medellin C., Zinke L.A., Ter Horst A.M., Gelardi D.L., Parikh S.J., Emerson J.B.. Viromes outperform total metagenomes in revealing the spatiotemporal patterns of agricultural soil viral communities. ISME J. 2021; 15:1956–1970. [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Shah S.A., Deng L., Thorsen J., Pedersen A.G., Dion M.B., Castro-Mejía J.L., Silins R., Romme F.O., Sausset R., Jessen L.E.et al.. Expanding known viral diversity in the healthy infant gut. Nat. Microbiol. 2023; 8:986–998. [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Nayfach S., Camargo A.P., Schulz F., Eloe-Fadrosh E., Roux S., Kyrpides N.C.. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat. Biotechnol. 2021; 39:578–585. [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Wang R., Zhang X., Wang J., Li S.C.. DeepHost: phage host prediction with convolutional neural network. Brief. Bioinform. 2022; 23:bbab385. [DOI] [PubMed] [Google Scholar]
24. Wang R., Ng Y.K., Zhang X., Wang J., Li S.. Coding nucleic acid sequences with graph convolutional network. 2022; bioRxiv doi:28 December 2022, preprint: not peer reviewed 10.1101/2022.08.22.504727. [DOI]
25. Hyatt D., Chen G.-L., LoCascio P.F., Land M.L., Larimer F.W., Hauser L.J.. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010; 11:119. [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Cantalapiedra C.P., Hernández-Plaza A., Letunic I., Bork P., Huerta-Cepas J.. eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mole. Biol. Evol. 2021; 38:5825–5829. [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Steinegger M., Söding J.. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 2017; 35:1026–1028. [DOI] [PubMed] [Google Scholar]
28. Kingsford C.L., Ayanbule K., Salzberg S.L.. Rapid, accurate, computational discovery of Rho-independent transcription terminators illuminates their relationship to DNA uptake. Genome Biol. 2007; 8:R22. [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Lowe T.M., Eddy S.R.. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997; 25:955–964. [DOI] [PMC free article] [PubMed] [Google Scholar]
30. Laslett D., Canback B.. ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res. 2004; 32:11–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
31. Eitzinger S., Asif A., Watters K.E., Iavarone A.T., Knott G.J., Doudna J.A., inhas F.u.A.A.. Machine learning predicts new anti-CRISPR proteins. Nucleic Acids Res. 2020; 48:4698–4708. [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Dong C., Hao G.-F., Hua H.-L., Liu S., Labena A.A., Chai G., Huang J., Rao N., Guo F.-B.. Anti-CRISPRdb: a comprehensive online resource for anti-CRISPR proteins. Nucleic Acids Res. 2018; 46:D393–D398. [DOI] [PMC free article] [PubMed] [Google Scholar]
33. Couvin D., Bernheim A., Toffano-Nioche C., Touchon M., Michalik J., Néron B., Rocha E.P., Vergnaud G., Gautheret D., Pourcel C.. CRISPRCasFinder, an update of CRISRFinder, includes a portable version, enhanced performance and integrates search for Cas proteins. Nucleic Acids Res. 2018; 46:W246–W251. [DOI] [PMC free article] [PubMed] [Google Scholar]
34. Chen L., Yang J., Yu J., Yao Z., Sun L., Shen Y., Jin Q.. VFDB: a reference database for bacterial virulence factors. Nucleic Acids Res. 2005; 33:D325–D328. [DOI] [PMC free article] [PubMed] [Google Scholar]
35. McArthur A.G., Waglechner N., Nizam F., Yan A., Azad M.A., Baylay A.J., Bhullar K., Canova M.J., De Pascale G., Ejim L.et al.. The comprehensive antibiotic resistance database. Antimicrob. Agents Ch. 2013; 57:3348–3357. [DOI] [PMC free article] [PubMed] [Google Scholar]
36. Enault F., Briet A., Bouteille L., Roux S., Sullivan M.B., Petit M.-A.. Phages rarely encode antibiotic resistance genes: a cautionary tale for virome analyses. The ISME J. 2017; 11:237–247. [DOI] [PMC free article] [PubMed] [Google Scholar]
37. Krogh A., Larsson B., Von Heijne G., Sonnhammer E.L.. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol. Biol. 2001; 305:567–580. [DOI] [PubMed] [Google Scholar]
38. Steinegger M., Söding J.. Clustering huge protein sequence sets in linear time. Nat. Commun. 2018; 9:2542. [DOI] [PMC free article] [PubMed] [Google Scholar]
39. Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J.. Basic local alignment search tool. J. Mol. Biol. 1990; 215:403–410. [DOI] [PubMed] [Google Scholar]
40. Zielezinski A., Vinga S., Almeida J., Karlowski W.M.. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 2017; 18:186. [DOI] [PMC free article] [PubMed] [Google Scholar]
41. Saitou N., Nei M.. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 1987; 4:406–425. [DOI] [PubMed] [Google Scholar]
42. Jiang Y., Wang Y., Che L., Zhou Q., Li S.C.. GutMeta: online microbiome analysis and interactive visualization with build-in curated human gut microbiome database. 2022; bioRxiv doi:27 September 2022, preprint: not peer reviewed 10.1101/2022.09.26.509484. [DOI]
43. Wang X., Chen L., Liu W., Zhang Y., Liu D., Zhou C., Shi S., Dong J., Lai Z., Zhao B.et al.. TIMEDB: tumor immune micro-environment cell composition database with automatic analysis and interactive visualization. Nucleic Acids Res. 2023; 51:D1417–D1424. [DOI] [PMC free article] [PubMed] [Google Scholar]
44. Jia W., Li H., Li S., Chen L., Li S.C.. Oviz-Bio: a web-based platform for interactive cancer genomics data visualization. Nucleic Acids Res. 2020; 48:W415–W426. [DOI] [PMC free article] [PubMed] [Google Scholar]
45. Fortier L.-C., Sekulovic O.. Importance of prophages to evolution and virulence of bacterial pathogens. Virulence. 2013; 4:354–365. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkad979_supplemental_file

Click here for additional data file.^{(6.1MB, pdf)}

Data Availability Statement

All the data are freely available at https://phagescope.deepomics.org.

[B1] 1. Rohwer F. Global phage diversity. Cell. 2003; 113:141. [DOI] [PubMed] [Google Scholar]

[B2] 2. Gregory A.C., Zayed A.A., Conceição-Neto N., Temperton B., Bolduc B., Alberti A., Ardyna M., Arkhipova K., Carmichael M., Cruaud C.et al.. Marine DNA viral macro-and microdiversity from pole to pole. Cell. 2019; 177:1109–1123. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B3] 3. Gregory A.C., Zablocki O., Zayed A.A., Howell A., Bolduc B., Sullivan M.B.. The gut virome database reveals age-dependent patterns of virome diversity in the human gut. Cell Host Microbe. 2020; 28:724–740. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] 4. Camarillo-Guerrero L.F., Almeida A., Rangel-Pineros G., Finn R.D., Lawley T.D.. Massive expansion of human gut bacteriophage diversity. Cell. 2021; 184:1098–1109. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B5] 5. Nayfach S., Páez-Espino D., Call L., Low S.J., Sberro H., Ivanova N.N., Proal A.D., Fischbach M.A., Bhatt A.S., Hugenholtz P.et al.. Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. Nat. Microbiol. 2021; 6:960–970. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] 6. Tisza M.J., Buck C.B.. A catalog of tens of thousands of viruses from human metagenomes reveals hidden associations with chronic diseases. Proc. Natl. Acad. Sci. U.S.A. 2021; 118:e2023202118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] 7. Zhang X., Wang R., Xie X., Hu Y., Wang J., Sun Q., Feng X., Lin W., Tong S., Yan W.et al.. Mining bacterial NGS data vastly expands the complete genomes of temperate phages. NAR Genom. Bioinform. 2022; 4:lqac057. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] 8. Camargo A.P., Nayfach S., Chen I.-M.A., Palaniappan K., Ratner A., Chu K., Ritter S.J., Reddy T., Mukherjee S., Schulz F.et al.. IMG/VR v4: an expanded database of uncultivated virus genomes within a framework of extensive functional, taxonomic, and ecological metadata. Nucleic Acids Res. 2023; 51:D733–D743. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B9] 9. Turner D., Adriaenssens E.M., Tolstoy I., Kropinski A.M.. Phage annotation guide: Guidelines for assembly and high-quality annotation. Phage. 2021; 2:170–182. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] 10. Russell D.A., Hatfull G.F.. PhagesDB: the actinobacteriophage database. Bioinformatics. 2017; 33:784–786. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] 11. Gao N.L., Zhang C., Zhang Z., Hu S., Lercher M.J., Zhao X.-M., Bork P., Liu Z., Chen W.-H.. MVP: a microbe–phage interaction database. Nucleic Acids Res. 2018; 46:D700–D707. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] 12. Terzian P., Olo Ndela E., Galiez C., Lossouarn J., Pérez Bucio R.E., Mom R., Toussaint A., Petit M.-A., Enault F.. PHROG: families of prokaryotic virus proteins clustered using remote homology. NAR Genom. Bioinform. 2021; 3:lqab067. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] 13. Cantu V.A., Salamon P., Seguritan V., Redfield J., Salamon D., Edwards R.A., Segall A.M.. PhANNs, a fast and accurate tool and web server to classify phage structural proteins. PLoS Comput. Biol. 2020; 16:e1007845. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] 14. Wu J., Liu Q., Li M., Xu J., Wang C., Zhang J., Xiao M., Bin Y., Xia J.. PhaGAA: an integrated web server platform for phage genome annotation and analysis. Bioinformatics. 2023; 39:btad120. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] 15. O’Leary N.A., Wright M.W., Brister J.R., Ciufo S., Haddad D., McVeigh R., Rajput B., Robbertse B., Smith-White B., Ako-Adjei D.et al.. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016; 44:D733–D745. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] 16. Benson D.A., Cavanaugh M., Clark K., Karsch-Mizrachi I., Ostell J., Pruitt K.D., Sayers E.W.. GenBank. Nucleic Acids Res. 2018; 46:D41. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] 17. Kanz C., Aldebert P., Althorpe N., Baker W., Baldwin A., Bates K., Browne P., van den Broek A., Castro M., Cochrane G.et al.. The EMBL nucleotide sequence database. Nucleic Acids Res. 2005; 33:D29–D33. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B18] 18. Ogasawara O., Kodama Y., Mashima J., Kosuge T., Fujisawa T.. DDBJ Database updates and computational infrastructure enhancement. Nucleic Acids Res. 2020; 48:D45–D50. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] 19. Schoch C.L., Ciufo S., Domrachev M., Hotton C.L., Kannan S., Khovanskaya R., Leipe D., Mcveigh R., O’Neill K., Robbertse B.et al.. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database. 2020; 2020:baaa062. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] 20. Santos-Medellin C., Zinke L.A., Ter Horst A.M., Gelardi D.L., Parikh S.J., Emerson J.B.. Viromes outperform total metagenomes in revealing the spatiotemporal patterns of agricultural soil viral communities. ISME J. 2021; 15:1956–1970. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] 21. Shah S.A., Deng L., Thorsen J., Pedersen A.G., Dion M.B., Castro-Mejía J.L., Silins R., Romme F.O., Sausset R., Jessen L.E.et al.. Expanding known viral diversity in the healthy infant gut. Nat. Microbiol. 2023; 8:986–998. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] 22. Nayfach S., Camargo A.P., Schulz F., Eloe-Fadrosh E., Roux S., Kyrpides N.C.. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat. Biotechnol. 2021; 39:578–585. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] 23. Wang R., Zhang X., Wang J., Li S.C.. DeepHost: phage host prediction with convolutional neural network. Brief. Bioinform. 2022; 23:bbab385. [DOI] [PubMed] [Google Scholar]

[B24] 24. Wang R., Ng Y.K., Zhang X., Wang J., Li S.. Coding nucleic acid sequences with graph convolutional network. 2022; bioRxiv doi:28 December 2022, preprint: not peer reviewed 10.1101/2022.08.22.504727. [DOI]

[B25] 25. Hyatt D., Chen G.-L., LoCascio P.F., Land M.L., Larimer F.W., Hauser L.J.. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010; 11:119. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] 26. Cantalapiedra C.P., Hernández-Plaza A., Letunic I., Bork P., Huerta-Cepas J.. eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mole. Biol. Evol. 2021; 38:5825–5829. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B27] 27. Steinegger M., Söding J.. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 2017; 35:1026–1028. [DOI] [PubMed] [Google Scholar]

[B28] 28. Kingsford C.L., Ayanbule K., Salzberg S.L.. Rapid, accurate, computational discovery of Rho-independent transcription terminators illuminates their relationship to DNA uptake. Genome Biol. 2007; 8:R22. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B29] 29. Lowe T.M., Eddy S.R.. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997; 25:955–964. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B30] 30. Laslett D., Canback B.. ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res. 2004; 32:11–16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B31] 31. Eitzinger S., Asif A., Watters K.E., Iavarone A.T., Knott G.J., Doudna J.A., inhas F.u.A.A.. Machine learning predicts new anti-CRISPR proteins. Nucleic Acids Res. 2020; 48:4698–4708. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B32] 32. Dong C., Hao G.-F., Hua H.-L., Liu S., Labena A.A., Chai G., Huang J., Rao N., Guo F.-B.. Anti-CRISPRdb: a comprehensive online resource for anti-CRISPR proteins. Nucleic Acids Res. 2018; 46:D393–D398. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B33] 33. Couvin D., Bernheim A., Toffano-Nioche C., Touchon M., Michalik J., Néron B., Rocha E.P., Vergnaud G., Gautheret D., Pourcel C.. CRISPRCasFinder, an update of CRISRFinder, includes a portable version, enhanced performance and integrates search for Cas proteins. Nucleic Acids Res. 2018; 46:W246–W251. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B34] 34. Chen L., Yang J., Yu J., Yao Z., Sun L., Shen Y., Jin Q.. VFDB: a reference database for bacterial virulence factors. Nucleic Acids Res. 2005; 33:D325–D328. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B35] 35. McArthur A.G., Waglechner N., Nizam F., Yan A., Azad M.A., Baylay A.J., Bhullar K., Canova M.J., De Pascale G., Ejim L.et al.. The comprehensive antibiotic resistance database. Antimicrob. Agents Ch. 2013; 57:3348–3357. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B36] 36. Enault F., Briet A., Bouteille L., Roux S., Sullivan M.B., Petit M.-A.. Phages rarely encode antibiotic resistance genes: a cautionary tale for virome analyses. The ISME J. 2017; 11:237–247. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B37] 37. Krogh A., Larsson B., Von Heijne G., Sonnhammer E.L.. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol. Biol. 2001; 305:567–580. [DOI] [PubMed] [Google Scholar]

[B38] 38. Steinegger M., Söding J.. Clustering huge protein sequence sets in linear time. Nat. Commun. 2018; 9:2542. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B39] 39. Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J.. Basic local alignment search tool. J. Mol. Biol. 1990; 215:403–410. [DOI] [PubMed] [Google Scholar]

[B40] 40. Zielezinski A., Vinga S., Almeida J., Karlowski W.M.. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 2017; 18:186. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B41] 41. Saitou N., Nei M.. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 1987; 4:406–425. [DOI] [PubMed] [Google Scholar]

[B42] 42. Jiang Y., Wang Y., Che L., Zhou Q., Li S.C.. GutMeta: online microbiome analysis and interactive visualization with build-in curated human gut microbiome database. 2022; bioRxiv doi:27 September 2022, preprint: not peer reviewed 10.1101/2022.09.26.509484. [DOI]

[B43] 43. Wang X., Chen L., Liu W., Zhang Y., Liu D., Zhou C., Shi S., Dong J., Lai Z., Zhao B.et al.. TIMEDB: tumor immune micro-environment cell composition database with automatic analysis and interactive visualization. Nucleic Acids Res. 2023; 51:D1417–D1424. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B44] 44. Jia W., Li H., Li S., Chen L., Li S.C.. Oviz-Bio: a web-based platform for interactive cancer genomics data visualization. Nucleic Acids Res. 2020; 48:W415–W426. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B45] 45. Fortier L.-C., Sekulovic O.. Importance of prophages to evolution and virulence of bacterial pathogens. Virulence. 2013; 4:354–365. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

PhageScope: a well-annotated bacteriophage database with automatic analyses and visualizations

Ruo Han Wang

Shuo Yang

Zhixuan Liu

Yuanzheng Zhang

Xueying Wang

Zixin Xu

Jianping Wang

Shuai Cheng Li

Abstract

Graphical Abstract

Graphical Abstract.

Introduction

Materials and methods

Phage sequence collection

Genome annotation

Completeness assessment

Phenotype annotation

Structural annotation

Taxonomic annotation

Functional annotation

Genome comparison

Sequence clustering

Sequence alignment

Comparative tree construction

Platform development

Results

PhageScope database

Figure 1.

Automatic analyses and visualizations

Figure 2.

Comparison to the existing databases and webservers

Discussion

Supplementary Material

Acknowledgements

Contributor Information

Data availability

Supplementary data

Funding

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases