Skip to main content
Genomics, Proteomics & Bioinformatics logoLink to Genomics, Proteomics & Bioinformatics
. 2021 Jan 6;18(6):760–772. doi: 10.1016/j.gpb.2020.11.001

MicroPhenoDB Associates Metagenomic Data with Pathogenic Microbes, Microbial Core Genes, and Human Disease Phenotypes

Guocai Yao 1,#, Wenliang Zhang 1,#, Minglei Yang 1, Huan Yang 1, Jianbo Wang 1, Haiyue Zhang 1, Lai Wei 3, Zhi Xie 2,3, Weizhong Li 1,2,4,
PMCID: PMC8377004  PMID: 33418085

Abstract

Microbes play important roles in human health and disease. The interaction between microbes and hosts is a reciprocal relationship, which remains largely under-explored. Current computational resources lack manually and consistently curated data to connect metagenomic data to pathogenic microbes, microbial core genes, and disease phenotypes. We developed the MicroPhenoDB database by manually curating and consistently integrating microbe-disease association data. MicroPhenoDB provides 5677 non-redundant associations between 1781 microbes and 542 human disease phenotypes across more than 22 human body sites. MicroPhenoDB also provides 696,934 relationships between 27,277 unique clade-specific core genes and 685 microbes. Disease phenotypes are classified and described using the Experimental Factor Ontology (EFO). A refined score model was developed to prioritize the associations based on evidential metrics. The sequence search option in MicroPhenoDB enables rapid identification of existing pathogenic microbes in samples without running the usual metagenomic data processing and assembly. MicroPhenoDB offers data browsing, searching, and visualization through user-friendly web interfaces and web service application programming interfaces. MicroPhenoDB is the first database platform to detail the relationships between pathogenic microbes, core genes, and disease phenotypes. It will accelerate metagenomic data analysis and assist studies in decoding microbes related to human diseases. MicroPhenoDB is available through http://www.liwzlab.cn/microphenodb and http://lilab2.sysu.edu.cn/microphenodb.

Keywords: Pathogenic microbes, Metagenomic data, Disease phenotypes, Microbe-disease association, COVID-19

Introduction

The human body feeds a large number of microbes, mainly composed of bacteria, followed by archaea, fungi, viruses, and protozoa. Microbes, inhabiting various organs of the human body, mainly in the gastrointestinal tract, as well as in the respiratory tract, oral cavity, stomach, and skin, play important roles in human health and disease [1], [2], [3]. Microbial gene products have rich biochemical and metabolic activities in the host [4], [5], [6]. Microorganisms usually form a healthy symbiotic relationship with the host. However, when the microbial content becomes abnormal or exogenous microbes infect the host, the balance of host microecology can be broken, which in turn can possibly cause various diseases [7], [8]. Tripartite network analysis in patients with irritable bowel syndrome demonstrated that the gut microbe Clostridia is significantly associated with brain functional connectivity and gastrointestinal sensorimotor function [9]. Strati et al. reported that Rett syndrome is substantially associated with a dysbiosis of both bacterial and fungal components of the gut microbiota [10]. The alteration of microbial communities on psoriatic skin is different from those on healthy skin and has a potential role in Th17 polarization to exacerbate cutaneous inflammation [11]. The ongoing pandemic of coronavirus disease 2019 (COVID-19) has affected more than 220 countries, areas, or territories worldwide by November 2020. Lung injury has been reported in most patients with confirmed severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection [12].

The interaction between microbes and hosts is a reciprocal relationship and remains largely under-explored [13]. Accurate relationship information between microbes and diseases can greatly assist studies in human health [14]. With the wide application of next-generation sequencing (NGS) technology, microbiological analysis methods and standards are being rapidly developed, such as metagenomic approaches [15]. As a result, a large amount of experimental data has been published [16]. Thus accurate database platforms are greatly needed to utilize these experimental data, determine the composition of pathogenic microbes in hosts, clarify microbial-disease relationships, and provide standardized high-quality annotation for clinical uses [17].

Due to the functional and clinical significance of microbes, several public databases have been established to collect microbe-disease association data, such as the Human Microbe-Disease Association Database (HMDAD) [18], Disbiome [19], the Virulence Factor Database (VFDB) [20], and the Comprehensive Antibiotic Resistance Database (CARD) [21]. HMDAD and Disbiome collate text-mining-based microbe–disease association data from peer-reviewed publications and describe the strength of the associations based on the credibility of the data sources. VFDB provides up-to-date knowledge of the virulence factors (VFs) of various bacterial pathogens; CARD contains high-quality reference data on the molecular basis of antimicrobial resistance with an emphasis on genes, proteins, and mutations involved. Data in VFDB and CARD help to explain the relationship between pathogenic microbial genes and the health status of hosts. In addition, to assist physicians and healthcare providers to quickly and accurately diagnose infectious diseases in patients, a guideline for utilization of the microbiology laboratory for diagnosis of infectious diseases was developed and is being regularly updated by the Infectious Diseases Society of America (IDSA) and the American Society for Microbiology (ASM) [22]. The curation and analysis of microbe-disease association data are essential for expediting translational research and application. However, these computational resources lack manually and consistently curated data to connect metagenomic data to pathogenic microbes, microbial core genes, and disease phenotypes.

To bridge this gap, we developed the MicroPhenoDB database (http://www.liwzlab.cn/microphenodb) by manually curating and consistently integrating microbe-disease association data. We collected and curated the microbe-disease associations from the IDSA guideline [22], the National Cancer Institute (NCI) Thesaurus OBO Edition (NCIT) [23], and the HMDAD [18] and Disbiome [19] databases, and also connected microbial core genes derived from the MetaPhlAn2 dataset [24] to pathogenic microbes and human diseases. A refined score model was adopted to prioritize the microbe-disease associations based on evidential metrics [18], [25]. In addition, a sequence search web application was also implemented to allow users to query sequencing data to identify pathogenic microbes in metagenomic samples, as well as to retrieve the disease-related information of virulence factors and antibiotic resistances. MicroPhenoDB allows users to browse, search, access, and analyze data through user-friendly web interfaces, visualizations, and web service application programming interfaces (APIs).

Data collection and processing

Data collection and manual annotation

To ensure data quality, we integrated the association data with annotations from HMDAD and Disbiome and manually collated and curated microbe-disease association data from the IDSA guideline and NCIT (Figure 1). The IDSA guideline provides criteria for clinical identification of infectious microbes, while NCIT is a reference terminology that provides comprehensive information for infectious microbes. To enrich the annotation for disease-microbe associations, we manually traced the relevant literature in HMDAD and Disbiome; we also provided the microbes with annotation at the resolution of species levels, such as taxonomies and official names. Association data between infectious microbes and diseases in IDSA were extracted. Relevant information about disease phenotypes and microbes in the microorganism notes from NCIT were extracted as well. The collected and integrated association data include information about microbe symbols, disease symbols, the increased or decreased impacts of the microbes, PubMed identifiers, and validation methods.

Figure 1.

Figure 1

Workflow demonstrating the construction and curation of the MicroPhenoDBdatabase

CARD, Comprehensive Antibiotic Resistance Database; EFO, Experimental Factor Ontology; HMDAD, Human Microbe-Disease Association Database; IDSA, Infectious Diseases Society of America; NCIT, National Cancer Institute Thesaurus; VFDB, Virulence Factor Database.

Controlled vocabulary and ontology to describe microbes and diseases

In MicroPhenoDB, several standard terminology and controlled vocabulary resources were adopted to consistently annotate microbes and diseases (Figure 1). Different tools and reference databases might give different taxonomies for microbes. To avoid this discrepancy, the official names of microbes were taken from NCIT [23], and the taxonomy identifiers were adopted from the National Center for Biotechnology Information (NCBI) [26] and UniProt [27]. The relationships between core genes and microbes were annotated using the MetaPhlAn2 tool [28], the microbial gene functions were annotated using the InterProScan tool [29], and the virulence factors and the drug resistance information of microbes were retrieved respectively from the databases of VFDB [20] and CARD [21]. The disease phenotypes were annotated with official names, experimental factor terms, definitions, classifications, and cross-references using the Experimental Factor Ontology (EFO) [30]. EFO provides a systematic description of many experimental variables across the European Bioinformatics Institute (EMBL-EBI) databases and the National Human Genome Research Institute (NHGRI) genome-wide association study (GWAS) catalog [31]; it also combines parts of several popular ontologies, such as Orphanet Rare Disease Ontology [32], Human Phenotype Ontology [33], and Monarch Disease Ontology [34]. The versions or releases of databases and tools used in the MicroPhenoDB construction are detailed in Table S1.

Association score model

One of the main problems in exploiting extensive collections of aggregated microbiome data is how to prioritize the associations. According to the previous studies by Ma et al. [18] and Pinero et al. [25], we refined the association score model to prioritize the microbe-disease associations using additional evidential metrics, including the number of sources that report the association, the type of curation of each source, and the number of supporting publications in the manual curation.

For every disease i and every microbe j, the raw score of their relationship Raw_scoreij was defined as:

Raw_scoreij=WIDSA+WNCIT+WLiterature×logN/nj (1)

In Equation (1), WIDSA is the weight of the association source from the IDSA guideline, WNCIT is the weight of the association source from NCIT, and WLiterature is the weight of the association source from literature publications. N is the number of all diseases in MicroPhenoDB, and nj is the number of diseases associated with microbe j. Log(N/nj) is computed to increase Raw_scoreij for the microbes that are associated explicitly with few diseases or decrease Raw_scoreij for the microbes globally associated with several diverse diseases.

In Equations (2), (3), (4), MicroPhenoDB assigns different weights to different evidential sources according to their reliabilities (Table 1) [25]. If the association is curated from literature, WLiterature is initially assigned as 0.25, otherwise assigned as 0. If the association is curated from NCIT [23], WNCIT is initially assigned as 0.5, which is double that of WLiterature, otherwise assigned as 0. If the association is curated from IDSA [22], WIDSA is initially assigned as 1.0, which is double that of WNCIT, otherwise assigned as 0. The three weights also depend on the direction of the abundance change of a microbe in a disease and the number of supporting publications. Dij (Dij ∈{1, −1}) represents the direction of the abundance change of microbe j in disease i. If the microbe j is increased in the case of disease i, Dij equals 1; if the microbe j is decreased in the case of disease i, Dij equals −1. np is the number of publications in which an association between a disease and a microbe has been reported. From the distribution of numbers of evidence, we found that np was less than 16 and mostly ranged from 1 to 2 (Figure S1).

WLiterature=Dij×0.25×npassociationfromliterature0otherwise (2)
WNCIT=Dij×0.5associationfromNCIT0otherwise (3)
WIDSA=Dij×1.0associationfromIDSA0otherwise (4)
Scoreij=21+1eRaw_scoreij-1 (5)

Table 1.

The weight of different evidential sources according to theirreliabilities

graphic file with name fx2.gif

Note: IDSA, Infectious Diseases Society of America; NCIT, National Cancer Institute Thesaurus OBO Edition.

Finally, the sigmoid function was used to normalize Raw_scoreij to limit the range of the final association score Scoreij from −1 to 1. In Equation (5), ‘e’ represents the natural constant e. Scoreij can be used to judge the confidence of the relationship between a microbe and a disease phenotype. Please see the score distribution in Figure 2. A Scoreij more than 0 indicates that the occurrence of the disease correlates with an increase of the microbial abundance, and a Scoreij less than 0 indicates that the occurrence of the disease correlates with a decrease of the microbial abundance. The greater the absolute value of Scoreij, the higher the number of previous reports of the respective microbe-disease association; the closer the score is to zero, the lower the number of previous reports of the respective microbe-disease association. By investigating the Scoreij distribution, most associations were found with Scoreij between −0.3 and 0.3, and the two peaks with Scoreij more than 0.3 were involved in high confidence associations from NCIT and IDSA (Figure 2). This suggested that the score points of −0.3 and 0.3 would be the highly reliable thresholds to assess the confidence level of an association.

Figure 2.

Figure 2

The distribution of association scores inMicroPhenoDB.

Implementation

The web applications in MicroPhenoDB were implemented in Java language by using the model-view-controller model and the SpringBoot framework and were deployed on an Apache Tomcat web server. The association data of microbes and disease phenotypes were stored in a MySQL database. Data access, search, and visualization were implemented by using the Ajax API technology. The frontend interface was visualized by using the Vue.js framework. The sequence search tool was implemented using the EMBL-EBI tool framework [35].

Database content and usage

Database content

MicroPhenoDB collated 7449 redundant associations between 1781 microbes and 542 human disease phenotypes across more than 22 human body sites (Table 2). Of the 7449 associations, 29.7% were manually curated from the IDSA guideline (1196, 16.1%) [22], NCIT (849, 11.4%) [23], and peer-reviewed publications for human respiratory infection virus (164, 2.2%), and the others were consistently integrated with annotation from HMDAD (673, 9.0%) [18] and Disbiome (4567, 61.3%) [19] (Figure 3A). Multiple publications might support the same association between a microbe and a disease phenotype. After removing data redundancy based on the supporting publications, MicroPhenoDB produced 5677 non-redundant microbe-disease phenotype associations (Table 2). The number of non-redundant associations was over 11-fold (5677/483) of that in HMDAD. Each non-redundant association was assigned with a unique accession number (e.g., MBP00000900) and an association score. For the microbe distribution, MicroPhenoDB contained 1497 bacteria in a broad sense (including 1474 bacteria in a narrow sense, 11 Rickettsia, 6 Chlamydia, 4 Ehrlichia, and 2 Mycoplasma), 183 viruses, 58 fungi, and 43 parasites (Table 2). Approximately 88.3% (5014/5677), 8.5% (481/5677), 2.0% (116/5677), and 1.2% (66/5677) of the associations were related to bacteria, viruses, fungi, and parasites respectively (Figure 3B). The top six frequent disease-associated bacteria phyla were Firmicutes, Proteobacteria, Bacteroidetes, Actinobacteria, Spirochaetes, and Fusobacteria. The top disease-associated fungal phylum was Ascomycota. Firmicutes included 271 genus/species in 4 classes (Bacilli, Clostridia, Erysipelotrichia, and Negativicutes) (Figure 3C). The microbes were mainly distributed in the body sites of the gastrointestinal tract (37.3%), oral cavity (9.5%), respiratory tract (6.9%), skin and soft tissue (4.2%), urinary tract (3.5%), vagina (2.5%), and central nervous system (2.0%) (Table 3). The disease phenotypes were classified and described by EFO [30]. Many diseases were associated with pathogenic microorganisms, such as bacterial, digestive, nervous, and autoimmune diseases (Figure 3D).

Table 2.

Data scope and scale inMicroPhenoDB

graphic file with name fx3.gif

Figure 3.

Figure 3

Data content and distribution inMicroPhenoDB

A. The association data collected from different resources. B. The distribution of different microbe types. C. The number of bacterial species in different phyla. D. The disease distribution in MicroPhenoDB. HMDAD, Human Microbe-Disease Association Database; IDSA, Infectious Diseases Society of America; NCIT, National Cancer Institute Thesaurus.

Table 3.

The top ten body sites of disease-associated microbes inMicroPhenoDB

graphic file with name fx4.gif

In total, 27,277 unique clade-specific core genes of 685 bacteria and viruses were retrieved from the dataset in MetaPhlAn2 and were annotated with gene functions using InterProScan (Table 2). In addition, 4204 virulence factor genes and 2522 drug resistance genes were also included from VFDB [20] and CARD [21], respectively. A small percentage ((4.3%, 65/1497) and (4.4%, 66/1497)) of bacteria was annotated with virulence factor information and antimicrobial resistance information, respectively (Table 2).

Web interface

The MicroPhenoDB website (http://www.liwzlab.cn/microphenodb) provides user-friendly web interfaces to enable users to search, browse, prioritize, and analyze the microbe-disease association data in the database (Figure 4). The website offers multiple optional search applications of microbes, diseases, and associations to acquire prioritized association data with body site and microbe type filters. The prioritized microbe-disease associations can be downloaded as a CSV file for further analysis. The hierarchical structure of microbes and diseases are respectively displayed on the ‘Browse’ web page. Information regarding the increasing or decreasing tendency of microbial abundance in a disease, virulence factor, and antibiotic resistance of the microbes, along with its core gene information, are available on the ‘Browse’ web page. In addition, MicroPhenoDB provides the web service APIs for programmatical access of the association data and produces an output in the JSON format. All the association data and the API documentation are available on the website. Users are also encouraged to submit their data of newly published microbe-disease associations. Once checked by our professional curators and approved by the submission review committee, the submitted record will be included in an updated release.

Figure 4.

Figure 4

The MicroPhenoDB webinterface

Applications of association data

MicroPhenoDB sequence search to explore metagenomics data

In MicroPhenoDB, microbes were connected with diseases through 5677 non-redundant associations and linked to unique clade-specific core genes via 696,934 relationships (Figure 5). Core genes could serve as a hub to connect metagenomic sequencing data to microbes and their associated diseases (Figure 5). A sequence search application was implemented on the MicroPhenoDB website (http://www.liwzlab.cn/microphenodb/#/tool) to allow users to query their metagenomic sequencing data against the MicroPhenoDB sequence datasets through the sequence alignment tools BLAST [36] and Bowtie2 [37] (Figure 5). The application can directly identify the composition of pathogenic microorganisms in metagenomic samples and can suggest potential disease phenotypes that may be caused without running the usual metagenomic sequencing data processing and assembly, which are both time and resource consuming. Functional annotation for microbial core genes by the application includes gene ontology and pathway information. Searching against the sequence datasets of microbial pathogenic factors and drug resistance genes allows identifying homologous genes and proteins related to virulence factors and antibiotic resistance (Figure 5).

Figure 5.

Figure 5

The MicroPhenoDB sequence search connects microbes, core genes and diseasephenotypes

BLAST, Basic Local Alignment Search Tool; SQL, Structure Query Language.

To assess the sequence search usability, we used the sequence search application to analyze an existing metagenomic dataset downloaded from the Genome Sequence Archive (accession: PRJCA000880) [38]. The dataset contained metagenomics data of lung biopsy tissues from 20 patients with pulmonary infection [39]. Our results identified pathogenic microbes in 95% (19 of 20) of patients, significantly higher than the 75% identification rate (15 of 20) found through the original metagenomic NGS (mNGS) analysis [39]. In addition, our search identified 37 pathogenic microbes in patients, while the mNGS method only identified 29 (Table S2). Of the 37 microbes, 23 were identical to those by mNGS analysis. It was hard to estimate the false positives of the other 14 microbes, but we found that they may cause infections in patients with underlying diseases such as immunodeficiency. Therefore, this comparison suggested that the MicroPhenoDB sequence search application could screen metagenomic data for effective identification of pathogenic microbes. Due to the large size of metagenomic data and the need for a broadband network, we provide a software package of the search application for users to download and run locally. We also encourage users to upload the microbial abundance information to the online application for further analysis and visualization.

Distinguish clinical phenotypes of SARS-CoV-2 infection from different viral respiratory infections

The single-stranded RNA coronavirus SARS-CoV-2 can infect humans and cause COVID-19 disease [40]. Its structure is similar to those of viruses causing severe acute respiratory syndrome (SARS) and Middle East respiratory syndrome (MERS) [41]. At present, the diagnosis of SARS-CoV-2 infection is mainly based on clinical phenotypes, chest computed tomography (CT), and nucleic acid testing. Compared with CT and nucleic acid testing, clinical phenotype monitoring has significant advantages, such as a short turnaround time, low cost, and convenience [42]. To distinguish clinical phenotypes of SARS-CoV-2 infection from different viral respiratory infections, we searched MicroPhenoDB and obtained association data that contained 63 disease phenotypes and 14 respiratory tract infection viruses, such as human rhinovirus, parainfluenza virus, respiratory syncytial virus, metapneumovirus, and coronaviruses. The data were then imported into the Cytoscape software [43] for network analysis. The output network (Figure 6) indicated that SARS-CoV-2 shares the clinical phenotype of pneumonia with the majority of other respiratory infection viruses, as well as the clinical phenotypes of dry-cough, headache, fever, myalgia, vomiting, diarrhea, and respiratory disease syndrome (underlined in green) with several influenza viruses and other coronaviruses. Importantly, the network also showed that dyspnea, fatigue, lymphopenia, anorexia, and septic shock (underlined in blue) were common clinical phenotypes of SARS-CoV-2 infection distinguished from other viral respiratory infections [12], [44], [45]. Bear in mind that these phenotypes of SARS-CoV-2 infection might be frequent complications of other diseases and treatments. For example, dyspnea is a frequent complication of chronic respiratory diseases [46], lung cancer [47], and hepatopulmonary syndrome [48]; septic shock is a complication of pneumococcal pneumonia, chronic corticosteroid treatment, and current tobacco smoking [49]; fatigue is a complication of multi-type cancers [50], [51] and Parkinson’s disease [52]; lymphopenia is a complication of human immunodeficiency viral infection [53]. However, our results suggest that these common clinical phenotypes could distinguish SARS-CoV-2 infection from infections by SARS-CoV, MERS-CoV, and other respiratory viruses.

Figure 6.

Figure 6

The Cytoscape network illustrates different clinical phenotypes across different viral respiratoryinfections

The diamonds represent the respiratory infection viruses. The red circles represent the disease phenotypes. Lager size of a circle or a diamond indicates more connections to a disease phenotype or a virus. The solid connection lines represent the associations between clinical phenotypes and viruses. Underlines indicate the clinical phenotypes discussed in the main text.

Association network in different body sites

The microbe-disease association data can be downloaded and used for further analysis. To generate a network to explore the reliable connections between the microbial changes and the diseases in multiple body sites, we obtained the association data of body sites such as the vagina, urinary tract, and genitals using the reliable association score thresholds mentioned above (>0.3 and <−0.3). The resulting association data were imported into the Cytoscape software [43] for network analysis. The output network (Figure 7) indicated that the decreasing abundance of Lactobacillus (underlined in red) was related to vaginal inflammation and bacterial vaginosis in the vagina, while the increasing abundance of Chlamydia (underlined in green) resulted in lymphogranuloma venereum in the genitals. Moreover, the network showed that the increasing abundance of Mycoplasma genitalium (underlined in blue) was associated with multiple diseases, which involve genitals, such as pelvic inflammatory disease, nongonococcal urethritis, and nonchlamydial nongonococcal urethritis. Furthermore, the network showed that a microbe abnormality could be associated with diseases involving different body sites. For example, the increasing abundance of Neisseria gonorrhoeae (underlined in purple) was associated with two diseases, each in the genitals and urinary tract. For users to assess the microbial pathogenicity, it is recommended to filter the data by using the association scores and follow the supporting publications for further investigation. Users can follow our step-by-step guidelines on the website (http://www.liwzlab.cn/microphenodb/#/guideline) to perform similar association analyses and generate Cytoscape networks.

Figure 7.

Figure 7

The Cytoscape network illustrates the associations between clinical phenotypes and microbes at differentbody sites

The diamonds represent clinical phenotypes resulted from a microbial abnormality at different body sites. The red circles represent the microbes. Lager size of a circle or a diamond indicates more connections to a clinical phenotype or a virus. The solid connection lines represent the associations between diseases and microbes with an increase in microbial abundance, and the dash connection lines represent the associations between diseases and microbes with a decrease in microbial abundance. Underlines indicate the microbes discussed in the main text.

Concluding remarks

Microbes play important roles in human health and disease. The curation and analysis of microbe-disease association data are essential for expediting translational research and application. In this study, we developed the MicroPhenoDB database by manually curating and consistently integrating microbe-disease association data. As far as we are aware, MicroPhenoDB is the first database platform to detail the relationships between pathogenic microbes, core genes, and disease phenotypes. In terms of data coverage, scoring models, and web applications, MicroPhenoDB outperformed data resources that contain similar association data (Table 4). For example, the numbers of associations, microbes, disease phenotypes, and supporting evidence in MicroPhenoDB were approximately 11.1, 6.1, 13.9, and 18.9-fold of those in HMDAD, respectively. Compared with both HMDAD and Disbiome, MicroPhenoDB refined the confidence scoring model using additional evidential metrics with different weights; it standardized the association annotations by manual curation and included pathogenic data of virulence factors, microbial core genes, and antibiotic resistance gens. Moreover, MicroPhenoDB implemented web applications and APIs for pathogenic microbe identifications in metagenomic data.

Table 4.

Data content and web applications of MicroPhenoDB compared with HMDAD andDisbiome

graphic file with name fx5.gif

Note: HMDAD, Human Microbe-Disease Association Database; API, Application Programming Interface.

In MicroPhenoDB, many associations with confident scores came from our manual curation of the up-to-date clinical guidelines supported by IDSA and ASM. MicroPhenoDB assigned higher weight values to the associations derived from the guidelines and lower weight values to the associations from other literature data and databases. The original model for scoring confidence of the disease-microbe associations in HMDAD was based on a single literature evidence. Our MicroPhenoDB score model rated different supporting evidence according to the credibility of related sources and provided a score to evaluate a disease-microbe association.

By integrating unique, clade-specific microbial core genes and using the data from MetaPhlAn2, the MicroPhenoDB sequence search application enables rapid identification of existing pathogenic microorganisms in metagenomic samples without running the usual sequencing data processing and assembly. However, the resulting associations from the sequence search do not guarantee microbial pathogenicity but provide clues for further investigation. The annotated core genes are also limited in size and cannot represent all microbial species. To consistently analyze the important functions of microbes, other data or tools are also recommended, such as UniRef clusters [54], MetaCyc [55], HUMAnN2 [56], and pan-genomic data.

To serve the research community, we will update the database every six months and constantly improve it with more features and functionalities. As a novel and unique resource, MicroPhenoDB connects pathogenic microbes, microbial core genes, and disease phenotypes; therefore, it can be used in metagenomic data analyses and assist studies in decoding microbes associated with human diseases.

Data availability

To access the association data, the online applications, and the software package, please visit http://www.liwzlab.cn/microphenodb/#/download.

CRediT author statement

Guocai Yao: Methodology, Software, Visualization, Writing - original draft. Wenliang Zhang: Methodology, Data curation, Writing - original draft. Minglei Yang: Visualization, Software. Huan Yang: Validation. Jianbo Wang: Formal analysis. Haiyue Zhang: Investigation, Writing - review & editing. Lai Wei: Resources, Writing - review & editing. Zhi Xie: Resources, Writing - review & editing. Weizhong Li: Conceptualization, Resources, Methodology, Supervision, Project administration, Writing - review & editing. All authors read and approved the final manuscript.

Competing interests

The authors have declared no competing interests.

Acknowledgments

This work was supported by the National Key R&D Program of China (Grant Nos. 2016YFC0901604 and 2018YFC0910401) and the National Natural Science Foundation of China (Grant No. 31771478) to WL.

Handled by Andreas Keller

Footnotes

Peer review under responsibility of Beijing Institute of Genomics, Chinese Academy of Sciences and Genetics Society of China.

Supplementary data to this article can be found online at https://doi.org/10.1016/j.gpb.2020.11.001.

Supplementary material

The following are the Supplementary data to this article:

Supplementary Figure S1.

Supplementary Figure S1

The distribution of numbers of supporting publications The blue histogram represents the frequency of the number of supporting publications.

Supplementary Table S1

The version or release of databases and tools used in the MicroPhenoDBconstruction

mmc1.docx (17.4KB, docx)
Supplementary Table S2

The analysis result by MicroPhenoDB sequence search in an existing metagenomic dataset(GSA: PRJCA000880)

mmc2.docx (18.8KB, docx)

References

  • 1.Sender R., Fuchs S., Milo R. Are we really vastly outnumbered? Revisiting the ratio of bacterial to host cells in humans. Cell. 2016;164:337–340. doi: 10.1016/j.cell.2016.01.013. [DOI] [PubMed] [Google Scholar]
  • 2.Lloyd-Price J., Abu-Ali G., Huttenhower C. The healthy human microbiome. Genome Med. 2016;8:51. doi: 10.1186/s13073-016-0307-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Ghaisas S., Maher J., Kanthasamy A. Gut microbiome in health and disease: linking the microbiome–gut–brain axis and environmental factors in the pathogenesis of systemic and neurodegenerative diseases. Pharmacol Therapeut. 2016;158:52–62. doi: 10.1016/j.pharmthera.2015.11.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Cho I., Blaser M.J. The human microbiome: at the interface of health and disease. Nat Rev Genet. 2012;13:260–270. doi: 10.1038/nrg3182. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Kundu P., Blacher E., Elinav E., Pettersson S. Our gut microbiome: the evolving inner self. Cell. 2017;171:1481–1493. doi: 10.1016/j.cell.2017.11.024. [DOI] [PubMed] [Google Scholar]
  • 6.Franzosa E.A., Sirota-Madi A., Avila-Pacheco J., Fornelos N., Haiser H.J., Reinker S. Gut microbiome structure and metabolic activity in inflammatory bowel disease. Nat Microbiol. 2019;4:293–305. doi: 10.1038/s41564-018-0306-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Jackson M.A., Verdi S., Maxan M., Shin C.M., Zierer J., Bowyer R.C.E. Gut microbiota associations with common diseases and prescription medications in a population-based cohort. Nat Commun. 2018;9:2655. doi: 10.1038/s41467-018-05184-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Schmidt T.S.B., Raes J., Bork P. The human gut microbiome: from association to modulation. Cell. 2018;172:1198–1215. doi: 10.1016/j.cell.2018.02.044. [DOI] [PubMed] [Google Scholar]
  • 9.Labus J.S., Osadchiy V., Hsiao E.Y., Tap J., Derrien M., Gupta A. Evidence for an association of gut microbial Clostridia with brain functional connectivity and gastrointestinal sensorimotor function in patients with irritable bowel syndrome, based on tripartite network analysis. Microbiome. 2019;7:45. doi: 10.1186/s40168-019-0656-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Strati F., Cavalieri D., Albanese D., De Felice C., Donati C., Hayek J. Altered gut microbiota in Rett syndrome. Microbiome. 2016;4:41. doi: 10.1186/s40168-016-0185-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Chang H.W., Yan D.i., Singh R., Liu J., Lu X., Ucmak D. Alteration of the cutaneous microbiome in psoriasis and potential role in Th17 polarization. Microbiome. 2018;6:154. doi: 10.1186/s40168-018-0533-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Huang C., Wang Y., Li X., Ren L., Zhao J., Hu Y. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet. 2020;395:497–506. doi: 10.1016/S0140-6736(20)30183-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Zhernakova A., Kurilshikov A., Bonder M.J., Tigchelaar E.F., Schirmer M., Vatanen T. Population-based metagenomics analysis reveals markers for gut microbiome composition and diversity. Science. 2016;352:565–569. doi: 10.1126/science.aad3369. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Falony G., Joossens M., Vieira-Silva S., Wang J., Darzi Y., Faust K. Population-level analysis of gut microbiome variation. Science. 2016;352:560–564. doi: 10.1126/science.aad3503. [DOI] [PubMed] [Google Scholar]
  • 15.Kinross J.M., Darzi A.W., Nicholson J.K. Gut microbiome-host interactions in health and disease. Genome Med. 2011;3:14. doi: 10.1186/gm228. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Almeida A., Mitchell A.L., Boland M., Forster S.C., Gloor G.B., Tarkowska A. A new genomic blueprint of the human gut microbiota. Nature. 2019;568:499–504. doi: 10.1038/s41586-019-0965-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Maffert P., Reverchon S., Nasser W., Rozand C., Abaibou H. New nucleic acid testing devices to diagnose infectious diseases in resource-limited settings. Eur J Clin Microbiol. 2017;36:1717–1731. doi: 10.1007/s10096-017-3013-9. [DOI] [PubMed] [Google Scholar]
  • 18.Ma W., Zhang L.u., Zeng P., Huang C., Li J., Geng B. An analysis of human microbe–disease associations. Brief Bioinform. 2017;18:85–97. doi: 10.1093/bib/bbw005. [DOI] [PubMed] [Google Scholar]
  • 19.Janssens Y., Nielandt J., Bronselaer A., Debunne N., Verbeke F., Wynendaele E. Disbiome database: linking the microbiome to disease. BMC Microbiol. 2018;18:50. doi: 10.1186/s12866-018-1197-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Chen L., Zheng D., Liu B., Yang J., Jin Q. VFDB 2016: Hierarchical and refined dataset for big data analysis—10 years on. Nucleic Acids Res. 2016;44:D694–D697. doi: 10.1093/nar/gkv1239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Jia B., Raphenya A.R., Alcock B., Waglechner N., Guo P., Tsang K.K. CARD 2017: Expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Res. 2017;45:D566–D573. doi: 10.1093/nar/gkw1004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Miller J.M., Binnicker M.J., Campbell S., Carroll K.C., Chapin K.C., Gilligan P.H. A guide to utilization of the microbiology laboratory for diagnosis of infectious diseases: 2018 update by the Infectious Diseases Society of America and the American Society for Microbiology. Clin Infect Dis. 2018;67:813–816. doi: 10.1093/cid/ciy584. [DOI] [PubMed] [Google Scholar]
  • 23.Sioutos N., Coronado S.D., Haber M.W., Hartel F.W., Shaiu W.L., Wright L.W. NCI Thesaurus: a semantic model integrating cancer-related clinical and molecular information. J Biomed Inform. 2007;40:30–43. doi: 10.1016/j.jbi.2006.02.013. [DOI] [PubMed] [Google Scholar]
  • 24.Truong D.T., Franzosa E.A., Tickle T.L., Scholz M., Weingart G., Pasolli E. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat Methods. 2015;12:902–903. doi: 10.1038/nmeth.3589. [DOI] [PubMed] [Google Scholar]
  • 25.Pinero J., Queralt-Rosinach N., Bravo A., Deu-Pons J., Bauer-Mehren A., Baron M. DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database (Oxford) 2015;2015:bav28. doi: 10.1093/database/bav028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Sayers E.W., Agarwala R., Bolton E.E., Brister J.R., Canese K., Clark K. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2019;47:D23–D28. doi: 10.1093/nar/gky1069. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.The UniProt Consortium UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 2019;47:D506–D515. doi: 10.1093/nar/gky1049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Segata N., Waldron L., Ballarini A., Narasimhan V., Jousson O., Huttenhower C. Metagenomic microbial community profiling using unique clade-specific marker genes. Nat Methods. 2012;9:811–814. doi: 10.1038/nmeth.2066. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Mitchell A.L., Attwood T.K., Babbitt P.C., Blum M., Bork P., Bridge A. InterPro in 2019: improving coverage, classification and access to protein sequence annotations. Nucleic Acids Res. 2019;47:D351–D360. doi: 10.1093/nar/gky1100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Malone J., Holloway E., Adamusiak T., Kapushesky M., Zheng J., Kolesnikov N. Modeling sample variables with an Experimental Factor Ontology. Bioinformatics. 2010;26:1112–1118. doi: 10.1093/bioinformatics/btq099. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.MacArthur J., Bowler E., Cerezo M., Gil L., Hall P., Hastings E. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog) Nucleic Acids Res. 2017;45:D896–D901. doi: 10.1093/nar/gkw1133. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Perez-Riverol Y., Ternent T., Koch M., Barsnes H., Vrousgou O., Jupp S. OLS client and OLS dialog: open source tools to annotate public omics datasets. Proteomics. 2017;17:1700244. doi: 10.1002/pmic.201700244. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Köhler S., Carmody L., Vasilevsky N., Jacobsen J.O.B., Danis D., Gourdine J. Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources. Nucleic Acids Res. 2019;47:D1018–D1027. doi: 10.1093/nar/gky1105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Mungall C.J., McMurry J.A., Köhler S., Balhoff J.P., Borromeo C., Brush M. The Monarch Initiative: an integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Res. 2017;45:D712–D722. doi: 10.1093/nar/gkw1128. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Li W., Cowley A., Uludag M., Gur T., McWilliam H., Squizzato S. The EMBL-EBI bioinformatics web and programmatic tools framework. Nucleic Acids Res. 2015;43:W580–W584. doi: 10.1093/nar/gkv279. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J., Bealer K. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421. doi: 10.1186/1471-2105-10-421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Langmead B., Salzberg S.L. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9:357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Zhang Z., Zhao W., Xiao J., Bao Y., He S., Zhang G. Database resources of the national genomics data center in 2020. Nucleic Acids Res. 2020;48:D24–D33. doi: 10.1093/nar/gkz913. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Li H., Gao H., Meng H., Wang Q., Li S., Chen H. Detection of pulmonary infectious pathogens from lung biopsy tissues by metagenomic Next-Generation sequencing. Front Cell Infect Microbiol. 2018;8:205. doi: 10.3389/fcimb.2018.00205. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Lai C.C., Shih T.P., Ko W.C., Tang H.J., Hsueh P.R. Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and coronavirus disease-2019 (COVID-19): the epidemic and the challenges. Int J Antimicrob Agents. 2020;55:105924. doi: 10.1016/j.ijantimicag.2020.105924. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Lu R., Zhao X., Li J., Niu P., Yang B., Wu H. Genomic characterisation and epidemiology of 2019 novel coronavirus: Implications for virus origins and receptor binding. Lancet. 2020;395:565–574. doi: 10.1016/S0140-6736(20)30251-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Guan W.J., Zhong N.S. Clinical characteristics of covid-19 in China. Reply. N Engl J Med. 2020;382:1861–1862. doi: 10.1056/NEJMc2005203. [DOI] [PubMed] [Google Scholar]
  • 43.Shannon P. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13:2498–2504. doi: 10.1101/gr.1239303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Chen N., Zhou M., Dong X., Qu J., Gong F., Han Y. Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study. Lancet. 2020;395:507–513. doi: 10.1016/S0140-6736(20)30211-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Lake M.A. What we know so far: COVID-19 current clinical knowledge and research. Clin Med (Lond) 2020;20:124–127. doi: 10.7861/clinmed.2019-coron. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Dubé B.P., Vermeulen F., Laveneziana P. Exertional dyspnoea in chronic respiratory diseases: from physiology to clinical application. Arch Bronconeumol. 2017;53:62–70. doi: 10.1016/j.arbres.2016.09.005. [DOI] [PubMed] [Google Scholar]
  • 47.Henshall C.L., Allin L., Aveyard H. A systematic review and narrative synthesis to explore the effectiveness of exercise-based interventions in improving fatigue, dyspnea, and depression in lung cancer survivors. Cancer Nurs. 2019;42:295–306. doi: 10.1097/NCC.0000000000000605. [DOI] [PubMed] [Google Scholar]
  • 48.Gorgy A.I., Jonassaint N.L., Stanley S.E., Koteish A., DeZern A.E., Walter J.E. Hepatopulmonary syndrome is a frequent cause of dyspnea in the short telomere disorders. Chest. 2015;148:1019–1026. doi: 10.1378/chest.15-0825. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Garcia-Vidal C., Ardanuy C., Tubau F., Viasus D., Dorca J., Linares J. Pneumococcal pneumonia presenting with septic shock: host- and pathogen-related factors and outcomes. Thorax. 2010;65:77–81. doi: 10.1136/thx.2009.123612. [DOI] [PubMed] [Google Scholar]
  • 50.Baguley B.J., Skinner T.L., Jenkins D.G., Wright O.R.L. Mediterranean-style dietary pattern improves cancer-related fatigue and quality of life in men with prostate cancer treated with androgen deprivation therapy: a pilot randomised control trial. Clin Nutr. 2020;40:245–254. doi: 10.1016/j.clnu.2020.05.016. [DOI] [PubMed] [Google Scholar]
  • 51.Desai J., Deva S., Lee J.S., Lin C., Yen C., Chao Y. Phase IA/IB study of single-agent tislelizumab, an investigational anti-PD-1 antibody, in solid tumors. J Immunother Cancer. 2020;8:e000453. doi: 10.1136/jitc-2019-000453. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Schrag A., Hommel A.L.A.J., Lorenzl S., Meissner W.G., Odin P., Coelho M. The late stage of Parkinson's -results of a large multinational study on motor and non-motor complications. Parkinsonism Relat Disord. 2020;75:91–96. doi: 10.1016/j.parkreldis.2020.05.016. [DOI] [PubMed] [Google Scholar]
  • 53.Pothlichet J., Rose T., Bugault F., Jeammet L., Meola A., Haouz A. PLA2G1B is involved in CD4 anergy and CD4 lymphopenia in HIV-infected patients. J Clin Invest. 2020;130:2872–2887. doi: 10.1172/JCI131842. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Suzek B.E., Wang Y., Huang H., McGarvey P.B., Wu C.H. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics. 2015;31:926–932. doi: 10.1093/bioinformatics/btu739. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Caspi R., Billington R., Ferrer L., Foerster H., Fulcher C.A., Keseler I.M. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res. 2016;44:D471–D480. doi: 10.1093/nar/gkv1164. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Franzosa E.A., McIver L.J., Rahnavard G., Thompson L.R., Schirmer M., Weingart G. Species-level functional profiling of metagenomes and metatranscriptomes. Nat Methods. 2018;15:962–968. doi: 10.1038/s41592-018-0176-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Table S1

The version or release of databases and tools used in the MicroPhenoDBconstruction

mmc1.docx (17.4KB, docx)
Supplementary Table S2

The analysis result by MicroPhenoDB sequence search in an existing metagenomic dataset(GSA: PRJCA000880)

mmc2.docx (18.8KB, docx)

Data Availability Statement

To access the association data, the online applications, and the software package, please visit http://www.liwzlab.cn/microphenodb/#/download.


Articles from Genomics, Proteomics & Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES