A Systems Approach to Refine Disease Taxonomy by Integrating Phenotypic and Molecular Networks

Xuezhong Zhou; Lei Lei; Jun Liu; Arda Halu; Yingying Zhang; Bing Li; Zhili Guo; Guangming Liu; Changkai Sun; Joseph Loscalzo; Amitabh Sharma; Zhong Wang

doi:10.1016/j.ebiom.2018.04.002

. 2018 Apr 6;31:79–91. doi: 10.1016/j.ebiom.2018.04.002

A Systems Approach to Refine Disease Taxonomy by Integrating Phenotypic and Molecular Networks

Xuezhong Zhou ^a,¹, Lei Lei ^b,¹, Jun Liu ^c,¹, Arda Halu ^d,¹, Yingying Zhang ^c,^f, Bing Li ^b,^c, Zhili Guo ^c,^g, Guangming Liu ^a, Changkai Sun ^h,^i,^j,^k, Joseph Loscalzo ^e, Amitabh Sharma ^d,^e,^⁎, Zhong Wang ^c,^⁎⁎

^aSchool of Computer and Information Technology and Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University, No.3 Shangyuancun, Haidian District, Beijing 100044, China

^bInstitute of Information on Traditional Chinese Medicine, China Academy of Chinese Medical Sciences, No.16, Nanxiaojie, Dongzhimennei, Dongcheng District, Beijing 100700, China

^cInstitute of Basic Research in Clinical Medicine, China Academy of Chinese Medical Sciences, No.16, Nanxiaojie, Dongzhimennei, Dongcheng District, Beijing 100700, China

^dChanning Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, 181 Longwood Avenue, Boston, MA 02115, USA

^eDepartment of Medicine, Brigham and Women's Hospital, Harvard Medical School, 75 Francis Street, Boston, MA 02115, USA

^fDongzhimen Hospital, Beijing University of Chinese Medicine, No.5 Haiyuncang, Dongcheng District, Beijing 100700,China

^gJiaxing Traditional Chinese Medicine Affiliated Hospital of Zhejiang Chinese Medical University, No. 1501, Zhongshan East Road, Jiaxing, Zhejiang 314000, China

^hSchool of Biomedical Engineering, Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, 116024, China

ⁱResearch Center for the Control Engineering of Translational Precision Medicine, Dalian University of Technology, Dalian 116024, China

^jState Key Laboratory of Fine Chemicals, Dalian R&D Center for Stem Cell and Tissue Engineering, Dalian University of Technology, Dalian 116024, China

^kLiaoning Provincial Key Laboratory of Cerebral Diseases, Institute for Brain Disorders, Dalian Medical University, Dalian 116044, China

^⁎

Correspondence to: A. Sharma, Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, 181 Longwood Avenue, Boston, MA 02115, USA. amitabh.sharma@channing.harvard.edu

^⁎⁎

Correspondence to: Z. Wang, Institute of Basic Research in Clinical Medicine, China Academy of Chinese Medical Sciences, No.16, Nanxiaojie, Dongzhimennei, Dongcheng District, Beijing 100700, China. wangzh@mail.cintcm.ac.cn

These authors contributed equally to this work.

PMCID: PMC6013753 PMID: 29669699

Abstract

The International Classification of Diseases (ICD) relies on clinical features and lags behind the current understanding of the molecular specificity of disease pathobiology, necessitating approaches that incorporate growing biomedical data for classifying diseases to meet the needs of precision medicine. Our analysis revealed that the heterogeneous molecular diversity of disease chapters and the blurred boundary between disease categories in ICD should be further investigated. Here, we propose a new classification of diseases (NCD) by developing an algorithm that predicts the additional categories of a disease by integrating multiple networks consisting of disease phenotypes and their molecular profiles. With statistical validations from phenotype-genotype associations and interactome networks, we demonstrate that NCD improves disease specificity owing to its overlapping categories and polyhierarchical structure. Furthermore, NCD captures the molecular diversity of diseases and defines clearer boundaries in terms of both phenotypic similarity and molecular associations, establishing a rational strategy to reform disease taxonomy.

Keywords: Disease taxonomy, Network medicine, Disease phenotypes, Molecular profiles, Precision medicine

Highlights

•
The International Classification of Diseases (ICD) lags behind the current molecular characteristics of disease.
•
We quantified the limitations (specificity and blurred boundary) of ICD with integrated phenotypic and molecular profiles.
•
An integrative disease network integrating phenotypic and genotypic profiles proposes a refined disease category framework.

Disease taxonomy is one of the foundations of medical science and healthcare solutions. The most widely used disease taxonomy in clinical settings is the International Classification of Diseases (ICD), a system established >100 years ago and maintained by the World Health Organization to track disease incidence. It is well recognized that ICD, which is based on clinical observations, largely lags behind the molecular achievements of this medical big data era. We quantified the limitations of ICD using integrated phenotypic and molecular profiles and proposed a refined disease taxonomy with possible applications for precision medicine.

1. Introduction

Disease taxonomy plays an important role in defining the diagnosis, treatment, and mechanisms of human diseases. The principle of the current clinical disease taxonomies, in particular the International Classification of Diseases (ICD), goes back to the work of William Farr in the nineteenth century and is primarily derived from the differentiation of clinical features (e.g. symptoms and micro-examination of diseased tissues and cells) (Council et al., 2011). Despite its extensive clinical use, this classification system lacks the depth required for precision medicine with the limitations of its rigid hierarchical structure and, moreover, it does not exploit the rapidly expanding molecular insights of disease phenotypes. For example, many diseases (e.g. cancer, chronic inflammatory diseases) in the current disease taxonomies have either high genetic heterogeneity (Bianchini et al., 2016; McClellan and King, 2010) or manifestation diversity (Arostegui et al., 2014; Jeste and Geschwind, 2014; Mannino, 2002), which give little basis for tailoring treatment to a patient's pathophysiology. Furthermore, disease comorbidities (Hu et al., 2016; Lee et al., 2008; Hidalgo et al., 2009), temporal disease trajectories (Jensen et al., 2014) in clinical populations, various molecular relationships between disease-associated cellular components and their connections in the interactome (Blair et al., 2013; Goh et al., 2007; Barabasi et al., 2011; Rzhetsky et al., 2007; Zhou et al., 2014), and many successful drug repurposing cases (Li and Jones, 2012; Chong and Sullivan Jr., 2007; Ashburn and Thor, 2004; Wu et al., 2016; Evans et al., 2005) altogether demonstrate the vague boundary between different diseases in current disease taxonomies. Moreover, the deep understanding of diseases based on the advances in disease biology, bioinformatics, and multi-omics data necessitates the reclassification of disease taxonomy (Mirnezami et al., 2012).

In the past decade, efforts to reclassify diseases based on molecular insights have increased with studies related to molecular-based disease subtyping in different disease conditions, such as acute leukemias (Golub et al., 1999; Alizadeh et al., 2000), colorectal cancer (Dienstmann et al., 2017), oesophageal carcinoma (Cancer Genome Atlas Research et al., 2017), pancreatic cancer (Bailey et al., 2016), cancer metastasis (Chuang et al., 2007), neurodegenerative disorders (Mann et al., 2000), autoimmunity disorders (Ahmad et al., 2003), multiple cancer types across tissues of origin (Hoadley et al., 2014), and a network-based stratification method for cancer subtyping (Hofree et al., 2013). Further insights will arise from integrating all types of biomedical data with a single framework to exploit disease-disease relationships. Data integration methods that utilize multiple types of data, including ontological and omics data, have been used to classify and refine disease relationships (Gligorijevic and Przulj, 2015; Menche et al., 2015; Gligorijevic et al., 2016). Despite these efforts, the development of a molecular-based disease taxonomy that links molecular networks and pathophenotypes still remains challenging (Menche et al., 2015; Hofmann-Apitius et al., 2015; Jameson and Longo, 2015).

Here, we aim to refine a widely used clinical disease classification scheme, the ICD. To achieve this, we first quantify the category similarity among the ICD chapters using ontology-based similarity measures and investigate the molecular connections of disease pairs in the same ICD chapters. Furthermore, we seek the correlation between category and molecular similarity, and check for the heterogeneity of molecular specificity and correlated boundary between categories in ICD taxonomy. Finally, we construct a new classification of diseases (NCD) with overlapping structures. The aim is to provide clear boundaries between distinct diseases belonging to different categories using a new disease classification scheme (Fig. 1 & Fig. S3).

Fig. 1 — Overview of the new disease taxonomy construction and validation. a. Similarity calculation between the disease pairs in ICD taxonomy, including the calculation of 1) category similarity; 2) Phenotype similarity (based on ICD-MeSH term mapping) and 3) Molecular profile similarities (based on ICD-UMLS term mapping) of disease pairs in ICD; b. Module or community annotations of disease association network by chapters in ICD or NCD. We generate disease association network, in which nodes represent diseases and the link weights represent their corresponding phenotype or molecule profile similarities. The module annotations of the disease network correspond to ICD chapters or NCD categories; c. Construction of integrated disease network (IDN) and generation of NCD. The links of IDN are fused from the multiple similarities (e.g. phenotype similarity and shared gene similarity). Based on IDN, NCD is generated by community detection algorithms with overlapping disease members; d. Quality evaluation and validation of ICD and NCD. The molecular specificity (or inverse molecular diversity) and network modularity are used for evaluation and comparison of the quality of two disease taxonomies. Furthermore, we validate the robustness of NCD with two independent phenotype-genotype association datasets, namely GWAS and PheWAS.

2. Materials and Methods

2.1. Basic Dataset Compilation

In this work, large curation efforts are performed to generate the related data sources (details see Supplementary Materials (SM) Section 1). We obtained the updated text version of ICD-9-CM (2011) and extracted the list of ICD codes with their hierarchical structures. While we recognize the improvements of the currently used ICD-10 over ICD-9, nevertheless, we chose to use ICD-9-CM as the adoption of ICD-10 has been slow in the United States (Butler, 2014) and since it was still being widely used at the time of the data collection for this paper (Blair et al., 2013; Wang et al., 2017). Furthermore, although ICD-10 does have more codes than ICD-9-CM, the structure is kept almost the same. We obtained the high-quality phenotype-genotype (disease-gene) associations from Disease Connect database (2015 version) (Liu et al., 2014), leaving out the less reliable text mining entries and focusing only on Genome-wide association study (GWAS), Online Mendelian Inheritance in Man (OMIM) and differential expression evidence types, and manually mapped those diseases in unified medical language system (UMLS) codes to ICD and MeSH codes (SM Section 1.6).

To calculate the molecular network and phenotype characteristics related to disease phenotypes, a high-quality subset of human protein-protein interactions was filtered from STRING V9.1 (Franceschini et al., 2013) using the score threshold at ≥ 700, as well as a well-established disease-phenotype (disease-symptom) association dataset (i.e. disease network with symptom similarity, HSDN) (Zhou et al., 2014) derived from PubMed bibliographic records and the gene ontology annotations from NCBI gene database are adopted. To ensure the results are not biased by computational predictions in the STRING database, we replicated the classification pipeline with manually curated PPI networks (Menche et al., 2015), which rely only on physical protein interactions with experimental support, and found that the results are robust (SM Section 8.3).

In addition, to validate the robustness of our results from independent data sources, we filtered the GWAS and Phenome Wide Association Studies (PheWAS) data from University of California Santa Cruz (UCSC) Genome Browser (Tyner et al., 2017) and PheWAS catalog (Denny et al., 2010) respectively, and performed additional ICD mapping task to prepare the data for validation analysis. The GWAS evidence of the DiseaseConnect database, which we used to build the disease associations, comes from the National Human Genome Research Institute (NHGRI) GWAS catalog (Welter et al., 2014), whereas for validation, we used the UCSC-GWAS Genome Browser. We have ensured that the GWAS data used to build the networks and to validate them have a very small overlap (SM Section 8).

2.2. Evaluating the Quality of ICD Disease Taxonomy

Here, we systematically evaluated the consistency of disease categories in ICD taxonomy from both clinical phenotype and molecular profiles (details are in SM Section 2). We investigated the quality of ICD disease taxonomy by evaluating the correlation between the closeness of disease pairs in the disease taxonomy and the underlying molecular connections (and symptom phenotype similarities) between disease pairs. For example, if two disease pairs have close positions (e.g. have a low level common parent disease) in the disease taxonomy, then we would expect that those disease pairs might have common genes or shared protein-protein interactions or similar phenotypes. We calculated the category similarity between disease pairs using a widely used semantic similarity measure (i.e. Lin measure using information content) (Lin, 1998; Pesquita et al., 2009) to represent the closeness of disease pairs located in the ICD taxonomy. Information theoretic measures such as information content have been used in the context of ICD-9-CM previously (Dahlem et al., 2015). The category similarity measure takes as input two concepts c1 and c2 and outputs a numeric measure of similarity. If two ICD codes have a very specific common parent code in the taxonomic tree structure, then the category similarity would be ~ 1.

The molecular and phenotype similarity between disease pairs are calculated by evaluating the shared genes and their GO annotations, molecular network similarities, and shared phenotypes by established similarity measures (e.g. Cosine measure and Jaccard measure). In particular, to propose a more robust representation of molecular network profiles of diseases, we partitioned the STRING network into 314 topological modules (Data S2) and used them to construct the relevant module vectors of diseases using Odds Ratio (OR) as weighting measure. For example, an ICD disease code would be represented with a 314-dimensional vector, which has a value of w_ij if its related gene is in a module or 0 otherwise. Suppose we have N genes in total and m_i genes of a module i. Now for a disease d_j with n_j genes, which has k_ij overlapping genes with the module i, we calculated the value of w_ij as the following equation,

w_{ij} = \frac{k_{ij} / (n_{j} - k_{ij})}{(m_{i} - k_{ij}) / (N - n_{j} - m_{i} + k_{ij})}

(1)

We used the cosine measure to calculate the molecular module similarity between disease pairs after the molecular module vector (i.e. OR weighting) of each disease was constructed.

Furthermore, as ICD taxonomy proposes a framework for organizing the diseases, it is expected that there should overlapping molecular interactions or phenotype relationships between the diseases of the same chapters than those of the different chapters. Thus, we assumed that when we collapse the ICD chapters as the module annotations, such that all the diseases in one chapter would be considered as members of a same module, the modularity of the disease association networks, i.e. the disease networks with molecular or phenotype associations as links, would reflect the quality of ICD disease taxonomy. This means that the higher the modularity, the higher the quality of the ICD chapters as a disease category framework.

To evaluate the quality of community structures in complex network, the modularity measure (Newman, 2006) was proposed to quantify the extent to which the connection in communities is above the random expectation in the whole network. Let a network have m edges and A_vw be an element of the adjacency matrix of the network. Suppose the vertices in the network are divided into communities such that vertex v belongs to community c_v. Then the modularity Q is defined as:

Q = \frac{1}{2 m} \sum_{vw} [A_{vw} - \frac{k_{v} k_{w}}{2 m}] δ (c_{v}, c_{w})

(2)

where the function δ(i, j) is 1 if i = j and 0 otherwise, and k_v is the degree of vertex v. The value of the modularity lies in the range [−1/2,1]. It is positive if the number of edges within groups exceeds the number expected on the basis of chance. Otherwise, it would be negative. We use it to measure the consistency of disease categories (ICD chapter or NCD) as an annotation of topological module (or community) structures within disease networks. We hypothesize that if a disease category framework captures the molecular or phenotypic profiles of diseases, then there would be more links existing between the disease members in a category than random expectation.

2.3. Measuring the Disease Specificity

As a quantification of the molecular diversity (or the inverse specificity) of a disease, we calculated the maximum betweenness of disease-related genes in the PPI network (Data S3). Betweenness (Freeman, 1977) is a widely used centrality measure to quantify how many shortest paths run through a given node. In particular, bridging nodes that connect disparate components of the network often have a high betweenness. The betweenness centrality of a node v is given by:

bc (v) = \sum_{s \neq v \neq t} \frac{n_{st} (v)}{g_{st}}

(3)

where n_st(v) denotes the number of shortest paths from s to t that pass through v and g_st is the total number of shortest paths from s to t. We will adopt the convention that $\frac{n_{st} (v)}{g_{st}} = 0$ if both n_st(v) and g_st are zero. We assume the molecular diversity of diseases would largely lie on the related genes with maximum betweenness. For example, to quantify the molecular diversity (in terms of maximum betweenness) of Alzheimer's disease (AD), we calculated all the betweenness values for the AD-related genes, such as APP, APOE, TNF and NOS3. Finally, we considered the molecular diversity of AD as 8.44e-3 since we found that APP has the maximum betweenness of 8.44e-3 among those genes (see Fig. S5a). In fact, this kind of measurement has been successfully used in a previous study (Zhou et al., 2014) to evaluate the diversity of diseases, which indicated that the diversity of disease manifestations has a strong positive correlation with the molecular diversity of diseases. For disease taxonomy with good quality, we would expect it to have its lowest level diseases (the leaf nodes in the tree-structure disease taxonomy) with similar molecular diversities.

2.4. Detection of the Significant Disease-chapter Associations

We calculated the edge density to quantify the molecular interactions between ICD chapters. To further detect the significant interactions between diseases in different chapters, we find an approach to obtain the diseases that have significant interactions with diseases in chapters other than their own. Given a disease d_i for investigation, we evaluate whether the proportion of interactions (i.e. edge density) of d_i to the disease set D_{C_k} of a chapter C_k is significantly larger than the average proportion of interactions between the diseases in C_k (Fig. S6). We use binomial test to filter the significant interacting disease-chapter pairs, in which the edge density of the disease to the chapter is significantly higher than the average edge density of the diseases in the corresponding chapter (details are in SM Section 4).

2.5. Multi-category Prediction of Diseases

The results showing positive correlations between category similarity and molecular similarity, and the high molecular diversity of many diseases imply that it would be possible to predict the multi-category map for each disease using its underlying molecular connections. To demonstrate a pilot method for multiple disease category prediction by integrating molecular module and shared gene similarities, we provided a novel algorithm to generate the possible associated additional disease categories for a given disease with the corresponding molecular association scores. (details are in SM Section 5, Fig. S7). In this algorithm, we integrated the correlation between category similarity and module similarity with significant disease-chapter associations (which are based on the shared gene similarity) to predict the additional chapters for a given disease. We divide the disease pairs in the same chapter to three subsets, which correspond to those pairs with shared root parents, shared second-level intermediate parents and shared third-level intermediate parents, respectively, to help predict to what degree a pair of diseases would be located closely in the disease taxonomy. The principle of the algorithm adheres to the positive correlation between category similarity (or the closeness of position of the disease pairs in ICD disease taxonomy) and molecular profile similarity of disease pairs, which means that strong molecular profile similarity between disease pairs would indicate close locations of them in the disease taxonomy. To ensure detecting the significant disease-chapter associations, we next filtered the predicted disease-chapter associations with positive association scores by the significant disease-chapter interactions based on shared genes.

2.6. Construction of Integrated Disease Network

To integrate disease associations derived from both molecular and phenotype features, we performed several sequential analytical steps to generate a highly reliable disease network with strict filtering criterions of the disease links (details are in SM Section 6). Firstly, we generated three disease association networks: disease network with module similarity (MSDN) with 598,420 links and 1744 nodes, disease network with shared genes (SGDN) with 133,469 links and 1868 nodes, and disease network with symptom similarity (HSDN) with 1,639,791 links and 1814 nodes (Fig. S10 & Table S10) according to shared genes, shared phenotypes and molecular module similarity, respectively. To reduce the possible noise and bias of disease related data sources, we applied a multi-scale backbone algorithm (Serrano et al., 2009) to obtain high reliable disease links (with significantly high weights than the random expectations) from the three disease networks. We finally obtained 53,241, 8554 and 134,370 high reliable links for MSDN, SGDN and HSDN, respectively and retained most nodes (1744 for MSDN, 1782 for SGDN and 1814 for HSDN) of these networks. To further reduce the possible weak associations (the disease pairs with high module similarity but no direct protein interactions) derived from module similarity, we calculated the minimum length of the shortest paths (MSPLs) between each disease pairs and used it as a filtering criterion (with MSPL≤1) for MSDN, which resulted in a more biological meaningful subset of MSDN with 33,611 links and 1694 diseases.

SGDN would capture strong associations between disease pairs if they have high degree of shared genes even their related genes are not forming functional modules. However, MSDN would give high weights for disease links if the disease pairs have similar co-locations on the topological modules of molecular network even they have no shared genes. Therefore, MSDN and SGDN are actually two complementary molecular association evidences for disease pairs and we finally obtained the union of the subset of MSDN and SGDN as the molecular association disease network (MADN), which contains 35,389 links and 1811 nodes with the weights derived from the two original networks. Next, we adopted a highly strict criterion to obtain an integrated disease network (IDN) from the fusion of MADN and HSDN links, which contains 35,114 disease links and 1857 nodes.

2.7. Overlapping Category Detection from Integrated Disease Network

Finding the overlapping disease categories could be transformed to the task of detecting the overlapping communities (i.e. modules) from the IDN. BigClam (Yang and Leskovec, 2013) is a state-of-the-art overlapping community detection algorithm based on a variant of non-negative matrix factorization, which achieves near linear running time and comparable high quality community results. We used the BigClam algorithm, which is packaged in SNAP complex network software (http://snap.stanford.edu/snap/) to automatically detect overlapping communities from IDN network. Finally, we obtained 223 overlapping disease communities with 1797 distinct ICD disease codes. These 223 disease subcategories contain different numbers of ICD codes, ranging from 5 to 168 (Fig. S12 &Data S10).

To obtain a top-level category framework of diseases corresponding to the chapters in ICD taxonomy, we calculated the overlapping degree of the 223 disease sub-categories by using Jaccard similarity to measure the common number of diseases held by two given disease categories. This generated a disease category network with 2685 links representing shared ICD codes (a link is established if two disease categories share at least an ICD code and the weights of links correspond to the Jaccard similarity) and nodes representing disease categories. After that, we clustered the 223 disease sub-categories additionally by a widely used non-overlapping community detection algorithm (considering the link weight and setting the resolution parameter as 0.5) into 17 top-level categories (which corresponds to the number of original chapter-level categories in ICD, which we named as New Chapters, NCs) using the shared ICD codes (Fig. S11c & Data S10). The modularity of these 17 top-level categories (this makes a good comparable partition with ICD chapters) in the network of 223 sub-categories is 0.426, which means a rather good partition of the network. These 17 NCs contain different numbers of sub-categories ranging from 4 to 25 or of diseases ranging from 53 to 369 (Fig. S11c & Data S10), covering diseases from all of the 17 chapters of ICD taxonomy (Table S11). These 17 NCs would still contain overlapping disease codes since the 223 disease sub-categories have overlapping disease codes. Therefore, 17 NCs with 223 disease sub-categories form a disease taxonomy consisting of two hierarchical levels with polyhierarchical categories although with a limited number (1797) of disease members.

2.8. Statistical Validation of NCD From External Data

To validate the robustness of NCD, we obtained two external phenotype-genotype data sources (i.e. UCSC-GWAS and PheWAS catalog), which have not been integrated yet for generating NCD for further investigation. By measuring whether the disease members in the sub-categories in NCD tend to incorporate the associations of shared genes from these two data sources, we would be able to validate the quality of NCD. If the diseases in such NCD sub-categories would tend to involve shared genes, then the diseases would be more likely associated with one another than other diseases. To test this hypothesis, we obtained the overlapping disease codes (ODC) in both NCD and the two external phenotype-genotype association databases and evaluate the degree of these ODC disease links in each NCD sub-category when considering two diseases linked if they share common genes. In detail, we firstly obtained the common disease codes involved in both NCD and UCSC-GWAS or PheWAS database. Then we generated a disease network with shared genes derived from the two datasets, in which two diseases linked if they shared at least one common gene. After that for each NCD sub-category, we generated a complete disease network with the ODC diseases in it and overlaid the network on the disease network with shared genes. Finally, the overlapping percentage of disease links would be calculated for evaluating the degree of molecular associations involved in diseases in each NCD sub-categories (details are in SM Section 8, Figs. S23–S25).

2.9. Statistical Analysis

We use R 3.1.0 as the main statistical tool in our work. The comparison of two percentages was calculated by Binomial test or Chi-squared test. Wilcoxon rank sum test was used for compare two independent list of values (e.g. two types of molecular diversities and two groups of MSPLs). All the correlations between two variables were calculated by Pearson's product moment correlation coefficient. Due to the incompleteness and bias of disease-related data (i.e. disease-gene associations and disease-symptom associations), we need to distinguish the information from the background noise. Therefore, for comparison with random expectation, we reshuffle (100 random permutations) the symptom features and the related genes of each disease using the Fisher-Yates method (Fisher and Yates, 1948). The calculations from random permutations were used for the correlation between category similarity and molecular similarity, as well as phenotype similarity. In addition, this was used for detection of the disease categories with high molecule diversity.

3. Results

3.1. Category Similarity of ICD Taxonomy

We curated 1883 distinct ICD disease codes (Table S1) from the 5-level tree structure of 14,292 ICD-9-CM codes, as well as high confidence protein-protein interactions consisting of 15,551 nodes and 218,409 edges (Franceschini et al., 2013). We compiled 153,277 distinct disease-gene associations between 4552 distinct diseases in UMLS codes and 14,975 genes reported in the DiseaseConnect database (Liu et al., 2014) (Fig. S3). Next, by manually mapping the DiseaseConnect identifiers to ICD codes, we obtained 160,754 disease-gene records involving 1883 distinct ICD codes and 14,906 genes (Figs. S1–2 and Data S1).

To evaluate the closeness of two diseases in the ICD tree structure, we applied an established semantic similarity algorithm named category similarity (see Methods, SM Section 2.1). This similarity measure is based on the information content, which quantifies the specificity of a term and can be applied to any categorization scheme that has a rooted tree structure, including the ICD-9-CM disease classification. We then created a disease network comprising 1883 nodes (representing ICD codes) and 154,563 edges. The edge weight reflects the category similarity values and higher values reflect higher similarity between diseases whose code positions are adjacent in the ICD tree. The category similarity distribution showed that most disease pairs (135,271, 87.52%) had similarities between 0.2 and 0.5 (Fig. S4a). Disease pairs within this range mostly belong to different disease subcategories in the same chapter, such as diseases of other endocrine glands and disorders of thyroid gland. For example, type 2 diabetes (ICD: 250.00) and simple goiter (ICD: 240.0), which are in ICD chapter 3, have a category similarity of 0.37. However, there do exist disease pairs with high category similarities, such as type 2 diabetes (ICD: 250.00) and type 1 diabetes (ICD: 250.01) with a category similarity 0.83. Overall, this measure indicates the capability of ICD in bringing together similar diseases in its tree structure, and the overrepresentation of lower similarity scores is indicative of its limitations in doing so.

While the ICD classification was derived from clinical manifestations (including symptoms and signs) and does not necessarily reflect the connections among the molecular components of diseases, it is informative to quantify to what extent it carries molecular information. We investigated the correlations of category similarity of disease pairs with 1) the degree of shared genes and shared clinical phenotypes, 2) GO term (Cell Component, Molecular Function, Biology Process) similarity (Mistry and Pavlidis, 2008), and 3) topological similarity (i.e., minimum shortest path length and molecular module similarity) among them (Methods, SM Section 2.2).

We found that close disease codes (disease pairs with a high category similarity) actually have higher clinical phenotype similarity (Methods, SM Section 2.3), which adheres to the construction principle of ICD taxonomy based on symptom phenotypes (Fig. S4b, PCC = 0.960, 95% CI = [0.854, 1.000], p = 2.079e-05). Furthermore, we observed strong correlations between higher category similarity bins for molecular profiles, compared to lower category similarity bins (Fig. S4c–i and Table S2. See Methods, SM Section 2 for detailed information). In particular, we observed that in addition to the strongly positive correlations, the percentage overlap of disease pairs with shared genes was generally larger than the random controls (Fig. S4c and d, see Methods, SM Section 2.4). The top 10 disease pairs with the largest number of shared genes are all from Chapter 2, which consists of cancer types. This might reflect the fact that cancers are the most studied and complex disease phenotypes involving various gene mutations (Table S3, see Methods, SM Section 2 for detailed information).

Overall, these findings indicate that diseases in the same ICD chapter tend to have a higher degree of shared genes, and the closer their positions in the ICD tree, the higher is the degree of shared genes.

3.2. Heterogeneity of Molecular Specificity in ICD Taxonomy

We measured the maximum betweenness of disease-related genes in the protein-protein interaction (PPI) network to quantify the molecular diversity (the inverse of specificity) of each disease, as described previously (Zhou et al., 2014) (see Methods, SM Section 3). A high maximum betweenness indicates a high molecular diversity. For example, the molecular diversity of Alzheimer's disease could be represented by the maximum betweenness of its related genes (i.e., the betweenness of the APP gene) in the PPI network (Fig. S5a).

We observed that the molecular diversity of diseases in the ICD taxonomy is heterogeneous, with molecular diversity values varying from 10⁻⁸ to 10⁻² with a median value of 8.93e-04 (Fig. 2a and Data S3). The top two disease chapters with the highest median molecular diversity were Chapter 2 (3.87e-03) and Chapter 1 (1.31e-03) (Fig. 2b). Furthermore, we found that neoplasm (Chapter 2) and infectious disease (Chapter 1) categories tended to have higher molecular diversity compared to their complementary categories (Neoplasms vs. Non-Neoplasms p < 2.2e-16, Infectious diseases vs. non-infectious diseases p = 2.0e-02, Fig. S5b–c) and random controls. We also found that disease categories annotated as “other/unspecified” categories (SM Section 3.1) had higher molecular diversity compared to disease categories with specific conditions (p = 9.75e-03, Fig. S5d, Data S4; see SM Section 3.1) and its random control. These results indicate that the diseases in neoplasms, infectious diseases, and “Other/unspecified diseases” categories should be further investigated for molecular subtypes. A detailed discussion of disease cases is offered in SM Section 3, Data S5 & Tables S4–S5.

Fig. 2 — Lack of molecular specificity in ICD taxonomy and the blurred boundary between disease categories in ICD taxonomy. a. The distribution of molecular diversity of 1883 ICD diseases; b. The boxplot of molecular diversity of 17 ICD chapters (ordered by median values); c. The disease network with shared genes in which the diseases belong to Chapter 5 and Chapter 8. The ICD codes 295, 296, in Chapter 5 have dense relationships to the ICD codes in Chapter 8; d. The disease category network with shared genes. The nodes indicate the disease chapters and the weights of edges represent the edge densities between disease chapter pairs; the nodes with same color are considered as a chapter cluster, which is detected by community detection algorithm; e. Modularity of disease networks with chapter as module annotations; f. The correlation between category similarity and phenotype similarity of ICD chapters.

3.3. The Blurred Molecular Boundary Between ICD Categories

In the current ICD taxonomy, we observed many instances where there exists a significant number of links between diseases in different chapters, comparable to the number of links between diseases within the same chapter (Table S6 & Fig. 2d, see Methods & SM Section 4). For example, strong shared-gene relationships were detected between respiratory diseases (Chapter 8) and mental, behavioral, and neurodevelopmental disorders (Chapter 5) (Fig. 2c–d, more examples shown in SM Section 4, Tables S7–9). In addition, by calculating the shared molecular connections between diseases in the context of chapters, we could detect 768 diseases with a significant number of shared genes with diseases other than those in their own chapters (Data S6 & SM Section 4).

To further quantify the molecular boundaries between the disease categories in ICD disease taxonomy, we evaluated the modularity, a structural measure of the tendency of the network to form close-knit communities (see Methods, SM Section 2.5), generated by either shared molecular profiles or shared phenotypes. When we mapped ICD chapters as grouping annotations on the disease networks filtered by with appropriate weight thresholds, and calculated their modularity, we observed very low modularity values (Fig. 2e). Since modularity is a widely used measure to validate the quality of partitions/module structures in complex networks, this means that the grouping of ICD chapters does not agree with the natural topological groupings of their corresponding molecular networks (disease modules). This finding gives strong evidence for the blurred disease boundaries of the ICD taxonomy, possibly arising from the complexity of the underlying molecular mechanisms, in particular the possible overlap of their respective subnetworks, or disease modules, in the interactome.

Furthermore, although the modularity of disease networks with shared phenotypes (similarity ≥ 0.1) is slightly positive, the weak correlation (PCC = 0.08, p-value = .7588) between phenotypic similarity and category similarity of disease pairs in each chapter (Fig. 2f) indicates that ICD taxonomy does not adequately incorporate phenotype similarity knowledge into disease category structures. These observations indicate that the strict tree structures in the ICD taxonomy wherein terms can only have one lineage (Cimino, 2011) may be inefficient for disease classification given the contemporary knowledge of disease pathobiology, and should therefore be refined to be polyhierarchical in structure.

3.4. Polyhierarchical Mapping of Diseases Using Molecular Module Similarity

It has been proposed that if two disease modules overlap in the molecular interaction network, local perturbations in one disease might disrupt the biological pathways in the other disease, resulting in shared pathobiological characteristics (Menche et al., 2015). We observed a strong positive correlation between category similarity and module similarity (see Methods, SM Section 5.1) of diseases, indicating that two diseases with higher module similarity would be more closely localized in the disease category (Fig. 3a, PCC = 0.887, 95% CI = [0.584,0.973], p = 6.12e-04; 3b, PCC = 0.974, 95% CI = [0.889,0.994], p = 2.08e-06).

Fig. 3 — Polyhierarchical map prediction of ICD taxonomy based on molecular module similarity. a. Correlation between mean semantic (category) similarity and mean modular similarity of disease pairs; b. Correlation between overlapping edge ratio with category similarity and modular similarity of disease pairs; c. Correlation between predicted category number and molecular diversity of ICD codes in Chapter 14 (PCC: 0.438; 95% CI: [0.401, 0.474]; p < 2.2e-16); d. Polyhierarchical map of the disease codes in Chapter 8, indicating that the 20 disease codes in Chapter 8 have significant associations with two disease category clusters: 1) Chapter 1 (infectious disease) and Chapter 2 (neoplasms); 2) Chapter 3 (endocrine, nutritional and metabolic diseases and immunity disorders), Chapter 5 (mental disorders), Chapter 6 (nervous diseases), Chapter 9 (digestive diseases), Chapter 12 (skin and subcutaneous disease) and Chapter 13 (musculoskeletal system and connective tissue diseases); e. The boxplots of phenotype similarity of predicted disease pairs and original ICD disease pairs in the same top-level chapters (p < 2.2e-16, Wilcoxon test).

Here, we utilize the module similarity between disease pairs to predict the categories of similar diseases. In particular, we determine the taxonomic closeness (SM Section 2.1) of each given disease pair to predict the additional categories of diseases, by applying heuristic rules incorporating the positive correlation between category similarity and module similarity (see Methods, SM Section 5.1). In particular, using the 598,420 disease pairs with positive module similarity (Data S7), we generated 2057 predicted additional category results for 722 out of 1883 disease codes (38.3%) in which each disease code had ~4 categories on average (Data S7&8). We found that the number of predicted categories positively correlated with the molecular diversity of the original disease codes (Fig. 3c, PCC = 0.547, 95% CI = [0.514, 0.578], p < 4.94e-324; for External validations see SM 5.2). This indicates that diseases with multiple pathogenic pathways could be captured by polyhierarchical mapping. For example, the 20 diseases in Chapter 8 (i.e. Diseases of the Respiratory System) have been predicted to belong to over five additional chapters, such as neoplasms, infectious diseases, and diseases of the skin and subcutaneous tissue (Fig. 3d), which is consistent with the heterogeneous pathogenesis of COPD and asthma (Grainge et al., 2016; Sharma et al., 2015). A detailed discussion on the polyhierarchial map of the mental disorders is offered in SM Section 5.3 (Fig. S8).

Furthermore, we found that the predicted category framework, which is based only on molecular module similarity, also had higher phenotype similarity than diseases with shared root codes in the original ICD chapters (see SM Section 5.2, median: 0.0703 vs. 0.0563; mean: 0.125 vs 0.109; p < 2.2e-16, Fig. 3e). This observation helps to establish that the predicted category results are of higher quality than ICD with respect to their phenotype homogeneity.

3.5. Integrated Disease Network for Overlapping Disease Classification

To extend and redefine disease concepts by discovering additional categories of a disease, we generated a novel disease taxonomy by constructing an integrated disease network (IDN) with: (a) Shared clinical phenotypes including shared symptoms; (b) Shared molecular profiles including (i) shared genes and molecular module similarity and (ii) shortest path lengths in the PPI network, based on a systematic integration process to filter out possible false positive associations (see Methods, SM Section 6, Fig. S9 and Fig. S11a), which includes 1857 diseases and 35,114 links (Data S9).

Next, we applied high performance community detection algorithms to identify overlapping community structures in the IDN (see Methods, SM Section 7 and Fig. S11a). In particular, we first used BigClam (see Methods) since this method is able to detect overlapping communities whereby a disease can belong to multiple communities, in line with our main premise of creating a molecular based flexible disease classification. This resulted in 223 disease sub-categories with overlapping diseases as members (Fig. S11a and Data S10), which included 1797 distinct diseases from the ICD taxonomy. These 223 disease sub-categories contain different numbers of ICD codes, ranging from 5 to 168 (Fig. S12), therefore, they represent different levels of disease categories similar to ICD chapters and their sub-categories.

To further develop a more unified view of the disease category quality, we used the established BGLL method (Blondel et al., 2008), which detects non-overlapping communities, to cluster these 223 sub-categories further into 17 non-overlapping, distinct parts, such that these represent the 17 new chapter-level categories (called new chapters, or NCs) using the shared ICD codes (see Methods, SM Section 7.2, Fig. S11b). Overall, this clustering order effectively ensures distinct top-level categories that have overlapping subcategories.

The resulting 17 NCs contain different numbers of sub-categories ranging from 4 to 25, or of diseases ranging from 53 to 369 (Fig. S11c). We denote the 17 NCs together with their 223 disease sub-categories as our new overlapping disease classification (NCD). Each of the resulting NCs reflects the shared features of integrative molecular and phenotypic profiles (SM Section 7.4; Fig. S11c & Table S12, Data S11–14). For example, NC08 could be denoted as the “limbic system development-vision disorders-related diseases” since the most enriched PPI module (p = 4.9e-324, Relevance ratio = 0.7778) of its constituent diseases was mainly related to biological process; “limbic system development” (p = 1.13e-04), and 73.84% (127/172) of diseases in NC08 shared the phenotype, “vision disorders” (p = 4.9e-324) (Tables S13–S14).

3.6. New Disease Categories Define Diseases with Clearer Boundaries and Balanced Diversity

To confirm the phenotypic and molecular cohesiveness of our overlapping disease categories, we compared the modularity of NCD with that of the ICD taxonomy. We found that the 17 NCs consistently have much higher modularity than the original ICD chapters for all types of disease association networks (SM Section 7.3; Fig. 4a–h, Fig. S13). This finding indicates that the phenotypic and molecular links between the diseases of a category in NCD are much denser compared to ICD taxonomy.

Fig. 4 — Properties of new disease categories (NCD) and comparison to conventional ICD classification. a. Modularity of phenotype (symptom)-based disease network (NCD vs. ICD); b. Modularity of gene-based disease network (NCD vs. ICD); c. Modularity of molecule module-based disease network (NCD vs. ICD); d. Modularity of gene ontology (molecular function)-based disease network (NCD vs. ICD); e. Modularity of gene ontology (biological process)-based disease network (NCD vs. ICD); f. Modularity of gene ontology (cellular component)-based disease network (NCD vs. ICD); g. Modularity of shortest path-based disease network (NCD vs. ICD); h. Modularity of GWAS shared gene-based disease network (NCD vs. ICD); i. Percentage of minimum shortest path lengths within the range [0, 2] of intra-category and inter-category (NCD vs. ICD, chi-squared test); j. The number of overlapping disease pairs from NCD with shared genes from GWAS, compared to random expectation, binomial test; k. The number of overlapping disease pairs from NCD with shared genes from PheWAS, compared to random expectation, binomial test; l. The number of predicted NCD categories for ICD codes (i.e. diseases) as a function of their molecular diversity.

Furthermore, to ensure that the performance of NCD is indeed due to the combined effect of the molecular and phenotypic profiles, we performed a controlled experiment where we determined the new disease categories based on molecular-based networks and phenotype-based networks only by running the entire analytical pipeline and applying the same category prediction algorithm (SM Section 7.3). We found that NCD outperforms both molecular-based categories and phenotype-based categories in capturing the gene similarity, GO term similarity and phenotypic similarity (Figs. S14–S16). This suggests the importance of integrating both clinical phenotypes and molecular profiles to obtain a high-quality disease taxonomy.

Furthermore, we found that the minimum shortest path lengths in the PPI network between disease pairs that belong to the same NCD categories had a larger percentage of low values (i.e., [0,2]) compared to ICD (Fig. 4i, 62.86% vs 58.95%, p < 4.9e-324; SM Section 7.3). This result indicates that diseases within an NCD category have a significantly higher degree of shared genes (or shorter path lengths) in comparison to diseases within a category in ICD. On the other hand, the MSPLs between disease pairs in different NCD categories had a significantly lower percentage of low values than those in the ICD (47.27% vs 54.88%, p < 4.9e-324, Fig. 4i & Fig. S17, External validations in SM 8.3), which indicates a lower degree of shared genes (or shorter path lengths) between diseases from different categories in NCD than in ICD. These findings demonstrate that our NCD framework has clearer boundaries between distinct diseases belonging to different categories than those in the original ICD disease taxonomy.

Moreover, to validate the robustness of NCD predictions, we calculated the degree of associations in terms of network density among the diseases in each sub-category of NCD. To this end, we investigated the overlaps with the disease pairs connected by shared genes using two independent phenotype-genotype association databases, GWAS and PheWAS (see Methods, SM Section 1.2,1.6 & 8). We found that for the 223 sub-categories in NCD, network density was significantly higher compared to random controls (GWAS: p-value = 9.42e-197, Fig. 4j; PheWAS:p-value = 1.31e-14, Fig. 4k). This means that the diseases in the 223 sub-categories in NCD would tend to have shared genes. For example, the New Chapter: NC12 in NCD, including 11 sub-categories and 136 ICD diseases (belonging to eight ICD chapters), is enriched with respiratory and airway diseases (e.g. COPD and asthma).

We obtained 37 overlapped diseases from the GWAS database, which have a high degree of shared genes with the diseases in each sub-category of the NC12 (Fig. 5a). In particular, the sub-categories, such as NC12.M06 (p-value = 2.53e-30), NC12.M03 (p-value = 1.80e-38) and NC12.M02 (p-value = 6.89e-19) have significantly higher density than those of the whole GWAS disease network (Fig. S22). Furthermore, we found that the overlapping subcategories of the NCD are able to differentiate between different components (i.e. asthma/allergy vs. COPD) of the same broad group of diseases (i.e. respiratory diseases) (see Fig. S22 for a detailed example). Indeed, in the NC12 disease chapter chiefly containing respiratory diseases, the two sub-categories, namely NC12.M06 and NC12.M07, overlap in the underlying molecular interaction network while still containing the respective disease (asthma and COPD, respectively) genes separately (Fig. 5a). A detailed discussion is offered in SM Section 8(with results in Data S21–22, Tables S18–19 & Figs. S23–S25).

Fig. 5 — Biological insights of new disease taxonomy. a. The New Chapter containing airway diseases (NC12) consists of 11 sub-categories and 136 ICD diseases belonging to 8 ICD chapters. The subcategories overlap in the underlying molecular interaction network, while still separately including the disease genes (asthma and COPD, respectively) that characterize each subcategory; b. The disease network of neoplasms in NCD. The 32 sub-categories significantly representing neoplasms are divided into 4 NCs (G1). *Helicobacter pylori* [*H. pylori*] (041.86), malignant neoplasm of stomach (151), duodenal ulcer (532), peptic ulcer, site unspecified (533), which have significant relationships, are clearly clustered into a sub-category (NC11. M07) (G2); c. A sub-category (NC06.M10) in NCD, which includes diseases from 8 different ICD chapters with shared molecular mechanism and phenotypes. Fifty percent (13/26 = 50%) of the diseases in NC06.M10 share a PPI module, the biological function of which is enriched with immune system response, while over 90% (25/26 = 96.2%) of the shared common phenotype of this module is “Pain”.

In addition, in NCD, a disease can be classified into multiple categories, and the number of categories of a disease positively correlates with its molecular diversity (Fig. 4l, PCC = 0.352, 95% CI = [0.311, 0.392], p-value < 4.94e-324; External validations in SM 8.3). For example, we reclassified neoplastic diseases into multiple categories due to their high molecular diversity. Two hundred and fifty-eight neoplastic diseases in our NCD were divided into 144 sub-categories and 17 NCs (Figs. S20 & S21). Thirty-nine out of 144 sub-categories (27.08%) were enriched with “neoplasm” diseases (Data S19, p-value = 2.78e-5). There were mainly 4 NCs (i.e., NC01, NC06, NC11, and NC16) containing these 32 sub-categories and 188 “neoplasm” disease codes (Fig. 5b), where 76.06% (143/188) of the neoplastic diseases were classified into >1 sub-category, ranging from 2 to 15(Data S20 & Table S17). The neoplasm with the highest molecular diversity, “malignant neoplasm of connective and other soft tissue” (ICD: 171; molecular diversity: 0.035), was reclassified into 15 sub-categories, and “malignant neoplasms of thyroid gland” (ICD: 193; molecular diversity: 0.0028) was assigned to 14 sub-categories. Furthermore, related diseases had been reclassified together in NCD, such as the well-known disease-correlations among H. pylori infection (ICD: 041.86), stomach cancer (ICD: 151), and duodenal ulcer (ICD: 532) or peptic ulcer (ICD: 533) (Fig. 5b) (Sitas, 2016; Graham, 2015).

More interestingly, some diseases, like viral hepatitis C (ICDs: 070.4, 070.5, 070.7), graft-versus-host disease (ICDs: 279.5/279.50), glomerulonephritis (ICDs: 580, 582, 582.9), circumscribed scleroderma (ICD: 701.0), systemic lupus erythematosus (ICD: 710.0), and rheumatoid arthritis (ICDs: 714/714.0), each from different chapters in ICD taxonomy, were classified together into a unique NCD sub-category (NC06.M10) since 50% (13/26) of these diseases share a PPI module related to immune response (SM Section 7.4, Fig. 5c, Data S15–16, Tables S15–16).

In addition, diseases originally in the same ICD chapter, such as viral pneumonia (ICD: 480) and influenza (ICD: 487) from respiratory system-related diseases (Chapter 8), were reclassified into different categories in the NCD (NC12, NC10). Influenza shared more phenotype profiles with “episodic mood disorders” (ICD: 296) in NC10.M01, rather than viral pneumonia in NC12 (Fig. S18& Data S17), which is in accordance with recent epidemiological studies between episodic mood disorders and influenza (Okusaga et al., 2011; Canetta et al., 2014), and, furthermore, we also found that influenza shared some molecular profiles with “episodic mood disorders” (ICD: 296) in NC10.M01 (Fig. S19, Data S18).

These findings suggest that NCD offers a promising integrative framework incorporating both clinical phenotypes and molecular profiles for disease taxonomy that has very practical implications for the precise investigation of disease subtyping and etiologies.

4. Discussion

Given the molecular network mechanisms (Barabasi et al., 2011; Zanzoni et al., 2009), genetic pleiotropy (Solovieff et al., 2013), as well as complicated genotype-phenotype associations underlying diseases, the establishment of a molecular-based disease taxonomy with clear boundaries is essential but challenging. From the molecular network perspective, we first investigated the utility, shortcomings, and inconsistencies of ICD-9-CM, the established disease taxonomy for clinical settings. We found that there exist a considerable number (~40% of our investigated diseases) of diseases, for example, cancer and infectious diseases, that have diverse molecular network mechanisms and tend to interact more with diseases from other chapters. It is also these molecularly diverse diseases that mainly contribute to the blurred boundary of ICD disease taxonomy (see Methods, SM Section 4&7). Upon exploring the molecular diversity and cross-chapter interactions between diseases, we propose a novel disease classification system based on the integration of the clinical and molecular profiles of diseases. In particular, we integrate disease networks taking into account molecular and phenotypic connectivity among diseases, predict the multiple disease categories that diseases belong to, and finally validate the biological cohesiveness of our NCD by network topological measures such as modularity and shortest path length.

Our findings indicate that although general correlations exist between disease closeness in ICD taxonomy and underlying molecular profiles, ICD still displays significant limitations with regard to the heterogeneity of molecular diversity and clear category boundaries. In our NCD, a disease with a high molecular diversity tends to be classified into multiple disease categories, which indicates that there exist more disease subtypes for that disease. For example, “malignant neoplasm of the pancreas” was reclassified into 11 sub-categories and 4 NCs, which is consistent with a recent study wherein 4 phenotypic subtypes of pancreatic cancer were enriched for 10 distinct molecular mechanisms (Bailey et al., 2016). Therefore, we believe that the new disease classification system may help facilitate precise clinical diagnosis and correct prognosis (Jameson and Longo, 2015), and does so in alignment with refined molecular network diagnostics.

Furthermore, the molecular network underpinnings and overlapping disease categories of NCD provide a credible relationship map between diseases and disease categories that may radically transform our current understanding of diseases and relevant treatment paradigms. On the one hand, our approach accurately links diseases with all possible underlying mechanisms in the molecular interaction network. On the other hand, it presents a promising approach to the identification of targeted drugs for the treatment of related diseases. For example, breast cancer and influenza (both in NC11.M02) may share potential drug targets (Park, 2012). As another example, metformin, widely prescribed to treat metabolic syndrome (in NC11.M02), could alter the gut microbiome composition and function, improve gut microbial dysbiosis (Forslund et al., 2015; Cabreiro et al., 2013), and also prevent colorectal cancer (also in NC11.M02) through microbiome-influenced immune response modification (Nakatsu et al., 2015). Here, it is important to note that while a considerable number of diseases have a strong environmental component, our main focus has been the many diverse molecular determinants. In the future, additional environmental factors such as epigenetic changes can be added into the data integration scheme to further refine the classification.

There exist several potential limitations of this work. Although we have aimed to address the possible confounders by constructing random controls and using external evaluations, the incompleteness and bias incorporated in the integrated data sources are likely to influence the generalization of our results. For example, DiseaseConnect yields an incomplete disease-gene database: 1883 ICD diseases could be mapped (Table S11), leading to only 1797 diseases included in the NCD. Furthermore, as with other studies that rely on literature-based and ontological knowledge, investigation bias remains an issue, where the molecular mechanisms (e.g. related genes and their interactions) of some diseases (e.g. cancer) being more intensively studied than others may influence the results (Menche et al., 2015). We expect the results of similar works to be more refined in the future as biomedical datasets become more complete. The incorporation of more comprehensive disease-gene data sources, such as DISEASES (Pletscher-Frankild et al., 2015) and MalaCards (Rappaport et al., 2017), could result in an improved study. While we have chosen to keep individual external gene expression datasets outside the scope of this study since gene expression is highly tissue- and cell-type dependent, it presents an interesting future direction and could potentially improve the quality of the resulting disease categories if exhaustive lists of tissue-specific expression datasets are used in dedicated studies. In addition, our NCD merely delivers a two-level taxonomy framework without elaborated hierarchical structures in the same disease categories, which could be further refined or optimized through methods like hierarchical clustering algorithms (Murtagh and Contreras, 2012) and systematic posteriori ontology engineering method (Gessler et al., 2013). Finally, high-quality ontologies, such as the Human Phenotype Ontology (Kohler et al., 2014) and Disease Ontology (Kibbe et al., 2015), can be used for external validation, or for further integration to obtain more robust and extensive NCDs.

In this big-data era, the dramatically increasing multi-omics databases, as well as clinical data from electronic health records (EHR) involving phenotypic, therapeutic and environmental factors information (Jensen et al., 2012), should also be incorporated into the new disease taxonomy refinement for patient stratification and disease treatment. At this point, a realistic assumption is that the translation of this classification to the clinic will need some time. That said, while the ICD is originally made “by clinicians for clinicians”, it is now widely used by biomedical researchers as well to gain a deeper understanding of human diseases. We therefore believe that researchers will be the first and direct beneficiaries of our approach.

In conclusion, our study provides valuable insights into the polyhierarchical network-based disease classification beyond the traditional tree structure. Our integrated disease network approach is sufficiently powerful to elucidate the tangled underpinnings of human diseases and uncover distinct disease boundaries. Our work may provide a new framework for the disease taxonomy reform based on big-data fusion, so as to generate further the robust infrastructure needed for precision medicine.

The following are the supplementary data related to this article.

Data S1

The distribution on the number of genes of 1883 ICD diseases.

mmc1.xlsx^{(42.9KB, xlsx)}

Data S2

The gene list of all the 314 PPI modules detected from the whole human PPI network.

mmc2.xlsx^{(63.8KB, xlsx)}

Data S3

The molecular diversity of the ICD according to the maximum betweenness of disease-related genes in the PPI network.

mmc3.xlsx^{(83.4KB, xlsx)}

Data S4

The maximum betweenness of the disease codes related to other/unspecified conditions.

mmc4.xlsx^{(42KB, xlsx)}

Data S5

The maximum betweenness (median) of ICD subcatagories.

mmc5.xlsx^{(12.3KB, xlsx)}

Data S6

The diseases with significant higher edge density to the chapter than the average edge density of the diseases in the corresponding chapter.

mmc6.xlsx^{(290.3KB, xlsx)}

Data S7

The associations between diseases and chapters based on molecular module similarity.

mmc7.xlsx^{(75.9KB, xlsx)}

Data S8 — Polyhierarchical map of diseases on modular similarity.

Data S9

The integrated disease network based on systematic integration process.

mmc8.xlsx^{(869.6KB, xlsx)}

Data S10

The information of the new classification of diseases.

mmc9.xlsx^{(164.2KB, xlsx)}

Data S11

The gene list in every PPI module for the 1797 diseases in the integrated disease network.

mmc10.xlsx^{(47.2KB, xlsx)}

Data S12

The significant modules in every NCDs.

mmc11.xlsx^{(22.9KB, xlsx)}

Data S13

The most significant shared genes with betweenness centrality in NC04 & NC17.

mmc12.xlsx^{(11.7KB, xlsx)}

Data S14

The significant phenotypes in every NCD.

mmc13.xlsx^{(25.2KB, xlsx)}

Data S15

The disease association network of NC06M10.

mmc14.xlsx^{(29.8KB, xlsx)}

Data S16

The significant shared PPI modules in NC06.M10.

mmc15.xlsx^{(15.1KB, xlsx)}

Data S17

The overlapping and unique significant phenotypes among 296,487 and 480.

mmc16.xlsx^{(10.9KB, xlsx)}

Data S18

The first-neighbor PPI subnetwork of the genes with 296, 487 and 480.

mmc17.xlsx^{(66KB, xlsx)}

Data S19

The 39 subcategories with significant “neoplasms” correlations.

mmc18.xlsx^{(12.8KB, xlsx)}

Data S20

The detail information for the 32 subcategories with significant “neoplasms” correlations in NC01, NC06, NC11, NC16.

mmc19.xlsx^{(37.2KB, xlsx)}

Data S21

The subcategories in NCD validated by the GWAS data.

mmc20.xlsx^{(13.2KB, xlsx)}

Data S22

The subcategories in NCD validated by the PhWAS data.

mmc21.xlsx^{(13KB, xlsx)}

Supplementary material

mmc22.docx^{(14.4MB, docx)}

Funding Sources

The work was supported by National Natural Science Foundation of China (61105055, 81230086 and 81673833), National Science and Technology Major Project for New Drugs Research and Development of China (2017ZX09301-059, 2017ZX09503-001-003), National Key R&D Project (2017YFC1703506), and the Fundamental Research Funds for the Central public welfare research institutes (ZZ0908029, 2017JBM020, DUT16ZD227, DUT17ZD222, DUT18ZD301). We also acknowledge the support by National Institutes of Health (NIH) grants P50- 533 HG004233-CEGS, MapGen grant (U01HL108630) and P01 HL083069, U01 534 HL065899, P01 HL105339, R01HL111759, 1P01HL132825-01 and RC HL10154301. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing Financial Interests

The authors declare that they do not have any competing financial interests.

Author Contributions

Z.W., A.S., X.Z. and J.L. (Joseph Loscalzo) conceived and designed the research; X.Z., L.L., J.L. (Jun Liu), A.H., Z.Y., B.L., Z.G., L.G., C.S., J.L. (Joseph Loscalzo) and Z.W. performed the research tasks: data curation and compiling (L.L., J.L.(Jun Liu), Z.Y., B.L. and Z.G.), data analysis (X.Z. and G.L.), result validation (L.L.,J.L. (Jun Liu),C.S., A.H., J.L. (Joseph Loscalzo) and Z.W.); X.Z, J.L. (Jun Liu), A.H., A.S. and L.L. wrote the manuscript. All authors have reviewed and revised the manuscript.

Contributor Information

Amitabh Sharma, Email: amitabh.sharma@channing.harvard.edu.

Zhong Wang, Email: wangzh@mail.cintcm.ac.cn.

References

Ahmad T., Marshall S., Jewell D. Genotype-based phenotyping heralds a new taxonomy for inflammatory bowel disease. Curr. Opin. Gastroenterol. 2003;19:327–335. doi: 10.1097/00001574-200307000-00002. [DOI] [PubMed] [Google Scholar]
Alizadeh A.A., Eisen M.B., Davis R.E., Ma C., Lossos I.S., Rosenwald A., Boldrick J.C., Sabet H., Tran T., Yu X., Powell J.I., Yang L., Marti G.E., Moore T., Hudson J., Jr., Lu L., Lewis D.B., Tibshirani R., Sherlock G., Chan W.C., Greiner T.C., Weisenburger D.D., Armitage J.O., Warnke R., Levy R., Wilson W., Grever M.R., Byrd J.C., Botstein D., Brown P.O., Staudt L.M. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000;403:503–511. doi: 10.1038/35000501. [DOI] [PubMed] [Google Scholar]
Arostegui I., Esteban C., Garcia-Gutierrez S., Bare M., Fernandez-de-Larrea N., Briones E., Quintana J.M. Subtypes of patients experiencing exacerbations of COPD and associations with outcomes. PLoS One. 2014;9 doi: 10.1371/journal.pone.0098580. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ashburn T.T., Thor K.B. Drug repositioning: identifying and developing new uses for existing drugs. Nat. Rev. Drug Discov. 2004;3:673–683. doi: 10.1038/nrd1468. [DOI] [PubMed] [Google Scholar]
Bailey P., Chang D.K., Nones K., Johns A.L., Patch A.M., Gingras M.C., Miller D.K., Christ A.N., Bruxner T.J., Quinn M.C., Nourse C., Murtaugh L.C., Harliwong I., Idrisoglu S., Manning S., Nourbakhsh E., Wani S., Fink L., Holmes O., Chin V., Anderson M.J., Kazakoff S., Leonard C., Newell F., Waddell N., Wood S., Xu Q., Wilson P.J., Cloonan N., Kassahn K.S., Taylor D., Quek K., Robertson A., Pantano L., Mincarelli L., Sanchez L.N., Evers L., Wu J., Pinese M., Cowley M.J., Jones M.D., Colvin E.K., Nagrial A.M., Humphrey E.S., Chantrill L.A., Mawson A., Humphris J., Chou A., Pajic M., Scarlett C.J., Pinho A.V., Giry-Laterriere M., Rooman I., Samra J.S., Kench J.G., Lovell J.A., Merrett N.D., Toon C.W., Epari K., Nguyen N.Q., Barbour A., Zeps N., Moran-Jones K., Jamieson N.B., Graham J.S., Duthie F., Oien K., Hair J., Grutzmann R., Maitra A., Iacobuzio-Donahue C.A., Wolfgang C.L., Morgan R.A., Lawlor R.T., Corbo V., Bassi C., Rusev B., Capelli P., Salvia R., Tortora G., Mukhopadhyay D., Petersen G.M., Munzy D.M., Fisher W.E., Karim S.A., Eshleman J.R., Hruban R.H., Pilarsky C., Morton J.P., Sansom O.J., Scarpa A., Musgrove E.A., Bailey U.M., Hofmann O., Sutherland R.L., Wheeler D.A., Gill A.J., Gibbs R.A., Pearson J.V., Biankin A.V., Grimmond S.M. Genomic analyses identify molecular subtypes of pancreatic cancer. Nature. 2016;531:47–52. doi: 10.1038/nature16965. [DOI] [PubMed] [Google Scholar]
Barabasi A.L., Gulbahce N., Loscalzo J. Network medicine: a network-based approach to human disease. Nat. Rev. Genet. 2011;12:56–68. doi: 10.1038/nrg2918. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bianchini G., Balko J.M., Mayer I.A., Sanders M.E., Gianni L. Triple-negative breast cancer: challenges and opportunities of a heterogeneous disease. Nat. Rev. Clin. Oncol. 2016;13:674–690. doi: 10.1038/nrclinonc.2016.66. [DOI] [PMC free article] [PubMed] [Google Scholar]
Blair D.R., Lyttle C.S., Mortensen J.M., Bearden C.F., Jensen A.B., Khiabanian H., Melamed R., Rabadan R., Bernstam E.V., Brunak S., Jensen L.J., Nicolae D., Shah N.H., Grossman R.L., Cox N.J., White K.P., Rzhetsky A. A nondegenerate code of deleterious variants in Mendelian loci contributes to complex disease risk. Cell. 2013;155:70–80. doi: 10.1016/j.cell.2013.08.030. [DOI] [PMC free article] [PubMed] [Google Scholar]
Blondel Vincent D., Guillaume Jean-Loup, Lambiotte Renaud, Lefebvre Etienne. Fast unfolding of communities in large networks. J. Stat. Mech. 2008;2008 [Google Scholar]
Butler M. Not so fast! Congress delays ICD-10-CM/PCS. Examining how the delay happen, its industry impact, and how best to proceed. JAHIMA. 2014;85:24–28. [PubMed] [Google Scholar]
Cabreiro F., Au C., Leung K.Y., Vergara-Irigaray N., Cocheme H.M., Noori T., Weinkove D., Schuster E., Greene N.D., Gems D. Metformin retards aging in C. elegans by altering microbial folate and methionine metabolism. Cell. 2013;153:228–239. doi: 10.1016/j.cell.2013.02.035. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cancer Genome Atlas Research, Network, University Analysis Working Group, Asan, B. C. Cancer Agency, Brigham, Hospital Women's, Institute Broad, University Brown, University Case Western Reserve, Institute Dana-Farber Cancer, University Duke, Centre Greater Poland Cancer, School Harvard Medical, Biology Institute for Systems, K. U. Leuven, Clinic Mayo, Center Memorial Sloan Kettering Cancer, Institute National Cancer, Hospital Nationwide Children's, University Stanford, Alabama University of, Michigan University of, Carolina University of North, Pittsburgh University of, Rochester University of, California University of Southern, M. D. Anderson Cancer Center University of Texas, Washington University of, Institute Van Andel Research, University Vanderbilt, University Washington, Institute Genome Sequencing Center: Broad, Louis Washington University in St, B. C. Cancer Agency Genome Characterization Centers, Institute Broad, School Harvard Medical, University Sidney Kimmel Comprehensive Cancer Center at Johns Hopkins, Carolina University of North, Center University of Southern California Epigenome, M. D. Anderson Cancer Center University of Texas, Institute Van Andel Research, Institute Genome Data Analysis Centers: Broad, University Brown, School Harvard Medical, Biology Institute for Systems, Center Memorial Sloan Kettering Cancer, Cruz University of California Santa, M. D. Anderson Cancer Center University of Texas, Consortium Biospecimen Core Resource: International Genomics, Hospital Research Institute at Nationwide Children's, Services Tissue Source Sites: Analytic Biologic, Center Asan Medical, Bioscience Asterand, Hospital Barretos Cancer, BioreclamationIvt, Clinic Botkin Municipal, School Chonnam National University Medical, System Christiana Care Health, Cureline, University Duke, University Emory, University Erasmus, Medicine Indiana University School of, Moldova Institute of Oncology of, Consortium International Genomics, Invidumed, Hamburg Israelitisches Krankenhaus, Medicine Keimyung University School of, Center Memorial Sloan Kettering Cancer, Goyang National Cancer Center, Bank Ontario Tumour, Centre Peter MacCallum Cancer, School Pusan National University Medical, School Ribeirao Preto Medical, Hospital St. Joseph's, Center Medical, University St. Petersburg Academic, Bank Tayside Tissue, Dundee University of, Center University of Kansas Medical, Michigan University of, Hill University of North Carolina at Chapel, Medicine University of Pittsburgh School of, M. D. Anderson Cancer Center University of Texas, University Disease Working Group: Duke, Center Memorial Sloan Kettering Cancer, Institute National Cancer, M. D. Anderson Cancer Center University of Texas, Medicine Yonsei University College of, Csra Inc Data Coordination Center, and Health Project Team Integrated genomic characterization of oesophageal carcinoma. Nature. 2017;541:169–175. doi: 10.1038/nature20805. [DOI] [PMC free article] [PubMed] [Google Scholar]
Canetta S.E., Bao Y., Co M.D., Ennis F.A., Cruz J., Terajima M., Shen L., Kellendonk C., Schaefer C.A., Brown A.S. Serological documentation of maternal influenza exposure and bipolar disorder in adult offspring. Am. J. Psychiatry. 2014;171:557–563. doi: 10.1176/appi.ajp.2013.13070943. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chong C.R., Sullivan D.J., Jr. New uses for old drugs. Nature. 2007;448:645–646. doi: 10.1038/448645a. [DOI] [PubMed] [Google Scholar]
Chuang H.Y., Lee E., Liu Y.T., Lee D., Ideker T. Network-based classification of breast cancer metastasis. Mol. Syst. Biol. 2007;3:140. doi: 10.1038/msb4100180. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cimino J.J. High-quality, standard, controlled healthcare terminologies come of age. Methods Inf. Med. 2011;50:101–104. [PMC free article] [PubMed] [Google Scholar]
Council National Research Committee Framework Developing New Taxonomy and Disease, editor. Toward Precision Medicine: Building a Knowledge Network for Biomedical Research and a New Taxonomy of Disease. The National Academies Press; Washington, DC: 2011. [PubMed] [Google Scholar]
Dahlem D., Maniloff D., Ratti C. Predictability bounds of electronic health records. Sci. Rep. 2015;5 doi: 10.1038/srep11865. [DOI] [PMC free article] [PubMed] [Google Scholar]
Denny J.C., Ritchie M.D., Basford M.A., Pulley J.M., Bastarache L., Brown-Gentry K., Wang D., Masys D.R., Roden D.M., Crawford D.C. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics. 2010;26:1205–1210. doi: 10.1093/bioinformatics/btq126. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dienstmann R., Vermeulen L., Guinney J., Kopetz S., Tejpar S., Tabernero J. Consensus molecular subtypes and the evolution of precision medicine in colorectal cancer. Nat. Rev. Cancer. 2017;17:79–92. doi: 10.1038/nrc.2016.126. [DOI] [PubMed] [Google Scholar]
Evans J.M., Donnelly L.A., Emslie-Smith A.M., Alessi D.R., Morris A.D. Metformin and reduced risk of cancer in diabetic patients. BMJ. 2005;330:1304–1305. doi: 10.1136/bmj.38415.708634.F7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fisher Ronald A., Yates Frank. Oliver & Boyd; London: 1948. Statistical Tables for Biological, Agricultural and Medical Research. [Google Scholar]
Forslund K., Hildebrand F., Nielsen T., Falony G., Le Chatelier E., Sunagawa S., Prifti E., Vieira-Silva S., Gudmundsdottir V., Krogh Pedersen H., Arumugam M., Kristiansen K., Voigt A.Y., Vestergaard H., Hercog R., Igor Costea P., Kultima J.R., Li J., Jorgensen T., Levenez F., Dore J., Nielsen H.B., Brunak S., Raes J., Hansen T., Wang J., Ehrlich S.D., Bork P., Pedersen O. Disentangling type 2 diabetes and metformin treatment signatures in the human gut microbiota. Nature. 2015;528:262–266. doi: 10.1038/nature15766. [DOI] [PMC free article] [PubMed] [Google Scholar]
Franceschini A., Szklarczyk D., Frankild S., Kuhn M., Simonovic M., Roth A., Lin J., Minguez P., Bork P., von Mering C., Jensen L.J. STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 2013;41:D808–15. doi: 10.1093/nar/gks1094. [DOI] [PMC free article] [PubMed] [Google Scholar]
Freeman L.C. A set of measures of centrality based on betweenness. Sociometry. 1977:35–41. [Google Scholar]
Gessler Damian D.G., Cliff Joslyn, Karin Verspoor. Terence Critchlow and Kerstin Kleese van Dam (eds.), Data intensive Science. CRC Press; 2013. A posteriori ontology engineering for data-driven science. [Google Scholar]
Gligorijevic V., Przulj N. Methods for biological data integration: perspectives and challenges. J. R. Soc. Interface. 2015;12 doi: 10.1098/rsif.2015.0571. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gligorijevic V., Malod-Dognin N., Przulj N. Integrative methods for analyzing big data in precision medicine. Proteomics. 2016;16:741–758. doi: 10.1002/pmic.201500396. [DOI] [PubMed] [Google Scholar]
Goh K.I., Cusick M.E., Valle D., Childs B., Vidal M., Barabasi A.L. The human disease network. Proc. Natl. Acad. Sci. U. S. A. 2007;104:8685–8690. doi: 10.1073/pnas.0701361104. [DOI] [PMC free article] [PubMed] [Google Scholar]
Golub T.R., Slonim D.K., Tamayo P., Huard C., Gaasenbeek M., Mesirov J.P., Coller H., Loh M.L., Downing J.R., Caligiuri M.A., Bloomfield C.D., Lander E.S. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286:531–537. doi: 10.1126/science.286.5439.531. [DOI] [PubMed] [Google Scholar]
Graham D.Y. Helicobacter pylori update: gastric cancer, reliable therapy, and possible benefits. Gastroenterology. 2015;148(719–31) doi: 10.1053/j.gastro.2015.01.040. [DOI] [PMC free article] [PubMed] [Google Scholar]
Grainge C., Thomas P.S., Mak J.C., Benton M.J., Lim T.K., Ko F.W. Year in review 2015: asthma and chronic obstructive pulmonary disease. Respirology. 2016;21:765–775. doi: 10.1111/resp.12771. [DOI] [PubMed] [Google Scholar]
Hidalgo C.A., Blumm N., Barabasi A.L., Christakis N.A. A dynamic network approach for the study of human phenotypes. PLoS Comput. Biol. 2009;5 doi: 10.1371/journal.pcbi.1000353. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hoadley K.A., Yau C., Wolf D.M., Cherniack A.D., Tamborero D., Ng S., Leiserson M.D., Niu B., McLellan M.D., Uzunangelov V., Zhang J., Kandoth C., Akbani R., Shen H., Omberg L., Chu A., Margolin A.A., Van't Veer L.J., Lopez-Bigas N., Laird P.W., Raphael B.J., Ding L., Robertson A.G., Byers L.A., Mills G.B., Weinstein J.N., Van Waes C., Chen Z., Collisson E.A., Benz C.C., Perou C.M., Stuart J.M. Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell. 2014;158:929–944. doi: 10.1016/j.cell.2014.06.049. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hofmann-Apitius M., Alarcon-Riquelme M.E., Chamberlain C., McHale D. Towards the taxonomy of human disease. Nat. Rev. Drug Discov. 2015;14:75–76. doi: 10.1038/nrd4537. [DOI] [PubMed] [Google Scholar]
Hofree M., Shen J.P., Carter H., Gross A., Ideker T. Network-based stratification of tumor mutations. Nat. Methods. 2013;10:1108–1115. doi: 10.1038/nmeth.2651. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hu J.X., Thomas C.E., Brunak S. Network biology concepts in complex disease comorbidities. Nat. Rev. Genet. 2016;17:615–629. doi: 10.1038/nrg.2016.87. [DOI] [PubMed] [Google Scholar]
Jameson J.L., Longo D.L. Precision medicine--personalized, problematic, and promising. N. Engl. J. Med. 2015;372:2229–2234. doi: 10.1056/NEJMsb1503104. [DOI] [PubMed] [Google Scholar]
Jensen P.B., Jensen L.J., Brunak S. Mining electronic health records: towards better research applications and clinical care. Nat. Rev. Genet. 2012;13:395–405. doi: 10.1038/nrg3208. [DOI] [PubMed] [Google Scholar]
Jensen A.B., Moseley P.L., Oprea T.I., Ellesoe S.G., Eriksson R., Schmock H., Jensen P.B., Jensen L.J., Brunak S. Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients. Nat. Commun. 2014;5:4022. doi: 10.1038/ncomms5022. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jeste S.S., Geschwind D.H. Disentangling the heterogeneity of autism spectrum disorder through genetic findings. Nat. Rev. Neurol. 2014;10:74–81. doi: 10.1038/nrneurol.2013.278. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kibbe W.A., Arze C., Felix V., Mitraka E., Bolton E., Fu G., Mungall C.J., Binder J.X., Malone J., Vasant D., Parkinson H., Schriml L.M. Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data. Nucleic Acids Res. 2015;43:D1071–8. doi: 10.1093/nar/gku1011. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kohler S., Doelken S.C., Mungall C.J., Bauer S., Firth H.V., Bailleul-Forestier I., Black G.C., Brown D.L., Brudno M., Campbell J., FitzPatrick D.R., Eppig J.T., Jackson A.P., Freson K., Girdea M., Helbig I., Hurst J.A., Jahn J., Jackson L.G., Kelly A.M., Ledbetter D.H., Mansour S., Martin C.L., Moss C., Mumford A., Ouwehand W.H., Park S.M., Riggs E.R., Scott R.H., Sisodiya S., Van Vooren S., Wapner R.J., Wilkie A.O., Wright C.F., Vulto-van Silfhout A.T., de Leeuw N., de Vries B.B., Washingthon N.L., Smith C.L., Westerfield M., Schofield P., Ruef B.J., Gkoutos G.V., Haendel M., Smedley D., Lewis S.E., Robinson P.N. The human phenotype ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res. 2014;42:D966–74. doi: 10.1093/nar/gkt1026. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lee D.S., Park J., Kay K.A., Christakis N.A., Oltvai Z.N., Barabasi A.L. The implications of human metabolic network topology for disease comorbidity. Proc. Natl. Acad. Sci. U. S. A. 2008;105:9880–9885. doi: 10.1073/pnas.0802208105. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li Y.Y., Jones S.J. Drug repositioning for personalized medicine. Genome Med. 2012;4:27. doi: 10.1186/gm326. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lin Dekang. The Fifteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc; San Francisco, CA, USA: 1998. An information-theoretic definition of similarity; pp. 296–304. [Google Scholar]
Liu C.C., Tseng Y.T., Li W., Wu C.Y., Mayzus I., Rzhetsky A., Sun F., Waterman M., Chen J.J., Chaudhary P.M., Loscalzo J., Crandall E., Zhou X.J. DiseaseConnect: a comprehensive web server for mechanism-based disease-disease connections. Nucleic Acids Res. 2014;42:W137–46. doi: 10.1093/nar/gku412. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mann D.M., McDonagh A.M., Snowden J., Neary D., Pickering-Brown S.M. Molecular classification of the dementias. Lancet. 2000;355:626. doi: 10.1016/S0140-6736(99)05207-1. [DOI] [PubMed] [Google Scholar]
Mannino D.M. COPD: epidemiology, prevalence, morbidity and mortality, and disease heterogeneity. Chest. 2002;121:121S–126S. doi: 10.1378/chest.121.5_suppl.121s. [DOI] [PubMed] [Google Scholar]
McClellan J., King M.C. Genetic heterogeneity in human disease. Cell. 2010;141:210–217. doi: 10.1016/j.cell.2010.03.032. [DOI] [PubMed] [Google Scholar]
Menche J., Sharma A., Kitsak M., Ghiassian S.D., Vidal M., Loscalzo J., Barabasi A.L. Disease networks. Uncovering disease-disease relationships through the incomplete interactome. Science. 2015;347:1257601. doi: 10.1126/science.1257601. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mirnezami R., Nicholson J., Darzi A. Preparing for precision medicine. N. Engl. J. Med. 2012;366:489–491. doi: 10.1056/NEJMp1114866. [DOI] [PubMed] [Google Scholar]
Mistry M., Pavlidis P. Gene ontology term overlap as a measure of gene functional similarity. BMC Bioinforma. 2008;9:327. doi: 10.1186/1471-2105-9-327. [DOI] [PMC free article] [PubMed] [Google Scholar]
Murtagh Fionn, Contreras Pedro. Algorithms for hierarchical clustering: an overview. Wiley Interdiscip. Rev. Data Min. Knowledge Dis. 2012;2:86–97. [Google Scholar]
Nakatsu G., Li X., Zhou H., Sheng J., Wong S.H., Wu W.K., Ng S.C., Tsoi H., Dong Y., Zhang N., He Y., Kang Q., Cao L., Wang K., Zhang J., Liang Q., Yu J., Sung J.J. Gut mucosal microbiome across stages of colorectal carcinogenesis. Nat. Commun. 2015;6:8727. doi: 10.1038/ncomms9727. [DOI] [PMC free article] [PubMed] [Google Scholar]
Newman M.E. Modularity and community structure in networks. Proc. Natl. Acad. Sci. U. S. A. 2006;103:8577–8582. doi: 10.1073/pnas.0601602103. [DOI] [PMC free article] [PubMed] [Google Scholar]
Okusaga O., Yolken R.H., Langenberg P., Lapidus M., Arling T.A., Dickerson F.B., Scrandis D.A., Severance E., Cabassa J.A., Balis T., Postolache T.T. Association of seropositivity for influenza and coronaviruses with history of mood disorders and suicide attempts. J. Affect. Disord. 2011;130:220–225. doi: 10.1016/j.jad.2010.09.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
Park A. Drugs zero in. Breast cancer, flu and obesity are in the crosshairs as drug companies produce more-targeted treatments. Time. 2012;179:42. [PubMed] [Google Scholar]
Pesquita C., Faria D., Falcao A.O., Lord P., Couto F.M. Semantic similarity in biomedical ontologies. PLoS Comput. Biol. 2009;5 doi: 10.1371/journal.pcbi.1000443. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pletscher-Frankild S., Palleja A., Tsafou K., Binder J.X., Jensen L.J. DISEASES: text mining and data integration of disease-gene associations. Methods. 2015;74:83–89. doi: 10.1016/j.ymeth.2014.11.020. [DOI] [PubMed] [Google Scholar]
Rappaport N., Twik M., Plaschkes I., Nudel R., Iny Stein T., Levitt J., Gershoni M., Morrey C.P., Safran M., Lancet D. MalaCards: an amalgamated human disease compendium with diverse clinical and genetic annotation and structured search. Nucleic Acids Res. 2017;45:D877–D887. doi: 10.1093/nar/gkw1012. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rzhetsky A., Wajngurt D., Park N., Zheng T. Probing genetic overlap among complex human phenotypes. Proc. Natl. Acad. Sci. U. S. A. 2007;104:11694–11699. doi: 10.1073/pnas.0704820104. [DOI] [PMC free article] [PubMed] [Google Scholar]
Serrano M.A., Boguna M., Vespignani A. Extracting the multiscale backbone of complex weighted networks. Proc. Natl. Acad. Sci. U. S. A. 2009;106:6483–6488. doi: 10.1073/pnas.0808904106. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sharma A., Menche J., Huang C.C., Ort T., Zhou X., Kitsak M., Sahni N., Thibault D., Voung L., Guo F., Ghiassian S.D., Gulbahce N., Baribaud F., Tocker J., Dobrin R., Barnathan E., Liu H., Panettieri R.A., Jr., Tantisira K.G., Qiu W., Raby B.A., Silverman E.K., Vidal M., Weiss S.T., Barabasi A.L. A disease module in the interactome explains disease heterogeneity, drug response and captures novel pathways and genes in asthma. Hum. Mol. Genet. 2015;24:3005–3020. doi: 10.1093/hmg/ddv001. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sitas F. Twenty five years since the first prospective study by Forman et al. (1991) on Helicobacter pylori and stomach cancer risk. Cancer Epidemiol. 2016;41:159–164. doi: 10.1016/j.canep.2016.02.002. [DOI] [PubMed] [Google Scholar]
Solovieff N., Cotsapas C., Lee P.H., Purcell S.M., Smoller J.W. Pleiotropy in complex traits: challenges and strategies. Nat. Rev. Genet. 2013;14:483–495. doi: 10.1038/nrg3461. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tyner C., Barber G.P., Casper J., Clawson H., Diekhans M., Eisenhart C., Fischer C.M., Gibson D., Gonzalez J.N., Guruvadoo L., Haeussler M., Heitner S., Hinrichs A.S., Karolchik D., Lee B.T., Lee C.M., Nejad P., Raney B.J., Rosenbloom K.R., Speir M.L., Villarreal C., Vivian J., Zweig A.S., Haussler D., Kuhn R.M., Kent W.J. The UCSC Genome Browser database: 2017 update. Nucleic Acids Res. 2017;45 doi: 10.1093/nar/gkw1134. (D626-D634) [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang K., Gaitsch H., Poon H., Cox N.J., Rzhetsky A. Classification of common human diseases derived from shared genetic and environmental determinants. Nat. Genet. 2017;49:1319–1325. doi: 10.1038/ng.3931. [DOI] [PMC free article] [PubMed] [Google Scholar]
Welter D., MacArthur J., Morales J., Burdett T., Hall P., Junkins H., Klemm A., Flicek P., Manolio T., Hindorff L., Parkinson H. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 2014;42:D1001–6. doi: 10.1093/nar/gkt1229. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wu L., Zhou B., Oshiro-Rapley N., Li M., Paulo J.A., Webster C.M., Mou F., Kacergis M.C., Talkowski M.E., Carr C.E., Gygi S.P., Zheng B., Soukas A.A. An Ancient, Unified Mechanism for Metformin Growth Inhibition in C. elegans and Cancer. Cell. 2016;167:1705–1718. doi: 10.1016/j.cell.2016.11.055. (e13) [DOI] [PMC free article] [PubMed] [Google Scholar]
Yang J., Leskovec J. Proceedings of the Sixth ACM International Conference on Web Search and Data Mining. ACM; 2013. Overlapping community detection at scale: a nonnegative matrix factorization approach; pp. 587–596. [Google Scholar]
Zanzoni A., Soler-Lopez M., Aloy P. A network medicine approach to human disease. FEBS Lett. 2009;583:1759–1765. doi: 10.1016/j.febslet.2009.03.001. [DOI] [PubMed] [Google Scholar]
Zhou X., Menche J., Barabasi A.L., Sharma A. Human symptoms-disease network. Nat. Commun. 2014;5:4212. doi: 10.1038/ncomms5212. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data S1

The distribution on the number of genes of 1883 ICD diseases.

mmc1.xlsx^{(42.9KB, xlsx)}

Data S2

The gene list of all the 314 PPI modules detected from the whole human PPI network.

mmc2.xlsx^{(63.8KB, xlsx)}

Data S3

The molecular diversity of the ICD according to the maximum betweenness of disease-related genes in the PPI network.

mmc3.xlsx^{(83.4KB, xlsx)}

Data S4

The maximum betweenness of the disease codes related to other/unspecified conditions.

mmc4.xlsx^{(42KB, xlsx)}

Data S5

The maximum betweenness (median) of ICD subcatagories.

mmc5.xlsx^{(12.3KB, xlsx)}

Data S6

The diseases with significant higher edge density to the chapter than the average edge density of the diseases in the corresponding chapter.

mmc6.xlsx^{(290.3KB, xlsx)}

Data S7

The associations between diseases and chapters based on molecular module similarity.

mmc7.xlsx^{(75.9KB, xlsx)}

Data S9

The integrated disease network based on systematic integration process.

mmc8.xlsx^{(869.6KB, xlsx)}

Data S10

The information of the new classification of diseases.

mmc9.xlsx^{(164.2KB, xlsx)}

Data S11

The gene list in every PPI module for the 1797 diseases in the integrated disease network.

mmc10.xlsx^{(47.2KB, xlsx)}

Data S12

The significant modules in every NCDs.

mmc11.xlsx^{(22.9KB, xlsx)}

Data S13

The most significant shared genes with betweenness centrality in NC04 & NC17.

mmc12.xlsx^{(11.7KB, xlsx)}

Data S14

The significant phenotypes in every NCD.

mmc13.xlsx^{(25.2KB, xlsx)}

Data S15

The disease association network of NC06M10.

mmc14.xlsx^{(29.8KB, xlsx)}

Data S16

The significant shared PPI modules in NC06.M10.

mmc15.xlsx^{(15.1KB, xlsx)}

Data S17

The overlapping and unique significant phenotypes among 296,487 and 480.

mmc16.xlsx^{(10.9KB, xlsx)}

Data S18

The first-neighbor PPI subnetwork of the genes with 296, 487 and 480.

mmc17.xlsx^{(66KB, xlsx)}

Data S19

The 39 subcategories with significant “neoplasms” correlations.

mmc18.xlsx^{(12.8KB, xlsx)}

Data S20

The detail information for the 32 subcategories with significant “neoplasms” correlations in NC01, NC06, NC11, NC16.

mmc19.xlsx^{(37.2KB, xlsx)}

Data S21

The subcategories in NCD validated by the GWAS data.

mmc20.xlsx^{(13.2KB, xlsx)}

Data S22

The subcategories in NCD validated by the PhWAS data.

mmc21.xlsx^{(13KB, xlsx)}

Supplementary material

mmc22.docx^{(14.4MB, docx)}

[bb0005] Ahmad T., Marshall S., Jewell D. Genotype-based phenotyping heralds a new taxonomy for inflammatory bowel disease. Curr. Opin. Gastroenterol. 2003;19:327–335. doi: 10.1097/00001574-200307000-00002. [DOI] [PubMed] [Google Scholar]

[bb0010] Alizadeh A.A., Eisen M.B., Davis R.E., Ma C., Lossos I.S., Rosenwald A., Boldrick J.C., Sabet H., Tran T., Yu X., Powell J.I., Yang L., Marti G.E., Moore T., Hudson J., Jr., Lu L., Lewis D.B., Tibshirani R., Sherlock G., Chan W.C., Greiner T.C., Weisenburger D.D., Armitage J.O., Warnke R., Levy R., Wilson W., Grever M.R., Byrd J.C., Botstein D., Brown P.O., Staudt L.M. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000;403:503–511. doi: 10.1038/35000501. [DOI] [PubMed] [Google Scholar]

[bb0015] Arostegui I., Esteban C., Garcia-Gutierrez S., Bare M., Fernandez-de-Larrea N., Briones E., Quintana J.M. Subtypes of patients experiencing exacerbations of COPD and associations with outcomes. PLoS One. 2014;9 doi: 10.1371/journal.pone.0098580. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0020] Ashburn T.T., Thor K.B. Drug repositioning: identifying and developing new uses for existing drugs. Nat. Rev. Drug Discov. 2004;3:673–683. doi: 10.1038/nrd1468. [DOI] [PubMed] [Google Scholar]

[bb0025] Bailey P., Chang D.K., Nones K., Johns A.L., Patch A.M., Gingras M.C., Miller D.K., Christ A.N., Bruxner T.J., Quinn M.C., Nourse C., Murtaugh L.C., Harliwong I., Idrisoglu S., Manning S., Nourbakhsh E., Wani S., Fink L., Holmes O., Chin V., Anderson M.J., Kazakoff S., Leonard C., Newell F., Waddell N., Wood S., Xu Q., Wilson P.J., Cloonan N., Kassahn K.S., Taylor D., Quek K., Robertson A., Pantano L., Mincarelli L., Sanchez L.N., Evers L., Wu J., Pinese M., Cowley M.J., Jones M.D., Colvin E.K., Nagrial A.M., Humphrey E.S., Chantrill L.A., Mawson A., Humphris J., Chou A., Pajic M., Scarlett C.J., Pinho A.V., Giry-Laterriere M., Rooman I., Samra J.S., Kench J.G., Lovell J.A., Merrett N.D., Toon C.W., Epari K., Nguyen N.Q., Barbour A., Zeps N., Moran-Jones K., Jamieson N.B., Graham J.S., Duthie F., Oien K., Hair J., Grutzmann R., Maitra A., Iacobuzio-Donahue C.A., Wolfgang C.L., Morgan R.A., Lawlor R.T., Corbo V., Bassi C., Rusev B., Capelli P., Salvia R., Tortora G., Mukhopadhyay D., Petersen G.M., Munzy D.M., Fisher W.E., Karim S.A., Eshleman J.R., Hruban R.H., Pilarsky C., Morton J.P., Sansom O.J., Scarpa A., Musgrove E.A., Bailey U.M., Hofmann O., Sutherland R.L., Wheeler D.A., Gill A.J., Gibbs R.A., Pearson J.V., Biankin A.V., Grimmond S.M. Genomic analyses identify molecular subtypes of pancreatic cancer. Nature. 2016;531:47–52. doi: 10.1038/nature16965. [DOI] [PubMed] [Google Scholar]

[bb0030] Barabasi A.L., Gulbahce N., Loscalzo J. Network medicine: a network-based approach to human disease. Nat. Rev. Genet. 2011;12:56–68. doi: 10.1038/nrg2918. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0035] Bianchini G., Balko J.M., Mayer I.A., Sanders M.E., Gianni L. Triple-negative breast cancer: challenges and opportunities of a heterogeneous disease. Nat. Rev. Clin. Oncol. 2016;13:674–690. doi: 10.1038/nrclinonc.2016.66. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0040] Blair D.R., Lyttle C.S., Mortensen J.M., Bearden C.F., Jensen A.B., Khiabanian H., Melamed R., Rabadan R., Bernstam E.V., Brunak S., Jensen L.J., Nicolae D., Shah N.H., Grossman R.L., Cox N.J., White K.P., Rzhetsky A. A nondegenerate code of deleterious variants in Mendelian loci contributes to complex disease risk. Cell. 2013;155:70–80. doi: 10.1016/j.cell.2013.08.030. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0045] Blondel Vincent D., Guillaume Jean-Loup, Lambiotte Renaud, Lefebvre Etienne. Fast unfolding of communities in large networks. J. Stat. Mech. 2008;2008 [Google Scholar]

[bb0050] Butler M. Not so fast! Congress delays ICD-10-CM/PCS. Examining how the delay happen, its industry impact, and how best to proceed. JAHIMA. 2014;85:24–28. [PubMed] [Google Scholar]

[bb0055] Cabreiro F., Au C., Leung K.Y., Vergara-Irigaray N., Cocheme H.M., Noori T., Weinkove D., Schuster E., Greene N.D., Gems D. Metformin retards aging in C. elegans by altering microbial folate and methionine metabolism. Cell. 2013;153:228–239. doi: 10.1016/j.cell.2013.02.035. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0065] Canetta S.E., Bao Y., Co M.D., Ennis F.A., Cruz J., Terajima M., Shen L., Kellendonk C., Schaefer C.A., Brown A.S. Serological documentation of maternal influenza exposure and bipolar disorder in adult offspring. Am. J. Psychiatry. 2014;171:557–563. doi: 10.1176/appi.ajp.2013.13070943. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0070] Chong C.R., Sullivan D.J., Jr. New uses for old drugs. Nature. 2007;448:645–646. doi: 10.1038/448645a. [DOI] [PubMed] [Google Scholar]

[bb0075] Chuang H.Y., Lee E., Liu Y.T., Lee D., Ideker T. Network-based classification of breast cancer metastasis. Mol. Syst. Biol. 2007;3:140. doi: 10.1038/msb4100180. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0080] Cimino J.J. High-quality, standard, controlled healthcare terminologies come of age. Methods Inf. Med. 2011;50:101–104. [PMC free article] [PubMed] [Google Scholar]

[bb0085] Council National Research Committee Framework Developing New Taxonomy and Disease, editor. Toward Precision Medicine: Building a Knowledge Network for Biomedical Research and a New Taxonomy of Disease. The National Academies Press; Washington, DC: 2011. [PubMed] [Google Scholar]

[bb0090] Dahlem D., Maniloff D., Ratti C. Predictability bounds of electronic health records. Sci. Rep. 2015;5 doi: 10.1038/srep11865. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0095] Denny J.C., Ritchie M.D., Basford M.A., Pulley J.M., Bastarache L., Brown-Gentry K., Wang D., Masys D.R., Roden D.M., Crawford D.C. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations. Bioinformatics. 2010;26:1205–1210. doi: 10.1093/bioinformatics/btq126. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0100] Dienstmann R., Vermeulen L., Guinney J., Kopetz S., Tejpar S., Tabernero J. Consensus molecular subtypes and the evolution of precision medicine in colorectal cancer. Nat. Rev. Cancer. 2017;17:79–92. doi: 10.1038/nrc.2016.126. [DOI] [PubMed] [Google Scholar]

[bb0105] Evans J.M., Donnelly L.A., Emslie-Smith A.M., Alessi D.R., Morris A.D. Metformin and reduced risk of cancer in diabetic patients. BMJ. 2005;330:1304–1305. doi: 10.1136/bmj.38415.708634.F7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0110] Fisher Ronald A., Yates Frank. Oliver & Boyd; London: 1948. Statistical Tables for Biological, Agricultural and Medical Research. [Google Scholar]

[bb0115] Forslund K., Hildebrand F., Nielsen T., Falony G., Le Chatelier E., Sunagawa S., Prifti E., Vieira-Silva S., Gudmundsdottir V., Krogh Pedersen H., Arumugam M., Kristiansen K., Voigt A.Y., Vestergaard H., Hercog R., Igor Costea P., Kultima J.R., Li J., Jorgensen T., Levenez F., Dore J., Nielsen H.B., Brunak S., Raes J., Hansen T., Wang J., Ehrlich S.D., Bork P., Pedersen O. Disentangling type 2 diabetes and metformin treatment signatures in the human gut microbiota. Nature. 2015;528:262–266. doi: 10.1038/nature15766. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0120] Franceschini A., Szklarczyk D., Frankild S., Kuhn M., Simonovic M., Roth A., Lin J., Minguez P., Bork P., von Mering C., Jensen L.J. STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 2013;41:D808–15. doi: 10.1093/nar/gks1094. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0125] Freeman L.C. A set of measures of centrality based on betweenness. Sociometry. 1977:35–41. [Google Scholar]

[bb0130] Gessler Damian D.G., Cliff Joslyn, Karin Verspoor. Terence Critchlow and Kerstin Kleese van Dam (eds.), Data intensive Science. CRC Press; 2013. A posteriori ontology engineering for data-driven science. [Google Scholar]

[bb0135] Gligorijevic V., Przulj N. Methods for biological data integration: perspectives and challenges. J. R. Soc. Interface. 2015;12 doi: 10.1098/rsif.2015.0571. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0140] Gligorijevic V., Malod-Dognin N., Przulj N. Integrative methods for analyzing big data in precision medicine. Proteomics. 2016;16:741–758. doi: 10.1002/pmic.201500396. [DOI] [PubMed] [Google Scholar]

[bb0145] Goh K.I., Cusick M.E., Valle D., Childs B., Vidal M., Barabasi A.L. The human disease network. Proc. Natl. Acad. Sci. U. S. A. 2007;104:8685–8690. doi: 10.1073/pnas.0701361104. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0150] Golub T.R., Slonim D.K., Tamayo P., Huard C., Gaasenbeek M., Mesirov J.P., Coller H., Loh M.L., Downing J.R., Caligiuri M.A., Bloomfield C.D., Lander E.S. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286:531–537. doi: 10.1126/science.286.5439.531. [DOI] [PubMed] [Google Scholar]

[bb0155] Graham D.Y. Helicobacter pylori update: gastric cancer, reliable therapy, and possible benefits. Gastroenterology. 2015;148(719–31) doi: 10.1053/j.gastro.2015.01.040. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0160] Grainge C., Thomas P.S., Mak J.C., Benton M.J., Lim T.K., Ko F.W. Year in review 2015: asthma and chronic obstructive pulmonary disease. Respirology. 2016;21:765–775. doi: 10.1111/resp.12771. [DOI] [PubMed] [Google Scholar]

[bb0165] Hidalgo C.A., Blumm N., Barabasi A.L., Christakis N.A. A dynamic network approach for the study of human phenotypes. PLoS Comput. Biol. 2009;5 doi: 10.1371/journal.pcbi.1000353. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0170] Hoadley K.A., Yau C., Wolf D.M., Cherniack A.D., Tamborero D., Ng S., Leiserson M.D., Niu B., McLellan M.D., Uzunangelov V., Zhang J., Kandoth C., Akbani R., Shen H., Omberg L., Chu A., Margolin A.A., Van't Veer L.J., Lopez-Bigas N., Laird P.W., Raphael B.J., Ding L., Robertson A.G., Byers L.A., Mills G.B., Weinstein J.N., Van Waes C., Chen Z., Collisson E.A., Benz C.C., Perou C.M., Stuart J.M. Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell. 2014;158:929–944. doi: 10.1016/j.cell.2014.06.049. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0175] Hofmann-Apitius M., Alarcon-Riquelme M.E., Chamberlain C., McHale D. Towards the taxonomy of human disease. Nat. Rev. Drug Discov. 2015;14:75–76. doi: 10.1038/nrd4537. [DOI] [PubMed] [Google Scholar]

[bb0180] Hofree M., Shen J.P., Carter H., Gross A., Ideker T. Network-based stratification of tumor mutations. Nat. Methods. 2013;10:1108–1115. doi: 10.1038/nmeth.2651. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0185] Hu J.X., Thomas C.E., Brunak S. Network biology concepts in complex disease comorbidities. Nat. Rev. Genet. 2016;17:615–629. doi: 10.1038/nrg.2016.87. [DOI] [PubMed] [Google Scholar]

[bb0190] Jameson J.L., Longo D.L. Precision medicine--personalized, problematic, and promising. N. Engl. J. Med. 2015;372:2229–2234. doi: 10.1056/NEJMsb1503104. [DOI] [PubMed] [Google Scholar]

[bb0195] Jensen P.B., Jensen L.J., Brunak S. Mining electronic health records: towards better research applications and clinical care. Nat. Rev. Genet. 2012;13:395–405. doi: 10.1038/nrg3208. [DOI] [PubMed] [Google Scholar]

[bb0200] Jensen A.B., Moseley P.L., Oprea T.I., Ellesoe S.G., Eriksson R., Schmock H., Jensen P.B., Jensen L.J., Brunak S. Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients. Nat. Commun. 2014;5:4022. doi: 10.1038/ncomms5022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0205] Jeste S.S., Geschwind D.H. Disentangling the heterogeneity of autism spectrum disorder through genetic findings. Nat. Rev. Neurol. 2014;10:74–81. doi: 10.1038/nrneurol.2013.278. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0210] Kibbe W.A., Arze C., Felix V., Mitraka E., Bolton E., Fu G., Mungall C.J., Binder J.X., Malone J., Vasant D., Parkinson H., Schriml L.M. Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data. Nucleic Acids Res. 2015;43:D1071–8. doi: 10.1093/nar/gku1011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0215] Kohler S., Doelken S.C., Mungall C.J., Bauer S., Firth H.V., Bailleul-Forestier I., Black G.C., Brown D.L., Brudno M., Campbell J., FitzPatrick D.R., Eppig J.T., Jackson A.P., Freson K., Girdea M., Helbig I., Hurst J.A., Jahn J., Jackson L.G., Kelly A.M., Ledbetter D.H., Mansour S., Martin C.L., Moss C., Mumford A., Ouwehand W.H., Park S.M., Riggs E.R., Scott R.H., Sisodiya S., Van Vooren S., Wapner R.J., Wilkie A.O., Wright C.F., Vulto-van Silfhout A.T., de Leeuw N., de Vries B.B., Washingthon N.L., Smith C.L., Westerfield M., Schofield P., Ruef B.J., Gkoutos G.V., Haendel M., Smedley D., Lewis S.E., Robinson P.N. The human phenotype ontology project: linking molecular biology and disease through phenotype data. Nucleic Acids Res. 2014;42:D966–74. doi: 10.1093/nar/gkt1026. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0220] Lee D.S., Park J., Kay K.A., Christakis N.A., Oltvai Z.N., Barabasi A.L. The implications of human metabolic network topology for disease comorbidity. Proc. Natl. Acad. Sci. U. S. A. 2008;105:9880–9885. doi: 10.1073/pnas.0802208105. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0225] Li Y.Y., Jones S.J. Drug repositioning for personalized medicine. Genome Med. 2012;4:27. doi: 10.1186/gm326. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0230] Lin Dekang. The Fifteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc; San Francisco, CA, USA: 1998. An information-theoretic definition of similarity; pp. 296–304. [Google Scholar]

[bb0235] Liu C.C., Tseng Y.T., Li W., Wu C.Y., Mayzus I., Rzhetsky A., Sun F., Waterman M., Chen J.J., Chaudhary P.M., Loscalzo J., Crandall E., Zhou X.J. DiseaseConnect: a comprehensive web server for mechanism-based disease-disease connections. Nucleic Acids Res. 2014;42:W137–46. doi: 10.1093/nar/gku412. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0240] Mann D.M., McDonagh A.M., Snowden J., Neary D., Pickering-Brown S.M. Molecular classification of the dementias. Lancet. 2000;355:626. doi: 10.1016/S0140-6736(99)05207-1. [DOI] [PubMed] [Google Scholar]

[bb0245] Mannino D.M. COPD: epidemiology, prevalence, morbidity and mortality, and disease heterogeneity. Chest. 2002;121:121S–126S. doi: 10.1378/chest.121.5_suppl.121s. [DOI] [PubMed] [Google Scholar]

[bb0250] McClellan J., King M.C. Genetic heterogeneity in human disease. Cell. 2010;141:210–217. doi: 10.1016/j.cell.2010.03.032. [DOI] [PubMed] [Google Scholar]

[bb0255] Menche J., Sharma A., Kitsak M., Ghiassian S.D., Vidal M., Loscalzo J., Barabasi A.L. Disease networks. Uncovering disease-disease relationships through the incomplete interactome. Science. 2015;347:1257601. doi: 10.1126/science.1257601. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0260] Mirnezami R., Nicholson J., Darzi A. Preparing for precision medicine. N. Engl. J. Med. 2012;366:489–491. doi: 10.1056/NEJMp1114866. [DOI] [PubMed] [Google Scholar]

[bb0265] Mistry M., Pavlidis P. Gene ontology term overlap as a measure of gene functional similarity. BMC Bioinforma. 2008;9:327. doi: 10.1186/1471-2105-9-327. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0270] Murtagh Fionn, Contreras Pedro. Algorithms for hierarchical clustering: an overview. Wiley Interdiscip. Rev. Data Min. Knowledge Dis. 2012;2:86–97. [Google Scholar]

[bb0275] Nakatsu G., Li X., Zhou H., Sheng J., Wong S.H., Wu W.K., Ng S.C., Tsoi H., Dong Y., Zhang N., He Y., Kang Q., Cao L., Wang K., Zhang J., Liang Q., Yu J., Sung J.J. Gut mucosal microbiome across stages of colorectal carcinogenesis. Nat. Commun. 2015;6:8727. doi: 10.1038/ncomms9727. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0280] Newman M.E. Modularity and community structure in networks. Proc. Natl. Acad. Sci. U. S. A. 2006;103:8577–8582. doi: 10.1073/pnas.0601602103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0285] Okusaga O., Yolken R.H., Langenberg P., Lapidus M., Arling T.A., Dickerson F.B., Scrandis D.A., Severance E., Cabassa J.A., Balis T., Postolache T.T. Association of seropositivity for influenza and coronaviruses with history of mood disorders and suicide attempts. J. Affect. Disord. 2011;130:220–225. doi: 10.1016/j.jad.2010.09.029. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0290] Park A. Drugs zero in. Breast cancer, flu and obesity are in the crosshairs as drug companies produce more-targeted treatments. Time. 2012;179:42. [PubMed] [Google Scholar]

[bb0295] Pesquita C., Faria D., Falcao A.O., Lord P., Couto F.M. Semantic similarity in biomedical ontologies. PLoS Comput. Biol. 2009;5 doi: 10.1371/journal.pcbi.1000443. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0300] Pletscher-Frankild S., Palleja A., Tsafou K., Binder J.X., Jensen L.J. DISEASES: text mining and data integration of disease-gene associations. Methods. 2015;74:83–89. doi: 10.1016/j.ymeth.2014.11.020. [DOI] [PubMed] [Google Scholar]

[bb0305] Rappaport N., Twik M., Plaschkes I., Nudel R., Iny Stein T., Levitt J., Gershoni M., Morrey C.P., Safran M., Lancet D. MalaCards: an amalgamated human disease compendium with diverse clinical and genetic annotation and structured search. Nucleic Acids Res. 2017;45:D877–D887. doi: 10.1093/nar/gkw1012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0310] Rzhetsky A., Wajngurt D., Park N., Zheng T. Probing genetic overlap among complex human phenotypes. Proc. Natl. Acad. Sci. U. S. A. 2007;104:11694–11699. doi: 10.1073/pnas.0704820104. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0315] Serrano M.A., Boguna M., Vespignani A. Extracting the multiscale backbone of complex weighted networks. Proc. Natl. Acad. Sci. U. S. A. 2009;106:6483–6488. doi: 10.1073/pnas.0808904106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0320] Sharma A., Menche J., Huang C.C., Ort T., Zhou X., Kitsak M., Sahni N., Thibault D., Voung L., Guo F., Ghiassian S.D., Gulbahce N., Baribaud F., Tocker J., Dobrin R., Barnathan E., Liu H., Panettieri R.A., Jr., Tantisira K.G., Qiu W., Raby B.A., Silverman E.K., Vidal M., Weiss S.T., Barabasi A.L. A disease module in the interactome explains disease heterogeneity, drug response and captures novel pathways and genes in asthma. Hum. Mol. Genet. 2015;24:3005–3020. doi: 10.1093/hmg/ddv001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0325] Sitas F. Twenty five years since the first prospective study by Forman et al. (1991) on Helicobacter pylori and stomach cancer risk. Cancer Epidemiol. 2016;41:159–164. doi: 10.1016/j.canep.2016.02.002. [DOI] [PubMed] [Google Scholar]

[bb0330] Solovieff N., Cotsapas C., Lee P.H., Purcell S.M., Smoller J.W. Pleiotropy in complex traits: challenges and strategies. Nat. Rev. Genet. 2013;14:483–495. doi: 10.1038/nrg3461. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0335] Tyner C., Barber G.P., Casper J., Clawson H., Diekhans M., Eisenhart C., Fischer C.M., Gibson D., Gonzalez J.N., Guruvadoo L., Haeussler M., Heitner S., Hinrichs A.S., Karolchik D., Lee B.T., Lee C.M., Nejad P., Raney B.J., Rosenbloom K.R., Speir M.L., Villarreal C., Vivian J., Zweig A.S., Haussler D., Kuhn R.M., Kent W.J. The UCSC Genome Browser database: 2017 update. Nucleic Acids Res. 2017;45 doi: 10.1093/nar/gkw1134. (D626-D634) [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0340] Wang K., Gaitsch H., Poon H., Cox N.J., Rzhetsky A. Classification of common human diseases derived from shared genetic and environmental determinants. Nat. Genet. 2017;49:1319–1325. doi: 10.1038/ng.3931. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0345] Welter D., MacArthur J., Morales J., Burdett T., Hall P., Junkins H., Klemm A., Flicek P., Manolio T., Hindorff L., Parkinson H. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 2014;42:D1001–6. doi: 10.1093/nar/gkt1229. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0350] Wu L., Zhou B., Oshiro-Rapley N., Li M., Paulo J.A., Webster C.M., Mou F., Kacergis M.C., Talkowski M.E., Carr C.E., Gygi S.P., Zheng B., Soukas A.A. An Ancient, Unified Mechanism for Metformin Growth Inhibition in C. elegans and Cancer. Cell. 2016;167:1705–1718. doi: 10.1016/j.cell.2016.11.055. (e13) [DOI] [PMC free article] [PubMed] [Google Scholar]

[bb0355] Yang J., Leskovec J. Proceedings of the Sixth ACM International Conference on Web Search and Data Mining. ACM; 2013. Overlapping community detection at scale: a nonnegative matrix factorization approach; pp. 587–596. [Google Scholar]

[bb0360] Zanzoni A., Soler-Lopez M., Aloy P. A network medicine approach to human disease. FEBS Lett. 2009;583:1759–1765. doi: 10.1016/j.febslet.2009.03.001. [DOI] [PubMed] [Google Scholar]

[bb0365] Zhou X., Menche J., Barabasi A.L., Sharma A. Human symptoms-disease network. Nat. Commun. 2014;5:4212. doi: 10.1038/ncomms5212. [DOI] [PubMed] [Google Scholar]

PERMALINK

A Systems Approach to Refine Disease Taxonomy by Integrating Phenotypic and Molecular Networks

Xuezhong Zhou

Lei Lei

Jun Liu

Arda Halu

Yingying Zhang

Bing Li

Zhili Guo

Guangming Liu

Changkai Sun

Joseph Loscalzo

Amitabh Sharma

Zhong Wang

Abstract

Highlights

1. Introduction

Fig. 1.

2. Materials and Methods

2.1. Basic Dataset Compilation

2.2. Evaluating the Quality of ICD Disease Taxonomy

2.3. Measuring the Disease Specificity

2.4. Detection of the Significant Disease-chapter Associations

2.5. Multi-category Prediction of Diseases

2.6. Construction of Integrated Disease Network

2.7. Overlapping Category Detection from Integrated Disease Network

2.8. Statistical Validation of NCD From External Data

2.9. Statistical Analysis

3. Results

3.1. Category Similarity of ICD Taxonomy

3.2. Heterogeneity of Molecular Specificity in ICD Taxonomy

Fig. 2.

3.3. The Blurred Molecular Boundary Between ICD Categories

3.4. Polyhierarchical Mapping of Diseases Using Molecular Module Similarity

Fig. 3.

3.5. Integrated Disease Network for Overlapping Disease Classification

3.6. New Disease Categories Define Diseases with Clearer Boundaries and Balanced Diversity

Fig. 4.

Fig. 5.

4. Discussion

Data S8.

Funding Sources

Competing Financial Interests

Author Contributions

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases