Abstract
Applying Pearson correlation and semantic relations in building a heterogeneous information graph (HIG) to develop a classification model has achieved a notable performance in improving the accuracy of predicting the status of health risks. In this study, the approach that was used, integrated knowledge of the medical domain as well as taking advantage of applying Pearson correlation and semantic relations in building a classification model for diagnosis. The research mined knowledge which was extracted from titles and abstracts of MEDLINE to discover how to assess the links between objects relating to medical concepts. A knowledge-base HIG model then was developed for the prediction of a patient’s health status. The results of the experiment showed that the knowledge-base model was superior to the baseline model and has demonstrated that the knowledge-base could help improve the performance of the classification model. The contribution of this study has been to provide a framework for applying a knowledge-base in the classification model which helps these models achieve the best performance of predictions. This study has also contributed a model to medical practice to help practitioners become more confident in making final decisions in diagnosing illness. Moreover, this study affirmed that biomedical literature could assist in building a classification model. This contribution will be advantageous for future researchers in mining the knowledge-base to develop different kinds of classification models.
Keywords: Knowledge graph, Electronic health data, Classification, Healthcare
Introduction
A large amount of research is increasingly available in applying the heterogeneous information graph (HIG) as an advantage for developing a classification model. Sun et al. [33] demonstrated that mining the HIG could provide an effective way to improve the quality of mining data. Due to the advantage of the HIG, Ji et al. [16] introduced a classification algorithm from the HIG based on ranking class. The result of the experiments showed that the proposed research would be more accurate in generating classes as well as providing a more meaningful ranking of objects in each class. In contrast, Luo et al. [21] suggested the concept meta path that represents the different relation paths of the HIG in improving the classification problem. The experiment indicated that their proposal overcame the other models of classification not only in the accuracy of classification but also in obtaining the weight of each meta path that is consistent with real-world situations.
In the medical domain, the use of the HIG, which was built from clinical data to predict the health risk status, has become a current outstanding topic. Perotte et al. [25] constructed a HIG from electronic health record data to predict the progression of chronic kidney disease. The result showed that the performance of this proposed model is more accurate than other models. This advantage increased quickly with the amount of research [19, 38] that used the HIG in predicting health risks.
In addition, evidence-based medicine is a modern medical practice. It aims to ensure that the clinician’s opinion relies on all available knowledge from the scientific literature. Cases et al. [7] argue that medical knowledge plays a vital role in decision support for medical practitioners as well as in healthcare knowledge delivery. Wang et al. [36] showed that the integration of the medical knowledge-base has a strong ability to improve the information retrieval performance through incorporating medical domain knowledge for relevance assessment. In addition, Goh et al. [12] argued that in the clinical decision support system, knowledge-base is more useful. However, extracting and transforming the evident knowledge into the care processes may have significant challenges because of different information contents and structures [4]. Besides, applying a knowledge-base into the HIG is still a challenge for improving the classification model.
In this study, a HIG was built by using Pearson correlation and semantic relations. The weight of edges in the graph was normalised by instances which were informed by the medical knowledge-base to improve the accuracy of the classification model. To generate a knowledge graph, the study used the biomedical literature of MEDLINE as a important source to map with the patient data of NHANES. However, there is no direct link between MEDLINE and NHANES. This study used the Medical Subject Heading and the International Classification of Diseases as a bridge to map MEDLINE and NHANES. After obtaining the mapping, the research extracted all titles and abstracts related to diabetes mellitus. The extracted data was used to identify the weight of terms related to diabetes by vector space model. These terms were later linked with nodes from the HIG as instances to normalise the edge of the graph. A classification model was suggested to apply this new HIG for diabetes prediction. Finally, the result was compared to the baseline model for health risk prediction. The performance of our model demonstrated that applying the knowledge-base brought a significant improvement in predicting diabetes.
The contributions of our work are as follows:
A framework for building a knowledge-base through HIG was produced.
A knowledge graph was developed which helped to discover new medical knowledge or achieved a deep understanding of a subject in the medical domain.
A classification model applying knowledge-based was studied to help improve the quality for predicting the health risk status.
The structures of MeSH and MEDLINE were rebuilt which is useful for researchers and developers in developing applications related to a treatment plan or decision support by the knowledge-base.
The remainder of this paper is organised as follows. In Sect. 2, we review the existing work on mapping between different sources of medical terminologies, Medical Literature with Text Mining and combining data and knowledge. The proposed method is presented in Sect. 3 related to learning medical knowledge and how to apply a knowledge-base in the HIG classification model. Following this, in Sect. 4, we deal with the design of the experiment, including the dataset, baseline model, and performance measurement. In Sect. 5, we provide the discussion related to the proposed methods as well as the results. Finally, Sect. 6 presents the conclusions.
Related work
Mapping between different sources of medical terminologies
Medical terminologies are crucial not only in healthcare but also in medical research. If practitioners and researchers could have a deep understanding of terminologies in distinguishing between terms and concepts, it could help reduce the limit of human errors. Moreover, combining these terms and concepts of different sources could lead to more semantic networks of these data. By mapping both the Medical Subject Heading (MeSH) and the International Classification of Diseases (ICD-10), Pereira et al. [24] obtained 68% of information recall with the use of prescription coding. These two sources are extracted from the Unified Medical Language System (UMLS) Metathesaurus. Similarity, Soualmia et al. [31] demonstrated that combined multiple-terminologies such as MeSH, MEDLINE, ICD-10, or UMLS improved the performance of information retrieval.
Approximately 34% of Metathesaurus strings [32] were identified from the titles and abstracts of the biomedical literature in MEDLINE. Taking advantage of this discovery, Schriml et al. [28] built an ontology of human disease that organises all concepts and terms related to the concepts of disease systems. This ontology was generated by extending and integrating the cross-mapping resources of MeSH, ICD, Systematized Nomenclature of Medicine—Clinical Terms (SNOMED CT), etc. The information presented in an ontology could be more useful in discovering the semantic network of diseases. Cross-terminology mapping of these sources to create a terminological medication system was also demonstrated by Saitwal et al. [27] who showed that they could help the growing amount of research, and increase the amount of clinical and public health data. In addition, Zhang et al. [43] combined the MeSH terms and the UMLS concepts to improve information retrieval of the relevant biomedical documents from MEDLINE. The result improved retrieval performance by 43.3%.
Knowledge discovery in medical literature
Discovering precise knowledge that relies on scientific literature is expected to strengthen the accuracy of a prediction system for the healthcare industry. Many previous researchers have tried to implement different learning approaches and a variety of system applications to improve the clinical system. Therefore, a model with two levels of self-supervision has been established concerning extraction using the knowledge-base MEDLINE and Unified Medical Language System (UMLS) [2]. In contrast, Jiang et al. [17] suggested a three-layer knowledge-base model that raises the precision in system prediction and provides more opportunities to recognise the relationship between conceptual diagnoses and real-time symptoms among diseases. By using a higher level of knowledge discovery, the outcome from this model could lead to a successful outcome in developing applications for clinical areas.
Originally, scientific literature played an important role in upgrading the quality of system development. This fact leads to the point that many researchers focus on using a knowledge-base in their study. For example, Wang et al. [37] used a knowledge resource to develop a system, which was automated to generate disease-pertinent concepts. Huang et al. [15] combined multiple populous knowledge sources with building a knowledge graph for helping practitioners exploring realistic clinical queries.
Moreover, Xu et al. [39] built their model to enable the system to have a deeper understanding of disease etiology, a model that automatically discovers other patterns specifying semantically similar relationships among diseases. Scientifically, a total of 34,448 disease pairs were discovered from 21,354,075 MEDLINE records, while Liu et al. [20] discovered 3,159 diseases using MeSH annotation of MEDLINE articles to get the comprehensive list of connections between disease and environmental factors. To extract drug or disease symptoms from the knowledge-base of MEDLINE, Zeng et al. [42] and Chen et al. [8] combined text mining and statistical techniques for automatic acquisition of knowledge from medical domains to identify drug-disease associations. Xu et al. [40] extracted 34,305 unique drug-disease pairs by developing a pattern-based biomedical relationship extraction method from MEDLINE, which was abstracted and compared to 56 602 cancer drug–SE pairs extracted by Wang et al. [41].
Combining data and knowledge
Shah et al. [29] argued that the combination between the clinical concepts and clinical notes could give researchers evidence in treatment as well as help practitioners in making decisions. A large number of researchers have demonstrated that biomedical literature plays an essential rule in healthcare. Some researchers have taken advantage of biomedical literature in improving the quality of healthcare as well as discovering a broad meaning of concepts related to human disease. Hanauer et al. [13] and Kavuluru et al. [18] differentiated between clinical datasets and biomedical literature to discover more fully the relationship and the meanings among concepts. These results brought useful data for further research in the medical domain. Similarity, Anupindi et al. [1] used MESH and ICD based on disease co-occurrence associations to connect the biomedical literature of MEDLINE and the patient data of PDN which was created from Medicare Claims for hospitalisations [14]. These diseases were generated by differentiating between MEDLINE and patient data to bring more benefits to build innovated models in clinical diagnosis. Escudi’e [11] combined EHR and biomedical literature to identify a specific disease. Using text from 741 patients, they obtained 79.3% of the mapped concepts for detected celiac disease. Zhao [44] used the linkage of EHR and MEDLINE to build a weighted Bayesian Network Inference model for predicting pancreatic cancer based on selecting 20 common risk factors. This suggestion has a significant accuracy improvement compared to existing representative methods for the prediction of pancreatic cancer.
Summary
Clearly, each medical source has a specific role in medical research. This source is more beneficial if it could be combined with others to create a cross-mapping of the different sources which help to discover more fully the semantic relationship among entices. Mapping from different medical sources completed by a large number of researchers [24, 27, 28, 31, 32, 43]. In addition, other researchers [1, 11, 13, 14, 18, 44] were successful in combining both data and knowledge-bases in building models for clinical diagnoses. Succeeding in mapping the different sources as well as combining data and knowledge-base has contributed a significant role in improving the quality of mining data. Moreover, some research [2, 17] has shown that the knowledge-base could help to improve the quality of mining data, following the previous research which indicated that applying a HIG in the classification model helped to obtain a significant improvement in performance. In this study, we aim to integrate the knowledge-base into the classification model to demonstrate the effect of the knowledge-base in the classification model.
Research problem formulation
In this work, we deal with the problem of classification. Currently, there are many types of approaches to deal with this issue. One of the suggestions uses a graph to solve the problem of classification. The graph distributes the information under concepts and relationships. Each concept presents a node, and each relationship indicates a link among nodes. Usually, each graph has a limited amount of information depending on the dataset. This graph may be a weak point in mining data. Therefore, this research aims to strengthen information for each node as well as to discover more fully the semantic meaning within each node. This issue encourages us to consider using a knowledge-based developed from verification and evaluation of experts to embed in the graph. This helps to improve the reliability of the graph for discovering knowledge. Then we apply this knowledge-based to solve the classification problem.
To complete the task that develops a classification model based on the knowledge graph, we use the following definition.
Definition 1
(Health Examination Records) A 3-tuple is called the Health Examination Record, where is a set of patients, , . is a set of attributes, , . A matrix is constructed by .
Definition 2
(Heterogeneous Information Graph) A heterogeneous information graph is a 3-tuple, , where , are t types of data objects and , , E and M are the set of links between any two data object V and the set of weight values by links, respectively.
Our target predicts all nodes that have not been labelled in this graph. However, there is limited information for each node in this graph, leading to a lack of knowledge in mining data. Therefore, we divide this graph into many subgraphs, so that each subgraph covers specific information which is defined as a class. In this case, each class presents the set of “Disease” type objects in the graph.
Definition 3
(Disease Subgraph) Given a heterogeneous information graph , and given a disease name , is the set of “Disease” type objects, a disease subgraph S is a graph if .
Then, we enrich information for this subgraph S based on embedding instances. Instances help to improve and expand the semantic relationship between nodes which are learned from a specific knowledge-based domain. Finally, we define our research problem in the following way:
Definition 4
(Research Problem) Given a subgraph , a subset of data object , where each data object in is labeled by a value . Then, the research problem is to predict the label for the unlabeled object of .
Framework
Conceptual model
In this project, the research aim is to embed the knowledge-base on the clinical data to deal with the classification problem. With the benefits achieved from the past work [26], the clinical data of NHANES is presented under the heterogeneous information graph (HIG). The HIG was constructed by using Pearson correlation and semantic relations which indicated superiority to the traditional approach in the classification model from our previous work. The most important task of this study generated instances that were learned from MEDLINE. Instances were embedded into the HIG which helped to improve the quality of the classification model. Instances must belong to the same subject with the subgraph of the HIG for populating knowledge. In this project, the research used MEDLINE as a vital source to build a knowledge graph that links to the HIG by instances. Fundamentally, we need to link MEDLINE to NHANES. To complete this task, we used MeSH and ICD-10 as a bridge to map the observation data and knowledge-base, where MESH is a hierarchy of concepts which helps to access MEDLINE effectively, and ICD-10 is an international statistical classification of diseases that link to NHANES. After mapping NHANES and MEDLINE, instances leaning from the knowledge-base were integrated into a HIG that we call a knowledge-based HIG (KB-HIG). Then, we applied a classification model on KB-HIG to demonstrate the effect of the knowledge-base in predicting the health risk status. Figure 1 presents our research design.
Fig. 1.

The Framework
Knowledge-base
Medical Subject Headings
Medical Subject Headings (MeSH) is a controlled vocabulary thesaurus which contains a set of terms naming descriptors, also known as subject headings in a hierarchical structure. The 2018 version of MeSH includes a total of 28939 descriptors, 234842 unique terms and 244778 supplementary concept records [35]. It was created by the National Library of Medicine in 1960, and covers all aspects of medicine and health care. MeSH is used for indexing journal articles for MEDLINE. Each article in the MEDLINE database is assigned about six to fifteen subject headings from MeSH. It is updated annually to reflect current terminology usage. An advantage of MeSH is that it can help to search the most specific terms available in each article. Therefore, using MeSH to search articles in MEDLINE help obtain the highest efficiency of information retrieval for searching at various levels of specificity.
Definition 5
(Medical Subject Headings) The Medical Subject Headings are , where n is the number of concepts and is a concept belong .
MEDLINE
MEDLINE is a bibliographic database that has collected journal articles in life sciences and biomedical information since 1966. It is produced by the National Library of Medicine in the United States. These academic journals cover medicine, veterinary medicine, nursing, dentistry, pharmacy, and health care. The database contains more than 27 million references which are selected from more than 5200 international publications in about 40 languages [10]. References are added to the database each week. MEDLINE uses MeSH which give uniformity and consistency to the indexing of the biomedical literature for information retrieval. MEDLINE uses the PubMed interface for free access on the Internet. Engines designed to search MEDLINE include author names, words in abstract and title of the article, date of publication, and MeSH terms. All journals are selected based on the recommendations of the Literature Selection Technical Review Committee from advisory committees of both external and internal experts. Therefore, the function of MEDLINE is as an essential resource for biomedical researchers around the world.
Definition 6
(MEDLINE) MEDLINE is the set of document , where m is the number of documents in . , where is a set of terms from d, and .
Medical knowledge graph
In this study, MeSH are considered as subgraphs to populate knowledge achieved from MEDLINE through instances. Figure 2 shows an example of an association between MeSH and MEDLINE. Each subgraph corresponds to a subject or a type of disease. Figure 3 presents an example of three subgraphs. There are many kinds of diseases presented in MeSH as well as the variety of information available in MEDLINE. Moreover, the listed number of diseases in MeSH corresponds to diseases in the reality of the existing dataset. This advantage could help create a map between the clinical data and MEDLINE for populating knowledge-base. These subgraph are an innovative approach for improving the accuracy of a classification model. As indicated by the discussion above, to map the observation data and MEDLINE, the study maps between MeSH and ICD-10. The UMLS Metathesaurus is used as standards to identify all relationships among these data. UMLS consists of various terms of different sources in biomedicine and health care which were created by the U.S. National Library of Medicine (NLM) in 1986. If a descriptor’s coding in MeSH and a disease coding in ICD-10 have the same concept in UMLS, a link is established to connect MeSH and ICD-10.
Fig. 2.

Association between MESH and MEDLINE
Fig. 3.

An example of subgraphs: there are three kinds of disease corresponding to three subgraphs. Each subgraph have a different number of nodes and type of links. The different subgraph may have the same node together. Each node may belong to a different object
Definition 7
(ICD-10) The ICD-10 is the set of disease , where s is the number of diseases and is a disease belong .
Given a set of diseases and a set of concepts . For each . For each . Given and , a link is generated based on the following:
| 1 |

Based on the mapping, a triangle of NHANE, MeSH and MEDLINE creates a knowledge graph in the medical domain. Each specific disease coding corresponds to a subgraph which is populated knowledge by a large number of articles from MEDLINE. All titles and abstracts of the articles were selected from published research that has been assessed by experts. Mining the extracted information to populate knowledge for this subgraph can bring potential value. Based on the subgraph, a specific knowledge domain can be obtained. The obtained knowledge was used as medical evidence which was embed into the HIG to evaluate the performance of the classification model.
Definition 8
(Knowledge Graph Base) The Knowledge Graph Base is a 3-tuple
where 
Based on the definition of knowledge-base, this study needs to perform an important task that is to learn instances from based on and map(d). These instance which are associated with concepts , play an important role in building a knowledge graph .
Assume that both and have k different subjects. Each k subject has n concepts. MeSH is presented as . Where MEDLINE is indicated as and m was the number of documents. To learn instances from any subject k for populating knowledge of subgraph corresponding subject k,
, where
is learned by mapping between and
| 2 |
These instances are integrated with edges in the HIG for improving the accuracy of classification model.
Applying knowledge-base in HIG classification model
In the work [26], we built a heterogeneous information graph to be able to deal with the problem of health risk prediction. In that study, we aimed to achieve the optimum knowledge discovery by applying the strongest effective factors in building the HIG. A HIG was a 3-tuple, with an object mapping function and a link type mapping function . In that study, a heterogeneous graph, which consisted of different types of nodes, was built based on health examination records. Pearson correlation was used to identify the link between two nodes in the graph. By using l attributes to calculate the coefficient values of the Pearson correlation, the strength of the relationship between two nodes in the heterogeneous graph was effectively demonstrated. Denoting a Pearson correlation coefficient value by represents the relationship between two vertices. This connection is valid if , where is a threshold defining the validity of object connection. The range of this is from -1 to 1; if the coefficient value approaches 1, two vertices and are strongly correlated. But if the coefficient value is in the range of nearly − 1, and , they are weakly correlated. Overall, the larger is, the stronger the relationship between and is. Moreover, in the study, semantic relations were applied to improve the accuracy of information retrieval. To perform risk prediction for each patient based on the heterogeneous graph, each type of information node was considered a semantic class. We, therefore, built a model to predict the risk of each disease belonging the different classes by comparing the effect among these classes.
As the discussion above demonstrates, a HIG was constructed based on the Pearson correlation and the semantic similarity. Later, the weight of edges in the HIG were normalised by using instances (Eq.(2)). The weight of edges was a critical characteristic of a network that was considered in the complex system by Supriya et al. [34] for the detection of the epilepsy syndrome. After populating the knowledge for the HIG through instances, a new knowledge graph was generated that was called the knowledge-based heterogeneous information graph (KB-HIG). Then, a classification model was developed base on the KB-HIG for predicting health risk status. The formula of normalisation is:
| 3 |
Assume that Ë was the set of links between any two data objects of a knowledge graph. Ë , , where k is the type of subjects and q is the number of the leaning instances, Ë . Eq.(3) was presented:
| 4 |
All instances were identify through a matrix between NHANES and MEDLINE. First, this study considered all of the attributes (variables) from NHANES being independent terms regrading a type of subject. Secondly, by extracting all titles and abstracts from MEDLINE corresponding to a kind of disease that was the same type subject with NHANES, a list of terms regarding a kind of disease was generated through the information network as a subgraph. A matrix was created by combining all terms between NHANES and MEDLINE. Figure 4 illustrates an example of the connection between NHANES and MEDLINE for diabetes. If any variable of NHANES contains any term from the diabetes information network in MEDLINE, these terms were marked this variable. All mapping between terms of NHANES and terms of MEDLINE called learned instances. In this study, we applied the word2vec algorithm suggested by [23] to identify the value for each instance. By utilising the word2vec algorithm, Zheng et al. [45] succeeded in extracting and calculating the weight of among terms. This technique is also called embedding and can measure semantic similarity among terms as well as identify similar neighbours for a given term. Obtaining the weight of among terms has a significant effect on the task of mining knowledge because the weight of among terms could help improve semantic similarity among terms. These benefits might contribute on improving the accuracy of classification model. The weight of among terms ranges from 0 to 1. For example, if a term does not exist or is not related to the given term (diabetes), the value is nearly 0. In contrast, the value is approximately 1 if the term is more related to diabetes.
Fig. 4.

Mapping variable from NHANES to terms from MEDLINE
By using the availability of the KB-HIG, we built an function that can formalise the profile of a healthy (unhealthy) patient regarding a disease object x:
| 5 |
where and . is coefficient adopted to balance the impact of objects that belong to the same semantic classes with x and being coefficients for different semantic classes with x.
Case study
This study used data mining and machine learning techniques in developing a classification model to help practitioners reduce human errors in assessing health risks. The model was able to provide evidence-based decision-making support, instead of experience-based support. If practitioners only rely on their experience, they may find it difficult to avoid human errors. Experience-based decisions may lead to overlooking the information in some critical cases. In contrast, data mining does not have the limitations as mentioned earlier which can assist in covering the overlooked areas. This technique can help provide some advantageous options for practitioners in categorising patients into broad groups such as healthy or unhealthy or suggesting the kind of diseases that patients may be suffering from, through a classification model.
In the real situation, practitioners may not use all the attributions of a patient (e.g., age, weight, use of alcohol, smoker) to decide whether the patient has diabetes or not. Foundationally, they diagnose a patient having diabetes by using some of the chemical tests. This method may skip some of the elements (e.g., use of alcohol, or smoker that could cause kidney cancer. Biswas and Kabir [3] argued that patients’ attribution such as aging 35 or higher have a substantial effect on diabetes while patients’ attribution like living further from town centres have no impact. In contrast, with data mining, a classification model could check whether a patient has diabetes by considering all the attributions. This approach could help reduce as much as possible the risk caused by misdiagnosis because practitioners may find it difficult to use all of a patient’s attributes in predicting the health risk status. Practitioners may find it difficult to identify whether a patient is suffering diabetes if this patient using 200 m/l alcohol a day, smokes three cigarettes per day, or has no such activity per week.
In this study, assume that a person with diabetes, had a thyroid problem, high cholesterol levels and high blood pressure, was defined as case A. While a person without diabetes, did not experience these symptoms was defined as case B. Beside, it is vital to notice that there are characteristics that were found in both cases with and without diabetes. For example, a person with and without diabetes might experience difficulty concentrating, doing errands alone, drinking at least twelve alcohol drinks one year or smoking least 100 cigarettes in life. The research, therefore, tries to analyse whether those symptoms could be used to identify a person belong to case A or case B through a classification model.
The work presented in this paper aims to generate a classification model to help practitioners in decision-making support. This model considers all characteristics of patients (e.g., age, use of alcohol, or smoker) that called as attributes in health examination records to assess a patient’s health risks. This study expects to discover all semantic relationships between patients’ characteristics and how they affect their health risks. Firstly, the study uses all patients’ characteristics to build a HIG based on the identified coefficients among attributions through the Pearson correlation. Secondly, the research improves the semantic relationship among nodes by instances which have come from the biomedical literature database. This source helps increase the reliability of prediction. By calculating the score for a specific disease, a risk status is assessed during the classification based on analysis of the patients’ characteristics. After applying the classification model to determine the personal health status, the model could diagnose whether or not a patient is suffering from a specific disease. These results based on evidence, can help practitioners to check their decisions and their treatment plans.
Evaluation
Experimental design
To increase the consistency of the approach, this research used two data sets in evaluating the proposed model for predicting diabetes mellitus. Diabetes mellitus was one of the severe health problems which caused 79% of deaths for people under the age of 60 by the statistical report of the World Health Organization [30]. Many researchers considered using this disease over the past decade for evaluating the performance of predictive models. Luo’s experiment [22] studied this disease in generating rules for explaining the results of the predictive models. Diabetes mellitus was also considered for evaluating the performance of experiments by Svetla et al [6]. They extracted entities from the big collection of outpatient records for frequent patterns mining. Figure 5 presents the dataflow in an experimental design. One of the five subset data extracted from the knowledge-base used to train our model. Later, four of the five subset data used to evaluate our model. With evaluation based on ground truth, this project uses standard metrics accuracy, recall, precision, and F1 measure in order to evaluate the performance of the model [5]. The result was compared to the baseline model. Moreover, to increase the reliability and evidence, this study integrated the knowledge-base with our previous work [26] and also with Chen’s work [9] to compare the effect of the knowledge-base to two original classification models which were used to answer these two questions:
What is the significant impact of the knowledge-base to the performance of classification model?
Does the knowledge-base have contribution to other models?
Fig. 5.

Experimental dataflow
One of the most important tasks of this experimental design is related to mapping the observation data and the medical knowledge. The study rebuilt the complexity of both MEDLINE and MeSH from XML format into a standard structure data before applying this data to train the model. The new construction helped provides efficient access to the concepts and relationships in MeSH as well as to each document in MEDLINE. To solve the issues, the study wrote an XML parser by using Java programming language to create a new structure of the database in MySQL under a table. This step help more convenient to extract information from MEDLINE by using MeSH for information retrieval.
After mapping MeSH and ICD-10, 153 types of diseases were detected related to 3257 patients in NHANES. Table 1 shows statistics for top 10 maps between ICD and MeSH. The research needs to extract all titles and abstracts of MEDLINE ID related to diabetes mellitus to generate instances in normalising the weight of edges in the graph. After mapping ICD-10 and MeSH, the descriptors coding of MeSH (D003924) was identified related to diabetes mellitus which showed in Table 1. This descriptors coding was used to extract all paper related to diabetes mellitus in the MEDLINE database. The study extracted 99785 papers related to diabetes mellitus. The extracted data was used to generate a list of terms related to diabetes mellitus. The weight of each term was calculated through the word vector space model. In this study, we use word2vec algorithm [23] to convert the extracted data into vector space. This algorithm helps to calculate the semantic relationship between all of the terms related to diabetes mellitus. Before using this extracted data for word vector space model, this study removed all stop-word and steam-word to improve accuracy information retrieval as well as ensure enough information in calculating the weight for each term.
Table 1.
Top 10 mapping disease between ICD-10 and MeSH
| ICD-10 Code | Description code | Patient | Percent (%) | MeSH code |
|---|---|---|---|---|
| I10 | Essential hypertension | 2421 | 17.75 | D006973 |
| E11 | Diabetes mellitus | 924 | 6.77 | D003924 |
| J45 | Asthma | 544 | 3.99 | D001249 |
| F32.9 | Major depressive disorder | 488 | 3.57 | D003863 |
| K21 | Gastro-esophageal reflux | 445 | 3.26 | D005764 |
| F41.9 | Anxiety disorder | 389 | 2.85 | D001008 |
| E03.9 | Hypothyroidism | 340 | 2.49 | D009230 |
| K30 | Functional dyspepsia | 175 | 1.28 | D004415 |
| T78.40 | Allergy | 159 | 1.66 | D006967 |
| I50.9 | Heart failure | 122 | 0.89 | D006333 |
After running the word vector space model, each term of diabetes mellitus has a value. Table 2 presents the weight of the top 10 terms. Finally, a matrix is created by linking these term to variables in NHANES. The cooperation terms are generated based on this matrix which called in- stances. Instances are used to help improve the accuracy in the classification model. The value of each instance was identified by an association between a variable in NHANES and a list of terms for diabetes mellitus. For example, if a list of terms for diabetes mellitus does not exist in a variable regarding the attribute of age, the instance value of age variable is 0. In contrast, the weight of an instance for a variable is equal the weight of each term that link to this variable. Table 3 presents the top 10 variables of NHANES that has the highest weight of instance.
Table 2.
Top 10 terms for diabetes mellitus after mining
| Term | Weight |
|---|---|
| Type | 0.80688 |
| Mellitus | 0.80014 |
| Non | 0.76048 |
| Dependent | 0.75367 |
| Patient | 0.75254 |
| Study | 0.72770 |
| Niddm | 0.70903 |
| Control | 0.68019 |
| Insulin | 65438 |
| Subject | 0.63372 |
Table 3.
Top 10 variable from NHANES have a strong effect to diabetes mellitus
| Variables | Weight |
|---|---|
| Taking insulin | 0.65438 |
| Age | 0.56882 |
| Glucose refrigerated serum | 0.47249 |
| Blood test | 0.44453 |
| W`ork activity | 0.35117 |
| Difficulty concentrating | 0.34173 |
| Cotinine Serum | 0.33450 |
| Mean cell volume | 0.30563 |
| Number adults in household | 0.26273 |
| Total protein | 0.24524 |
Dataset
This study used the observation data of the National Health and Nutrition Examination Survey (NHANES)1 and the National Ambulatory Medical Care Surveys (NAMCS)2 for training sets and testing set to evaluate the proposed model. The NHANES dataset has hundreds of available parameters which collect a wide range of health assessments such as lab tests, physical examinations, and personal habits. NHANES dataset contains 9770 participants with more than 2585 attributes. Due to missing data and ensuring enough data for the experiment, only 318 attributes are used in the training model as well as testing results. The NAMCS dataset has variables regarding the patient’s smoking habits, the physician’s diagnosis, the kinds of diagnostic, prescription status as well as demographic information on patients, including age, sex, weight, height, race, etc. The NAMCS dataset contains 32281 patients with 440 attributes. However, only 164 attributes are used for the experiment because of the missing data.
Also, before applying these data for training model, we need to do data preprocessing as data cleaning and data normalisation because the dataset has a different type of data and contains a large amount of missing data. All variables of the dataset were presented as a binary label. A value of “1” indicates as a positive case and a value of “0” as a negative case. Based on the design model, all the data value were presented as between 0 and 1 for validity in the experiment. For example, a nominal data type such as gender was converted into zero and one from male and female and an ordinal data type such as general health condition was converted into 0, 0.5 and 1 from bad, good and excellent. With the data that range results (50 to 150) such as blood test, values were converted into the format minimum and maximum [0, 1]. This step also provided a standard to identify positive and negative cases. Where 0 was set for negative situations (unhealthy), and 1 was set for positive cases (healthy). If any attributes are not in this criterion, a new format will be needed. Besides, the missing data were replaced by the average of all values in the respective attribute.
Baseline model
In this study, we use both our model of previous work [26] and Chen’s Model [9] to evaluate the effect of the result if we consider using a knowledge-base in diagnosing. In the last work, we suggest a heterogeneous graph classification model (HIG model) by using both semantic similarity and Pearson correlation to build a classification model. In contrast, Chen’s Model (SHG-Health Model) used the semi-supervised learning algorithm that considered only the neighbourhood node in the heterogeneous graph to deal with the classification. The result has demonstrated that the achieved performance of the HIG model overcome the SHG-Health Model. By adopting knowledge-base in these two models, we hope to achieve more improvement compared to the original model.
Result and analysis
Experimental results
To apply knowledge-base in a classification model to improve the performance of prediction, we completed a task to map between the observation data and MEDLINE based on MeSH and ICD-10. The statistic of mapping is presented in Section 5.1. After the mapping, we performed a task of language processing to extract all documents related to diabetes mellitus. The obtained data were used to integrate into the classification model for predicting diabetes mellitus. The result of the experiment is indicated in Figure 6. The precision of the KB-HIG model has a significant improvement compared to the baseline model in both two datasets. The result of recall and accuracy has improved by approximately 20 per cent. Overall, the experimental performance of KB-HIG model is better than SHG-Health model, which is justified through the vale of F-Measure by the NHANES and NAMCS dataset with nearly 75 per cent and around 55 per cent, respectively.
Fig. 6.

Comparison between KB-HIG Model and SHG-Health model
Comparison and discussions
The results obtained from this study demonstrated that the accuracy of the predicted outcome could improve by applying the knowledge-base in the classification model. Using the knowledge-base in the classification model could help to generate the semantic relationship among another object. In this study, we converted the extracted data regarding diabetes mellitus to the word vector which can identify how to affect between other keywords or terms based on the vector space model. This technique helps to find out terms being similar as well as being different with others by calculating the distance between the two terms. The objects (as variables) do not relate to the topic (as liver cancer) was removed based on the threshold calculated through the word2Vec algorithm. Removing all of the objects that are not related to the topic helped the model obtain high performance in predicting disease.
To raise reliability as well as provide more evidence about the advantage of this study, we expanded our experiment on applying the knowledge-base in other classification modes. We aim to improve our claim that the knowledge-base could have a significant impact on different classification modes. There are two questions that we aim to answer.
-
What is the significant impact of the knowledge-base to the performance of classification model? As we can see in Figure 7, applying the knowledge-base in the classification model to predict diabetes mellitus helped the KB-HIG model obtain significant improvement for both the precision and accuracy. Overall, the performance of the KB-HIG model for both the NHANES and NAMCS dataset raises approximately 15% and round 17% compared to the model do not apply knowledge-base(HIG model), respectively. This result demonstrated that using the knowledge-base could help the classification model obtaining high performance.
In previous work, we suggested a new method of building the HIG. Then, we develop a classification model by using this heterotrophic information graph. The result showed that our model was better than the baseline model. The reason for this advantage was regarding applying the Pearson correlation and semantic relation to constructing the HIG. This graph helped the former model from our previous research to obtain an in-depth understanding between the semantic class which removes as much information as possible about objects that are not related to the subject. Therefore, our previous model received a high improvement in the prediction of health risk status. In this study, we added one more level of semantic into the classification model based on the knowledge-base. This method helped our model continue to increase the performances of prediction because the more object that was not related to the subject was removed. However, this will make the cooperation between possible predict samples and negative predict samples were more close. This is the reason why the recall of the KB-HIG was down compared to the HIG.
Does the knowledge-base have contribution to other models? In this study, we also consider integrating the knowledge-base on Chen’s Model to assess the influence of the knowledge-base on Chen’s Model. Chen’s Model used the information neighbourhood node of the HIG to predict health status. The result from Figure 8 shows that the model after applying the knowledge-base has a performance higher than that of the original model. The performance of the model within the knowledge-base achieved a significant improvement with more than 80% from the NHANNES dataset and nearly 75% from the NAMCS dataset compared to the model without the knowledge-base. This result opened chances for future researchers that could consider using the knowledge-base in their model for obtaining more the performance of classification.
Fig. 7.

Comparison between KB-HIG Model and HIG model
Fig. 8.

Comparison between KB-SHG-Health Model and SHG-Health model
The original Chen’s Model only considered the HIG that constructed based on the information neighbourhood node to develop their model. They would skip all nodes related to the subject if these nodes were not neighbourhood nodes, although this node was connected to the topic. Besides, the model covers all of the objects to predict the health risk status, and even this object was not related to the subject. In contrast, in this experiment, we used a large number of articles from MEDLINE as an evidence base to integrate into the baseline model. This method helped to remove the information neighbourhood node that was not related to the subject. This led to the knowledge-base Chen’s Model obtaining a better performance of information retrieval than their original model. Finally, we made a comparison among four models which were presented in Figure 9 and Figure 10. Both two models using the knowledge-base, including KB-HIG and KB SHG-Health from two datasets, were better than the HIG model and SHG-Health model without a knowledge-base. It was clear to confirm that applying the knowledge base into classification models was capable of contributing to the improvement of the classify performance.
Fig. 9.

A comparison of four models by using NHANES dataset
Fig. 10.

A comparison of four models by using NAMCS dataset
Conclusions
Biomedical literature from MEDLINE has become an essential resource for biomedical researchers. Therefore, an approach suggested by using MEDLINE to populate the knowledge-base, which was expected to improve the accuracy in predicting the health risk status for our model and baseline model. After populating the knowledge for the HIG by using instances that learned from the knowledge-base, a classification model was constructed to mine this KB-HIG for predicting the health risk status. The result of this proposal brought a significant improvement, which contributed to increase the accuracy of the prediction. This study also demonstrated that applying the knowledge-base into the classification model has achieved more benefits than others without using the knowledge-base. Besides, rebuilding the structure of MeSH and MEDLINE helped facilitate the combination of the observation data to build a KB-HIG. By providing a framework to generate this KB-HIG, this approach may motivate future researchers in applying the knowledge-base in their model.
In future research, we will take advantage of the knowledge-base in predicting more kinds of disease. Mainly, the study aims to develop a multi-label classification model in predicting more than one illness for each patient. This approach will be more benefit in the real situation of the treatment plan which makes the practitioner focus on identifying the diseases a patient is suffering from.
Acknowledgements
The work is conducted with approval from the Human Research Ethics Committee of the University of Southern Queensland, Australia (Approval ID: H18REA049). The authors acknowledge the use of the National Health and Nutrition Examination Survey (NHANES) and National Ambulatory Medical Care Survey (NAMCS) in the study and especially, thank the Centers for Disease Control and Prevention of the Department of Health and Human Services, the United States for making the data set publicly available for research purpose. The authors also appreciate the courtesy of the U.S. National Library of Medicine for allowing the use of MEDLINE.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Thuan Pham, Email: Thuan.Pham@usq.edu.au.
Xiaohui Tao, Email: Xiaohui.Tao@usq.edu.au.
Ji Zhang, Email: Ji.Zhang@usq.edu.au.
Jianming Yong, Email: Jianming.Yong@usq.edu.au.
References
- 1.Anupindi TR, Srinivasan P. Disease comorbidity linkages between medline and patient data. In: 2017 IEEE International Conference on Healthcare Informatics (ICHI) IEEE; 2017. pp. 403–408.
- 2.Banuqitah H, Eassa F, Jambi K, Abulkhair M. Two level self-supervised relation extraction from medline using umls. Int J Data Min Knowl Manag Process IJDKP. 2016;6(3):11–23. doi: 10.5121/ijdkp.2016.6302. [DOI] [Google Scholar]
- 3.Biswas RK, Kabir E. Influence of distance between residence and health facilities on non-communicable diseases: an assessment over hypertension and diabetes in bangladesh. PLoS ONE. 2017;12(5):e0177027. doi: 10.1371/journal.pone.0177027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Böckmann B, Heiden K. Extracting and transforming clinical guidelines into pathway models for different hospital information systems. Health Inf Sci Syst. 2013;1(1):13. doi: 10.1186/2047-2501-1-13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Bowes D, Hall T, Gray D. Comparing the performance of fault prediction models which report multiple performance measures: recomputing the confusion matrix. In: Proceedings of the 8th International Conference on Predictive Models in Software Engineering. ACM; 2012. pp. 109–118.
- 6.Boytcheva S, Angelova G, Angelov Z, Tcharaktchiev D. Mining comorbidity patterns using retrospective analysis of big collection of outpatient records. Health Inf Sci Syst. 2017;5(1):3. doi: 10.1007/s13755-017-0024-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Cases M, Furlong LI, Albanell J, Altman RB, Bellazzi R, Boyer S, Brand A, Brookes AJ, Brunak S, Clark TW, et al. Improving data and knowledge management to better integrate health care and research. J Intern Med. 2013;274(4):321–328. doi: 10.1111/joim.12105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Chen ES, Hripcsak G, Xu H, Markatou M, Friedman C. Automated acquisition of disease-drug knowledge from biomedical and clinical documents: an initial study. J Am Med Inf Assoc. 2008;15(1):87–98. doi: 10.1197/jamia.M2401. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Chen L, Li X, Sheng QZ, Peng WC, Bennett J, Hu HY, Huang N. Mining health examination records—a graph-based approach. IEEE Trans Knowl Data Eng. 2016;28(9):2423–2437. doi: 10.1109/TKDE.2016.2561278. [DOI] [Google Scholar]
- 10.Costa JP, Stopar L, Fuart F, Grobelnik M, Santanam R, Sun C, Carlin P, Black M, Wallace J. Mining medline for the visualisation of a global perspective on biomedical knowledge. In: KDD 2018 (24th ACM SIGKDD Conference on Knowledge Discovery and Data Mining); 2018.
- 11.Escudié JB, Rance B, Malamut G, Khater S, Burgun A, Cellier C, Jannot AS. A novel data-driven workflow combining literature and electronic health records to estimate comorbidities burden for a specific disease: a case study on autoimmune comorbidities in patients with celiac disease. BMC Med Inf Decis Mak. 2017;17(1):140. doi: 10.1186/s12911-017-0537-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Goh WP, Tao X, Zhang J, Yong J. Decision support systems for adoption in dental clinics: a survey. Knowl Based Syst. 2016;104:195–206. doi: 10.1016/j.knosys.2016.04.022. [DOI] [Google Scholar]
- 13.Hanauer DA, Saeed M, Zheng K, Mei Q, Shedden K, Aronson AR, Ramakrishnan N. Applying metamap to medline for identifying novel associations in a large clinical dataset: a feasibility analysis. J Am Med Inf Assoc. 2014;21(5):925–937. doi: 10.1136/amiajnl-2014-002767. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Hidalgo CA, Blumm N, Barabási AL, Christakis NA. A dynamic network approach for the study of human phenotypes. PLoS Comput Biol. 2009;5(4):e1000353. doi: 10.1371/journal.pcbi.1000353. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Huang Z, Yang J, van Harmelen F, Hu Q. Constructing knowledge graphs of depression. In: International Conference on Health Information Science. Springer; 2017. pp. 149–161.
- 16.Ji M, Han J, Danilevsky M. Ranking-based classification of heterogeneous information networks. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2011. pp. 1298–1306.
- 17.Jiang Y, Qiu B, Xu C, Li C. The research of clinical decision support system based on three-layer knowledge base model. J Healthcare Eng. (2017). [DOI] [PMC free article] [PubMed]
- 18.Kavuluru R, Han S, Harris D. Unsupervised extraction of diagnosis codes from emrs using knowledge-based and extractive text summarization techniques. In: Canadian conference on artificial intelligence. Springer; 2013. pp. 77–88. [DOI] [PMC free article] [PubMed]
- 19.Lei X, Zhang Y. Predicting disease-genes based on network information loss and protein complexes in heterogeneous network. Inf Sci. 2019;479:386–400. doi: 10.1016/j.ins.2018.12.008. [DOI] [Google Scholar]
- 20.Liu YI, Wise PH, Butte AJ. The “etiome”: identification and clustering of human disease etiological factors. In: BMC bioinformatics. vol. 10, p. S14. BioMed Central; 2009. [DOI] [PMC free article] [PubMed]
- 21.Luo C, Guan R, Wang Z, Lin C. Hetpathmine: A novel transductive classification algorithm on heterogeneous information networks. In: European Conference on Information Retrieval. Springer; 2014. pp. 210–221.
- 22.Luo G. Automatically explaining machine learning prediction results: a demonstration on type 2 diabetes risk prediction. Health Inf Sci Syst. 2016;4(1):2. doi: 10.1186/s13755-016-0015-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems; 2013. pp. 3111–3119.
- 24.Pereira S, Névéol A, Massari P, Joubert M, Darmoni S. Construction of a semi-automated icd-10 coding help system to optimize medical and economic coding. In: MIEl; 2006. pp. 845–850. [PubMed]
- 25.Perotte A, Ranganath R, Hirsch JS, Blei D, Elhadad N. Risk prediction for chronic kidney disease progression using heterogeneous electronic health record data and time series analysis. J Am Med Inf Assoc. 2015;22(4):872–880. doi: 10.1093/jamia/ocv024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Pham T, Tao X, Zhanag J, Yong J, Zhang W, Cai Y. Mining heterogeneous information graph for health status classification. In: 2018 5th International Conference on Behavioral, Economic, and Socio-Cultural Computing (BESC). IEEE; 2018. pp. 73–78.
- 27.Saitwal H, Qing D, Jones S, Bernstam EV, Chute CG, Johnson TR. Cross-terminology mapping challenges: a demonstration using medication terminological systems. J Biomed Inform. 2012;45(4):613–625. doi: 10.1016/j.jbi.2012.06.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Schriml LM, Mitraka E, Munro J, Tauber B, Schor M, Nickle L, Felix V, Jeng L, Bearer C, Lichenstein R, et al. Human disease ontology 2018 update: classification, content and workflow expansion. Nucleic Acids Res. 2018;47(D1):D955–D962. doi: 10.1093/nar/gky1032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Shah S, Luo X, Kanakasabai S, Tuason R, Klopper G. Neural networks for mining the associations between diseases and symptoms in clinical notes. Health Inf Sci Syst. 2019;7(1):1. doi: 10.1007/s13755-018-0062-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Shakeel PM, Baskar S, Dhulipala VS, Jaber MM. Cloud based framework for diagnosis of diabetes mellitus using k-means clustering. Health Inf Sci Syst. 2018;6(1):16. doi: 10.1007/s13755-018-0054-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Soualmia LF, Sakji S, Letord C, Rollin L, Massari P, Darmoni SJ. Improving information retrieval with multiple health terminologies in a quality-controlled gateway. Health Inf Sci Syst. 2013;1(1):8. doi: 10.1186/2047-2501-1-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Srinivasan S, Rindflesch TC, Hole WT, Aronson AR, Mork JG. Finding umls metathesaurus concepts in medline. In: Proceedings of the AMIA Symposium. p. 727. American Medical Informatics Association; 2002. [PMC free article] [PubMed]
- 33.Sun Y, Han J. Mining heterogeneous information networks: a structural analysis approach. Acm Sigkdd Explorations Newsl. 2013;14(2):20–28. doi: 10.1145/2481244.2481248. [DOI] [Google Scholar]
- 34.Supriya S, Siuly S, Wang H, Cao J, Zhang Y. Weighted visibility graph with complex network features in the detection of epilepsy. IEEE Access. 2016;4:6554–6566. doi: 10.1109/ACCESS.2016.2612242. [DOI] [Google Scholar]
- 35.Tateisi Y. Resources for assigning mesh IDs to Japanese medical terms. Genomics Inform. 2019;17(2):e16. doi: 10.5808/GI.2019.17.2.e16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Wang H, Zhang Q, Yuan J. Semantically enhanced medical information retrieval system: a tensor factorization based approach. IEEE Access. 2017;5:7584–7593. doi: 10.1109/ACCESS.2017.2698142. [DOI] [Google Scholar]
- 37.Wang L, Del Fiol G, Bray BE, Haug PJ. Generating disease-pertinent treatment vocabularies from medline citations. J Biomed Inform. 2017;65:46–57. doi: 10.1016/j.jbi.2016.11.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Xiong Y, Ruan L, Guo M, Tang C, Kong X, Zhu Y, Wang W. Predicting disease-related associations by heterogeneous network embedding. In: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE; 2018. pp. 548–555.
- 39.Xu R, Li L, Wang Q. driskkb: a large-scale disease-disease risk relationship knowledge base constructed from biomedical text. BMC Bioinform. 2014;15(1):105. doi: 10.1186/1471-2105-15-105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Xu R, Wang Q. Large-scale extraction of accurate drug-disease treatment pairs from biomedical literature for drug repurposing. BMC Bioinform. 2013;14(1):181. doi: 10.1186/1471-2105-14-181. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Xu R, Wang Q. Toward creation of a cancer drug toxicity knowledge base: automatically extracting cancer drug-side effect relationships from the literature. J Am Med Inf Assoc. 2013;21(1):90–96. doi: 10.1136/amiajnl-2012-001584. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Zeng Q, Cimino JJ. Automated knowledge extraction from the umls. In: Proceedings of the AMIA Symposium. p. 568. American Medical Informatics Association; 1998. [PMC free article] [PubMed]
- 43.Zhang Y, Srimani PK, Wang JZ. Combining mesh thesaurus with umls in pseudo relevance feedback to improve biomedical information retrieval. In: 2016 IEEE International Conference on Knowledge Engineering and Applications (ICKEA). IEEE; 2016. pp. 67–71.
- 44.Zhao D, Weng C. Combining pubmed knowledge and ehr data to develop a weighted bayesian network for pancreatic cancer prediction. J Biomed Inform. 2011;44(5):859–868. doi: 10.1016/j.jbi.2011.05.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Zheng G, Callan J. Learning to reweight terms with distributed representations. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. ACM; 2015. pp. 575–584.
