Abstract
Purpose:
Kidney stone disease (KSD) is a common urological disorder with an increasing incidence worldwide. The extensive knowledge about KSD is dispersed across multiple databases, challenging the visualization and representation of its hierarchy and connections. This paper aims at constructing a disease-specific knowledge graph for KSD to enhance the effective utilization of knowledge by medical professionals and promote clinical research and discovery.
Methods:
Text parsing and semantic analysis were conducted on literature related to KSD from PubMed, with concept annotation based on biomedical ontology being utilized to generate semantic data in RDF format. Moreover, public databases were integrated to construct a large-scale knowledge graph for KSD. Additionally, case studies were carried out to demonstrate the practical utility of the developed knowledge graph.
Results:
We proposed and implemented a Kidney Stone Disease Knowledge Graph (KSDKG), covering more than 90 million triples. This graph comprised semantic data extracted from 29,174 articles, integrating available data from UMLS, SNOMED CT, MeSH, DrugBank and Microbe-Disease Knowledge Graph. Through the application of three cases, we retrieved and discovered information on microbes, drugs and diseases associated with KSD. The results illustrated that the KSDKG can integrate diverse medical knowledge and provide new clinical insights for identifying the underlying mechanisms of KSD.
Conclusion:
The KSDKG efficiently utilizes knowledge graph to reveal hidden knowledge associations, facilitating semantic search and response. As a blueprint for developing disease-specific knowledge graphs, it offers valuable contributions to medical research.
Keywords: Kidney stone disease, Knowledge graph, Biomedical ontology, Knowledge integration, Semantic reasoning, Knowledge discovery
Introduction
Kidney stone disease (KSD) is a major health concern affecting the urinary system, characterized by a steadily increasing incidence rate that significantly impairs patients’ quality of life and health [1]. Globally, approximately 5–15% of the population is afflicted by KSD, with a lifetime prevalence rate of 14% [2] and a heritability estimate ranging from 46 to 57% [3]. KSD is also known for its high recurrence rate, with up to 50% of patients experiencing a recurrence within 5 years, posing a substantial economic burden. In the United States alone, the treatment costs for KSD patients surpass 10 billion dollars annually [4].
Numerous achievements have significantly contributed to our understanding and research of KSD, offering valuable knowledge and insights [5–7]. Despite these advancements, knowledge remains fragmented and incomplete across different disease-specific databases, literature, medical knowledgebases, lacking a cohesive and easily accessible knowledge system [8, 9]. Physicians, researchers, and patients often devote significant time and effort to navigating various sources to gather and organize pertinent information, thereby limiting decision-making efficiency and personalized care. Thus, to address the needs of the relevant populations regarding KSD knowledge, integrating and sharing the latest research findings, clinical practices, and other resources is crucial to provide comprehensive and reliable medical information.
Knowledge graph is an integral component of Artificial Intelligence (AI), serving as an essential tool for information processing and knowledge organization [10]. It offers a structured approach to representing knowledge via visual graphs, detailing entities and their interrelations in the real world. This method is distinguished by its standardization, reusability, rationality, openness, and ease of maintenance. In contrast to traditional information management methods, Knowledge graphs possess intelligent reasoning capabilities. These capabilities allow not only for the visualization of knowledge reasoning processes but also for the rapid and efficient identification of logical relationships within knowledge.
Recently, the construction technology of knowledge graphs has been developed rapidly, involving several key stages such as knowledge extraction, knowledge fusion and knowledge reasoning [11, 12]. In order to build a comprehensive knowledge graph, researchers can extract and correlate information from structured, semi-structured, or unstructured data by using techniques such as natural language processing, machine learning, and ontology integration. For example, Wang et al. utilized BERT for rich word embeddings, BiLSTM for feature extraction, and CRF for optimizing annotation sequences to achieve effective extraction of medical information [13]. In addition, the combination of standardized knowledge representation frameworks (e.g. RDF and OWL) and graph databases (e.g. GraphDB and Neo4j) greatly improves the efficiency of knowledge storage and query. For instance, Aldwairi et al. explored the advantages of using RDF and graph databases for complex semantic modeling [14]. These practices provide important methods and tools for our study, enabling complex data processing, and efficient knowledge management and application.
Several biomedical ontologies and terminology systems are commonly used in the construction of knowledge graphs to ensure semantic consistency and interoperability of data [15]. Unified Medical Language System (UMLS) is a knowledge representation and retrieval system that contains a large number of medical terms and concepts. Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT) is an internationally recognized standard for medical terminology, covering clinical information about diseases, surgeries, medication, etc. Medical Subject Headings (MeSH) provides a standardized vocabulary system for indexing and categorizing medical literature, which assists researchers in subject search and knowledge discovery through a hierarchical organization. Furthermore, many ontologies pertinent to kidney research have been developed and made available through public platforms like the BioPortal. Chronic Kidney Disease Ontology1 (CKDO) is used to help identify chronic kidney disease in primary healthcare and support data analysis. Kidney Tissue Atlas Ontology2 (KTAO) offers detailed insights into kidney structures, cells, and pathways. These ontologies provide standardized nomenclature and classification of kidney function and disease, and facilitate the integration of different data types. These ontologies haven’t been specially optimized for KSD, however; instead, their primary emphasis is on renal function and specific disorders. There are still limitations in the integration and application of KSD knowledge, despite the fact that they provide a crucial basis for relevant research.
Many research institutions and scholars have developed disease knowledge graphs encompassing various diseases by amalgamating diverse biomedical data and publications [16]. The patient-centered lung cancer knowledge graph for health factors was established by integrating Electronic Medical Records (EMR) and terminology of UMLS, facilitating search and risk factor analysis [17]. Similarly, the diabetic nephropathy knowledge graph combines guidelines and EMR of traditional Chinese medicine [18]. A knowledge graph for rheumatoid arthritis was created, using SNOMED CT for normalizing the named entity links [19]. Breast cancer knowledge graphs have been constructed utilizing either multiple data sources [20], such as EMR, guidelines and medical encyclopedias, or by utilizing a single source such as PubMed articles [21]. The ADHD-KG was proposed to provide comprehensive information on attention deficit hyperactivity disorder, extracted from five kinds of resources [22], such as MeSH and DrugBank et al. These studies demonstrate the possibility of using medical ontology to integrate different data sources to create a knowledge graph that supports clinical decision and research.
However, many existing disease knowledge graphs are still limited in several aspects. In the methods, lack of standardization in knowledge representation is a common problem, causing many disease knowledge graphs to suffer from inconsistencies when merging data from different sources. As a result, expanding and integrating the knowledge graphs is challenging. In the resources, many knowledge graphs rely only on a single data source, restricting the breadth and depth of the knowledge they represent. They often fail to comprehensively cover all important details about a disease, which may easily lead to underwhelming performance in complex knowledge reasoning and discovery tasks. In the topics, many knowledge graphs are usually customized for specific research issues and clinical needs, limiting their utility of medical applications. While various knowledge graphs, such as GenomicKB [23], PrimeKG [24], CKG [25], and CMeKG [26] (as shown in Table S1), have covered KSD-related knowledge, there is still a need to create a unified and standardized KSD knowledge graph that incorporates a variety of medical resources. Therefore, in an effort to successfully address the aforementioned issues, this paper offers the results of our study. In this paper, our main study work is as follows:
A large-scale Kidney Stone Disease Knowledge Graph (KSDKG) was developed, comprising over 90 million triples. It amalgamates extensive knowledge from about 30 thousand biomedical articles and various data sources, establishing a comprehensive knowledge framework for KSD.
The KSDKG utilized biomedical ontologies such as SNOMED CT and UMLS for semantic annotation of concepts, enhancing the semantic consistency and data interoperability. This improvement significantly increases the accuracy and relevance of information retrieval.
Three application cases demonstrating knowledge reasoning highlighted utility of the KSDKG. It supports knowledge integration and knowledge discovery, aiding in exploring the mechanism of KSD, such as identifying potentially related diseases.
Methods
Construction process of the KSDKG
This paper focuses on creating a knowledge graph specifically for KSD, aimed at systematically organizing and displaying pertinent knowledge, while enabling semantic querying and reasoning. Considering disease characteristics of KSD and application requirements, we devised the construction strategy for the KSDKG by using literature and public datasets after in-depth discussion with clinicians (as shown in Fig. 1). The process began with collection of extensive articles, encompassing each phase from initial data processing to complex querying and reasoning. To enhance the practicality and accuracy of the KSDKG, we also utilized additional available resources, which provided opportunities to validate and enrich our knowledge graph.
Fig. 1.
Construction process of the KSDKG
Specifically, the process primarily consists of the following parts: (1) Searching and collecting articles about KSD on PubMed; (2) Structurally representing knowledge extracted from articles by the XMedlan tool, and generating semantic data to construct the KSDKG; (3) Integrating existing biomedical ontologies and public databases to enhance knowledge coverage of the KSDKG; (4) Realizing the storage of KSDKG and reasoning through GraphDB, and demonstrating the applications.
Data collection
The PubMed database [27] was chosen as the primary source of literature for development of the KSDKG. It contains abundant researches and medical information, as an academically authoritative database renowned for academic authority, extensive coverage, robust search and continuous updates. To systematically and comprehensively summarize the content of literature, providing essential support for upper-level knowledge graph-based applications, we designed appropriate search rules to retrieve all articles related to KSD from the PubMed (as shown in Fig. 2).
Fig. 2.

Literature collection process for the KSDKG
We conducted advanced searches using MeSH3 terms for KSD in PubMed, such as “Nephrolithiasis” and “Kidney Calculi”, along with relevant keywords. During the investigation of MeSH terms, it was observed that the entry terms did not include specific types of renal stones, such as “renal pelvic stone” and “calyceal stones”, excluding “staghorn calculi”. In this context, we added these terms as keywords in the search. Through a joint search of MeSH headings and keywords, more comprehensive literature on KSD can be obtained. The search rules are as follows.
| Retrieval Form = #1 OR #2 OR #3 | |
|---|---|
| #1: | ((Nephrolithiasis[MeSH Terms]) OR (Kidney Calculi[MeSH Terms])) OR (Staghorn Calculi[MeSH Terms]) |
| #2: | (((((((((((((((((((((kidney stone) OR (Renal Stones)) OR (calculus of kidney)) OR (kidney calculi)) OR (renal calculi)) OR (nephrolithiasis)) OR (renal calculus)) OR (Nephrolith)) OR (nephritic stone)) OR (Calculi, Kidney)) OR (Calculus, Kidney)) OR (Kidney Calculus)) OR (Kidney Stones)) OR (Stone, Kidney)) OR (Stones, Kidney)) OR (Calculi, Renal)) OR (Calculus, Renal)) OR (Staghorn Calculi)) OR (Calculi, Staghorn)) OR (Calculus, Staghorn)) OR (Staghorn Calculus)) OR (staghorn nephrolithiasis) |
| #3: | (((renal pelvic stone) OR (renal pelvic stones)) OR (calyceal stones)) OR (calyceal stone) |
Following the above search strategy, a total of 39,482 articles on KSD were retrieved as original data by May 2023. Considering the abstract as a crucial part of literature, which provids main content and key information, we excluded articles lacking abstracts. Additionally, we used the unique identifier (PMID) of an article in PubMed or other attributes to remove duplicate articles, which can avoid repeated processing of the same information to improve the efficiency and accuracy of construction. Ultimately, 29,174 articles were identified using for construction of KSDKG.
Knowledge structuration
RDF for knowledge representation
The transformation of unstructured and semi-structured data into a well-defined structured data with explicit semantics is essential for construction of knowledge graphs. The procedure entails the extraction of hierarchical structures, relationships, and attributes, which organize knowledge into a coherent and easily accessible format, enabling automated processing by computers.
The Resource Description Framework (RDF) is a knowledge representation method based on the Semantic Web [28]. It provides a common and standardized format to share and exchange knowledge across various applications and systems. Generally, RDF serves to illustrate relationships among resources based on the triple form of < subject, predicate, object>.
We adopted N-triples format for representation of RDF data for efficient storage, transmission, and processing of data. The N-triples format is crafted to encapsulate RDF data through a series of triples, with each triple delineated on a distinct line and concluded with a point. Within these triples, both the subject and predicate are invariably represented by a Uniform Resource Identifier (URI), ensuring a standardized reference framework. Conversely, the object enjoys greater flexibility, permitting representation either as a URI or a literal value, accommodating a broader spectrum of data types and information. For example, some information with N-triples format extracted from the literature (PMID: 6220270) can be expressed as:
<https://www.ncbi.nlm.nih.gov/pubmed/6220270a3>
<http://www.ztonebv.nl/KG#hasSource>
“Abstract”.
<https://www.ncbi.nlm.nih.gov/pubmed/6220270a3>
<http://www.ztonebv.nl/KG#hasAnnotation>
<https://www.ncbi.nlm.nih.gov/pubmed/6220270a3b3>.
<https://www.ncbi.nlm.nih.gov/pubmed/6220270a3b3>
<http://www.ztonebv.nl/KG#hasLabel>
“kidney stone”.
Semantic annotation method
Biomedical ontologies, such as UMLS,4 SNOMED CT,5 and MeSH, constitute a comprehensive repository of terms, concepts, and relationships, providing a unified framework for organizing and expressing medical knowledge. In this paper, we adopted an ontology-based semantic annotation method to align and associate significant concepts found in articles with corresponding terms in UMLS, SNOMED CT, and MeSH. It ensures enhanced precision and uniformity in term definitions, facilitating the semantic representation of data. The strategy enhanced integration and interaction of KSDKG with other knowledge graphs to improve the accuracy of retrieval, and support applications in semantic search and knowledge discovery.
We analyzed and processed literature obtained from PubMed by using the biomedical text processing tool XMedlan [29, 30], and identified key concepts and their context in the text. After extraction, we performed the multi-source mapping of these concepts with the UMLS, SNOMED CT or MeSH, trying to ensure that each concept finds its corresponding encodings in different terminology systems. Furthermore, we annotated each identified concept with a concept unique identifier (CUI) in UMLS or concept ID (SCTID) in SNOMED CT to ensure concept standardization. Through these edcodings and mapping relationships, the association between concepts is further constructed, so that the representation of the same concept in different terminology systems is consistent.
For instance, the concept of “kidney stone” mentioned in the literature can be represented not only as SCTID 95570007 in SNOMED CT, but also as CUI C0022650 in UMLS. There are several concepts similar to “kidney stone” within SNOMED CT (as shown in Table 1). Therefore, it is essential to annotate “kidney stone” to the respective concepts accurately and enumerate all applicable mappings.
Table 1.
Concept ID and concept name of kidney stone in SNOMED CT
| Concept ID | Concept name |
|---|---|
| 95570007 | Kidney stone |
| 197793002 | (Calculus of kidney) or (nephrolithiasis NOS) |
| 155868000 | Kidney calculus (& [staghorn]) (disorder) |
| 197792007 | Urinary calculus (& [kidney &/or ureter) |
| 266623004 | Kidney calculus (& [staghorn]) |
| 155867005 | Urinary calculus (& [kidney &/or ureter]) (disorder) |
| 266622009 | Urinary calculus (& [kidney &/or ureter]) (disorder) |
| 56491003 | Nephrolithiasis |
| 197795009 | Renal calculus NOS (disorder) |
| 236707002 | Nephrolithiasis NOS (disorder) |
Additionally, it is crucial to consider the context to accurately handling synonyms, which may correspond to different concepts in a standardized system and might not always be appropriate in a specific context. Therefore, when identifying the corresponding SCTID or CUI, it is essential to select the encodings that are more closely related to the concept according to the context. These chosen IDs can aid in accurately representing the meaning of the concept.
By utilizing this approach, we are able to identify a concept through its specific concept encoding to acquire detailed information on the concept’s definition, attributes. Moreover, by mapping synonyms of the concept along with their respective concept encodings, we can establish connections and ensure a uniform expression of terminology. This method enhances a more complete representation of the semantic content, facilitating a broader understanding of diseases.
Semantic data generation
The text from titles and abstracts of articles was segmented into sentences for a finer-grained processing. In the process of extracting the key concepts from each sentence (e.g., diseases, drugs, and anatomical structures), we not only identify the concept, but also obtained multi-dimensional information about them. This included the starting position of the concept in the sentence, the length of the concept, and the source of the sentence in which the concept is located (title or abstract).
Additionally, we collected and recorded detailed linked concept information (e.g., standardized terms ID, source, and link) based on the details from semantic annotations. The standardized terms ID means that each concept is mapped to a unique identifier for a standardized term, such as CUI for UMLS or SCTID for SNOMED CT. The source indicates the origin of this standardized term, such as UMLS, SNOMED CT, or MeSH. The external link represents generating URIs that point to external knowledgebases, making concepts in the knowledge graph interoperable with other data sources.
Therefore, semantic data generated from literature comprises three main components: text information, concept information, and semantic annotation information (as shown in Fig. 3). These semantic data enhance a deeper understanding of the text and provide richer, more accurate semantic information for subsequent research and application. In order to process and utilize the semantic data effectively in subsequent analyses, we converted the semantic data into RDF triples, expressed in N-triple format.
Fig. 3.
Structure of semantic data from literature. Circles represent the nodes, where the grey indicate link nodes. Arrows depict the relationships between nodes
In the generation process of RDF, the subject of a triple is the conversion of each recognized concept and its related information into the first element of an RDF triple that uniquely identifies the source or entity of that information. Each concept, once normalized, is allocated a unique URI that serves as the subject of an RDF triple. For example, when describing the extracted term "kidney stone" in the sentence, the subject can be expressed as <https://www.ncbi.nlm.nih.gov/pubmed/8873378a2b7>. This URI assures the subject’s global uniqueness, allowing it to be uniquely identified and referenced across knowledge graphs. The type of relationship being expressed determines the predicate of a triple, which is also represented by a URI. For example, to describe the concept’ standardized ID and URL, the predicates can be represented as <http://www.ztonebv.nl/KG#Senser> and <http://www.ztonebv.nl/KG#SenseURL>, respectively. The object of a triple are represented either by a URI (referring to another entity or concept) or a literal. For example, describing the SCTID of “kidney stone”, the object could be described as a string "95570007". Finally, all RDF triples for semantic data are stored as N-triple format.
Knowledge integration
To further enrich the content and semantic representation of the KSDKG, this paper integrated information from different data sources. This integration not only strengthens the connections among knowledge in KSDKG, facilitating the in-depth mining and analysis of medical knowledge, but also enables more thorough discovery, inference, and application of knowledge.
Our efforts were particularly focused on incorporating data from DrugBank6 and the Microbe-Disease Knowledge Graph7 (MDKG), alongside the inclusion of biomedical ontologies. This approach equips the KSDKG with advanced capabilities for analyzing and retrieving drug-related and microbe-related knowledge, thereby providing invaluable resources for researchers and clinicians, and fostering advancements in medical research and clinical practices.
DrugBank is a comprehensive drug knowledge repository that integrates bioinformatics and cheminformatics resources. We used the DrugBank API and the downloaded data files to extract drug information, such as drug ID, name, indication, and pharmacological mechanism, etc. In order to achieve data integration, we transformed the data obtained into RDF triples. Each drug entry was assigned a unique URI to guarantee its individuality and facilitate identification. For instance, “Lepirudin” with the DrugBank ID DB00001 can be represented as an RDF triple: <https://www.drugbank.ca/drugs/DB00001><http://www.w3.org/2000/01/rdf-schema#label> “Lepirudin”. The resultant RDF data was ultimately stored in the database with KSDKG.
MDKG delineates associations between microbes and diseases [31], extracted from Wikipedia and other databases. Similar to our work, the MDKG dataset was organized in RDF triples, which made it easier to integrate with KSDKG. And the disease and microbe concepts in MDKG had been mapped to MeSH, simplifying the interaction and integration process with other MeSH-based knowledge graphs or datasets. Thus, by integrating MDKG data, KSDKG is able to enrich the content of microbial-disease relationships and support more detailed studies on the effects of microbes on healthcare.
Knowledge storage and reasoning
GraphDB for knowledge storage
In the implementation of the KSDKG, it is very important to choose the appropriate storage ways. Common storage methods for knowledge graph are based on graph databases and relational databases. Compared with relational databases, graph databases offer efficient query performance, support flexible data mode, and handle large-scale data and workloads effectively. They are more suitable for storing and querying semantic data of knowledge graphs [32].
GraphDB8 was opted for the triple storage and visualization of the KSDKG in this paper. It is a semantic graph database compliant with W3C standards, and based on graph theory to store relationships, which applies in storing, managing, and querying complex and highly interconnected data [33]. GraphDB is capable of large-scale semantic reasoning, allowing users to derive new semantic facts from existing ones. It supports RDF and SPARQL queries, offering powerful query and analysis capabilities.
SPARQL query for knowledge reasoning
Simple Protocol and RDF Query Language (SPARQL) as a pattern-based query language, is widely used for querying and analyzing RDF data [28]. It provides rich query functions and syntax, allowing the retrieval and analysis of data using graph patterns, filters, and aggregation functions. For example, graph patterns can be used to describe the association relationships between KSD and microbes. Filters can be used to filter data with specific attributes or relationships, while aggregation functions can be used for data statistics and analysis.
SPARQL can be also utilized to receive implicit information combining reasoning mechanism of GraphDB [34]. It means that new semantic facts can be derived from existing facts and rules of knowledge graphs.
A simple SPARQL query usually consists of the following components:
PREFIX: used to define the namespace to simplify the URIs in the query.
SELECT: used to specify the variables to be retrieved.
WHERE: used to specify the query pattern for matching a particular pattern in the RDF graph.
Besides the above basic components, the structure of a SPARQL query can also include FROM (to specify the dataset or graph), OPTIONAL (to enable optional matching), FILTER (to restrict results), ORDER BY (to sort results), LIMIT (to restrict the number of results) and OFFSET (to skip a number of results), etc. These components enhance the flexibility and effectiveness of queries, equipping them to handle a diverse range of data retrieval challenges, from simple to complex.
For example, consider a query to find the parent concepts (in English labels) of "Kidney stone" in SNOMED CT. The SPARQL query process for retrieving the hypernym is illustrated in the Fig. 4. "snomed:", "rdfs:", and "sct:" are prefixes corresponding to their URIs. "?parentKS_en", "?ks" and "?parentKS" are variables defined within the query. "?parentKS_en" is what we want to retrieve, representing the English label of the parent concept in the SNOMED CT. "rdfs:subClassOf" is a predicate from RDF Schema used to define class hierarchy, indicating that "?ks" is the child concept (hyponym) of "?parentKS". After the SPARQL query executed on the subgraph, the "Urolithiasis" and "Kidney disease" are identified as the parent concepts that are sorted by their English labels.
Fig. 4.
An example for process of SPARQL query
In this paper, we conducted SPARQL queries and reasoning on GraphDB to explore studies related to KSD based on the KSDKG. Through flexible querying and reasoning, implicit knowledge and associative relationships can be discovered and visually presented as data results. Therefore, it provides valuable information and insights for KSD-related researches, such as microbes analyses, diseases interrelations and drug discovery.
Results
Description of the KSDKG
The KSDKG is a knowledge graph encompassing semantic data from literature related to KSD on PubMed and other available data. It comprises 43,822,089 triples from 29,174 articles. Beyond the literature, the KSDKG integrates biomedical ontologies (SNOMED CT, UMLS and MeSH) and public databases (DrugBank and MDKG), enriching the knowledge graph with diversity and depth (as shown in Table 2).
Table 2.
Triples of the KSDKG in GraphDB
| Knowledge resource | Number of data item | Number of triple |
|---|---|---|
| Semantic data of literatures | 29,174 articles | 43,821,993 |
| SNOMED CT | 4,291,226 | |
| UMLS | 3,081,799 | |
| MeSH | 13,756,783 | |
| DrugBank | 15,693 drugs | 15,305,066 |
| MDKG | 436,196 bacteria entities and 8483 diseases | 1,574,829 |
| Inferred triples | 11,118,633 | |
| Total | 92,950,329 |
In GraphDB, the KSDKG contains a total of 92,950,329 triples, of which 11,118,633 are generated based on inference rules. It indicates that knowledge of the KSDKG extends beyond merely explicit triples, deriving new information through inference techniques. The expansion ratio, calculated as total triples divided by explicit triples, stands at 1.14%. It demonstrates a commendable performance of the KSDKG in knowledge generation and expansion.
Application case analysis
In this section, we demonstrated the knowledge accessibility and semantic reasoning capabilities of the KSDKG through three application cases. These cases further elucidate the utility of the KSDKG in facilitating research and knowledge discovery.
Case 1: Query for KSD-related microbes
Some microbes have been identified as relevant to formation and progression of KSD. To explore potential targets for the prevention and treatment of KSD from the perspective of microbes, it’s crucial to delve into the diversity and characteristics of microbes, including their effects on urinary tract health and their roles in stone formation. Utilizing the KSDKG to perform detailed searches and gather extensive data on microbes, can offering a well-rounded understanding that could pave the way for novel therapeutic interventions.
We utilized the SCTID 264395009 of microbes for retrieval and analysis. Through SPARQL queries (as shown in Listing 1), we obtained 16,614 concepts related to microbes (including their subclass concepts) in SNOMED CT, and 1,776 articles mentioning 441 microbe concepts in their titles and abstracts.
Listing 1.
The SPARQL query code for case 1
In an effort to investigate the microbes associated with KSD, researchers have shown interest in different kinds of microbes. To understand this, we obtained information by counting the number of articles for each microbial categories. The results revealed that 948 papers mentioned subclass of concepts related to bacteria (SCTID: 409822003), 106 papers for fungus (SCTID: 23496000), and 383 papers for virus (SCTID: 49872002).
While focusing on a specific microbe related to KSD, we excluded general categories of microbes in SNOMED CT such as bacteria, fungus, and virus. Through SPARQL queries, we identified the top 20 most widely studied microbes (as shown in Fig. 5). Notably, Escherichia coli has received the most attention. It indicates the particular interest of researchers in the role and mechanisms of Escherichia coli in the formation of KSD.
Fig. 5.
Number of articles on different microbes
The research findings in the publication with PMID "34336124" indicate that Escherichia coli activates the NF-kB/P38 signal pathway by enhancing oxidative damage and inflammation regulated by polyphosphate kinase 1/flagellin, promoting the formation of calcium oxalate stones. This result can serve as a new breakthrough in clinical research for the prevention and treatment of KSD.
Case 2: Query for KSD-related drugs
Many studies have explored the use of drugs in the diagnosis and treatment of KSD, including inhibiting stone formation, promoting stone expulsion and preventing stone recurrence. Systematically sorting out and summarizing existing studies can provide comprehensive treatment evidence to inform and guide clinical practice and future researches.
In this case, literature was analyzed to obtain more information about drugs related to KSD. The corresponding SCTIDs of drugs were 410942007 (Name: Drug or medicament) and 373873005 (Name: Pharmaceutical/biologic product). We conducted retrieval and analysis using these two concept IDs. The code for the SPARQL query is as Listing 2.
Listing 2.
The SPARQL query code for case 2
The results were shown that 3,047 concepts of drugs in SNOMED CT were mentioned in 15,022 articles. Specifically, for the SCTID 410942007, 1,653 drug-related concepts appear in 13,796 articles, while for another ID, there were 1,394 concepts mentioned in 13,252 articles.
To further understand the basic information about drugs mentioned in the literature, such as SCTID of drugs, drug names, indications, interactions, etc., knowledge from DrugBank of KSDKG was combined for retrieval. It demonstrates integration capability and availability of knowledge within the KSDKG. Figure 6 displays the basic information about KSD-related drugs.
Fig. 6.
Information on drugs associated with KSD. Descriptions of drug indications and interactions are simplified for convenience
Case 3: Query for KSD-related diseases
Microbes can interact with the host organism in various ways, playing a important role in the formation and development of diseases. In this case, the potential diseases associated with KSD were explored through known associations between bacteria and diseases, contributing to gain insight into the pathomechanisms between KSD and other diseases.
We supposed that if a bacterium is associated with KSD and also related to another disease, then there may be a correlation between KSD and this disease. This inference is based on logical reasoning derived from known associations. However, it is important to note that the accuracy of this deduction requires further investigation and evidence for support.
In MeSH, the concepts ID of KSD includes D053040 (Nephrolithiasis), D007669 (Kidney Calculi), and D000069856 (Staghorn Calculi). Following the inference rules, we conducted exploration and analysis using SPARQL queries based on MeSH. The SPARQL query code is shown in Listing 3.
Listing 3.
The SPARQL query code for case 3
The findings suggest that under the influence of Eubacterium hallii (E.hallii) and Eubacterium ruminantium (E.ruminantium), KSD may be associated with 10 diseases (as shown in Fig. 7), such as Crohn disease, glycogen storage disease, liver cirrhosis, Parkinson disease, thyroid neoplasms, Graves ophthalmopathy, Hashimoto disease, chronic kidney disease, ulcerative colitis, and infantile autism. Notably, there may be a certain correlation between KSD and thyroid neoplasms whether affected by E.hallii or E.ruminantium. Furthermore, these bacteria may also contribute to associations among these ten diseases.
Fig. 7.
Potential disease associations influenced by bacteria. Solid arrows indicate the explicit relationships already existing in the KSDKG. Dotted arrows indicate possible associations generated by reasoning
E.hallii and E.ruminantium are the primary bacterial strains in the gut microbes that produce short-chain fatty acids (SCFAs). SCFAs play crucial roles in the gastrointestinal tract, including energy provision, intestinal mucosal health maintenance, and immune system regulation. Researches have shown that relative abundances of E.hallii and E.ruminantium are significantly lower in patients with KSD compared to the healthy population. The abundance of E.hallii is reduced in patients with Crohn disease, glycogen storage disease, and liver cirrhosis, while it is increased in patients with Parkinson disease and Hashimoto disease. Therefore, in the progression of some diseases, changes in E.hallii and E.ruminantium lead to alterations in SCFAs, which may affect the development of KSD. Researchers can explore new directions of research into the mechanisms of KSD formation from these associations.
This case is only an exploratory study to understand the possible association of KSD with other diseases effected by bacteria. Future studies could further expound these underlying associations mechanisms and pathways by exploring bacteria’s biological characteristics and causal relationships among bacteria and diseases. These associations can also be further validated through clinical cases or animal studies.
Discussion and conclusion
In this paper, a large-scale knowledge graph for KSD was constructed covering more than 90 million triples. Knowledge of the KSDKG was extracted from multiple data sources such as the biomedical literature (about 30 thousand articles), biomedical ontologies (SNOMED CT, UMLS, and MeSH), and public available databases (DrugBank and MDKG). This integration transformed discrete resources into structured knowledge, laying a robust data foundation.
While constructing the KSDKG, we semantically annotated concepts identified within medical literature using SNOMED CT and UMLS. This approach enabled the normalization and standardized representation of concepts, facilitating knowledge integration and the application of inference. We performed semantic queries and case analysis on microbes, drugs and diseases, successfully retrieving the desired data and uncovering potential diseases associated with KSD. It indicated that the KSDKG can extract specific information from an integrated data source to simplify the knowledge retrieval path, and reveal new knowledge or insights. Consequently, our knowledge graph not only contributes to the knowledge organization, but also introduces a novel method for conducting research on KSD.
By combining multiple databases, we have ensured broad coverage of knowledge related to KSD. The data sources we chose, such as UMLS, SNOMED CT, and MeSH, are widely recognized and authoritative. These sources undergo rigorous review and regular updates to reflect the latest medical information and research findings with a high level of credibility and reliability. Through semantic analysis, concept mapping, and triple extraction, KSDKG has shown good completeness in information retrieval and reasoning, meeting the basic application requirements of a knowledge graph. Additionally, an initial manual review reduced noise and mistakes, enhancing the accuracy of KSDKG.
Our work, while extensive, has certain limitations that merit further discussion and improvement. Although the KSDKG already incorporates a variety of data resources, it primary focused on biomedical literature. To enhance the breadth and depth of our research, it is imperative to consider expanding more data sources. Resources such as the Gene Ontology (GO) [35], the Kyoto Encyclopedia of Genes and Genomes (KEGG) [36], and guidelines [37] from the European Association of Urology (EAU) and the China Urological Association (CUA) can provide valuable data support for researching the pathogenesis and intervention of KSD. Furthermore, the construction of a knowledge graph is an ongoing iterative process. Therefore, quality and construction efficiency of knowledge graphs warrants our continuous attention and research. Although KSDKG has undergone extensive data integration and preliminary estimation, a comprehensive and systematic evaluation is still pending. In future work, we intend to adopt standardized and multi-level methods to evaluate the quality of KSDKG. We will employ high-quality, manually annotated datasets as benchmarks to conduct a detailed assessment, and invite experts proficient in KSD for manual inspections and validations of the knowledge accuracy and reasoning logic. Performance metrics (e.g., accuracy and recall) of information retrieval tasks will also be used for quantitative analysis to further reflect the overall quality and practical applicability. KSDKG is not also automated enough to build and update knowledge, so more efficient tools need to be developed. Large Language Models (LLMs) have catalyzed innovation and breakthroughs in knowledge graph research [38, 39]. Therefore, we can utilize LLMs to refine and update the KSDKG, ensuring effective knowledge management.
In the future, KSDKG may hold broad application prospects in different fields. In clinical decision support, KSDKG can be combined with clinical and epidemiological data to identify patient-specific risk factors, explore patterns and correlations between different groups, and provide physicians with evidence-based recommendations (such as disease diagnosis, test and treatment recommendations). In disease research and drug discovery, by correlating genetic, clinical and biochemical data, we can explore the underlying pathophysiological mechanisms of KSD to discover new therapeutic targets. And a broader understanding of disease interactions can be achieved by combining with different disease knowledge graphs, to provide a more comprehensive view of health management. In medical education, knowledge graphs visualize the relationships between medical concepts, helping students and researchers to better understand complex medical knowledge and enhance learning and research. Therefore, further research of KSDKG has significant potential to promote medical research and practice.
Overall, our research endeavors to construct a disease-specific knowledge graph for KSD, thereby improving the use of extensive medical resources. This effort aims to provide clinicians with more accessible, efficient services for knowledge mining and discovery, supporting the prevention and treatment of KSD. Furthermore, we aspire that our works will inspire more researchers to delve into the creation and application of medical knowledge graphs in disease-specific domains, collectively advancing medical research and contributing to the development of precision medicine.
Additional file
Acknowledgements
We sincerely express our gratitude to Professor Kewei Xu and Dr. Cong Lai from the Department of Urology at Sun Yat-sen Memorial Hospital, Sun Yat-sen University, for their assistance and professional guidance during the design and development of our works.
Funding
This work was supported by the Key Research and Development Program of China (2022YFC3601600), the Guangzhou Science and Technology Plan (202201011545), the National Natural Science Foundation of China (61876194), the Science and Technology Innovation Special Project of Guangdong Province (202011020004), and Fundamental Research Funds for the Central Universities, Sun Yat-Sen University (24xkjc025).
Data availability
The data associated with this study are not publicly available at this time, as they are currently reserved for additional analyses. However, further details can be provided upon reasonable request, if needed.
Declarations
Conflict of interest
The authors declare that there are no Conflict of interest.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
The online version contains supplementary material available at 10.1007/s13755-024-00309-3.
References
- 1.Hao X, Shao Z, Zhang N, Jiang M, Cao X, Li S, Guan Y, Wang C. Integrative genome-wide analyses identify novel loci associated with kidney stones and provide insights into its genetic architecture. Nat Commun. 2023;14(1):7498. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Gillams K, Juliebø-Jones P, Juliebø SØ, Somani BK. Gender differences in kidney stone disease (ksd): findings from a systematic review. Curr Urol Rep. 2021;22:1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Goldfarb DS, Avery AR, Beara-Lasic L, Duncan GE, Goldberg J. A twin study of genetic influences on nephrolithiasis in women and men. Kidney Int Rep. 2019;4(4):535–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Chmiel JA, Stuivenberg GA, Al KF, Akouris PP, Razvi H, Burton JP, Bjazevic J. Vitamins as regulators of calcium-containing kidney stones-new perspectives on the role of the gut microbiome. Nat Rev Urol. 2023;20(10):615–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Crivelli JJ, Maalouf NM, Paiste HJ, Wood KD, Hughes AE, Oates GR, Assimos DG. Disparities in kidney stone disease: a scoping review. J Urol. 2021;206(3):517–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Li L, Liu M, Lai C, Ji W, Xu K, Zhou Y. Analysis of residual stones in patients and related influencing factors after percutaneous nephrolithotomy: a retrospective study. In: 2023 IEEE 11th international conference on healthcare informatics (ICHI). IEEE; 2023. p. 32–39.
- 7.Peerapen P, Thongboonkerd V. Kidney stone prevention. Adv Nutr. 2023;14(3):555–69. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Sassanarakkit S, Peerapen P, Thongboonkerd V. Stonemod: a database for kidney stone modulatory proteins with experimental evidence. Sci Rep. 2020;10(1):15109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Liu M, Luo J, Li L, Pan X, Tan S, Ji W, Zhang H, Tang S, Liu J, Wu B, et al. Design and development of a disease-specific clinical database system to increase the availability of hospital data in china. Health Inf Sci Syst. 2023;11(1):11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Hogan A, Blomqvist E, Cochez M, d’Amato C, Melo GD, Gutierrez C, Kirrane S, Gayo JEL, Navigli R, Neumaier S, et al. Knowledge graphs. ACM Comput Surv (Csur). 2021;54(4):1–37. [Google Scholar]
- 11.Peng C, Xia F, Naseriparsa M, Osborne F. Knowledge graphs: opportunities and challenges. Artif Intell Rev. 2023;56(11):13071–102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Zhong L, Wu J, Li Q, Peng H, Wu X. A comprehensive survey on automatic knowledge graph construction. ACM Comput Surv. 2023;56(4):1–62. [Google Scholar]
- 13.Wang T, Zhang Y, Zhang Y, Lu H, Yu B, Peng S, Ma Y, Li D. A hybrid model based on deep convolutional network for medical named entity recognition. J Electr Comput Eng. 2023;2023(1):8969144. [Google Scholar]
- 14.Aldwairi M, Jarrah M, Mahasneh N, Al-khateeb B. Graph-based data management system for efficient information storage, retrieval and processing. Inf Process Manage. 2023;60(2): 103165. [Google Scholar]
- 15.Ji X, Ritter A, Yen P-Y. Using ontology-based semantic similarity to facilitate the article screening process for systematic reviews. J Biomed Inform. 2017;69:33–42. [DOI] [PubMed] [Google Scholar]
- 16.Wu X, Duan J, Pan Y, Li M. Medical knowledge graph: data sources, construction, reasoning, and applications. Big Data Min Anal. 2023;6(2):201–17. [Google Scholar]
- 17.Chen A, Huang R, Wu E, Han R, Wen J, Li Q, Zhang Z, Shen B. The generation of a lung cancer health factor distribution using patient graphs constructed from electronic medical records: retrospective study. J Med Internet Res. 2022;24(11):40361. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Zhao X, Wang Y, Wen T. The construction of a tcm knowledge graph and application of potential knowledge discovery in diabetic kidney disease by integrating diagnosis and treatment guidelines and real-world clinical data. Front Pharmacol. 2023;14:1147677. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Liu F, Liu M, Li M, Xin Y, Gao D, Wu J, Zhu J. Automatic knowledge extraction from Chinese electronic medical records and rheumatoid arthritis knowledge graph construction. Quant Imaging Med Surg. 2023;13(6):3873. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.An B. Construction and application of Chinese breast cancer knowledge graph based on multi-source heterogeneous data. Math Biosci Eng. 2023;20(4):6776–99. [DOI] [PubMed] [Google Scholar]
- 21.Jin S, Liang H, Zhang W, Li H, et al. Knowledge graph for breast cancer prevention and treatment: literature-based data analysis study. JMIR Med Inform. 2024;12(1):52210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Papadakis E, Baryannis G, Batsakis S, Adamou M, Huang Z, Antoniou G. Adhd-kg: a knowledge graph of attention deficit hyperactivity disorder. Health Inf Sci Syst. 2023;11(1):52. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Feng F, Tang F, Gao Y, Zhu D, Li T, Yang S, Yao Y, Huang Y, Liu J. Genomickb: a knowledge graph for the human genome. Nucl Acids Res. 2023;51(D1):950–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Chandak P, Huang K, Zitnik M. Building a knowledge graph to enable precision medicine. Sci Data. 2023;10(1):67. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Santos A, Colaço AR, Nielsen AB, Niu L, Strauss M, Geyer PE, Coscia F, Albrechtsen NJW, Mundt F, Jensen LJ, et al. A knowledge graph to interpret clinical proteomics data. Nat Biotechnol. 2022;40(5):692–702. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Byambasuren O, Yang Y, Sui Z, Dai D, Chang B, Li S, Zan H. Preliminary study on the construction of Chinese medical knowledge graph. J Chin Inf Process. 2019;33(10):1–9. [Google Scholar]
- 27.White J. Pubmed 2.0. Med Ref Serv Q. 2020;39(4):382–7. [DOI] [PubMed] [Google Scholar]
- 28.Ali W, Saleem M, Yao B, Hogan A, Ngomo A-CN. A survey of RDF stores & SPARQL engines for querying knowledge graphs. VLDB J. 2022;31:1–26. [Google Scholar]
- 29.Ait-Mokhtar S, Bruijn B, Hagege C, Rupi P. Intermediary-stage ie components. Technical report, D3. 5. Technical report, EURECA Project; 2014.
- 30.Khiari A. Identification of variants of compound terms. PhD thesis, Master Thesis. Technical Report. Université Paul Sabatier, Toulouse; 2015.
- 31.Fu C, Zhong R, Jiang X, He T, Jiang X. An integrated knowledge graph for microbe-disease associations. In: Health information science: 9th international conference, HIS 2020, Amsterdam, The Netherlands, October 20–23, 2020, proceedings 9. Springer; 2020. p. 79–90.
- 32.Paul S, Mitra A, Koner C. A review on graph database and its representation. In: 2019 international conference on recent advances in energy-efficient computing and communication (ICRAECC). IEEE; 2019. pp. 1–5.
- 33.Güting RH. Graphdb: modeling and querying graphs in databases. In: VLDB, vol. 94, Citeseer; 1994. pp. 12–15.
- 34.Lan G, Liu T, Wang X, Pan X, Huang Z. A semantic web technology index. Sci Rep. 2022;12(1):3672. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Aleksander SA, Balhoff J, Carbon S, Cherry JM, Drabkin HJ, Ebert D, Feuermann M, Gaudet P, Harris NL, et al. The gene ontology knowledgebase in 2023. Genetics. 2023;224(1):031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Kanehisa M, Furumichi M, Tanabe M, Sato Y, Morishima K. Kegg: new perspectives on genomes, pathways, diseases and drugs. Nucl Acids Res. 2017;45(D1):353–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Tzelves L, Türk C, Skolarikos A. European association of urology urolithiasis guidelines: where are we going? Eur Urol Focus. 2021;7(1):34–8. [DOI] [PubMed] [Google Scholar]
- 38.Pan S, Luo L, Wang Y, Chen C, Wang J, Wu X. Unifying large language models and knowledge graphs: a roadmap. IEEE Trans Knowl Data Eng. 2024. 10.1109/TKDE.2024.3352100. [Google Scholar]
- 39.Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data associated with this study are not publicly available at this time, as they are currently reserved for additional analyses. However, further details can be provided upon reasonable request, if needed.









