Abstract
Given the nature of rare diseases, lack of data and standards impedes research in rare diseases. A method to improve data interoperability is necessary to allow data reuse, integration, and exchange in rare disease. A computational package named NormMap was developed to identify rare disease related data from various types of resources in free text via semantic annotation with rare disease terms from NCATS Genetic and Rare Diseases (GARD). In this preliminary study, four different sources which include NIH funded projects, clinical trials, PubMed articles, and Reddit subreddits, were applied to generate rare disease profiles by extending and exploring NormMap. Those profiles would offer a complete view of rare diseases from different aspects, funding agencies, patient groups, scientific research, to ultimately advance rare disease research, which is demonstrated in our case study.
Keywords: rare disease, rare disease profile, data integration, NormMap
I. INTRODUCTION
Although the age of big data is in full swing, big data is not always available in rare diseases (RD), since small sample sizes are inevitable, especially when the primary end point is uncommon [1]. On another hand, with the recent advances in genomic sequencing, molecular biology and machine learning, notable progress in RD could be on the horizon [2]. However, such accumulated data is scattered and not standardized, which hinders data integration for supporting further research. A centralized research infrastructure with integration of RD related data would minimize barriers to making connections, whether biological, therapeutic, or societal, within and between RD [3]. In this study, we proposed to analyze and integrate data from publications, clinical trials, NIH grant funding and social media to enable a complete view of RD for advancing research.
Publications, clinical trials, grant funding and social media have been intensively applied in biomedical research and clinical settings. An increasing number of rare disease related studies have been published in the past twenty years, from 4,516 in 2000 to 23,322 in 2022 by searching “rare disease” in PubMed, which offers tremendous data to gain new insight. For instance, more recent text-mining algorithms have been developed to mine PubMed publications for hidden or undiscovered knowledge identification [4]. Social media is emerging as a mainstream source with high engagement of users and minimally restricted communication. For instance, Twitter reaches more than 500 million posts a day and Facebook can reach over 2 billion likes a day [5]. Reddit, as a popular online forum, maintains the topics as subreddits. Identification of RD related topics could prove to be a valuable source from patients’ perspective to researchers for supporting their research.
To make connections among different types of data, semantic annotation, a process of annotating biomedical concepts presented in free text with standard vocabularies, is an effective solution. A popular annotation tool called MetaMap [6] focused on the connection of unstructured biomedical text to the unified medical language system (UMLS). eRAM [7], developed in response to NIH’s Undiagnosed Disease Program, is an encyclopedia of rare disease annotations that integrated various data sources such as OMIM, UMLS, and Orphanet together to expand rare disease mapping. NormMap, a previously developed disease annotation tool, was to annotate RD from GARD from NIH grant data in free text. To extend NormMap as a generalizable annotation tool for any type of free text, we introduced NormMap V2 and applied it to generate RD profiles with NIH grants, publications, clinical trials, and social media data.
II. METHODOLOGY
To generate integrative RD profiles by analyzing data from PubMed, NIH Grants & Funding, ClinicalTrials.gov and reddit, we identified and linked RD related data by exploring NormMap V2. The framework of this study is shown in Fig 1.
Fig 1.
Overview of integrative rare disease profile generation.
A. NormMap V2 development.
We previously developed a normalization-based disease annotation algorithm called NormMap V1 with an ultimate goal of identifying research projects funded for RDs by annotating NIH grant funding data [8]. More specifically, GARD disease names and project abstracts of NIH funded projects were both subject to a series of normalization steps for disease annotation. The steps of normalization was to standardize word formats, and special expressions, including letter cases, hyphens, parentheses, forward slashes, Roman numbers, etc. Each RD name and abstract sentences were then tokenized as a list. To match projects to RD, at least one token list of the RD name should be a subset of at least one list of the abstract’s tokens. To expand the use of NormMap to any types of free text and improve its performance, we developed NormMap V2 with extensions [Figure 2].
Fig 2.
Workflow diagram of NormMap V2
We refined normalization rules defined for NormMap V1 including removing any unnecessary words/tokens, such as whitespace, HTML tags, parentheses, etc., converting hyphens and apostrophes to their Linux-based forms. The NLTK WordNetLemmatizer package was used to convert words from plural to singular form. To improve performance of RD match, we applied the SpaCy PhraseMatcher package [9] to match RD name to the free text, instead of token-based match.
B. Data preparation.
We obtained disease names and synonyms for 6,061 GARD RDs from the NCATS GARD Knowledge Graph (NGKG) [10] and applied those disease names and synonyms to search related information from the aforementioned data sources.
B.1. NIH Funded Projects
We downloaded around 2 million projects between 1985 and 2021 through the NIH Exporter in May 2022.
B.2. PubMed Publications
GARD disease names/synonyms were used to search for PubMed articles published in the last fifty years via PubMed API [11], limiting three articles per disease with consideration of the PubMed API restriction.
B.3. Clinical Trials
RD related clinical trials [12] were acquired from clinicaltrial.gov via the Clinical Trial API [13].
B.4. Reddit Subreddits
The Subreddits were retrieved from our previous study by implementing the Top2Vec model [14].
III. RESULTS
To create RD profiles, we ran NormMap V2 on all of the collected data and obtained 6,018 Clinical Trials, 401,361 Grants, 3,377 PubMed articles, and 461 Subreddits for 3,042 GARD diseases. All of the collected text from the four sources can now be indexed by their GARD ID for integrative profile generation to support research.
A. NormMap V2 performance
The performance of NormMap V2 compared to NormMap V1 was analyzed by running 10,000 funded project abstracts. NormMap V2 obtained an 94% average correct mapping rate when manually reviewing 200 of the annotations
B. Integrative RD profile generation
To aggregate the collected funded projects, articles, clinical trials, and social media posts, we generated 3,042 RD profiles. There are 142 profiles with data from all four sources, 783 with data from three sources, and 975 with data from two sources. Composition of sources applied for RD profile creation is shown in Fig 3.
Fig 3.
Composition of sources applied to RD profile creation
IV. CASE STUDY
To illustrate the use of RD profiles for advancing research, we performed a case study of generating a profile for Acute Lymphoblastic Leukemia (ALL, GARD:0000522), a type of cancer in which the bone marrow makes too many lymphocytes (a type of white blood cell) [15]. The profile contains 3,307 NIH funded projects, 2,512 clinical trials, 1,037 PubMed articles, and 2 subreddits.
Treatments for ALL may include chemotherapy or targeted drugs that specifically kill cancer cells [16]. Berlin-Frankfurt-Munster therapy (BFM), a chemotherapy regimen was designed to stratify ALL patients based on their risk group by age, white blood cell count, presence of adverse cytogenetics, and residual leukemia on a bone marrow biopsy following chemotherapy [17]. In the ALL profile, there are 14 clinical trials, 5 funded projects, 4 PubMed publications, and the “leukemia” subreddit, which are related to BFM. Among the 14 BFM related clinical trials, four of them are in phase III and three in phase IV. Within the scope of BFM there are several modified protocols that contain different drug regimens for each risk group, one of them being the high-intensity BFM-2000 protocol [18]. By manually looking at posts from the subreddit of “leukemia”, there are posts discussing the side effects of BFM-2000, “So I’m nearly at the end of the induction of my BFM-2000 treatment. So still in hospital. I’ve been doing very very well the last weeks, no severe side effects at all”, “every day my sugar levels drop dangerously low… so they wake me up every hour at night and make me drink/eat sugar but it won’t go up”, “During induction my sugar levels were out of wack from the high dose steroids. When I would get fluids to hydrate if the bag didn’t have dextrose (d5) in it my sugars would drop”. It seems the high dose of steroids (e.g., Prednisolone) given with the BFM-2000 protocol appears to cause the side effect of weakened glucose regulation within BFM-2000 has been actively discussed in Reddit, which is well known [19].
Associations between glucose deregulation and ALL are present in the ALL profile. The clinical trial, NCT00566566, in the ALL profile, concludes the overall survival rate improved for ALL patients recently, with the downside that more young adult survivors became susceptible to developing diabetes and obesity. We also identified one NIH funded project from the ALL profile titled, “Insulin Resistance in Children with Acute Lymphocytic Leukemia Undergoing Induction”, and it hypothesized that insulin resistance exists prior to the diagnosis of ALL and worsens during therapy. Additionally, one PubMed article, PMID 33968771, found in the ALL profile supports the two previous statements by writing that adipocytes, which are major energy storage sites in the body [20], play an active role in shifting leukemia cell metabolism from glucose to free fatty acid oxidation. Again, the pooled information from clinical trials, the funded grants as well as peer-reviewed studies will help researchers to design and test hypotheses with more confidence.
V. DISCUSSION
To address the challenge of data shortage and scatter for RD, we introduced RD profiles by integrating different types of data via a disease annotation tool, NormMap V2.
To improve the performance of NormMap V1, we developed NormMap V2 with extensions. The SpaCy PhraseMatcher package was applied in NormMap V2 and it effectively resolved the context related matching problem, which is prone to produce false positives because it labeled text as containing a match if all the tokens of the RD were found anywhere within the text.
As demonstrated in the section of RD profile application, four different resources provide complementary information to contribute to ALL’s profile creation for broadening the understanding of ALL from different aspects. As a proof of concept, the data from different resources for RD profile creation has not been organically integrated in a structured form for automating insight identification from RD profile generation, which will be the next step.
VII. CONCLUSION
In this study, we introduced an integrative RD profile with data from four different resources by applying NormMap V2. There is no doubt that RD profiles would allow researchers to accelerate research by overviewing the landscape of research effort and identifying research gaps for advancing research in rare diseases.
ACKNOWLEDGEMENT
This research was supported in part by the Intramural (ZIA TR000410-03) and Extramural research program at the NCATS, NIH.
Contributor Information
Devon Leadman, Division of Rare Diseases Research Innovation, National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Bethesda, MD.
Sue Qu, Division of Rare Diseases Research Innovation, National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Bethesda, MD.
Yanji Xu, Division of Rare Diseases Research Innovation, National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Bethesda, MD.
Qian Zhu, Division of Preclinical Innovation, National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Rockville, MD.
REFERENCES
- [1].Mitani A, Haneuse S, Small data challenges of studying rare diseases, JAMA Network Open, 3(3) (2020). 10.1001/jamanetworkopen.2020.1965. [DOI] [Google Scholar]
- [2].Halley M, Smith H, Ashley E, Goldenberg A, Tabor H, A call for an integrated approach to improve efficiency, equity and sustainability in Rare disease research in the United States, Nature Genetics, 54(3) (2022) 219–222. 10.1038/s41588-022-01027-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Rare diseases, common challenges, Nature Genetics, 54(3) (2022) 215–215. 10.1038/s41588-022-01037-8 [DOI] [PubMed] [Google Scholar]
- [4].Rebholz-Schuhmann D, Oellrich A, Hoehndorf R, Text-mining solutions for Biomedical Research: Enabling Integrative Biology, Nature Reviews Genetics, 13(12) (2012) 829–839. 10.1038/nrg3337 [DOI] [Google Scholar]
- [5].Young S, Behavioral insights on big data: Using social media for predicting biomedical outcomes, Trends in Microbiology, 22(11) (2014) 601–602. 10.1016/j.tim.2014.08.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Aronson A, Lang F, An overview of MetaMap: historical perspective and recent advances, Journal of the American Medical Informatics Association, 17(3) (2010) 229–236. 10.1136/jamia.2009.002733 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Jia J, An Z, Ming Y, Guo Y, Li W, Liang Y, Guo D, Li X, Tai J, Chen G, Jin Y, Liu Z, Ni X, Shi T, eRAM: encyclopedia of rare disease annotations for precision medicine, Nucleic Acids Research, 46(D) (2018) D937–D943. 10.1093/nar/gkx1062 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Tang C, Xu Y, Zhu Q, Data Normalization Improves Semantic Annotation – a Case Study of Rare Disease Name Annotation, 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), (2021) 2609–2611. doi: 10.1109/BIBM52615.2021.9669475 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].PhraseMatcher · Spacy API Documentation. PhraseMatcher. https://spacy.io/api/phrasematcher (accessed September 12, 2022) [Google Scholar]
- [10].Zhu Q, Nguyen D, Grishagin I, Southall N, Sid E, Pariser A, An integrative knowledge graph for rare diseases, derived from the genetic and rare diseases information center (gard), Journal of Biomedical Semantics, 11(1) (2020). 10.1186/s13326-020-00232-y [DOI] [Google Scholar]
- [11].The E-utilities in-depth: Parameters, syntax and more. https://www.ncbi.nlm.nih.gov/books/NBK25499/ (accessed September 12, 2022)
- [12].Studies by topic. Studies By Topic - ClinicalTrials.gov. https://www.clinicaltrials.gov/ct2/search/browse?brwse=ord_alpha_all (accessed September 12, 2022) [Google Scholar]
- [13].ClinicalTrials.gov API, Clinicaltrials.gov - API home. https://www.clinicaltrials.gov/api/gui (accessed September 12, 2022) [Google Scholar]
- [14].Karas B, Qu S, Xu Y, Zhu Q, Experiments with LDA and top2vec for embedded topic discovery on social media data—a case study of Cystic Fibrosis, Frontiers in Artificial Intelligence, 5 (2022). 10.3389/frai.2022.948313 [DOI] [Google Scholar]
- [15].“Acute lymphoblastic leukemia - about the disease,” Genetic and Rare Diseases Information Center. [Online]. Available: https://rarediseases.info.nih.gov/diseases/522/acute-lymphoblastic-leukemia. [Accessed: 25-Oct-2022]. [Google Scholar]
- [16].Mayo Foundation for Medical Education and Research, Acute lymphocytic leukemia, Mayo Clinic, (2021). https://www.mayoclinic.org/diseases-conditions/acute-lymphocytic-leukemia/symptoms-causes/syc-20369077 (accessed September 12, 2022) [Google Scholar]
- [17].Chang J, Medlin S, Kahl B, Longo W, Williams E, Lionberger J, Kim K, Kim J, Esterberg E, Juckett M, Augmented and standard Berlin–Frankfurt–Münster chemotherapy for treatment of adult acute lymphoblastic leukemia, Leukemia & Lymphoma, 49(12) (2008) 2298–2307. 10.1080/10428190802517732 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Acute lymphoblastic leukaemia: BFM 20001 schema - eviq. https://www.eviq.org.au/getmedia/97eef717-e611-4e6a-a639-1e4b6834c940/BFM-2000-treatment-schema.pdf.aspx (accessed September 13, 2022)
- [19].Prednisolone side effects: Common, severe, long term, Drugs.com. https://www.drugs.com/sfx/prednisolone-side-effects.html (accessed September 13, 2022)
- [20].“Adipocytes,” UMass Chan Medical School, 03-Mar-2022. [Online]. Available: https://www.umassmed.edu/guertinlab/research/adipocytes/. [Accessed: 25-Oct-2022]. [Google Scholar]