Skip to main content
Clinical and Translational Science logoLink to Clinical and Translational Science
. 2022 Jun 6;15(8):1838–1847. doi: 10.1111/cts.13301

Progress toward a universal biomedical data translator

Karamarie Fecho 1,†,, Anne E Thessen 2,, Sergio E Baranzini 3, Chris Bizon 1, Jennifer J Hadlock 4, Sui Huang 4, Ryan T Roper 4, Noel Southall 5, Casey Ta 6, Paul B Watkins 7, Mark D Williams 5, Hao Xu 1, William Byrd 8, Vlado Dančík 9, Marc P Duby 10, Michel Dumontier 11, Gustavo Glusman 4, Nomi L Harris 12, Eugene W Hinderer 13, Greg Hyde 14, Adam Johs 15, Andrew I Su 16, Guangrong Qin 4, Qian Zhu 5; The Biomedical Data Translator Consortium, Jennifer Dougherty, Conrad Huang, Andrew Magis, Brett Smith, Remzi Celebi, Zhehuan Chen, Ricardo De Miranda Azevedo, Vincent Emonet, Jay Lee, Chunhua Weng, Arif Yilmaz, Keum Joo Kim, Eugene Santos, Lucas Tonstad, Luke Veenhuis, Chase Yakaboski, Liliana Acevedo, Steven Carrell, Eric Deutsch, Amy Glen, Andrew Hoffman, David Koslicki, Lindsey Kvarfordt, Zheng Liu, Shaopeng Liu, Chunyu Ma, Luis Mendoza, Arun Teja Muluka, Finn Womack, Erica Wood, Jared Roach, Prateek Goel, Rosina Weber, Andrew Williams, Joseph Gormley, Tom Zisk, Kristina Hanspers, Maureen Hoatlin, Alexander Pico, Anders Riutta, Jackson Callaghan, Colleen Xu, Stanley C Ahalt, Jim Balhoff, Stephen Edwards, Perry Haaland, Michael Knowles, Ashok Krishnamurthy, Meisha Mandal, David B Peden, Emily Pfaff, Shepherd Schurman, Shalki Shrivastava, Hong Yi, Jason Reilly, Richa Kanwar, Steven Cox, Gaurav Vaidya, Max Wang, Ahmed Alkanaq, Maria Costanzo, Ryan Koesterer, Jason Flannick, Noel Burtt, Alexandria Kluge, Irit Rubin, Michael “ Michi” Strasser, Lawrence Chung, Jimin Kang, Michelle Mantilla, Sandrine Muller, Bria Persaud, Qi Wei, Andrew Baumgartner, Cheng Dai, Venkata Duvvuri, Denise Mauldin, Ilya Shmulevich, Namdi Brandon, Alon Greyber, Yaphet Kebede, Daniel Korn, Abrar Mesbah, Phil Owen, Rayn Sakaguchi, Sarah Seitanakis, Alexander Tropsha, Adam Viola, Robert Hubal, Marian Mersmann, Kenny Morton, Yao Yao, Jason Lin, Ricardo Avila, Chunlei Wu, Marco Alvarado Cano, Vicki Gardner, Tursynay Issabekova, Julie McMurry, Kevin Schaper, William Baumgartner, Kevin Cohen, Edgar Gatica, Lawrence Hunter, Guthrie Price, Kaiwen He, Jeff Henrickson, Tarun Mamidi, Matthew Might, John Osborne, Michael Patton, Greg Rosenblatt, Thi Tran‐Nguyen, Andrew Crouse, Basazin Belhu, Tom Conlin, Kenneth Huellas‐Bruskiewicz, Nathaniel Rudavsky‐Brody, Manil Shrestha, Lisa Stillwell, Marcin von Grotthuss, Patrick Wang, Jiwen Kevin Xin, Xinghua Zhou, James Champion, Erik Scott, Priya Sharma, Meghamala Sinha, Shruti Raj, Philip Mease, R Carter Peene, Jason McClelland, Charles P Schmitt, Margaret Leigh, Dan Corkill, Eric Zhou, John Alden, Jeffrey Massung, Mac Kenzie Brandes, Nada Amin, Mei‐Jan Chen, Camerron Crowder, Mary E Crumbley, Nathaniel Fehrmann, Aleksandra M Foksinska, Lindsay Jenkins, Kaiwen He, Forest B Huls, Matthew Jarrell, Elizabeth Pollard, Sienna Rucka, Nicholas Southern, Jillian Tinglin, Jordan Whitlock, Marissa Zheng
PMCID: PMC9372428  PMID: 35611543

Abstract

Clinical, biomedical, and translational science has reached an inflection point in the breadth and diversity of available data and the potential impact of such data to improve human health and well‐being. However, the data are often siloed, disorganized, and not broadly accessible due to discipline‐specific differences in terminology and representation. To address these challenges, the Biomedical Data Translator Consortium has developed and tested a pilot knowledge graph‐based “Translator” system capable of integrating existing biomedical data sets and “translating” those data into insights intended to augment human reasoning and accelerate translational science. Having demonstrated feasibility of the Translator system, the Translator program has since moved into development, and the Translator Consortium has made significant progress in the research, design, and implementation of an operational system. Herein, we describe the current system’s architecture, performance, and quality of results. We apply Translator to several real‐world use cases developed in collaboration with subject‐matter experts. Finally, we discuss the scientific and technical features of Translator and compare those features to other state‐of‐the‐art, biomedical graph‐based question‐answering systems.

INTRODUCTION

The breadth and diversity of biomedical data available today hold great promise in the application of such data into actionable outcomes aimed at accelerating translational science and ultimately improving human health and well‐being. Indeed, advancements in computing and storage capabilities have fostered a wealth of large data sets across translational domains. Translational scientists now have unprecedented access to data and knowledge on genes, biological pathways, chemicals, metabolites, drugs, diseases, environmental exposures, clinical healthcare records, and more. However, the inherent power of the available data has not been fully harnessed due to long‐recognized challenges related to the compartmentalization of data into separate domains, the lack of widely adopted standards or the adoption of standards that are domain‐specific, and noncompliance with the principles of findability, accessibility, interoperability, and reusability (FAIR). 1

The Biomedical Data Translator program (“Translator program”) was launched in Fall 2016 by the National Center for Advancing Translational Sciences (NCATS) in an effort to overcome the many challenges that have long hindered translational science. The vision of the Translator program is to augment human reasoning and accelerate scientific discovery “through an informatics platform that enables interrogation of relationships across the full spectrum of data types.” 2 To achieve this goal, NCATS rapidly and adeptly established a diverse community of nearly 200 basic and clinical scientists, informaticians, ontologists, software developers, and practicing clinicians distributed over 11 teams and 28 institutions to form the Biomedical Data Translator Consortium (“Translator Consortium”). The Translator Consortium adheres to several core principles that have allowed the program to make considerable progress toward a shared vision: namely, team science; a bottom‐up management approach; and open‐source community‐contributed software development. (See Figure S1 for complete timeline and notable milestones.)

The Translator Consortium last reported on the program in two 2019 publications. 3 , 4 The aim of this review is to provide an update on the Translator program. We first review approaches for knowledge representation in translational science. We then describe the technical solution that the Translator program has converged on. We demonstrate real‐world use‐case applications of the prototype Translator system (“Translator”). Finally, we end with a discussion of next steps and a comparison between Translator and similar systems.

KNOWLEDGE REPRESENTATION IN TRANSLATIONAL SCIENCE

“Knowledge” versus “data”

The distinction between “knowledge” and “data” is most often captured as the data‐to‐information‐to‐knowledge‐to‐wisdom transformation or DIKW pyramid. 5 Although the origins of this hierarchical representation model are uncertain, and other knowledge representations exist, 6 the DIKW framework has been widely used in fields like information science, communications science, and library science. Within this hierarchical framework, data are viewed as abundant and characterized as discrete objective facts or observations; information is considered to be assertions derived from data and intended to provide interpretation of the data; knowledge is viewed as generally accepted, universal assertions derived from the accumulation of information; and wisdom is considered to be the most abstract layer of understanding derived from assertions and insights into acquired knowledge. 7

Approaches for knowledge representation

Application of the conceptual DIKW framework has focused primarily on knowledge discovery, or the systematic process whereby observations or data are organized and interpreted into information that is then scrutinized or tested in the context of existing knowledge, with any subsequent assertions disseminated for peer consensus and adjudication before being accepted as new knowledge. Approaches for knowledge discovery date back to ancient times and form the foundation of the scientific method. 8 Approaches for knowledge representation likewise date back to ancient times. 8 Early forms of modern peer‐reviewed publication represent one approach to knowledge representation that remains in use today.

Knowledge graphs

In recent years, “knowledge graphs” (KGs) have become a common approach for knowledge representation in a variety of fields. 9 , 10 In a KG, entities or data types are represented as nodes and connected to each other by edges with predicates that describe the relationship between entities. A “schema” is used to constrain the KG by specifying how knowledge can be represented; as such, it provides a framework for validating specific instances of knowledge representation through rules that dictate the syntax and semantics. KGs allow users to pose questions that can then be translated into query graphs and applied to identify subgraphs within the KG that match the general structure of the query graph, thereby producing answers to user queries and generating new knowledge. 11 KGs have had many successful applications, with Google’s KG 10 perhaps the most widely known.

THE TRANSLATOR SOLUTION

The Translator Consortium has adopted a federated KG‐based approach for biomedical knowledge representation and discovery (Figure 1).

FIGURE 1.

FIGURE 1

Overview of the Translator architecture. Note that while the high‐level architecture depicted in the figure is accurate, certain components may deviate slightly from the architecture in their approach to implementation. Abbreviations: SRI, Standards and Reference Implementation; TRAPI, Translator Reasoner Application Programming Interface. (Graphic prepared by Kelsey Urgo).

Translator comprises four main components: Knowledge Providers (KPs); Autonomous Relay Agents (ARAs); an Autonomous Relay System (ARS); and a Standards and Reference Implementation Component (SRI).

The objective of KPs is to contribute domain‐specific, high‐value information abstracted from one or more underlying “knowledge sources,” which may be raw data as defined by the DIKW framework or information that has been abstracted from the data. ARAs build upon the knowledge contributed by KPs by way of reasoning and inference and in response to user‐defined queries. In addition, ARAs may independently expose information abstracted from data. The ARS functions as a central relay station between ARAs and broadcasts user queries to the ARAs. The SRI services are responsible for the development, implementation, and community adoption of the standards needed to achieve the overall goals of the Translator Consortium.

Translator leverages integrated data from over 250 knowledge sources, each exposed via open application programming interfaces (APIs). The knowledge sources include, among others, highly curated biomedical databases such as Comparative Toxicogenomics Database, 12 and ontologies such as Mondo, the Monarch Disease Ontology. 13

In addition, Translator openly exposes data derived from several electronic health record (EHR) systems, clinical registries, and clinical studies, from which future medical knowledge can be generated: Columbia University Irving Medical Center; UNC Health; the nonprofit Providence Health System; the Drug Induced Liver Injury Network (DILI Network); the Personalized Environment and Genes Study within the National Institute of Environmental Health Sciences; the Institute for System Biology’s Wellness cohort; and select cancer cohorts from within The Cancer Genome Atlas. Of importance, the Translator clinical KPs do not expose raw clinical data, but rather aggregated or semi‐aggregated data and statistical associations or machine learning predictions derived from clinical data, in full compliance with all federal and institutional regulations. 14

The Translator Consortium has adopted several tools and approaches to support standardization, harmonization, and interoperability across the diverse Translator system. First, all Translator services are accessible via APIs. The APIs are standardized in their metadata, structure, and operations using the Translator Reasoner API (TRAPI) standard, 15 which defines a standard HTTP protocol for transmitting queries and receiving answers, with both structured as graphs. Second, all Translator services are registered in the SmartAPI registry, 16 thus adhering to FAIR principles. Third, the open‐source Biolink Model 17 , 18 , 19 , 20 provides an upper‐level graph‐oriented universal schema that facilitates semantic harmonization and reasoning across disparate knowledge sources.

With these standards in place, users can query across the numerous data sources that are accessible via the federated Translator system. To demonstrate, we provide a simple example. Suppose a user asks what chemical entities treat chronic pain? The user is thus asking about approved drugs and other chemicals that may treat chronic pain. To answer this question, the user question must first be translated into a TRAPI‐compliant directed query graph, structured in JSON format, with Biolink Model node and edge types specified and a compact unique resource identifiers (CURIE) used to constrain one node (Figure 2).

FIGURE 2.

FIGURE 2

An example of a natural language question translated into a TRAPI directed query graph in JSON format. (a) the natural language question: what chemical entity(ies) treats chronic pain? (b) the natural language question represented as an object‐predicate‐subject “triple.” (c) the TRAPI query that was executed by Translator. TRAPI, Translator Reasoner Application Programming Interface. (Graphic prepared by Kelsey Urgo).

In this query, “chronic pain” is specified as a biolink:Disease type node n0 with the CURIE HP:0012532, which is defined by the Human Phenotype Ontology as “chronic pain.” A second node n1 is specified only as a biolink:ChemicalEntity type. Nodes n0 and n1 are related by an edge with the relation defined by a predicate specified as biolink:treats. The query graph is thus structured to ask what chemical entity(ies) treats chronic pain? The query graph is then sent to the ARS, which parses the query and distributes it to the ARAs. The ARAs then distribute it to those KPs that have provided a meta‐graph within the SmartAPI registry indicating that they are able to respond to queries of this type. The ARAs may apply a variety of sophisticated reasoning and inference algorithms to the answers returned by the KPs, including different approaches for ranking and scoring answers such as weighting by supporting publications or abstract co‐occurrence of subject and object nodes. Finally, the ARS compiles the ARA results for the user.

A review of the answers to the query finds expected answers such as oxycodone, hydrocodone, codeine, lidocaine, and ibuprofen. There are also answers that are accurate but may not be responsive to the user’s query such as methadone, which is used to treat opioid dependence, 21 and caffeine, which is an adjuvant in certain pain medicine formulations. 22 In addition, the answer set includes perhaps unexpected answers such as naloxone and naltrexone, which are opioid antagonists. An examination of the evidence and provenance that Translator returns in support of these answers identifies publications in the form of PubMed identifiers (PMIDs), with links to PubMed abstracts that suggest that these compounds may be effective in the treatment of chronic pain conditions such as fibromyalgia and inflammatory bowel conditions (Figure 3). Although a pain specialist may not find these findings surprising, many users likely would be surprised to find that there are cases in which an opioid antagonist is beneficial in the treatment of pain, for which opioid agonists are often administered.

FIGURE 3.

FIGURE 3

Screenshots demonstrating an example of Translator evidence and provenance in support of naltrexone hydrochloride as an answer to the query in Figure 2.

APPLICATION USE CASES

The chronic pain use case illustrates basic Translator functionalities in the context of a simple “one‐hop” Translator query (i.e., two nodes connected by one edge) and the types of insights and discoveries that the Translator Consortium intends to achieve. Here, we provide an overview of three additional use cases (Figure 4).

FIGURE 4.

FIGURE 4

Schematic of three generalizable Translator workflows applied to support specific use‐case queries on (a) immune‐mediated inflammatory diseases, (b) Crohn's–Parkinson's disease relationship, and (c) drug‐induced liver injury. (Graphic prepared by Kelsey Urgo).

Explore: Immune‐mediated inflammatory diseases

The immune‐mediated inflammatory disease (IMID) use case was motivated by an interdisciplinary team that was interested in learning more about immunomodulatory drugs that are used to treat IMIDs, including systemic sclerosis, which is a spectrum of rare diseases involving excess collagen that can lead to fibrosis of the skin and/or internal organs. The team was interested in many classes of drugs, including Janus kinase inhibitors (JAK‐Is), which have been suggested in the literature as a potential treatment for systemic sclerosis. The team thus approached the Translator Consortium with the following question: what real‐world evidence is there for the use of JAK‐Is in patients with systemic sclerosis?

Structured EHR data do not track the condition for which a medication is prescribed to a given patient. An investigator can examine co‐occurrence rates between diagnoses and medications, but those rates can be deceptive due to the prevalence of commonly prescribed drugs such as acetaminophen among the general population. Translator clinical KPs have overcome this limitation of EHR data by allowing users to openly explore both co‐occurrence rates and relative frequencies of medications, as well as information on whether a medication is contemporaneously predictive for a given disease or phenotype, thus provisioning informative EHR data and assertions without regulatory hurdles.

In this case, the Translator Consortium approached the user’s question by executing a one‐hop query that targeted Translator clinical KPs (Figure 4a). They first queried on a set of multiple IMIDs simultaneously. Translator answer sets comprised between 360 and 905 specific answers each and included drugs commonly used to treat IMIDs such as methotrexate and dexamethasone. A subsequent query focused specifically on the IMID systemic sclerosis. For this more restrictive query, Translator answer sets comprised between 128 and 366 specific answers each, including expected results such as mycophenolate, cyclophosphamide, and rituximab. Real‐world evidence also was returned in the answer sets. For example, the observed‐expected frequency ratio for co‐occurrence of mycophenolate and systemic sclerosis was 3.91 (99% confidence interval: 3.67–4.11). When examining JAK‐Is, Translator found evidence of co‐occurrence in patients with systemic sclerosis, although the results were not among the top‐ranked answers. However, Translator reported that the JAK‐I tofacitinib was predictive of systemic sclerosis in a real‐world logistic regression model, indicating that JAK‐Is have been prescribed to certain patients with systemic sclerosis. In addition, Translator provided PubMed abstracts suggesting mechanisms by which JAK‐Is might treat systemic sclerosis, including evidence from mouse models and case studies. One example publication was titled: “Generation of a novel CD30+ B cell subset producing GM‐CSF and its possible link to the pathogenesis of systemic sclerosis.” 23

The investigative team is now using Translator to further explore mechanistic evidence connecting JAK‐Is and IMID disease processes.

Explain: Crohn’s disease and Parkinson’s disease

This use case was motivated by clinical observations that patients with Crohn’s disease are at higher risk of Parkinson’s disease—two apparently unrelated diseases. Specifically, the investigative team approached the Translator Consortium with the following question: why do patients with Crohn’s disease have a higher risk of developing Parkinson’s disease?

The Translator Consortium addressed this question by constructing a two‐hop query that sought biomedical entities that might be shared by both Crohn’s disease and Parkinson’s disease (Figure 4b). The query was structured with two specified biolink:Disease nodes, each connected to an unspecified biolink:NamedThing node (i.e., a root class for all things and informational relationships).

Due to the open structure of the query, Translator returned a variety of biomedical entities, including genes, diseases, chemicals, and drugs. Five genes were found to be associated with both Crohn’s disease and Parkinson’s disease, supporting the initial observation and suggesting at least partial common susceptibility pathways between these diseases. The identified genes were: LRRK2 (leucine rich repeat kinase 2); PARK7 (Parkinsonism associated deglycase); NOD2 (nucleotide binding oligomerization domain containing 2); GPR65 (G protein‐coupled receptor 65); and MUC19 (mucin 19). Moreover, Translator provided quantitative publication support for each gene’s involvement in both Crohn’s disease and Parkinson’s disease. In addition to genes, Translator found that the antibiotic rifaximin was associated with both diseases. Whereas the association between rifaximin and Crohn’s disease was not surprising to the investigative team, given that antibiotics are used to control bacterial overgrowth in patients with Crohn’s disease, 24 the association between rifaximin and Parkinson’s disease was surprising. In fact, Translator provided publication support showing that rifaximin reduced motor fluctuations in a small clinical trial on Parkinson’s disease, with a publication titled: “Small intestinal bacterial overgrowth in Parkinson’s disease: tribulations of a trial.” 25

The investigative team had expected LRRK2 to be among the answers returned to the query, so the fact that this gene indeed was returned by Translator provided the team with confidence in the accuracy and sensitivity of Translator answers and convinced them that a convergence of evidence, even if modest, such as the evidence exposed in this use case, can guide the emergence of unknown or unconventional KG paths and thereby assist with the identification of new treatment approaches to disease. The investigative team now plans to take a deeper dive into the supporting evidence and generate new queries to determine if there are common biological processes that might explain how these shared genes contribute to two diseases that were not previously thought to be related. The team also plans to search for additional data sources to incorporate into Translator, including specialized data sources on gene expression, functional genomics, and pharmacogenomics.

Repurpose: Drug‐induced liver injury

The DILI use case was motivated by shared interests between the Translator Consortium and the DILI Network. A high priority for the DILI Network, which is the longest running cohort‐based study funded by the National Institutes of Health, is to initiate a DILI clinical trial. This priority is motivated by the fact that the only consensus treatment for DILI is to discontinue the causal agent, leaving patients with few therapeutic options until the drug injury resolves, and leaving underlying diseases and conditions untreated. DILI Network investigators have been unable to identify a suitable therapeutic, namely, one that is generally safe, with sufficient biological justification to support a clinical trial.

Hence, one of the investigators of the DILI Network approached the Translator Consortium with this goal in mind. The specific question that was asked was what drug candidate(s) might be repurposed for the treatment of DILI, and is there sufficient biological plausibility to justify the use of that candidate(s) in a clinical trial?

The Translator Consortium approached this question with a two‐fold solution (Figure 4c): (1) implement a complex asynchronous three‐hop query to identify candidate drugs, leveraging the knowledge provided by Translator clinical KPs; and then (2) implement a simple one‐hop query to find additional support for any candidate drugs thus identified, leveraging the real‐world and curated knowledge provided by all KPs.

Translator successfully executed both queries and identified two candidate drugs, both antioxidants that are available over‐the‐counter and in prescription formulation: resveratrol and quercetin. Translator provided additional evidence to justify the use of these candidates in a clinical trial, including: the identification of intermediary genes that suggest biological plausibility; evidence of effectiveness in rodent models of DILI; and clinical trial precedence in other diseases and conditions such as chronic obstructive pulmonary disease. Moreover, Translator provided real‐world evidence that these drugs are prescribed to patients.

To exemplify the knowledge and data that Translator reasons over, we highlight the answers and additional evidence that Translator provided in support of quercetin. Specifically, for the initial three‐hop query, Translator provided real‐world evidence that DILI is associated with a variety of other diseases, including autoimmune hepatitis, psoriasis, and osteoarthritis. One answer subgraph indicated that toxic liver disease (equivalent to DILI) co‐occurs with infectious bacterial disease with sepsis in patients, with an observed‐expected frequency ratio of 4.48 (99% confidence interval: 3.63–5.00). Tumor necrosis factor (TNF), a proinflammatory cytokine, was identified as the gene in the path between infectious bacterial disease with sepsis and quercetin, with Translator indicating that the evidence was derived from a resource called SemMedDB. Translator provided more than two dozen publications, including PubMed abstracts, supporting a relationship between TNF and quercetin, with most publications derived from primary rodent studies. The first publication was titled: “Quercetin inhibits LPS‐induced nitric oxide and tumor necrosis factor‐alpha production in murine macrophages”; and the abstract suggests that quercetin inhibits TNF. 26 The second one‐hop query then asked for additional evidence related to quercetin. Translator provided evidence that quercetin is effective in the treatment of DILI, drug‐induced dyskinesia, and drug‐related side effects and adverse reactions in rodent models. The first publication 27 in one answer subgraph was titled: “Involvement of P450s and nuclear receptors in the hepatoprotective effect of quercetin on liver injury by bacterial lipopolysaccharide”; and the abstract contained the sentence: “In this study, we used liposomal nanoparticles to entrap quercetin and evaluated its protective and therapeutic effects on drug‐induced liver injury in rats.” In addition, Translator provided real‐world evidence that quercetin was prescribed to patients with a variety of diseases, including allergic rhinitis, with an observed‐expected frequency ratio of 2.24 (99% confidence interval: 1.30–2.79). Moreover, Translator provided evidence that quercetin is in clinical trials as a treatment for chronic obstructive pulmonary disorder, 28 thus establishing precedence for a clinical trial on DILI.

Having met the criteria for viable drug candidates in clinical trials of DILI, members of the Translator Consortium now plan to prepare a formal report on Translator’s findings for consideration by the DILI Network Steering Committee.

DISCUSSION

The Translator program is in its third year of development, having first demonstrated feasibility. (See Figure S1 for complete timeline and notable milestones.) Several key advancements have been achieved since we first described the Translator system in 2019. 3 , 4 For example, at the time of our first report, a unified Translator “system” functionally did not exist; rather, Translator was comprised of many individual tools and services that were not truly integrated or harmonized. This is in contrast to the prototype Translator system that now exists, which functions as a truly unified system. This achievement is due, in part, to the consortium‐wide adoption of ontologies and standards, such as Biolink Model and TRAPI, as well as tools to support their adoption and continued use. These ontologies and standards allow for the seamless integration and harmonization across completely disparate “knowledge sources,” including observational clinical datasets and curated biomedical datasets. The Translator program has also moved beyond its initial two use cases on Fanconi anemia and asthma to include the use cases described here on IMIDs, Crohn’s disease/Parkinson’s disease, and DILI, as well as others. Moreover, the Translator program now has a nontechnical component, the SRI, that aims to create and maintain the collaborative framework required to support the adoption and implementation of standards and references, including services to support technical Translator components and teams. Through these standards and services, Translator has been able to readily expand the number of knowledge sources from which it draws data and knowledge and the number of use cases that it is able to support. A final achievement worth mentioning is that the Translator Consortium has maintained a unique culture of open collegial collaboration and communication, despite the addition of new teams and the inevitable turnover of team members.

Whereas a prototype Translator system now exists, with demonstration of its success in returning valid answers to user questions, there are several areas of improvement required to truly achieve a production‐level Translator system.

First, the scoring and ranking algorithms that are invoked by the ARAs are intentionally varied to provide breadth in answer sets and associated evidence. We acknowledge a need to refine the scoring and ranking algorithms in order to prioritize those answers with strong evidence, more complete provenance, and high confidence, thereby enriching for answers that are likely to provide the greatest insights to users.

Second, the TRAPI standard and Biolink Model are critical to standardize queries and answers across the federated Translator system. However, standardization can result in a lack of granularity and an inability to pose nuanced queries. For instance, workflow operations are only minimally supported in the current TRAPI standard. We are working to provision a variety of logical operations such as a graph overlay operation. We are also extending the Biolink Model to support nuanced statements by developing a core set of qualifiers (e.g., disease severity) that can be used to capture semantic richness.

Third, the clinical insights provided by the Translator system should be interpreted with caution. For instance, in our IMID use case, we provided real‐world EHR evidence that JAK‐Is co‐occurred with systemic sclerosis and were predictive of systemic sclerosis in a logistic regression model, thus supporting the assertion that they are prescribed to patients with systemic sclerosis. However, we did not provide evidence of clinical benefit when prescribed to treat systemic sclerosis. Translator clinical KPs rely primarily on structured EHR data. Structured EHR data can be used to derive information on clinical benefit, for example, by examining the frequency of emergency department visits for condition X among patients with a diagnosis of disease Y who were prescribed medication Z compared to those who were not prescribed the same medication. However, TRAPI currently does not support such nuanced queries, although efforts are underway to adapt TRAPI to allow for more sophisticated queries. For certain use cases (e.g., DILI), Translator clinical KPs expose study data, which do support TRAPI‐compliant assertions regarding clinical outcomes, but such data are not available for all use cases. At present, the approach that we are taking is to use curated knowledge sources to explore mechanistic evidence for how JAK‐Is might reduce inflammation in systemic sclerosis.

Finally, whereas several Translator teams have developed user interfaces (UIs) that support TRAPI queries and answers, a uniform cross‐component UI is not yet available, although NCATS recently funded a team to develop one (see timeline in Figure S1). We recognize the urgent need for such an interface, which will allow us to more effectively engage users, thus serving a broader community and promoting long‐term sustainability. We note that a mock‐up Translator UI has been developed and is now being vetted by users, with an early‐phase prototype UI expected to be deployed by the end of calendar year 2022.

We note that the Translator system is one of several available biomedical KG‐based question‐answering systems. Others include Causaly, 29 Elsevier’s Biology Knowledge Graph 30 and related Pathway Studio, 31 and Google’s Knowledge Graph. 10 We emphasize a few differences among these systems. First, the Translator system is the only open‐source, community‐contributed system; Causaly and Elsevier’s systems are commercial, and Google’s Knowledge Graph is largely proprietary. For the IMID and DILI use cases reported here, the open‐source nature of Translator allowed us to run queries that openly explored EHR evidence on co‐occurrence rates of observations, relative frequencies, and disease risk predictions, without regulatory hurdles. Second, these systems are narrower in scope than Translator. Elsevier’s systems are highly specific to basic biology and do not span the translational spectrum. Causaly’s system supports a broader set of translational questions, but only a subset of those supported by Translator. Thus, our use cases included queries that spanned multiple biomedical entities (e.g., genes, chemical entities, small molecules, drugs, phenotypes, diseases) and numerous knowledge sources, including clinical knowledge sources. Third, Translator supports a more sophisticated set of queries than the other systems. For instance, Google’s Knowledge Graph only supports simple “lookup” operations, albeit with highly sophisticated natural language parsing of user questions. Causaly’s system is currently limited to linear two‐hop queries. Neither Causaly’s nor Elsevier’s systems support batch or asynchronous queries, in contrast to the Translator system. Our DILI use case leveraged Translator’s advanced capabilities, including three‐hop, batch, and asynchronous queries. Finally, none of the other systems support clinical knowledge, such as EHR data, which provided key support for two of the three use cases reported herein.

In conclusion, we have developed a biomedical KG‐based Translator system capable of integrating a wide range of data sets and translating those data into insights intended to augment human reasoning and accelerate translational science. We are now working on refinements to the prototype Translator system.

CONFLICT OF INTEREST

S.E.B. and S.H. have received support from the NSF Convergence Accelerator Open Knowledge Networks to develop applications related to the SPOKE KG. All other authors declared no competing interests for this work.

Supporting information

Figure S1

ACKNOWLEDGEMENTS

The authors are grateful to members of the Publications Committees at the National Center for Advancing Translational Sciences (NCATS), the National Institute of Environmental Health Sciences, and the National Institute on Aging, as well as Dr. Naga P. Chalasani, for their review and approval of the manuscript for publication. They further thank Kelsey Urgo for assistance with graphics design, and Stanley C. Ahalt of the Renaissance Computing Institute for financially supporting the graphics design work. Moreover, the authors are appreciative of the unwavering leadership and support provided by the Translator Extramural Leadership Team and the Intramural Research Program at NCATS.

Fecho K, Thessen AE, Baranzini SE, et al. Progress toward a universal biomedical data translator. Clin Transl Sci. 2022;15:1838‐1847. doi: 10.1111/cts.13301

Funding informationThis work was supported by the National Center for Advancing Translational Sciences, Biomedical Data Translator Program (Other Transaction Awards OT2TR003434, OT2TR003436, OT2TR003428, OT2TR003448, OT2TR003427, OT2TR003430, OT2TR003433, OT2TR003450, OT2TR003437, OT2TR003443, OT2TR003441, OT2TR003449, OT2TR003445, OT2TR003422, OT2TR003435, OT3TR002026, OT3TR002020, OT3TR002025, OT3TR002019, OT3TR002027, OT2TR002517, OT2TR002514, OT2TR002515, OT2TR002584, and OT2TR002520; Contract number 75N95021P00636). Additional funding was provided by the National Center for Advancing Translational Sciences, Intramural Research Program (ZIA TR000276‐05) and the National Institute of Diabetes and Digestive and Kidney Diseases (5U01DK065201).

REFERENCES

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Figure S1


Articles from Clinical and Translational Science are provided here courtesy of Wiley

RESOURCES