Knowledge-based biomedical Data Science

Lawrence E Hunter

doi:10.3233/DS-170001

. Author manuscript; available in PMC: 2018 Oct 4.

Published in final edited form as: EPJ Data Sci. 2017 Dec 8;1(1-2):19–25. doi: 10.3233/DS-170001

Knowledge-based biomedical Data Science

Lawrence E Hunter ¹

PMCID: PMC6171523 NIHMSID: NIHMS887796 PMID: 30294517

Abstract

Computational manipulation of knowledge is an important, and often under-appreciated, aspect of biomedical Data Science. The first Data Science initiative from the US National Institutes of Health was entitled “Big Data to Knowledge (BD2K).” The main emphasis of the more than $200M allocated to that program has been on “Big Data;” the “Knowledge” component has largely been the implicit assumption that the work will lead to new biomedical knowledge. However, there is long-standing and highly productive work in computational knowledge representation and reasoning, and computational processing of knowledge has a role in the world of Data Science.

Knowledge-based biomedical Data Science involves the design and implementation of computer systems that act as if they knew about biomedicine. There are many ways in which a computational approach might act as if it knew something: for example, it might be able to answer a natural language question about a biomedical topic, or pass an exam; it might be able to use existing biomedical knowledge to rank or evaluate hypotheses; it might explain or interpret data in light of prior knowledge, either in a Bayesian or other sort of framework. These are all examples of automated reasoning that act on computational representations of knowledge. After a brief survey of existing approaches to knowledge-based data science, this position paper argues that such research is ripe for expansion, and expanded application.

Keywords: Ontology, knowledge representation, reasoning, inference, machine learning, text mining, explanation

1. Representations of biomedical knowledge

All computational approaches to knowledge require specification of how the computer system represents knowledge internally, and how it might compute with those representations to produce outputs (often called, perhaps metaphorically, reasoning). Classic descriptions of knowledge representation and reasoning systems, e.g. [16] focus on what ontological commitments a knowledge representation makes, what inferences are possible with it, and, sometimes, which of those inferences can be made efficiently. These issues remain useful in thinking about how knowledge representation and reasoning play a role in today’s data science environment.

As [16] pointed out, knowledge representations entail ontological commitments. Adoption of existing ontologies, rather than creating idiosyncratic or single-use ontologies provides significant advantages for reproducibility in scientific research, for inter-operability, and in avoiding pitfalls in the modeling of knowledge. A great deal of work has been done in biomedical ontology (e.g. [2,36,39,41,45] and many others), and these increasingly mature ontological resources form an important basis for knowledgebased data science. Community-curated ontologies (such as those meeting the Open Biomedical Ontologies (OBO) Foundry criteria [42]) capture a consensus view of the entities and processes involved in biology, medicine and biomedical research, analogous to how nomenclature committees systematize naming conventions.While not meeting all of the criteria of the OBO Foundry, terminological resources such as UMLS [30], Snomed-CT [7] and the NCI thesaurus [19] have also been used to provide useful pseudo-ontological foundations for knowledge representations.

While ontologies identify the basic elements from which a knowledge representation is constructed, they are agnostic about the mechanisms by which ontological units are assembled into representations of knowledge. Building on decades of work in artificial intelligence research, the W3C produced a collection of international standards for assembling ontological entities into assertions and managing collections of assertions, together referred to as the Semantic Web. The focus of the Semantic Web standards is to make it possible to link web elements with shared meaning, and is sometimes described as the Linked Data paradigm. The Semantic Web builds on the standard Resource Description Framework (RDF), which provides a way to link three uniform resource identifiers (URIs) to specify a pair of entities and a relationship between them (forming an RDF “triple”). Collections of triples form a graph, and a computational mechanism for managing such collections is called a triple store. The Semantic Web standards also define RDF Schemas (RDFS) and a Web Ontology Language (OWL) which facilitate richer knowledge representations, SPARQL, which provides a query language for interrogating RDF graphs or triple stores, and the Simple Knowledge Organization System (SKOS), which provides a basic ontology, including simple semantic relationships. While the Semantic Web standards are intended to be general representation tools for all knowledge (e.g. RDF for facilitating exchange of research data), the combination of Semantic Web standards and biomedical ontologies are the basis of most current biomedical knowledge representation systems.

2. Knowledge-based inference

Representations of knowledge are sterile without use. Although human visualization of computationally represented knowledge (e.g. [32]) can be useful, the primary use of computationally represented knowledge is inference. There are many forms of inference, and thousands of publications describing computational methods of reasoning. Although too broad to survey here, a brief introduction to the types of knowledge-based inference common in biomedical applications gives some idea of its potential.

2.1. Logical inference

Computational logical inference is a mapping from a base set of assertions to create additional assertions that are entailed by the base. While deductive reasoning is the classic form of logical inference, it is, in general, computationally intractable. Various restricted forms of deductive inference, such as those based on description logics, have better computational performance, at the cost of greatly restricting the utility of the inferences. Description logics, for example, are limited to inferring subsumption relationships based on necessary and sufficient class definitions. Contemporary applications of description logic inference in biomedical knowledge representation have been successful primarily in checking for modeling errors (e.g. [8,26]), although some other applications have been attempted (e.g. [9,22,23]).

Deductive retrieval is a special case of deductive inference, where the inference is to compute whether a set of logical axioms and base assertions can be combined to satisfy a query; the programming language Prolog and the W3C standard for the Semantic Web Rule Language (SWRL) are examples of approaches to deductive retrieval. Triple stores extended with deductive retrieval are much more valuable than those that can retrieve only queries that match exactly. Several knowledge-bases of biomedicine based on these technologies have been developed (e.g. [3,6,31,48]), and their uses extend beyond deductive retrieval alone.

2.2. Inference from ontology annotation

In addition to the creation of biomedical ontologies, a great deal of effort has gone into annotating genes and other biological entities to ontological categories. Gene Ontology annotations of genes and gene products figure prominently in major databases such as UniProt and the Mouse Genome Informatics. These annotations provide a quick summary of knowledge about gene function, subcellular localization and biological processes. By far the most common application of computational representations of knowledge to problems in biomedicine is enrichment analysis, see e.g. [24,43,46]. Enrichment analysis generates hypotheses about the concerted functions of collections of genes by testing for annotations that occur more frequently in the collection than would be expected by chance. Ontology annotation directly supports other sorts of knowledge-based inference as well. For example, phenotype annotations play a major role in mapping between human disease and animal models (e.g. [28,34,35]). Formal representations of metabolic pathways (e.g. [18,27]) have been used to analyze metabolomic data and support metabolic engineering.

2.3. Inference from the biomedical literature

Despite the rapid growth of databases with ontological annotation, the main and by far the largest repository of biomedical knowledge remains the published literature. An important domain of knowledge-based data science involves natural language processing with the goal of producing computational representations of the knowledge in the literature. The most basic of these approaches involves tagging passages in the literature with ontological terms (e.g. EuroPMC’s SciLite annotations, or [20]). Computational methods to identify semantically well-defined entities in the literature support further analysis that identifies links both among different documents in the literature (e.g. [52]) and between entities in the literature and database entries about them (e.g. [37]). More ambitious literature mining goals involve producing more complex knowledge representations directly by processing natural language documents, e.g. [15,49], although significant improvements in performance are likely to be necessary before the results of such processing find widespread use in biomedical research. Text mining approaches applied to clinical records and social media, e.g. for pharmacovigilance applications, have also made significant strides recently [17]. The best performing text mining systems themselves often use representations of prior knowledge to drive understanding of text.

Natural language processing systems have also been used to support automated question answering. Perhaps the most well known of these efforts is IBM’s Watson system [12], which has found significant biomedical application. Many other computational systems for question answering, targeted to biomedical researchers and clinicians, have been fielded, e.g. as reviewed in [1,4]. Computational approaches to building systems that can answer biomedical exam questions have also been developed, e.g. [14].

2.4. Hypothesis generation, evaluation and modification

Perhaps the oldest method of computing with knowledge is Bayesian inference [21]. By providing a quantitative framework for the idea that observations consistent with prior knowledge are more likely than ones that contradict it, Bayesian reasoning has provided a basis for knowledge-based computation long before computation was automated. Contemporary computers provide the power necessary to support more elaborate Bayesian inference, including model selection as well as estimating model parameters [13].

Network-based inference, such as link prediction or community finding, have been successfully applied to generate significant biomedical hypotheses. Systems that compute over representations of knowledge of biomedicine have been used to propose as yet unobserved relationships among biological entities, e.g. for drugs [33], microRNAs [51], diseases [44] and proteins [47]; some of these predictions have been empirically validated, e.g. [25].

Perhaps the most exciting potential for knowledge-based computational systems is in the development and refinement of mechanistic explanations of biomedical phenomena. The vast scope and rapid evolution of the biomedical literature, combined with the breakdown of disciplinary boundaries driven by genome-scale research has made it increasingly difficult for researchers to effectively assimilate all the knowledge potentially relevant to interpreting the results of their own experiments. Although most computational approaches aim to provide material for the Results section of a paper, a few are beginning to target the Discussion section as well. While no knowledge-based computer system has repeatedly generated important biomedical hypotheses de novo, promising proof-of-concept systems include systems to generate hypotheses from the literature [40] and those aimed at hypothesis generation or refinement from data [11,38], as well as mixed initiative human-computer hypothesis generation [29]. Although it remains aspirational, the synthesis of computational simulation with knowledge-based generation and refinement of hypotheses has received substantial interest from funding agencies [50].

3. Open challenges in knowledge-based Data Science

As is clear from the NIH BD2K experience, computation over knowledge is a less widespread research focus than analysis of big data, and to date has had less impact in biomedicine. Certain applications, such as enrichment analysis and link prediction, have found widespread use in biomedical research. Text mining systems are increasingly deployed in areas such as helping clinicians keep up with rapidly changing clinical data [10] and pharmacovigilance. However, there are significant challenges to realizing the potential for knowledge-based data science. Perhaps the foremost among these is the knowledge acquisition bottleneck: human curation, even for the relatively simple task of annotation of genes to gene ontology terms is difficult to scale [5]. Alternatives to manual curation, including applications of text mining and machine learning, have shown promise, but are still far short of human-like performance. Another important understudied question is how to represent what is not known: any scientist can describe gaps, ambiguities and uncertainties in existing knowledge, yet there are few computational methods capable of representing, let alone reasoning about, such ignorance.

Even more challenging than developing representations of what is already known is the application of that knowledge to the pressing problems of biomedical research. Existing inference methods are far short of the range and creativity of human experts in developing potential explanations, generating significant hypotheses, and generally interpreting results in light of previous knowledge. Many promising inference methods scale poorly, and are constrained in their ability to harness large knowledge-bases by the extremely large computational loads involved. Even deductive retrieval systems can be computationally intractable over large knowledge-bases; more complex forms of inference hit the limits of current hardware with even smaller knowledge-bases. The Semantic Web standard was developed largely with description logic inference in mind; while it provides a solid foundation for knowledge representation systems, representational transformations may improve the efficiency of other sorts of inference.

Perhaps the biggest challenges in knowledge-based data science are in developing the vision for what such a system could effectively contribute to biomedical research. Is it possible to build computational systems that bring to bear disparate yet relevant facts from across all biomedical disciplines and scales, exploiting their ability to process far more information than any individual human being? Could such a system make sound judgements ranking alternative hypotheses based on an exhaustive comprehension of the literature? Is it possible for computational systems to generate significant and novel mechanistic and pathomechanistic hypotheses about open questions in biomedicine? It is positive answers to questions like these that will drive knowledge-based data science into the mainstream of biomedical research.

References

1.Athenikos S, Han H. Biomedical question answering: A survey. Comput Methods Programs Biomed. 2010;99:1–24. doi: 10.1016/j.cmpb.2009.10.003. [DOI] [PubMed] [Google Scholar]
2.Bandrowski A, Brinkman R, Brochhausen M, Brush M, Bug B, Chibucos M, Clancy K, Courtot M, Derom D, Dumontier M, Fan L, Fostel J, Fragoso G, Gibson F, Gonzalez-Beltran A, Haendel M, He Y, Heiskanen M, Hernandez-Boussard T, Jensen M, Lin Y, Lister A, Lord P, Malone J, Manduchi E, McGee M, Morrison N, Overton J, Parkinson H, Peters B, Rocca-Serra P, Ruttenberg A, Sansone S, Scheuermann R, Schober D, Smith B, Soldatova L, Stoeckert CJ, Taylor C, Torniai C, Turner J, Vita R, Whetzel P, Zheng J. The ontology for biomedical investigations. Plos One. 2016;11:0154556. doi: 10.1371/journal.pone.0154556. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Barros M, Couto F. Knowledge representation and management: A linked data perspective. Yearb Med Inform. 2016;10:178–183. doi: 10.15265/IY-2016-022. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Bauer M, Berleant D. Usability survey of biomedical question answering systems. Hum Genomics. 2012;6:17. doi: 10.1186/1479-7364-6-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Baumgartner WJ, Cohen K, Fox L, Acquaah-Mensah G, Hunter L. Manual curation is not sufficient for annotation of genomic databases. Bioinformatics. 2007;23:41–48. doi: 10.1093/bioinformatics/btm229. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Belleau F, Nolin M, Tourigny N, Rigault P, Morissette J. Bio2RDF: Towards a mashup to build bioinformatics knowledge systems. J Biomed Inform. 2008;41:706–716. doi: 10.1016/j.jbi.2008.03.004. [DOI] [PubMed] [Google Scholar]
7.Bhattacharyya SB. Introduction to SNOMED CT. Springer Nature; 2015. Overview of SNOMED CT; pp. 1–2. [DOI] [Google Scholar]
8.Bodenreider O, Smith B, Kumar A, Burgun A. Investigating subsumption in SNOMED CT: An exploration into large description logic-based biomedical terminologies. Artificial Intelligence in Medicine. 2007;39(3):183–195. doi: 10.1016/j.artmed.2006.12.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Boeker M, França F, Bronsert P, Schulz S. TNM-O: Ontology support for staging of malignant tumours. J Biomed Semantics. 2016;7:64. doi: 10.1186/s13326-016-0106-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Bringing Precision Medicine to Community Oncologists. Cancer Discov. 2017;7:6–7. doi: 10.1158/2159-8290.CDNB2016-147. [DOI] [PubMed] [Google Scholar]
11.Callahan A, Dumontier M, Shah N. HyQue: Evaluating hypotheses using semantic web technologies. J Biomed Semantics. 2011;2(Suppl 2):3. doi: 10.1186/2041-1480-2-S2-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Chen Y, Elenee AJ, Weber G, Watson IBM. How cognitive computing can be applied to big data challenges in life sciences research. Clin Ther. 2016;38:688–701. doi: 10.1016/j.clinthera.2015.12.001. [DOI] [PubMed] [Google Scholar]
13.Chipman H, George EI, McCulloch RE, Clyde M, Foster DP, Stine RA. The practical implementation of Bayesian model selection. Lecture Notes – Monograph Series. 2001;38:65–134. doi: 10.1214/lnms/1215540964. [DOI] [Google Scholar]
14.Clark P, Etzioni O, Khot T, Sabharwal A, Tafjord O, Turney PD, Khashabi D. Combining retrieval, statistics, and inference to answer elementary science questions. Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence; February 12–17, 2016; Phoenix, Arizona, USA. 2016. pp. 2580–2586. http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/11963. [Google Scholar]
15.Cohen K, Verspoor K, Johnson H, Roeder C, Ogren P, Baumgartner WJ, White E, Tipney H, Hunter L. High-precision biological event extraction: Effects of system and of data. Comput Intell. 2011;27:681–701. doi: 10.1111/j.1467-8640.2011.00405.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Davis R, Shrobe H, Szolovits P. What is a knowledge representation? Ai Magazine. 1993;14(1):17–33. http://www.aaai.org/ojs/index.php/aimagazine/article/view/1029/947. [Google Scholar]
17.Demner-Fushman D, Elhadad N. Aspiring to unintended consequences of natural language processing: A review of recent developments in clinical and consumer-generated text processing. Yearb Med Inform. 2016;10:224–233. doi: 10.15265/IY-2016-017. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Fabregat A, Sidiropoulos K, Garapati P, Gillespie M, Hausmann K, Haw R, Jassal B, Jupe S, Korninger F, McKay S, Matthews L, May B, Milacic M, Rothfels K, Shamovsky V, Webber M, Weiser J, Williams M, Wu G, Stein L, Hermjakob H, D’Eustachio P. The Reactome pathway Knowledgebase. Nucleic Acids Res. 2016;44:481–487. doi: 10.1093/nar/gkv1351. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Fragoso G, de Coronado S, Haber M, Hartel F, Wright L. Overview and utilization of the NCI thesaurus. Comparative and Functional Genomics. 2004;5(8):648–654. doi: 10.1002/cfg.445. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Funk C, Baumgartner WJ, Garcia B, Roeder C, Bada M, Cohen K, Hunter L, Verspoor K. Large-scale biomedical concept recognition: An evaluation of current automatic annotators and their parameters. Bmc Bioinformatics. 2014;15:59. doi: 10.1186/1471-2105-15-59. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Heckerman D. Learning in Graphical Models. Springer; 1998. A tutorial on learning with Bayesian networks; pp. 301–354. [DOI] [Google Scholar]
22.Hochheiser H, Castine M, Harris D, Savova G, Jacobson R. An information model for computable cancer phenotypes. Bmc Med Inform Decis Mak. 2016;16:121. doi: 10.1186/s12911-016-0358-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Holford M, Krauthammer M. Mutadelic: Mutation analysis using description logic inferencing capabilities. Bioinformatics. 2015;31:3742–3747. doi: 10.1093/bioinformatics/btv467. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Huang DW, Sherman B, Lempicki R. Bioinformatics enrichment tools: Paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009;37:1–13. doi: 10.1093/nar/gkn923. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Jahchan N, Dudley J, Mazur P, Flores N, Yang D, Palmerton A, Zmoos A, Vaka D, Tran K, Zhou M, Krasinska K, Riess J, Neal J, Khatri P, Park K, Butte A, Sage J. A drug repositioning approach identifies tricyclic antidepressants as inhibitors of small cell lung cancer and other neuroendocrine tumors. Cancer Discov. 2013;3:1364–1377. doi: 10.1158/2159-8290.CD-13-0183. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Jansen K, Kim T, Coenen A, Saba V, Hardiker N. Harmonising nursing terminologies using a conceptual framework. Stud Health Technol Inform. 2016;225:471–475. https://www.ncbi.nlm.nih.gov/pubmed/27332245. [PubMed] [Google Scholar]
27.Keseler I, Mackie A, Santos-Zavaleta A, Billington R, Bonavides-Martínez C, Caspi R, Fulcher C, Gama-Castro S, Kothari A, Krummenacker M, Latendresse M, Muñiz-Rascado L, Ong Q, Paley S, Peralta-Gil M, Subhraveti P, Velázquez-Ramírez D, Weaver D, Collado-Vides J, Paulsen I, Karp P. The EcoCyc database: Reflecting new knowledge about Escherichia coli K-12. Nucleic Acids Res. 2017;45:543–550. doi: 10.1093/nar/gkw1003. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Kibbe W, Arze C, Felix V, Mitraka E, Bolton E, Fu G, Mungall C, Binder J, Malone J, Vasant D, Parkinson H, Schriml L. Disease ontology 2015 update: An expanded and updated database of human diseases for linking biomedical knowledge through disease data. Nucleic Acids Res. 2015;43:1071–1078. doi: 10.1093/nar/gku1011. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Leach S, Tipney H, Feng W, Baumgartner W, Kasliwal P, Schuyler R, Williams T, Spritz R, Hunter L. Biomedical discovery acceleration, with applications to craniofacial development. Plos Comput Biol. 2009;5:1000215. doi: 10.1371/journal.pcbi.1000215. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Lindberg C. The unified medical language system (UMLS) of the national library of medicine. J Am Med Rec Assoc. 1990;61:40–42. https://www.ncbi.nlm.nih.gov/pubmed/10104531. [PubMed] [Google Scholar]
31.Livingston K, Bada M, Baumgartner WJ, Hunter L. KaBOB: Ontology-based semantic integration of biomedical databases. Bmc Bioinformatics. 2015;16:126. doi: 10.1186/s12859-015-0559-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Lohmann S, Negru S, Haag F, Ertl T. Lecture Notes in Computer Science. Springer Nature; 2014. VOWL 2: User-oriented visualization of ontologies; pp. 266–281. [DOI] [Google Scholar]
33.Lu Y, Guo Y, Korhonen A. Link prediction in drug-target interactions network using similarity indices. Bmc Bioinformatics. 2017;18:39. doi: 10.1186/s12859-017-1460-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Mungall C, McMurry J, Köhler S, Balhoff J, Borromeo C, Brush M, Carbon S, Conlin T, Dunn N, Engelstad M, Foster E, Gourdine J, Jacobsen J, Keith D, Laraway B, Lewis S, NguyenXuan J, Shefchek K, Vasilevsky N, Yuan Z, Washington N, Hochheiser H, Groza T, Smedley D, Robinson P, Haendel M. The Monarch Initiative: An integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Res. 2017;45:712–722. doi: 10.1093/nar/gkw1128. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Mungall C, Washington N, Nguyen-Xuan J, Condit C, Smedley D, Köhler S, Groza T, Shefchek K, Hochheiser H, Robinson P, Lewis S, Haendel M. Use of model organism and disease databases to support matchmaking for human disease gene discovery. Hum Mutat. 2015;36:979–984. doi: 10.1002/humu.22857. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Ong E, Xiang Z, Zhao B, Liu Y, Lin Y, Zheng J, Mungall C, Courtot M, Ruttenberg A, He Y. Ontobee: A linked ontology data server to support ontology term dereferencing, linkage, query and integration. Nucleic Acids Res. 2017;45:347–352. doi: 10.1093/nar/gkw918. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Pafilis E, O’Donoghue SI, Jensen LJ, Horn H, Kuhn M, Brown NP, Schneider R. Reflect: Augmented browsing for the life scientist. Nature Biotechnology. 2009;27(6):508–510. doi: 10.1038/nbt0609-508. [DOI] [PubMed] [Google Scholar]
38.Racunas S, Shah N, Albert I, Fedoroff N. HyBrow: A prototype system for computer-aided hypothesis evaluation. Bioinformatics. 2004;20(Suppl 1):257–264. doi: 10.1093/bioinformatics/bth905. [DOI] [PubMed] [Google Scholar]
39.Sharp M. Toward a comprehensive drug ontology: Extraction of drug-indication relations from diverse information sources. J Biomed Semantics. 2017;8:2. doi: 10.1186/s13326-016-0110-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Smalheiser N, Torvik V, Zhou W. Arrowsmith two-node search interface: A tutorial on finding meaningful links between two disparate sets of articles in MEDLINE. Comput Methods Programs Biomed. 2009;94:190–197. doi: 10.1016/j.cmpb.2008.12.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Smedley D. Faculty of 1000 evaluation for The human phenotype ontology: Semantic unification of common and rare disease, Faculty of 1000 Ltd. 2017 doi: 10.3410/f.725602763.793528156. [DOI] [Google Scholar]
42.Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W, Goldberg L, Eilbeck K, Ireland A, Mungall C, Leontis N, Rocca-Serra P, Ruttenberg A, Sansone S, Scheuermann R, Shah N, Whetzel P, Lewis S. The OBO foundry: Coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol. 2007;25:1251–1255. doi: 10.1038/nbt1346. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Soldatos T, Perdigão N, Brown N, Sabir K, O’Donoghue S. How to learn about gene function: Text-mining or ontologies? Methods. 2015;74:3–15. doi: 10.1016/j.ymeth.2014.07.004. [DOI] [PubMed] [Google Scholar]
44.Suthram S, Dudley JT, Chiang AP, Chen R, Hastie TJ, Butte AJ. Network-based elucidation of human disease similarities reveals common functional modules enriched for pluripotent drug targets. Plos Comput Biol. 2010;6(2):1000662. doi: 10.1371/journal.pcbi.1000662. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.The Gene Ontology Consortium. Expansion of the gene ontology knowledgebase and resources. Nucleic Acids Res. 2017;45:331–338. doi: 10.1093/nar/gkw1108. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Tipney H, Hunter L. An introduction to effective use of enrichment analysis software. Hum Genomics. 2010;4:202–206. doi: 10.1186/1479-7364-4-3-202. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Tripathi S, Moutari S, Dehmer M, Emmert-Streib F. Comparison of module detection algorithms in protein networks and investigation of the biological meaning of predicted modules. Bmc Bioinformatics. 2016;17:129. doi: 10.1186/s12859-016-0979-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Willighagen E, Waagmeester A, Spjuth O, Ansell P, Williams A, Tkachenko V, Hastings J, Chen B, Wild D. The ChEMBL database as linked open data. J Cheminform. 2013;5:23. doi: 10.1186/1758-2946-5-23. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Xia J, Fang A, Zhang X. A novel feature selection strategy for enhanced biomedical event extraction using the Turku system. Biomed Res Int. 2014;2014:205239. doi: 10.1155/2014/205239. [DOI] [PMC free article] [PubMed] [Google Scholar]
50.You J. Artificial intelligence. DARPA sets out to automate research. Science. 2015;347:465. doi: 10.1126/science.347.6221.465. [DOI] [PubMed] [Google Scholar]
51.Zeng X, Zhang X, Liao Y, Pan L. Prediction and validation of association between microRNAs and diseases by multipath methods. Biochim Biophys Acta. 2016;1860:2735–2739. doi: 10.1016/j.bbagen.2016.03.016. [DOI] [PubMed] [Google Scholar]
52.Zheng J, Howsmon D, Zhang B, Hahn J, McGuinness D, Hendler J, Ji H. Entity linking for biomedical literature. Bmc Med Inform Decis Mak. 2015;15(Suppl 1):4. doi: 10.1186/1472-6947-15-S1-S4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.Athenikos S, Han H. Biomedical question answering: A survey. Comput Methods Programs Biomed. 2010;99:1–24. doi: 10.1016/j.cmpb.2009.10.003. [DOI] [PubMed] [Google Scholar]

[R2] 2.Bandrowski A, Brinkman R, Brochhausen M, Brush M, Bug B, Chibucos M, Clancy K, Courtot M, Derom D, Dumontier M, Fan L, Fostel J, Fragoso G, Gibson F, Gonzalez-Beltran A, Haendel M, He Y, Heiskanen M, Hernandez-Boussard T, Jensen M, Lin Y, Lister A, Lord P, Malone J, Manduchi E, McGee M, Morrison N, Overton J, Parkinson H, Peters B, Rocca-Serra P, Ruttenberg A, Sansone S, Scheuermann R, Schober D, Smith B, Soldatova L, Stoeckert CJ, Taylor C, Torniai C, Turner J, Vita R, Whetzel P, Zheng J. The ontology for biomedical investigations. Plos One. 2016;11:0154556. doi: 10.1371/journal.pone.0154556. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Barros M, Couto F. Knowledge representation and management: A linked data perspective. Yearb Med Inform. 2016;10:178–183. doi: 10.15265/IY-2016-022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Bauer M, Berleant D. Usability survey of biomedical question answering systems. Hum Genomics. 2012;6:17. doi: 10.1186/1479-7364-6-17. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Baumgartner WJ, Cohen K, Fox L, Acquaah-Mensah G, Hunter L. Manual curation is not sufficient for annotation of genomic databases. Bioinformatics. 2007;23:41–48. doi: 10.1093/bioinformatics/btm229. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Belleau F, Nolin M, Tourigny N, Rigault P, Morissette J. Bio2RDF: Towards a mashup to build bioinformatics knowledge systems. J Biomed Inform. 2008;41:706–716. doi: 10.1016/j.jbi.2008.03.004. [DOI] [PubMed] [Google Scholar]

[R7] 7.Bhattacharyya SB. Introduction to SNOMED CT. Springer Nature; 2015. Overview of SNOMED CT; pp. 1–2. [DOI] [Google Scholar]

[R8] 8.Bodenreider O, Smith B, Kumar A, Burgun A. Investigating subsumption in SNOMED CT: An exploration into large description logic-based biomedical terminologies. Artificial Intelligence in Medicine. 2007;39(3):183–195. doi: 10.1016/j.artmed.2006.12.003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Boeker M, França F, Bronsert P, Schulz S. TNM-O: Ontology support for staging of malignant tumours. J Biomed Semantics. 2016;7:64. doi: 10.1186/s13326-016-0106-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Bringing Precision Medicine to Community Oncologists. Cancer Discov. 2017;7:6–7. doi: 10.1158/2159-8290.CDNB2016-147. [DOI] [PubMed] [Google Scholar]

[R11] 11.Callahan A, Dumontier M, Shah N. HyQue: Evaluating hypotheses using semantic web technologies. J Biomed Semantics. 2011;2(Suppl 2):3. doi: 10.1186/2041-1480-2-S2-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Chen Y, Elenee AJ, Weber G, Watson IBM. How cognitive computing can be applied to big data challenges in life sciences research. Clin Ther. 2016;38:688–701. doi: 10.1016/j.clinthera.2015.12.001. [DOI] [PubMed] [Google Scholar]

[R13] 13.Chipman H, George EI, McCulloch RE, Clyde M, Foster DP, Stine RA. The practical implementation of Bayesian model selection. Lecture Notes – Monograph Series. 2001;38:65–134. doi: 10.1214/lnms/1215540964. [DOI] [Google Scholar]

[R14] 14.Clark P, Etzioni O, Khot T, Sabharwal A, Tafjord O, Turney PD, Khashabi D. Combining retrieval, statistics, and inference to answer elementary science questions. Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence; February 12–17, 2016; Phoenix, Arizona, USA. 2016. pp. 2580–2586. http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/11963. [Google Scholar]

[R15] 15.Cohen K, Verspoor K, Johnson H, Roeder C, Ogren P, Baumgartner WJ, White E, Tipney H, Hunter L. High-precision biological event extraction: Effects of system and of data. Comput Intell. 2011;27:681–701. doi: 10.1111/j.1467-8640.2011.00405.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Davis R, Shrobe H, Szolovits P. What is a knowledge representation? Ai Magazine. 1993;14(1):17–33. http://www.aaai.org/ojs/index.php/aimagazine/article/view/1029/947. [Google Scholar]

[R17] 17.Demner-Fushman D, Elhadad N. Aspiring to unintended consequences of natural language processing: A review of recent developments in clinical and consumer-generated text processing. Yearb Med Inform. 2016;10:224–233. doi: 10.15265/IY-2016-017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Fabregat A, Sidiropoulos K, Garapati P, Gillespie M, Hausmann K, Haw R, Jassal B, Jupe S, Korninger F, McKay S, Matthews L, May B, Milacic M, Rothfels K, Shamovsky V, Webber M, Weiser J, Williams M, Wu G, Stein L, Hermjakob H, D’Eustachio P. The Reactome pathway Knowledgebase. Nucleic Acids Res. 2016;44:481–487. doi: 10.1093/nar/gkv1351. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Fragoso G, de Coronado S, Haber M, Hartel F, Wright L. Overview and utilization of the NCI thesaurus. Comparative and Functional Genomics. 2004;5(8):648–654. doi: 10.1002/cfg.445. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Funk C, Baumgartner WJ, Garcia B, Roeder C, Bada M, Cohen K, Hunter L, Verspoor K. Large-scale biomedical concept recognition: An evaluation of current automatic annotators and their parameters. Bmc Bioinformatics. 2014;15:59. doi: 10.1186/1471-2105-15-59. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Heckerman D. Learning in Graphical Models. Springer; 1998. A tutorial on learning with Bayesian networks; pp. 301–354. [DOI] [Google Scholar]

[R22] 22.Hochheiser H, Castine M, Harris D, Savova G, Jacobson R. An information model for computable cancer phenotypes. Bmc Med Inform Decis Mak. 2016;16:121. doi: 10.1186/s12911-016-0358-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Holford M, Krauthammer M. Mutadelic: Mutation analysis using description logic inferencing capabilities. Bioinformatics. 2015;31:3742–3747. doi: 10.1093/bioinformatics/btv467. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Huang DW, Sherman B, Lempicki R. Bioinformatics enrichment tools: Paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009;37:1–13. doi: 10.1093/nar/gkn923. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Jahchan N, Dudley J, Mazur P, Flores N, Yang D, Palmerton A, Zmoos A, Vaka D, Tran K, Zhou M, Krasinska K, Riess J, Neal J, Khatri P, Park K, Butte A, Sage J. A drug repositioning approach identifies tricyclic antidepressants as inhibitors of small cell lung cancer and other neuroendocrine tumors. Cancer Discov. 2013;3:1364–1377. doi: 10.1158/2159-8290.CD-13-0183. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Jansen K, Kim T, Coenen A, Saba V, Hardiker N. Harmonising nursing terminologies using a conceptual framework. Stud Health Technol Inform. 2016;225:471–475. https://www.ncbi.nlm.nih.gov/pubmed/27332245. [PubMed] [Google Scholar]

[R27] 27.Keseler I, Mackie A, Santos-Zavaleta A, Billington R, Bonavides-Martínez C, Caspi R, Fulcher C, Gama-Castro S, Kothari A, Krummenacker M, Latendresse M, Muñiz-Rascado L, Ong Q, Paley S, Peralta-Gil M, Subhraveti P, Velázquez-Ramírez D, Weaver D, Collado-Vides J, Paulsen I, Karp P. The EcoCyc database: Reflecting new knowledge about Escherichia coli K-12. Nucleic Acids Res. 2017;45:543–550. doi: 10.1093/nar/gkw1003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Kibbe W, Arze C, Felix V, Mitraka E, Bolton E, Fu G, Mungall C, Binder J, Malone J, Vasant D, Parkinson H, Schriml L. Disease ontology 2015 update: An expanded and updated database of human diseases for linking biomedical knowledge through disease data. Nucleic Acids Res. 2015;43:1071–1078. doi: 10.1093/nar/gku1011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Leach S, Tipney H, Feng W, Baumgartner W, Kasliwal P, Schuyler R, Williams T, Spritz R, Hunter L. Biomedical discovery acceleration, with applications to craniofacial development. Plos Comput Biol. 2009;5:1000215. doi: 10.1371/journal.pcbi.1000215. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Lindberg C. The unified medical language system (UMLS) of the national library of medicine. J Am Med Rec Assoc. 1990;61:40–42. https://www.ncbi.nlm.nih.gov/pubmed/10104531. [PubMed] [Google Scholar]

[R31] 31.Livingston K, Bada M, Baumgartner WJ, Hunter L. KaBOB: Ontology-based semantic integration of biomedical databases. Bmc Bioinformatics. 2015;16:126. doi: 10.1186/s12859-015-0559-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Lohmann S, Negru S, Haag F, Ertl T. Lecture Notes in Computer Science. Springer Nature; 2014. VOWL 2: User-oriented visualization of ontologies; pp. 266–281. [DOI] [Google Scholar]

[R33] 33.Lu Y, Guo Y, Korhonen A. Link prediction in drug-target interactions network using similarity indices. Bmc Bioinformatics. 2017;18:39. doi: 10.1186/s12859-017-1460-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Mungall C, McMurry J, Köhler S, Balhoff J, Borromeo C, Brush M, Carbon S, Conlin T, Dunn N, Engelstad M, Foster E, Gourdine J, Jacobsen J, Keith D, Laraway B, Lewis S, NguyenXuan J, Shefchek K, Vasilevsky N, Yuan Z, Washington N, Hochheiser H, Groza T, Smedley D, Robinson P, Haendel M. The Monarch Initiative: An integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Res. 2017;45:712–722. doi: 10.1093/nar/gkw1128. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Mungall C, Washington N, Nguyen-Xuan J, Condit C, Smedley D, Köhler S, Groza T, Shefchek K, Hochheiser H, Robinson P, Lewis S, Haendel M. Use of model organism and disease databases to support matchmaking for human disease gene discovery. Hum Mutat. 2015;36:979–984. doi: 10.1002/humu.22857. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.Ong E, Xiang Z, Zhao B, Liu Y, Lin Y, Zheng J, Mungall C, Courtot M, Ruttenberg A, He Y. Ontobee: A linked ontology data server to support ontology term dereferencing, linkage, query and integration. Nucleic Acids Res. 2017;45:347–352. doi: 10.1093/nar/gkw918. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Pafilis E, O’Donoghue SI, Jensen LJ, Horn H, Kuhn M, Brown NP, Schneider R. Reflect: Augmented browsing for the life scientist. Nature Biotechnology. 2009;27(6):508–510. doi: 10.1038/nbt0609-508. [DOI] [PubMed] [Google Scholar]

[R38] 38.Racunas S, Shah N, Albert I, Fedoroff N. HyBrow: A prototype system for computer-aided hypothesis evaluation. Bioinformatics. 2004;20(Suppl 1):257–264. doi: 10.1093/bioinformatics/bth905. [DOI] [PubMed] [Google Scholar]

[R39] 39.Sharp M. Toward a comprehensive drug ontology: Extraction of drug-indication relations from diverse information sources. J Biomed Semantics. 2017;8:2. doi: 10.1186/s13326-016-0110-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] 40.Smalheiser N, Torvik V, Zhou W. Arrowsmith two-node search interface: A tutorial on finding meaningful links between two disparate sets of articles in MEDLINE. Comput Methods Programs Biomed. 2009;94:190–197. doi: 10.1016/j.cmpb.2008.12.006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Smedley D. Faculty of 1000 evaluation for The human phenotype ontology: Semantic unification of common and rare disease, Faculty of 1000 Ltd. 2017 doi: 10.3410/f.725602763.793528156. [DOI] [Google Scholar]

[R42] 42.Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W, Goldberg L, Eilbeck K, Ireland A, Mungall C, Leontis N, Rocca-Serra P, Ruttenberg A, Sansone S, Scheuermann R, Shah N, Whetzel P, Lewis S. The OBO foundry: Coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol. 2007;25:1251–1255. doi: 10.1038/nbt1346. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] 43.Soldatos T, Perdigão N, Brown N, Sabir K, O’Donoghue S. How to learn about gene function: Text-mining or ontologies? Methods. 2015;74:3–15. doi: 10.1016/j.ymeth.2014.07.004. [DOI] [PubMed] [Google Scholar]

[R44] 44.Suthram S, Dudley JT, Chiang AP, Chen R, Hastie TJ, Butte AJ. Network-based elucidation of human disease similarities reveals common functional modules enriched for pluripotent drug targets. Plos Comput Biol. 2010;6(2):1000662. doi: 10.1371/journal.pcbi.1000662. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] 45.The Gene Ontology Consortium. Expansion of the gene ontology knowledgebase and resources. Nucleic Acids Res. 2017;45:331–338. doi: 10.1093/nar/gkw1108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] 46.Tipney H, Hunter L. An introduction to effective use of enrichment analysis software. Hum Genomics. 2010;4:202–206. doi: 10.1186/1479-7364-4-3-202. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] 47.Tripathi S, Moutari S, Dehmer M, Emmert-Streib F. Comparison of module detection algorithms in protein networks and investigation of the biological meaning of predicted modules. Bmc Bioinformatics. 2016;17:129. doi: 10.1186/s12859-016-0979-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] 48.Willighagen E, Waagmeester A, Spjuth O, Ansell P, Williams A, Tkachenko V, Hastings J, Chen B, Wild D. The ChEMBL database as linked open data. J Cheminform. 2013;5:23. doi: 10.1186/1758-2946-5-23. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] 49.Xia J, Fang A, Zhang X. A novel feature selection strategy for enhanced biomedical event extraction using the Turku system. Biomed Res Int. 2014;2014:205239. doi: 10.1155/2014/205239. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] 50.You J. Artificial intelligence. DARPA sets out to automate research. Science. 2015;347:465. doi: 10.1126/science.347.6221.465. [DOI] [PubMed] [Google Scholar]

[R51] 51.Zeng X, Zhang X, Liao Y, Pan L. Prediction and validation of association between microRNAs and diseases by multipath methods. Biochim Biophys Acta. 2016;1860:2735–2739. doi: 10.1016/j.bbagen.2016.03.016. [DOI] [PubMed] [Google Scholar]

[R52] 52.Zheng J, Howsmon D, Zhang B, Hahn J, McGuinness D, Hendler J, Ji H. Entity linking for biomedical literature. Bmc Med Inform Decis Mak. 2015;15(Suppl 1):4. doi: 10.1186/1472-6947-15-S1-S4. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Knowledge-based biomedical Data Science

Lawrence E Hunter

Abstract

1. Representations of biomedical knowledge

2. Knowledge-based inference

2.1. Logical inference

2.2. Inference from ontology annotation

2.3. Inference from the biomedical literature

2.4. Hypothesis generation, evaluation and modification

3. Open challenges in knowledge-based Data Science

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Knowledge-based biomedical Data Science

Lawrence E Hunter

Abstract

1. Representations of biomedical knowledge

2. Knowledge-based inference

2.1. Logical inference

2.2. Inference from ontology annotation

2.3. Inference from the biomedical literature

2.4. Hypothesis generation, evaluation and modification

3. Open challenges in knowledge-based Data Science

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases