Empowering industrial research with shared biomedical vocabularies

Lee Harland; Christopher Larminie; Susanna-Assunta Sansone; Sorana Popa; M Scott Marshall; Michael Braxenthaler; Michael Cantor; Wendy Filsell; Mark J Forster; Enoch Huang; Andreas Matern; Mark Musen; Jasmin Saric; Ted Slater; Jabe Wilson; Nick Lynch; John Wise; Ian Dix

doi:10.1016/j.drudis.2011.09.013

. Author manuscript; available in PMC: 2020 Mar 26.

Published in final edited form as: Drug Discov Today. 2011 Sep 23;16(21-22):940–947. doi: 10.1016/j.drudis.2011.09.013

Empowering industrial research with shared biomedical vocabularies

Lee Harland ^1,¹⁰, Christopher Larminie ², Susanna-Assunta Sansone ³, Sorana Popa ⁴, M Scott Marshall ⁵, Michael Braxenthaler ⁶, Michael Cantor ⁷, Wendy Filsell ⁸, Mark J Forster ⁹, Enoch Huang ¹⁰, Andreas Matern ¹¹, Mark Musen ¹², Jasmin Saric ¹³, Ted Slater ¹⁴, Jabe Wilson ¹⁵, Nick Lynch ¹⁶, John Wise ¹⁷, Ian Dix ¹⁸

PMCID: PMC7098809 NIHMSID: NIHMS923360 PMID: 21963522

Abstract

The life science industries (including pharmaceuticals, agrochemicals and consumer goods) are exploring new business models for research and development that focus on external partnerships. In parallel, there is a desire to make better use of data obtained from sources such as human clinical samples to inform and support early research programmes. Success in both areas depends upon the successful integration of heterogeneous data from multiple providers and scientific domains, something that is already a major challenge within the industry. This issue is exacerbated by the absence of agreed standards that unambiguously identify the entities, processes and observations within experimental results. In this article we highlight the risks to future productivity that are associated with incomplete biological and chemical vocabularies and suggest a new model to address this long-standing issue.

Introduction

Commercial life science organizations are evolving; they are exploring new mechanisms to adjust to well-documented economic and productivity challenges. At the same time, thanks to the rapid technological advances within biology they are facing an explosion in the volume and complexity of available data. Efficient management, processing and application of internal and external data are vital to research and development productivity [1,2]. Yet, an integrated view across experiments, literature and databases presupposes that the entities of interest, such as molecules, compounds, cells, observations and even people can be identified and recorded unambiguously. In information systems, identity can be asserted through the use of reference vocabularies in the form of lists, taxonomies and ontologies (Box 1). While these various structures support different use cases, they all provide a mechanism to define the ‘things’ within the data unequivocally [3]. Unfortunately, coverage of biomedical and chemical concepts is patchy at best, with many scientific domains devoid of representation. Even where specific vocabularies exist, they have often been developed for specific purposes, and are unable to support different applications. For example, a thesaurus is useful for text-mining, but may be poor for classification tasks. This compounds the problem, requiring industry informaticians to source or develop multiple vocabulary variants within one scientific domain. Human disease is a good example of this, being represented by the International Classification of Diseases (http://www.who.int/whosis/icd10), Medical Subject Headings (http://www.nlm.nih.gov/mesh), National Cancer Institute Thesaurus (http://ncit.nci.nih.gov/ncitbrowser), Systematized Nomenclature of Medicine-Clinical Terms (http://www.ihtsdo.org), Human Disease Ontology (http://do-wiki.nubic.northwestern.edu) and many other proprietary resources. A further consequence is a combinatorial explosion of cross-referencing required to align the same entity across each source. Alternatively, technical or legal restrictions could prevent the cross-referencing of proprietary vocabularies, resulting in incomplete integration. In this article, we consider the impact that this chaotic terminological landscape has on our ability to provide effective information support to industrial research. We present arguments for increased industry participation in developing and sustaining these foundational resources and propose one potential path forward.

BOX 1.

Defining ‘vocabulary’

Agreeing on a common definition of the words term, lexicon, dictionary, vocabulary, taxonomy and ontology can in itself be difficult. Here, we use the word ‘vocabulary’ to refer to a terminological resource that provides the identification and definition of entities (also known as ‘concepts’) within a scientific domain. This includes taxonomies, ontologies and terminologies as defined by Smith et al. [35] and glossaries, dictionaries, lexicons, thesauri, taxonomies and ontologies as defined by Vanopstal et al. [36]. Each of these vocabulary forms fit with different use cases and applications, but all provide an ability to unambiguously identify biomedical entities. The various important elements are described below:

Basic identity (based on requirements defined by Laibe and Le Novère [37])

Domain: Which area(s) of biomedical science are covered.
Identifier: A unique, permanent, standards-compliant and non-editable string of characters that can be used to represent the entity. Ideally the identifier should be free from any semantics itself.
Preferred term: A human readable name for an individual entity.
Definition: A human readable, succinct definition of the real-world entity that this entry in the vocabulary represents.

Organizational structure

Dictionaries: Lists of entities with basic identity criteria.
Thesauri: Entities with lists of natural language synonyms particularly suited to search and text-mining applications.
Taxonomies: Provides a parent/child hierarchy that might form an important part of the identity and definition of an entity (e.g. ‘Asthma’ is a sibling of ‘Respiratory Tract Disease’).
Ontology: Contains entities and their relationships to each other. Most expressive knowledge representation, relating entities to each other through multiple associations (for example ‘is a’, ‘part of’, ‘found in’ and ‘interacts with’).

Linguistic features

Synonyms: Per entity, a set of other words (and ideally, language origin) that are used to describe the same entity within natural language.
Homonyms, antonyms, ‘part of speech’ fragments: Assist in dealing with ambiguity particularly in term recognition in text-mining and search applications.

Other important elements

Provenance: The origin of the concept (source, author, date added/modified, among others).
Cross-references: Assertions that describe the relationship of concepts between individual vocabularies. Crucial when integrating data indexed with different vocabularies.
Domain-specific features: Additional elements which are important within certain specific areas. An example of this is a molecular structure representation (such as an InChI [38]) for entities within vocabularies of drugs and other small molecules.

Origin and use of industry vocabularies

The vocabularies used in industry originate from three main sources. Many are based on, or incorporate elements of those developed by the academic/nonprofit sector, of which over two hundred are listed at the National Center for Biomedical Ontology (NCBO: http://bioportal.bioontology.org). Others are sourced from commercial suppliers, having either been developed specifically for the industry customer or as part of a larger product. Internal vocabulary groups within industry provide the remainder, augmenting existing resources and filling gaps as necessary. Generally, these industry-tuned vocabularies do not make their way into the public domain nor are they shared beyond one company. Although notable exceptions exist (such as the transition of the ‘Bayer Code’ for identifying crop plants and pests: http://pp1.eppo.org), these are few and far between. Furthermore, although some public resources receive ad hoc funding or intellectual contribution from industry, at present this is not coordinated nor part of any long-term strategy. Because all companies are working with the same fundamental science and as such are constructing and maintaining similar vocabularies, there is clearly replication of effort [4].

Perhaps the most fundamental role for vocabularies within industry concerns the storage and retrieval of data and information in electronic repositories. In particular, results from major experiments, animal studies or human clinical samples represent significant resource and financial investments. For both efficiency and indeed regulatory reasons, it is essential that these data can be found by those who require it. Searching repositories can be made more efficient by using thesauri that mitigate the problem of multiple names for a single biomedical entity. For example, ensuring that data concerning the protein carboxypeptidase B2 can be found, no matter which of its many synonyms (CPB2; carboxypeptidase U, CPU; plasma carboxypeptidase B, pCPB; thrombin-activatable fibrinolysis inhibitor, TAFI) are actually used. Alternatively, vocabularies can be used to restrict scientists to specific, controlled terms within data entry software. Obtaining compliance is straightforward when the user is presented with a defined list (typically in a ‘drop-down’ menu), but is much more difficult where they are entering text freely. However, systems that make the selection of vocabulary terms within spreadsheets and documents as intuitive as spell-checking should improve this, and are now becoming a reality [5,6] and see Microsoft¹ News Center: http://www.microsoft.com/presspass/press/2009/mar09/03-11MSCreativeCommonsPR.mspx). Furthermore, over the next few years the concept of humans sharing the process of documenting methods and results with the machines performing the experiments might also become routine practice [7]. In such scenarios it will be clearly advantageous for both parties to use a common language.

In an information-rich industry, scientists need to go beyond simple search requests and ask more detailed questions. Table 1 outlines some common information tasks, all of which require a fundamental ability to join information from multiple sources. Crucially this includes internal and external systems, and for the later, both free and commercial content, highlighting the need for universally accessible vocabularies. Taxonomies and ontologies further aid the interpretation of integrated data by providing the ability to filter results to scientifically useful groupings, such as ‘inflammatory diseases’, ‘antirheumatoid agents’ or ‘G-protein coupled receptors’. Vocabularies are also crucial to more advanced tasks, such as systems modeling in important areas, such as neural circuitry [8], carcinoma classification [9] and drug toxicity [10]. There is a similar vocabulary dependency in common data mining approaches that provide insights into off-target interactions for drugs [11], new therapeutic opportunities [12], high-throughput screening data [13] and others [14]. Finally, they also power the many successful text-mining techniques that provide surveillance across biomedical literature, patents, regulatory documents and tweets that would otherwise be impossible to monitor fully [15,16].

TABLE 1.

Information strategies for typical R&D questions

Business area	Scientific question	Approach and sources	Vocabularies required
Understanding disease biology	Which proteins are in the EGFR pathway?	Integrate public and commercial pathway databases with text-mined and other results	Gene/protein; pathway
Understanding drug mechanisms	What mechanisms of action have been tested for the treatment of melanomas?	Query internal portfolio systems, drug databases, literature and patents for targets, pathways and bioprocesses connected to general melanoma and specific subtypes of the disease	Gene/protein; target; compound; drug; drug class; disease; mechanism of action; intracellular process; physiological process; study type
Chemistry design and synthesis	Which internal and external compounds are strong, specific antagonists of the CCR1 receptor?	Query internal and external pharmacology databases in addition to text-mined data from public literature	Target; compound; drug; pharmacological assay
Drug repurposing	Can any of our compounds be repurposed for additional indications?	Integrate competitor intelligence databases, text-mining results, omics analyses and more to identify novel compound-disease or target-disease associations	Gene/protein; target; pathway; disease; mechanism of action; biological process; physiological process; symptom
Drug repurposing, safety monitoring	Which of our drugs affect behavior in animal assays?	Take drug portfolio and query internal database and public literature for behavioral signals for candidate and launched compounds	Compound; drug; animal model; behavioral endpoint
Differentiation, safety mitigation	What adverse events for launched anti-inflammatory drugs have been observed in model organisms?	Query drug event databases, both internal and external and integrate the results	Compound; drug; drug class; disease; symptom; species; adverse event; toxicity endpoint; toxicity assay
Pharmacovigilance	Can we monitor relevant resources for safety indications for a set of drugs?	Query literature, clinical trial and drug event databases, both internal and external and integrate the results	Drug; drug class; symptom; adverse event; toxicity endpoint
Strategic collaborations	Who is an external expert on pancreatitis?	Mine/integrate literature, grant applications, patents, conference abstracts to identify key opinion leaders	People; institute; city; country; disease
Competitor intelligence	How does our internal immunomodulation portfolio compare to the competition	Integrate drug, target and mechanism data across many different diseases where immunomodulation is important. Normalize company names based on mergers/acquisitions. Optionally cluster targets by pathway or mechanism.	Gene/protein; pathway; target; disease; company; mechanism of action; intracellular process; physiological process; pharmaceutical type; clinical trial; clinical outcome
Competitor intelligence, strategic investment	What new trends are emerging in a particular research area	Analyze literature, patent, grant and conference information to identify biomedical concepts	General biomedical dictionaries; institution; pharmaceutical type; biotechnology concept

Open in a new tab

EGFR: epidermal growth factor receptor; CCR1: chemokine receptor type 1.

Emerging challenges

The negative impact of partial and missing vocabularies on industrial research is not a new issue [1]. However, in the current, rapidly evolving environment, new scientific, business and technical indicators suggests this problem will become even more acute. Within human health, there is an increased interest in using clinical data to drive and augment basic research, especially when combined with in vitro and animal model studies [17]. A good example of the direction in which many are headed is provided by a recent biomarker study from Genentech (http://www.gene.com/gene/index.jsp), that focused on samples taken from over 3000 patients with rheumatoid arthritis [18]. Multiple genetic, gene expression, cell population and protein marker studies were performed by several contract research organizations (CROs) and subsequently integrated for the analysis. The authors describe how their efforts were hindered by a lack of vocabulary standards, with significant laborious, manual intervention required to match up ethnicity, study regions, drugs and drug types across the results. Because the relationship between large organizations and CROs is evolving from a customer-provider to a research partner model [19], it is likely that these phenomena will arise more frequently. Indeed, the blurring of lines between internal and external teams is apparent in many business strategies, such as industry–academic partnerships, precompetitive initiatives, product in-licensing and open innovation [20]. It is also true that the depth of biomedical knowledge required to take some innovative products to market might be beyond that which any one company can build alone [21]. Thus, the future would appear to be a highly dynamic system with different, transient external partnerships evolving as projects progress. It is self-evident that these highly networked and fluid business models will be significantly hindered by inefficient data exchange and analysis. Indeed, Vargas et al. [22] have highlighted how data standardization issues have already become a problem in major industry–academia collaborations. Similarly, a recent Harvard Business Review identified this area as one requiring substantial improvement, concluding that: ‘A common standard for sharing drug asset data would unleash tremendous innovation’ (The HBR List: Breakthrough Ideas for 2010: https://archive.harvardbusiness.org/cla/web/pl/product.seam?c=2275&i=2277&cs=7b9e2623ca9d337e9e6dd0e21012b011).

From a technical perspective, the drive for integration has led many industry and academic informaticians to explore ‘Semantic Web’ technology [14]. This approach holds promise in addressing major information challenges by combining data integration with powerful querying and inferencing capabilities. The World Wide Web Consortium Heath-care and Life Sciences Interest Group (http://www.w3.org/2001/sw/hcls) and others have developed several industry-relevant use cases that identify scientific relationships previously hidden in existing data [23–26]. Yet integrating and inferencing using the Semantic Web is completely dependent upon the proper identification of the biomedical entities in these data [2,27]. Thus, there is an intrinsic link between the availability of good vocabularies and the future success of this technology within industry.

A new approach

The traditional individual company approach to vocabulary provision has at best, provided limited support of the overall information needs of industry scientists. However, the ever-increasing volume and complexity of preclinical and translational data suggests that this path is unsustainable and cannot meet the levels of coverage now needed. Furthermore, individual standards add little to the ability to integrate other sources of data, being useful only after laborious (and often ambiguous) cross-referencing exercises. We must realize a new environment where project teams are unhindered in using data from whichever partners they choose to engage with and any electronic systems they wish to interrogate. Only open, shared standards that are available to all information producers and consumers are able to fulfill this need. Thus, we propose that industry develops a new strategy in this area, based on the precompetitive development of open research vocabularies. Such an approach could provide many benefits, including:

Cost savings in vocabulary development by sharing the work.
Less redundancy, greater coverage and more concepts for same effort.
Wider body of experts to draw on giving broader scientific representation.
Opportunity to cover multiple languages and integrate non-English information.
Proactively maintained, rather than ad hoc patches when gaps are found.
Increased efficiency, more time exploiting, less time ‘plumbing’.
Better analytical capabilities, better results and greater business impact.

Of course, collaboration can be non-trivial and initial projects might take considerable time and effort to deliver as the participants establish the most effective way to operate. Additionally, different organizations might have irreconcilable views on the construction and primary application of any specific vocabulary [3,28]. Consequently, not all internal vocabularies should be candidates for externalization, and examples that are highly tuned to one company’s specific need would not be relevant. However, even these efforts would benefit by being built upon a backbone of common open concepts, providing better connections to the wider information network. Given the vast landscape of biomedicine, it should be possible to identify many major areas of common need, such as open cell/tissue hierarchies and inter-relationships, catalogs of animal models (and relationships to human biology), pathophysiological processes and disease phenotypes. Although agreement on aims and scope should be possible, there might still be differences of opinion regarding vocabulary content, either in the entities themselves or their relationships to one another. For example, participants might disagree on what constitutes valid symptoms for a disease or the inclusion of ambiguous synonyms in thesauri. However, this problem is relatively simple to mitigate by correct recording of provenance within each vocabulary entry, enabling consumers to include or exclude elements from different contributors as they see fit.

It is important to recognize that the information and informatics infrastructure varies between life science companies and the ability for each member to benefit will progress at differing rates. This might add to the complexity and cost of any solution to ensure that it is workable for all participants. Alternatively, those companies with a more advanced infrastructure could identify ways to donate these systems and experiences into the public domain. Ultimately, no one will want to be at a disadvantage and hence building awareness of the advantages of collaboration is crucial to gain enough support and resources to ensure continued momentum. Although these concerns are applicable to many precompetitive initiatives, they do not seem to have had too great a negative impact thus far. However, they could influence individual decisions whether to engage or refrain from certain vocabulary projects.

Economics

Although the strategic partnerships suggested above might support this culture shift, investment from industry will still be required because these collaborations will require time and money. Does partnership make economic sense? Here it is helpful to look at a real example, such as the Medical Dictionary for Regulatory Activities (MedDRA) vocabulary, used primarily within pharmacovigilance and developed by the International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH). MedDRA access, maintenance and support are funded through tiered subscriptions, with a maximum rate of US $62,850/year for companies with an annual revenue >US $5 billion, but much lower for smaller businesses and free to nonprofit and regulatory authorities (MedDRA: MSSO: http://www.meddramsso.com/public_subscription_rates.asp). The subscription provides an organization with access to over 670,000 highly annotated concepts, translated into multiple languages, in addition to meetings, training and support. Contrast this with an internal approach; the equivalent amount would likely not even cover one employee year, and could never reach the levels of support offered by the group of international subject experts, vocabulary experts, IT experts, and leadership from the ICH MedDRA Management Board. Clearly this joint maintenance model is far more beneficial from the financial stand point than the situation that each company individually carries such maintenance cost for its own in-house medical vocabulary. Of course, MedDRA is only one specific example that has a relatively large number of subscribers and we cannot extrapolate these figures to every vocabulary. However, they do illustrate that industry has already developed one model of operation in this space and that it currently provides significantly extended capabilities at a reasonable cost.

External partnerships

The benefits of open, public data standards for industry align well with strategies put forward in this area for public science [21,22,29–31], highlighting potential synergies between the two domains. Such partnerships would offer a chance to kick-start cross-industry projects by leveraging existing experience and resources, and providing a ‘neutral territory’ in which to collaborate. Furthermore, there are several public organizations and systems that could provide some of the core capabilities that will be required (Table 2). Clearly, there will be challenges in adapting the mechanisms and conventions that are employed by the nonprofit sector to the requirements of industry. Box 2 highlights some of the major issues that would need to be addressed, but given that industry is a major consumer of public research, there is much to be gained by exploring the possibility of greater alignment. Part of this strategy must include dialog with funding agencies concerning the long-term support for shared development and infrastructure. This will require adaptable solutions that take advantage of multinational grants, industry funding and in-kind contributions. Crucially, these initiatives should not be seen as simply ‘industry projects’, but rather opportunities for industry and academia to solve a common problem. Although industry can (and should) engage much more in public vocabulary infrastructure, it cannot be the sole mechanism for funding this work. Long-term support of co-ordination groups, such as the Open Biomedical Ontologies (OBOs) Foundry, cross-national support for projects involving the US-based NCBO and mechanisms to integrate and co-ordinate with the other organizations, including those listed in Table 2, will all be important. We believe that public–private projects provide a rich environment to explore this issue and indeed, have already begun generating some important learning in areas, such as toxicology (OpenTox [10]), pharmacology (OpenPhacts: http://openphacts.org) and disease knowledge and semantic publishing (Pistoia Alliance: http://www.pistoiaalliance.org/workinggroups/sesl.html and [32]).

TABLE 2.

Potential nonprofit partners for industry vocabulary management

Name	Description	Further Information
National Center for Biomedical Ontology (NCBO)	Provides tools and services for vocabulary access, management and annotation	http://www.bioontology.org
Concept Wiki and Concept Web Alliance	Provides community annotation and vocabulary management	http://www.conceptwiki.org
Miriam Registry and Identifiers.org	Provides a system for identifying and cataloguing scientific concepts	http://www.ebi.ac.uk/miriam, http://www.identifiers.org
Shared Names	Provides persistent identifiers for biomedical concepts	http://www.sharednames.org
Okkam	Provides EU funded infrastructure for systematic entity identification	http://www.okkam.org
Open Biological and Biomedical Ontologies (OBO) Foundry	Provides a home for many core vocabularies. Promotes best practice, alignment and development standards	http://www.obofoundry.org
BioSharing	Community of journals, funders and standardization efforts; standards-centered catalogs of data sharing resources	http://www.biosharing.org
Pistoia Alliance	Not-for-profit, precompetitive alliance to promote industry standards	http://www.pistoiaalliance.org
SAGE Commons	Not-for-profit, precompetitive alliance for building computational disease models that require vocabularies	http://www.sagebase.org/commons
Innovative Medicines Initiative	Public–private pharmaceutical research and development initiative sharing many information management challenges	http://www.imi.europa.eu
Elixir	European initiative to create long-term life science research and translational medicine informatics and information infrastructure	http://www.elixir-europe.org
National Center for Advancing Translational Sciences (NCATS)	New NIH-funded institute to bridge the ‘translation gap’ in the precompetitive space. Shared data and information challenges with industry	[17]
Clinical and Translational Science Awards Consortium	Network of clinical and translational research centers including information systems in the USA. Shared data and information challenges	https://www.commonfund.nih.gov/ctsa
Clinical Data Interchange Standards Consortium (CDISC)	Provides standards to support clinical research data interoperability. Potential to collaborate on relevant terminologies	http://www.cdisc.org
International Health Terminology Standards Development Organization	Not-for-profit association that develops and promotes use of the SNOMED-CT vocabulary for health information exchange	http://www.ihtsdo.org
International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH)	Brings together the regulatory authorities and pharmaceutical industry of Europe, Japan and the US to discuss scientific and technical aspects of drug registration. Developer of the MedDRA vocabulary	http://www.ich.org
World Wide Web Heath-care and Life Sciences Interest Group	Develops and supports the use of Semantic Web technologies for health care and life science	http://www.w3.org/2001/sw/hcls

Open in a new tab

BOX 2.

The missing pieces

Although commercial organizations and cross-industry groups collaborate on some open, public vocabularies, the majority are led by nonprofit consortia. To develop resources and infrastructure of value to all parties, the following challenges will need to be addressed:

Where are living vocabularies to be hosted? How exactly will the vocabularies be served? How will unique identifiers be assigned? And who will perform cross-referencing?
Where do industry vocabularies that do not have an existing public domain owner reside? And who is responsible for them?
What provision for change management will be supported? Who is ‘authorized’ to make changes? How will updates be validated? Can the right technical frameworks be put in place to capture provenance and manage data integrity (prevent malicious damage, accidental loss)?
What will be the schedule for release of new versions of the vocabulary?
How might multiple companies reconcile overlapping vocabularies? Who will provide advice and how?
How will multi-lingual requirements be managed? This will be important for those companies that require data and document indexing documents from offices in distributed geographical locations.
Even in the most buoyant economic climate industry could not provide an active participant on each vocabulary in the public domain, how will relationships between those which are funded and those which are not be affected?
How do collaborative models cope with changing priorities? What is of interest today might not be in vogue tomorrow, meaning a rapid withdrawal of funding or intellectual input.
What expectations can industry have on publicly funded groups to provide such services? And how do we agree fair funding commitments?

In addition to industry and nonprofit groups, the role of commercial content suppliers, aggregators and publishers must be considered. In many instances, these organizations are supportive of open data standards and actively promote their adoption and use [33,34]. It is also true that they have developed and might sell their own proprietary thesauri, taxonomies and ontologies. An industry vision around open vocabularies has implications for those providers that have already made significant investment in this area. Considering this perspective, two important points should be made. Firstly, to fully leverage the value in commercial data we must markedly improve our ability to integrate and interrogate it, a capability that is dependent upon data standards. As customers, industry needs to clearly articulate the anticipated business benefits to content providers and to partner with them to achieve this in a way that provides value to both parties. Secondly, we are not calling for universal access to commercial data, but rather that the concepts within that data are identified using standards that present few barriers to widespread adoption. Providers are still able to create proprietary assertions and databases, but in the knowledge that the cells, symptoms, processes and pathologies within them will fully connect with their customers other data.

Finally, the lack of an extensive community of smaller commercial vocabulary providers suggests that building industry-relevant vocabularies ‘to order’ might not be an area of high economic value on its own. Perhaps smaller companies could identify entrepreneurial approaches that combine collaborative vocabulary services with additional ‘value-added’ (i.e. revenue-generating) opportunities. This could include agreeing to host and maintain a particular vocabulary that is relevant to their core business on behalf of the community. In return, they would not only obtain wider input on a resource of utility to themselves, but potentially also develop a wider network of potential customers. Major content providers could also have such a role here, assisting in an area in which they already have a great deal of experience.

Concluding remarks

The availability of high-quality biomedical vocabularies is an often-overlooked but crucial component to future success in life science research. Our aim is to raise awareness of this issue across the industry and to develop the necessary support, participation and funding required to address it. Specifically, we propose the initiation of pilot studies to explore and validate the hypothesis that shared vocabularies will be mutually beneficial and cost effective. It should not be difficult to identify some core areas of need, agree the scope and provide quantitative judgements on the ability to align and develop vocabularies across multiple companies. Such projects should enable us to identify key bottlenecks in addition to building the economic models to judge whether such an approach will be financially viable in the long term. Secondly, we advocate that vocabulary standards become an intrinsic element within industry software, whether they are document repositories, electronic laboratory notebooks or intelligence systems. Ensuring that the designers of these applications consider how they will identify concepts in a way that facilitates integration will provide significant future benefits. Finally, industry should actively engage with many of the external groups listed in Table 2, using constructs such as the Pistoia Alliance and Innovative Medicines Initiative (IMI) to supply the legal and logistical frameworks for collaboration. By providing drive, leadership and a vision in this area, industry has an opportunity to begin to tackle a long-standing issue and enable the science needed to deliver future products.

Acknowledgments

This topic was first discussed at a Pistoia Alliance-sponsored meeting on industry vocabulary strategies. We thank the following participants for valuable contributions which provided the substrate for this perspective: Michael Ashburner, Susanna Lewis, Alan Ruttenberg, Barry Smith (OBO Foundry); Johanna McEntyre, Dominic Clark, Chris Taylor (EMBL-EBI); Philippe Rocca-Serra (University of Oxford); Gordon Baxter (Biowisdom); Douglas Bassett (Ingenuity); Ashley George, Raymond Grimaila (GSK); Tim Shay (Symyx); Bette Brunelle (Dialog); Paolo Ciccarese (Medicognos); Olivier Bodenereider (National Library of Medicine); Andrej Bugrim (GeneGo). We also thank Therese Vachon (Novartis), Hilary Vass (AstraZeneca) and Phoebe Roberts (Pfizer) for suggestions and corrections. We thank Anna Zhao-Wong (MSSO) for information regarding MedDRA.

References

1.Searls D. Data integration: challenges for drug discovery. Nat Rev Drug Discov. 2005;4:45–58. doi: 10.1038/nrd1608. [DOI] [PubMed] [Google Scholar]
2.Slater T, et al. Beyond data integration. Drug Discov Today. 2008;13:584–589. doi: 10.1016/j.drudis.2008.01.008. [DOI] [PubMed] [Google Scholar]
3.Cimino JJ. Desiderata for controlled medical vocabularies in the twenty-first century. Methods Inf Med. 1998:394–403. [PMC free article] [PubMed] [Google Scholar]
4.Barnes MR, et al. Lowering industry firewalls: pre-competitive informatics initiatives in drug discovery. Nat Rev Drug Discov. 2009:1–8. doi: 10.1038/nrd2944. [DOI] [PubMed] [Google Scholar]
5.Rocca-Serra P, et al. ISA software suite: supporting standards-compliant experimental annotation and enabling curation at the community level. Bioinformatics. 2010;26:2354–2356. doi: 10.1093/bioinformatics/btq415. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Wolstencroft K, et al. RightField: embedding ontology annotation in spreadsheets. Bioinformatics. 2011;27:2021–2022. doi: 10.1093/bioinformatics/btr312. [DOI] [PubMed] [Google Scholar]
7.Qi D, et al. An ontology for description of drug discovery investigations. J Integr Bioinform. 2010;7:1–13. doi: 10.2390/biecoll-jib-2010-126. [DOI] [PubMed] [Google Scholar]
8.Rubin DL, et al. Computational neuroanatomy: ontology-based representation of neural components and connectivity. BMC Bioinformatics. 2009;10(Suppl 2):S3. doi: 10.1186/1471-2105-10-S2-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Kumar A, et al. An ontology for carcinoma classification for clinical bioinformatics. Stud Health Technol Inform. 2005;116:635–640. [PubMed] [Google Scholar]
10.Hardy B, et al. Collaborative development of predictive toxicology applications. J Cheminform. 2010;2:7. doi: 10.1186/1758-2946-2-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Campillos M, et al. Drug target identification using side-effect similarity. Science. 2008;321:263–266. doi: 10.1126/science.1158140. [DOI] [PubMed] [Google Scholar]
12.Campbell SJ, et al. Visualising the drug target landscape. Drug Discov Today. 2010;15:3–15. doi: 10.1016/j.drudis.2009.09.011. [DOI] [PubMed] [Google Scholar]
13.Schürer SC, et al. BioAssay ontology annotations facilitate cross-analysis of diverse high-throughput screening data sets. J Biomol Screen. 2011;16:415–426. doi: 10.1177/1087057111400191. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Chen H, Xie G. The use of web ontology languages and other semantic web tools in drug discovery. Expert Opin Drug Discov. 2010;5:413–423. doi: 10.1517/17460441003762709. [DOI] [PubMed] [Google Scholar]
15.Agarwal P, Searls DB. Literature mining in support of drug discovery. Brief Bioinform. 2008;9:479–492. doi: 10.1093/bib/bbn035. [DOI] [PubMed] [Google Scholar]
16.Agarwal P, Searls DB. Can literature analysis identify innovation drivers in drug discovery? Nat Rev Drug Discov. 2009;8:865–878. doi: 10.1038/nrd2973. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Collins FS. Reengineering translational science: the time is right. Sci Transl Med. 2011;3:1–6. doi: 10.1126/scitranslmed.3002747. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Sorani MD, et al. Clinical and biological data integration for biomarker discovery. Drug Discov Today. 2010;15:741–748. doi: 10.1016/j.drudis.2010.06.005. [DOI] [PubMed] [Google Scholar]
19.Hutchins S, et al. Open partnering of integrated drug discovery: continuing evolution of the pharmaceutical model. Drug Discov Today. 2011;16:281–283. doi: 10.1016/j.drudis.2011.02.007. [DOI] [PubMed] [Google Scholar]
20.Hunter J, Stephens S. Is open innovation the way forward for big pharma? Nat Rev Drug Discov. 2010;9:87–88. [Google Scholar]
21.Friend SH. The need for precompetitive integrative bionetwork disease model building. Clin Pharmacol Ther. 2010;87:536–539. doi: 10.1038/clpt.2010.40. [DOI] [PubMed] [Google Scholar]
22.Vargas G, et al. Arguments against precompetitive collaboration. Clin Pharmacol Ther. 2010;87:527–529. doi: 10.1038/clpt.2010.30. [DOI] [PubMed] [Google Scholar]
23.Sahoo SS, et al. An ontology-driven semantic mashup of gene and biological pathway information: application to the domain of nicotine dependence. J Biomed Inform. 2008;41:752–765. doi: 10.1016/j.jbi.2008.02.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Gudivada RC, et al. Identifying disease-causal genes using semantic web-based representation of integrated genomic and phenomic knowledge. J Biomed Inform. 2008;41:717–729. doi: 10.1016/j.jbi.2008.07.004. [DOI] [PubMed] [Google Scholar]
25.Chen B, et al. Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data. BMC Bioinformatics. 2010;11:255. doi: 10.1186/1471-2105-11-255. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Jentzsch A, et al. Enabling tailored therapeutics with linked data. Proceedings of the WWW2009 Workshop on Linked Data on the Web. 2009:1–6. [Google Scholar]
27.Heath T, Bizer C, editors. Linked Data: Evolving the Web into a Global Data Space. Morgan & Claypool; 2010. [Google Scholar]
28.Goble C, Wroe C. The montagues and the capulets. Comp Funct Genom. 2004;5:623–632. doi: 10.1002/cfg.442. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Field D, et al. Omics data sharing. Science. 2009;326:234–236. doi: 10.1126/science.1180598. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Quackenbush J. Data reporting standards: making the things we use better. Genome Med. 2009;1:111. doi: 10.1186/gm111. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Schofield PN, et al. Sustaining the data and bioresource commons. Science. 2010;330:592–593. doi: 10.1126/science.1191506. [DOI] [PubMed] [Google Scholar]
32.Mons B, et al. The value of data. Nat Genet. 2011;43:281–283. doi: 10.1038/ng0411-281. [DOI] [PubMed] [Google Scholar]
33.Editorial. Changing the face of scientific publishing. Integr Biol. 2009;1:293. doi: 10.1039/b906424a. [DOI] [PubMed] [Google Scholar]
34.Editorial. Putting data to work. Nat Chem Biol. 2010;6:783. doi: 10.1038/nchembio.465. [DOI] [PubMed] [Google Scholar]
35.Smith B, et al. Towards a reference terminology for ontology research and development in the biomedical domain. In. Proceedings of KR-MED 2006. 2006:57–66. [Google Scholar]
36.Vanopstal K, et al. Vocabularies and retrieval tools in biomedicine: disentangling the terminological knot. J Med Syst. 2009 doi: 10.1007/s10916-009-9389-z. [DOI] [PubMed] [Google Scholar]
37.Laibe C, Le Novère N. MIRIAM resources: tools to generate and resolve robust cross-references in systems biology. BMC Syst Biol. 2007;1:58. doi: 10.1186/1752-0509-1-58. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.McNaught A. The IUPAC international chemical identifier: InChl – a new standard for molecular informatics. Chem Int. 2006;28:12–14. [Google Scholar]

[R1] 1.Searls D. Data integration: challenges for drug discovery. Nat Rev Drug Discov. 2005;4:45–58. doi: 10.1038/nrd1608. [DOI] [PubMed] [Google Scholar]

[R2] 2.Slater T, et al. Beyond data integration. Drug Discov Today. 2008;13:584–589. doi: 10.1016/j.drudis.2008.01.008. [DOI] [PubMed] [Google Scholar]

[R3] 3.Cimino JJ. Desiderata for controlled medical vocabularies in the twenty-first century. Methods Inf Med. 1998:394–403. [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Barnes MR, et al. Lowering industry firewalls: pre-competitive informatics initiatives in drug discovery. Nat Rev Drug Discov. 2009:1–8. doi: 10.1038/nrd2944. [DOI] [PubMed] [Google Scholar]

[R5] 5.Rocca-Serra P, et al. ISA software suite: supporting standards-compliant experimental annotation and enabling curation at the community level. Bioinformatics. 2010;26:2354–2356. doi: 10.1093/bioinformatics/btq415. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Wolstencroft K, et al. RightField: embedding ontology annotation in spreadsheets. Bioinformatics. 2011;27:2021–2022. doi: 10.1093/bioinformatics/btr312. [DOI] [PubMed] [Google Scholar]

[R7] 7.Qi D, et al. An ontology for description of drug discovery investigations. J Integr Bioinform. 2010;7:1–13. doi: 10.2390/biecoll-jib-2010-126. [DOI] [PubMed] [Google Scholar]

[R8] 8.Rubin DL, et al. Computational neuroanatomy: ontology-based representation of neural components and connectivity. BMC Bioinformatics. 2009;10(Suppl 2):S3. doi: 10.1186/1471-2105-10-S2-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Kumar A, et al. An ontology for carcinoma classification for clinical bioinformatics. Stud Health Technol Inform. 2005;116:635–640. [PubMed] [Google Scholar]

[R10] 10.Hardy B, et al. Collaborative development of predictive toxicology applications. J Cheminform. 2010;2:7. doi: 10.1186/1758-2946-2-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Campillos M, et al. Drug target identification using side-effect similarity. Science. 2008;321:263–266. doi: 10.1126/science.1158140. [DOI] [PubMed] [Google Scholar]

[R12] 12.Campbell SJ, et al. Visualising the drug target landscape. Drug Discov Today. 2010;15:3–15. doi: 10.1016/j.drudis.2009.09.011. [DOI] [PubMed] [Google Scholar]

[R13] 13.Schürer SC, et al. BioAssay ontology annotations facilitate cross-analysis of diverse high-throughput screening data sets. J Biomol Screen. 2011;16:415–426. doi: 10.1177/1087057111400191. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Chen H, Xie G. The use of web ontology languages and other semantic web tools in drug discovery. Expert Opin Drug Discov. 2010;5:413–423. doi: 10.1517/17460441003762709. [DOI] [PubMed] [Google Scholar]

[R15] 15.Agarwal P, Searls DB. Literature mining in support of drug discovery. Brief Bioinform. 2008;9:479–492. doi: 10.1093/bib/bbn035. [DOI] [PubMed] [Google Scholar]

[R16] 16.Agarwal P, Searls DB. Can literature analysis identify innovation drivers in drug discovery? Nat Rev Drug Discov. 2009;8:865–878. doi: 10.1038/nrd2973. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Collins FS. Reengineering translational science: the time is right. Sci Transl Med. 2011;3:1–6. doi: 10.1126/scitranslmed.3002747. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Sorani MD, et al. Clinical and biological data integration for biomarker discovery. Drug Discov Today. 2010;15:741–748. doi: 10.1016/j.drudis.2010.06.005. [DOI] [PubMed] [Google Scholar]

[R19] 19.Hutchins S, et al. Open partnering of integrated drug discovery: continuing evolution of the pharmaceutical model. Drug Discov Today. 2011;16:281–283. doi: 10.1016/j.drudis.2011.02.007. [DOI] [PubMed] [Google Scholar]

[R20] 20.Hunter J, Stephens S. Is open innovation the way forward for big pharma? Nat Rev Drug Discov. 2010;9:87–88. [Google Scholar]

[R21] 21.Friend SH. The need for precompetitive integrative bionetwork disease model building. Clin Pharmacol Ther. 2010;87:536–539. doi: 10.1038/clpt.2010.40. [DOI] [PubMed] [Google Scholar]

[R22] 22.Vargas G, et al. Arguments against precompetitive collaboration. Clin Pharmacol Ther. 2010;87:527–529. doi: 10.1038/clpt.2010.30. [DOI] [PubMed] [Google Scholar]

[R23] 23.Sahoo SS, et al. An ontology-driven semantic mashup of gene and biological pathway information: application to the domain of nicotine dependence. J Biomed Inform. 2008;41:752–765. doi: 10.1016/j.jbi.2008.02.006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Gudivada RC, et al. Identifying disease-causal genes using semantic web-based representation of integrated genomic and phenomic knowledge. J Biomed Inform. 2008;41:717–729. doi: 10.1016/j.jbi.2008.07.004. [DOI] [PubMed] [Google Scholar]

[R25] 25.Chen B, et al. Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data. BMC Bioinformatics. 2010;11:255. doi: 10.1186/1471-2105-11-255. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Jentzsch A, et al. Enabling tailored therapeutics with linked data. Proceedings of the WWW2009 Workshop on Linked Data on the Web. 2009:1–6. [Google Scholar]

[R27] 27.Heath T, Bizer C, editors. Linked Data: Evolving the Web into a Global Data Space. Morgan & Claypool; 2010. [Google Scholar]

[R28] 28.Goble C, Wroe C. The montagues and the capulets. Comp Funct Genom. 2004;5:623–632. doi: 10.1002/cfg.442. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Field D, et al. Omics data sharing. Science. 2009;326:234–236. doi: 10.1126/science.1180598. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Quackenbush J. Data reporting standards: making the things we use better. Genome Med. 2009;1:111. doi: 10.1186/gm111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.Schofield PN, et al. Sustaining the data and bioresource commons. Science. 2010;330:592–593. doi: 10.1126/science.1191506. [DOI] [PubMed] [Google Scholar]

[R32] 32.Mons B, et al. The value of data. Nat Genet. 2011;43:281–283. doi: 10.1038/ng0411-281. [DOI] [PubMed] [Google Scholar]

[R33] 33.Editorial. Changing the face of scientific publishing. Integr Biol. 2009;1:293. doi: 10.1039/b906424a. [DOI] [PubMed] [Google Scholar]

[R34] 34.Editorial. Putting data to work. Nat Chem Biol. 2010;6:783. doi: 10.1038/nchembio.465. [DOI] [PubMed] [Google Scholar]

[R35] 35.Smith B, et al. Towards a reference terminology for ontology research and development in the biomedical domain. In. Proceedings of KR-MED 2006. 2006:57–66. [Google Scholar]

[R36] 36.Vanopstal K, et al. Vocabularies and retrieval tools in biomedicine: disentangling the terminological knot. J Med Syst. 2009 doi: 10.1007/s10916-009-9389-z. [DOI] [PubMed] [Google Scholar]

[R37] 37.Laibe C, Le Novère N. MIRIAM resources: tools to generate and resolve robust cross-references in systems biology. BMC Syst Biol. 2007;1:58. doi: 10.1186/1752-0509-1-58. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.McNaught A. The IUPAC international chemical identifier: InChl – a new standard for molecular informatics. Chem Int. 2006;28:12–14. [Google Scholar]

PERMALINK

Empowering industrial research with shared biomedical vocabularies

Lee Harland

Christopher Larminie

Susanna-Assunta Sansone

Sorana Popa

M Scott Marshall

Michael Braxenthaler

Michael Cantor

Wendy Filsell

Mark J Forster

Enoch Huang

Andreas Matern

Mark Musen

Jasmin Saric

Ted Slater

Jabe Wilson

Nick Lynch

John Wise

Ian Dix

Abstract

Introduction

BOX 1.

Defining ‘vocabulary’

Basic identity (based on requirements defined by Laibe and Le Novère [37])

Organizational structure

Linguistic features

Other important elements

Origin and use of industry vocabularies

TABLE 1.

Emerging challenges

A new approach

Economics

External partnerships

TABLE 2.

BOX 2.

The missing pieces

Concluding remarks

Acknowledgments

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases