Semantic Similarity in Biomedical Ontologies

Catia Pesquita; Daniel Faria; André O Falcão; Phillip Lord; Francisco M Couto

doi:10.1371/journal.pcbi.1000443

. 2009 Jul 31;5(7):e1000443. doi: 10.1371/journal.pcbi.1000443

Semantic Similarity in Biomedical Ontologies

Catia Pesquita ^1,^*, Daniel Faria ¹, André O Falcão ¹, Phillip Lord ², Francisco M Couto ¹

Editor: Philip E Bourne³

PMCID: PMC2712090 PMID: 19649320

Abstract

In recent years, ontologies have become a mainstream topic in biomedical research. When biological entities are described using a common schema, such as an ontology, they can be compared by means of their annotations. This type of comparison is called semantic similarity, since it assesses the degree of relatedness between two entities by the similarity in meaning of their annotations. The application of semantic similarity to biomedical ontologies is recent; nevertheless, several studies have been published in the last few years describing and evaluating diverse approaches. Semantic similarity has become a valuable tool for validating the results drawn from biomedical studies such as gene clustering, gene expression data analysis, prediction and validation of molecular interactions, and disease gene prioritization.

We review semantic similarity measures applied to biomedical ontologies and propose their classification according to the strategies they employ: node-based versus edge-based and pairwise versus groupwise. We also present comparative assessment studies and discuss the implications of their results. We survey the existing implementations of semantic similarity measures, and we describe examples of applications to biomedical research. This will clarify how biomedical researchers can benefit from semantic similarity measures and help them choose the approach most suitable for their studies.

Biomedical ontologies are evolving toward increased coverage, formality, and integration, and their use for annotation is increasingly becoming a focus of both effort by biomedical experts and application of automated annotation procedures to create corpora of higher quality and completeness than are currently available. Given that semantic similarity measures are directly dependent on these evolutions, we can expect to see them gaining more relevance and even becoming as essential as sequence similarity is today in biomedical research.

Introduction

Comparison and classification have been central pillars of biology since Linnaeus proposed his taxonomy and Darwin observed the mockingbirds on the Galapagos Islands. Like most scientific knowledge, biological laws and models are derived from comparing entities (such as genes, cells, organisms, populations, species) and finding their similarities and differences. However, biology is unlike other sciences in that its knowledge can seldom be reduced to mathematical form. Thus, biologists either record their knowledge in natural language—for example, in scientific publications—or they must seek other forms of representation to organize it, such as classification schemes. When new entities arise, biologists approach them by comparing them to known entities and making inferences according to their degree of similarity.

Comparing entities is not always trivial. For instance, while the sequences or structures of two gene products can be compared directly (through alignment algorithms), the same is not true of their functional aspects. The difference is that sequences and structures have an objective representation and measurable properties, whereas functional aspects have neither. This does not mean that it is impossible to compare functional aspects, but that to be compared they must be expressed in a common and objective form.

The advent of automated sequencing has had deep repercussions on knowledge representation in biology. As experimental methods shift in scope from the gene level to the genomic level, computational analysis is proving essential in handling the increasing amount of data. Thus it has become crucial to adopt common and objective knowledge representations to help knowledge sharing and computer reasoning. This need led to the development of ontologies for goals such as annotating gene products (Gene Ontology), annotating sequences (Sequence Ontology), and annotating experimental assays (Microarray and Gene Expression Data Ontology).

The adoption of ontologies for annotation provides a means to compare entities on aspects that would otherwise not be comparable. For instance, if two gene products are annotated within the same schema, we can compare them by comparing the terms with which they are annotated. While this comparison is often done implicitly (for instance, by finding the common terms in a set of interacting gene products), it is possible to do an explicit comparison with semantic similarity measures. Within the context of this article, we define a semantic similarity measure as a function that, given two ontology terms or two sets of terms annotating two entities, returns a numerical value reflecting the closeness in meaning between them.

The Gene Ontology (GO) [1] is the main focus of investigation of semantic similarity in molecular biology, not only because it is the ontology most widely adopted by the life sciences community, but also because comparing gene products at the functional level is crucial for a variety of applications. Semantic similarity applied to the GO annotations of gene products provides a measure of their functional similarity. From this point forward, we will use the term “functional similarity” when referring to the similarity between two gene products given by the semantic similarity between the sets of GO terms with which they are annotated. As such, the semantic similarity measures and the studies reviewed in this article are presented in the context of GO, notwithstanding the fact that they are applicable to other biological ontologies.

GO provides a schema for representing gene product function in the cellular context. Figure 1 shows how GO is structured as three independent directed acyclic graphs (DAGs) that correspond to orthogonal categories of gene product function: molecular function, biological process, and cellular component. The nodes in the graph represent terms that describe components of gene product function. GO links the terms to each other by relationships, most commonly of the types ‘is a’ and ‘part of’, the former expressing a simple class–subclass relationship and the latter expressing a part–whole relationship. Gene products that are described by GO terms are said to be annotated with them, either directly or through inheritance, since annotation to a given term implies annotation to all of its ancestors (true path rule). The Gene Ontology Consortium is responsible for developing and maintaining GO terms, their relationships, and their annotations to genes and gene products of the collaborating databases. Moreover, GO Consortium is also responsible for developing tools that support the creation, maintenance, and use of all this information.

The fact that GO is a DAG rather than a tree is illustrated by the term “transcription factor activity” which has two parents. An example of a *part of* relationship is also shown between the terms cell part and cell.

Classification of Semantic Similarity Measures

Several approaches are available to quantify semantic similarity between terms or annotated entities in an ontology represented as a DAG such as GO. This article distinguishes these approaches in the following way:

Scope: Which entities they intend to compare, that is, GO terms versus gene products;
Data source: Which components of the ontology they use, i.e., edges versus nodes;
Metric: How they quantify and combine the information stored on those components.

Comparing Terms

There are essentially two types of approaches for comparing terms in a graph-structured ontology such as GO: edge-based, which use the edges and their types as the data source; and node-based, in which the main data sources are the nodes and their properties. We summarize the different techniques employed in these approaches in Figure 2 and describe them in the following sections. Note that there are other approaches for comparing terms that do not use semantic similarity—for example, systems that select a group of terms that best summarize or classify a given subject based on the discrete mathematics of finite partially ordered sets [2].

DCA, disjoint common ancestors; IC, information content; MICA, most informative common ancestor.

Edge-based

Edge-based approaches are based mainly on counting the number of edges in the graph path between two terms [3]. The most common technique, distance, selects either the shortest path or the average of all paths, when more than one path exists. This technique yields a measure of the distance between two terms, which can be easily converted into a similarity measure. Alternatively, the common path technique calculates the similarity directly by the length of the path from the lowest common ancestor of the two terms to the root node [4].

While these approaches are intuitive, they are based on two assumptions that are seldom true in biological ontologies: (1) nodes and edges are uniformly distributed [5], and (2) edges at the same level in the ontology correspond to the same semantic distance between terms. Several strategies have been proposed to attenuate these issues, such as weighting edges differently according to their hierarchical depth, or using node density and type of link [6]. However, terms at the same depth do not necessarily have the same specificity, and edges at the same level do not necessarily represent the same semantic distance, so the issues caused by the aforementioned assumptions are not solved by those strategies.

Node-based

Node-based approaches rely on comparing the properties of the terms involved, which can be related to the terms themselves, their ancestors, or their descendants. One concept commonly used in these approaches is information content (IC), which gives a measure how specific and informative a term is. The IC of a term c can be quantified as the negative log likelihood,

where p(c) is the probability of occurrence of c in a specific corpus (such as the UniProt Knowledgebase), being normally estimated by its frequency of annotation. Alternatively, the IC can also be calculated from the number of children a term has in the GO structure [7], although this approach is less commonly used.

The concept of IC can be applied to the common ancestors two terms have, to quantify the information they share and thus measure their semantic similarity. There are two main approaches for doing this: the most informative common ancestor (MICA technique), in which only the common ancestor with the highest IC is considered [8]; and the disjoint common ancestors (DCA technique), in which all disjoint common ancestors (the common ancestors that do not subsume any other common ancestor) are considered [9].

Approaches based on IC are less sensitive to the issues of variable semantic distance and variable node density than edge-based measures [8], because the IC gives a measure of a term's specificity that is independent of its depth in the ontology (the IC of a term is dependent on its children but not on its parents). However, the IC is biased by current trends in biomedical research, because terms related to areas of scientific interest are expected to be more frequently annotated than other terms. Nevertheless, the use of the IC still makes sense from a probabilistic point of view: it is more probable (and less meaningful) that two gene products share a commonly used term than an uncommonly used term, regardless of whether that term is common because it is generic or because it is related to a hot research topic.

Other node-based approaches include looking at the number of shared annotations, that is, the number of gene products annotated with both terms [10]; computing the number of shared ancestors across the GO structure; and using other types of information such as node depth and node link density (i.e., node degree) [11].

Comparing Gene Products

Gene products can be annotated with several GO terms within each of the three GO categories. Gene product function is often described by several molecular function terms, and gene products often participate in multiple biological processes and are located in various cellular components. Thus, to assess the functional similarity between gene products (within a particular GO category) it is necessary to compare sets of terms rather than single terms. Several strategies have been proposed for this, which we have divided into two categories: pairwise and groupwise approaches, as shown in Figure 3.

Pairwise

Pairwise approaches measure functional similarity between two gene products by combining the semantic similarities between their terms. Each gene product is represented by its set of direct annotations, and semantic similarity is calculated between terms in one set and terms in the other (using one of the approaches described previously for comparing terms). Some approaches consider every pairwise combination of terms from the two sets (all pairs technique), while others consider only the best-matching pair for each term (best pairs technique). A global functional similarity score between the gene products is obtained by combining these pairwise semantic similarities, with the most common combination approaches being the average, the maximum, and the sum.

Groupwise

Groupwise approaches do not rely on combining similarities between individual terms to calculate gene product similarity, but calculate it directly by one of three approaches: set, graph, or vector.

In set approaches only direct annotations are considered and functional similarity is calculated using set similarity techniques.

In graph approaches gene products are represented as the subgraphs of GO corresponding to all their annotations (direct and inherited). Functional similarity can be calculated either using graph matching techniques or, because these are computationally intensive, by considering the subgraphs as sets of terms and applying set similarity techniques.

In vector approaches a gene product is represented in vector space, with each term corresponding to a dimension, and functional similarity is calculated using vector similarity measures. Vectors can be binary, with each dimension denoting presence or absence of the term in the set of annotations of a given gene product, or scalar, with each dimension representing a given property of the term (for example, its IC).

Survey of Semantic Similarity Measures

Since the first application of semantic similarity in biology, by Lord et al. [12], several semantic similarity measures have been developed for use with GO, as shown in Table 1. The following sections present a survey of semantic similarity measures proposed within the context of GO.

Table 1. Summary of term measures, their approaches, and their techniques.

Measure	Approach	Techniques
Resnik [8]	Node-based	MICA
Lin [13]	Node-based	MICA
Jiang and Conrath [14]	Node-based	MICA
GraSM [9]	Node-based	DCA
Schlicker et al. [16]	Node-based	MICA
Wu et al. [22]	Edge-based	Shared path
Wu et al. [23]	Edge-based	Shared path; distance
Bodenreider et al. [17]	Node-based	Shared annotations
Othman et al. [11]	Hybrid	IC/depth/number of children; distance
Wang et al. [25]	Hybrid	Shared ancestors
Riensche et al. [18]	Node-based	IC/MICA; shared annotations
Yu et al. [20]	Edge-based	Shared path
Cheng et al. [21]	Edge-based	Shared path
Pozo et al. [24]	Edge-based	Shared path

Open in a new tab

Measures for Terms

Node-based

The most common semantic similarity measures used with GO have been Resnik's, Lin's, and Jiang and Conrath's, which are node-based measures relying on IC [8],[13],[14]. They were originally developed for the WordNet, and then applied to GO [12],[15]. Resnik measures similarity between two terms as simply the IC of their most informative common ancestor (MICA):

(1)

While this measure is effective in determining the information shared by two terms, it does not consider how distant the terms are from their common ancestor. To take that distance into account, Lin's and Jiang and Conrath's measures relate the IC of the MICA to the IC of the terms being compared:

(2)

(3)

However, being relative measures, Inline graphic and are displaced from the graph. This means that these measures are proportional to the IC differences between the terms and their common ancestor, independently of the absolute IC of the ancestor.

To overcome this limitation, Schlicker et al. [16] have proposed the relevance similarity measure, which is based on Lin's measure, but uses the probability of annotation of the MICA as a weighting factor to provide graph placement.

(4)

A constraint all of these measures share is that they look only at a single common ancestor (the MICA) despite the fact that GO terms can have several DCA. To avoid this, Couto et al. [9] proposed the GraSM approach, which can be applied to any of the measures previously described, and where the IC of the MICA is replaced by the average IC of all DCA.

Bodenreider et al. [17] developed a node-based measure that also uses annotation data but does not rely on information theory. It represents each GO term as a vector of all gene products annotated with it, and measures similarity between two terms by computing the scalar product of their vectors.

Riensche et al. used coannotation data to map terms between different GO categories and calculate a weighting factor, which can then be applied to a standard node-based semantic similarity measure [18].