Abstract
Biomedical ontologies are increasingly being used to improve information retrieval methods. In this paper, we present a novel information retrieval approach that exploits knowledge specified by the Semantic Web ontology and rule languages OWL and SWRL. We evaluate our approach using an autism ontology that has 156 SWRL rules defining 145 autism phenotypes. Our approach uses a vector space model to correlate how well these phenotypes relate to the publications used to define them. We compare a vector space phenotype representation using class hierarchies with one that extends this method to incorporate additional semantics encoded in SWRL rules. From a PubMed-extracted corpus of 75 articles, we show that average rank of a related paper using the class hierarchy method is 4.6 whereas the average rank using the extended rule-based method is 3.3. Our results indicate that incorporating rule-based definitions in information retrieval methods can improve search for relevant publications.
1. Introduction
Ontologies are now widely used to provide domain knowledge in many types of biomedical applications, ranging from database mediation1, 2, data annotation3, and data mining4. In the area of information retrieval, researchers have developed search engines that can automatically expand the user’s terms to include synonyms, hypernyms, hyponyms, and concept hierarchies using terminology resources like UMLS and to organize the query results hierarchically based on ontologies such as the Gene Ontology. Ontologies have also been used to identify relevant domain relationships in articles5. Although biomedical ontologies are increasingly being expressed in the W3C standard Ontology Web Language, or OWL (http://www.w3.org/TR/owl-ref/), relatively few information retrieval methods exploit knowledge in this format.
In our work, we are examining the information retrieval needs of autism domain experts who are developing an ontology-based catalog of autism phenotypes (called Phenologue). In particular, we are interested in whether information retrieval of PubMed articles relevant to particular phenotypes is improved when the rule definition of the phenotype is included with the knowledge encoded from the ontology-based class hierarchy. As previously described6, the autism ontology is an extension of the OWL-based NIFSTD (Neuroscience Integration Framework ontology7). Figure 1 shows a part of this ontology modeling phenotypes using an OWL-based class hierarchy. In our prior work, we have shown that the Semantic Web Rule Language, or SWRL (http://www.w3.org/Submission/SWRL/), can be used to provide significantly more detailed rule-based definitions of these phenotypes. In particular, we use SWRL rules to define how phenotypes (e.g., delayed phrases) can be inferred from available data collected from standard autism assessment instrument, such as the Autism Diagnostic Instrument. Figure 2 shows a SWRL rule encoding the delayed phrases phenotype using a cutoff score of the age when phrase acquisition is developmentally expected.
Figure 1.
Autism phenotypes show as a part of autism ontology class hierarchy. Some of these phenotypes such as, for example, delayed phrases have detailed definitions, which are encoded as SWRL rules
Figure 2.
An example SWRL rule encoding the delayed phrases phenotype
When creating the Phenologue knowledge base, domain experts are interested in finding relevant publications for each phenotype, which could be used to establish literature support. To support this knowledge acquisition effort, we have developed a method that incorporates the existing ontology and rule definitions for phenotypes in the search process. Our method extends the standard vector space model, which is widely used by search engines, to exploit these ontology and rule definitions. Using our autism phenotype ontology, we undertook a validation of how well our approach can produce relevant results for semantic search of phenotype definitions in full-text scientific articles.
2. Background
Many modern literature-search tools use vector space modeling to implement search strategies8. It provides an efficient and scalable computational approach for converting a text-based corpus to a standard mathematical format and for then searching for terms in that corpus. This model represents each document in a corpus as a vector in Euclidian space, where each dimension of a vector corresponds to an individual term in the overall corpus. If a document includes a term, its value in the vector is given a non-zero weight for that term. One of the most common methods to compute this weight is term frequency-inverse document frequency weighting8. In this weighting scheme, weights increase proportionally to the number of the term appearances in the document but are scaled down by the frequency of the term in the corpus.
To perform querying, vectors assigned to documents can be compared to query vectors constructed from search terms. They can also be compared against other documents’ vectors to determine the similarity between documents. This approach is used in PubMed’s Related Articles functionality9, for example, and in many other document clustering methods10. While powerful and very efficient, standard vector-based search is based only on the presence and frequency of terms in documents and does not consider any additional information about terms themselves. If a user searches for the term, many synonymous or semantically related terms are not considered.
To improve search quality, historically lexical databases and thesauri like WordNet11 are used to expand terms in queries12. Ontologies are increasingly being used in a similar way to enhance the information retrieval process by paraphrasing, relaxing, or expanding user queries through context identification and disambiguation12, 13. However these methods typically do not take the advantage of the rich information model presented in ontologies. Some recent approaches have attempted to rectify this shortcoming. Junfeng et al.14 proposed annotating web content with ontology terms and using these annotations in addition to web content in a vector space model for information retrieval purposes. Castells et al.15 proposed manually assigning weights to annotations in vector space modeling. Khan et al.13 presented a method to extract related concepts automatically and then index a document based on a concept-based model using ontologies. However none of these methods considers (1) incorporating the axiomatic or rule-based semantics from ontologies in the search process and (2) quantifying ontological expansions by a measure of semantic similarity.
The autism ontology we used to evaluate our information retrieval approach is written in OWL and contains hierarchies of autism phenotypes. The goal of the ontology is to create standard encodings of autism phenotypes, which can be used to support concept-based querying of the National Database of Autism Research (NDAR)16 and other resources. The ontology contains both an information model that represents research or clinical data collected through standardized instruments and a domain ontology that defines terms and relationships among nine major categories of autism phenotypes, such as language, social interaction, and behavioral abnormalities. The ontology is an extension of the NIFSTD neuroscience ontology, which does not contain significant numbers of classes related to autism phenotypes. To encode the initial domain knowledge of phenotypes, a systematic review was undertaken in PubMed to find clinical phenotypes defined by three common assessment instruments used in autism2, 6. A manual analysis by two domain experts of 26 relevant papers produced a list of 145 phenotypes, some of which had multiple definitions. The NIFSTD ontology was then expanded to incorporate classes and properties related to these phenotypes as well as the items from the assessment instruments.
As mentioned, SWRL was used to encode each of the 156 phenotype definitions in the autism OWL ontology. SWRL has emerged as the primary language for encoding rules in OWL ontologies. It provides a Horn-like rule language built on OWL and uses the same description logic foundation as OWL. SWRL rules are written directly in terms of domain concepts in an OWL ontology and can be considered as formal logical statements about ontology entities. This property allows SWRL rules to be treated as first class entities in an ontology and supports automatic methods to determine how rules interact with the ontology. Our extended rule-based method for information retrieval exploits this property to examine the phenotype rules stored in the autism ontology and to determine automatically the domain concepts used in those rules.
3. Methods
To address the shortcomings of prior ontology-based methods for information retrieval, we introduce a novel search method in this paper, which we call the extended rule-based method. This method combines the efficiency of vector modeling with the additional expressivity provided by the semantic information encoded in ontologies. It automatically extends the vector space representation to use SWRL rules and uses various semantic relationships that are encoded in the associated ontology. Instead of simply filtering results based on class hierarchies, it quantitatively incorporates semantic similarity metric in the vector space representation before undertaking the search process. Figure 3 shows the basic steps of our method.
Figure 3.
An overview of the extended rule-based method, which shows how we model publications and rules in vector space and compute their correlation
To evaluate the efficacy of this extended rule-based method, we compare it to prior ontology-derived methods. In particular, we used the framework presented by Jun-feng et al.14 to come up with a baseline method for our evaluation. This method, which we call class hierarchy method, uses just the ontology hierarchies to expand queries for the domain concepts searched for in the corpus. Like the extended rule-based method, the class hierarchy method uses vector space modeling to represent publications. To model concepts, the class hierarchy method uses a binary vector that assign 1 to a present concept and it super-classes, sub-classes and sibling-classes; 0 is assigned otherwise. Correlation between concept binary vectors from and publication vectors are computed by a standard cosine similarity measure. After presenting details of the two information retrieval approaches, we discuss how we validated their precision by correlating the phenotype vectors to a set of 75 PubMed-retrieved full-text publications which contained the 26 articles that were the basis of knowledge in the autism ontology.
3.1. Vector Space Modeling
The goal of the extended rule-based method is to take a set of SWRL rules in a domain ontology and to automatically find research publications that are related to the rules. As a first step, we generate a representation of the ontology, its associated rule base, and a set of domain publications in a corpus.
As mentioned, we used a vector space modeling approach8 to encode this representation. Because of structural differences between publications and rules, we used a slightly modified vector space modeling methods to encode each source. The focus of our approach is to capture the semantic information about terms in a rule to expand the representation of each rule.
Publications Model
Before converting publications to vectors of words, we applied several preprocessing procedures on the publications in the corpus. As the first step, we converted publications from PDF format to plain text. We eliminated stop words, which are common words that do not convey information about context. For this step, we used a list of stop words from the natural language processing literature17. Next, we applied a stemming algorithm to replace derivation words with their origins. For this step we used the Porter stemming algorithm, which is the de facto standard algorithm used for English Stemming18. After trimming the text we converted each publication to a vector of weighted terms. We used the term frequency inverse document frequency-weighting scheme.
Rule Representation Model
We then represented each of the 156 rules that define the 145 phenotypes in the ontology as a weighted vector of terms. To calculate this weighting, we incorporated the semantics of the OWL terms used in the rules. In standard vector space modeling, the weights of terms in a vector depend only on the relative frequency of the terms. Our method extends this approach to reflect the relationships between the OWL terms used in the rules and ontology terms that are semantically related to those terms.
To quantify the strength of the relationship between ontology terms and the terms used in a rule, we used a semantic similarity-weighting scheme. Semantic similarity is a common metric in ontology-based systems and is used to capture the strength of the various relationships between terms in an ontology8. A number of approaches exist to quantify semantic similarity. In most of these approaches, the ontology is considered as a directed graph, with terms being nodes in that graph, and the various relationships between the terms being represented as edges and graph node properties to quantify each term’s semantic distance in the graph19, 20.
We used a method based on the hierarchy of classes and properties in an ontology in our semantic similarity weighting scheme. In this approach, semantic similarity decreases exponentially with the distance between two ontology entities in the hierarchy. The definition of semantic similarity is:
Here, SemanticSimilaritya(b) is the semantic similarity between classes or properties a and b where distance(a,b) is the distance between them in the hierarchy. α is a variable to scale the semantic similarity values and can be varied based on the semantic relationship between a and b.
This semantic similarity metric was used to weigh the conceptual similarity of classes and properties that are covered by each SWRL rule. After we calculated the weightings of a set of classes and properties from each SWRL rule, we again used a vector space model to represent these rules and incorporated the weighting in this representation.
If a class or property was in the relevant set of multiple atoms in the rule, we summed its relevance values in the vector representation to reflect the relevance of this class or property to the whole rule base. We also considered term annotations as a part of the vector representations. Adding annotations to the vector space captured metadata information that the rule developers added to the rule base.
3.2. Correlation Computation
After we represent the sets of rules and research publications as points in a vector space, we can compute a similarity metric between rules and publications. We used a standard cosine similarity metric to calculate this similarity. The cosine similarity for two vectors is equivalent to the cosine of the angle between them. Values range from 0 for orthogonal vectors to 1 for identical vectors. The mathematical formula for cosine similarity is:
Where a and b are two vectors in the Euclidean space, θ is the angle between them, a.b is their dot product, and ||a|| and ||b|| are the magnitudes of the vectors. Since cosine similarity measure normalizes vectors, it does not depend on the size of the vectors and gives an accurate and stable measure of similarity for any vector space dimension.
We calculated the cosine similarity between each phenotype rule in the rule base and all the publications in the corpus. For each phenotype rule, all publications in the corpus were sorted according to each one’s cosine similarity score.
4. Evaluation and Results
We evaluated the accuracy of our method by assessing the correlation of the 145 phenotype classes and their 156 associated rules with a corpus of 75 papers from the autism domain. These papers were retrieved from PubMed using the same keyword search criteria that were used to find the original articles used to develop the autism ontology. The corpus consists of the 26 articles that contained clear definitions of phenotypes and another 49 articles that were not found to have such phenotype definitions. There were 1,726 classes and properties in the phenotype domain ontology. We used the ontology, along with metadata annotations, to create a vector space of 2,765 dimensions. The addition of the corpus created a vector space of 32,086 terms after processing.
For each rule, we sorted all the publications in corpus using the cosine similarity of each paper to its phenotype class and corresponding rule(s). We used the publications that were used to define a phenotype class by ontology developers as the gold standard for relevance and thus the most related publication to the related rule. In the ideal search result, the publication or publications that were used to define the phenotype will be at the top or near the top of the sorted list of 75 publications. Because most of the phenotypes have only one source publication, we chose to look at the rank of this source article rather than a more typical precision and recall statistic. We report on the average ranking of the correct match among all 145 phenotype classes as well as the percentage of articles among the top one to ten related publications that had at least one correctly matched publication.
In the sorted list of the related publications, the average position of the correct source article or articles for the extended rule-based method was 3.3. For comparison, we examined our method against the class hierarchy method, which, instead of using our semantic similarity measure to incorporate the semantics provided by the SWRL rules in the ontology, uses the ontology hierarchy alone to expand each phenotype concept to a set of related concepts. Our comparison shows that the average position of the correct source article or articles with the class hierarchy method was 4.6, which is significantly lower than our extended rule-based method.
To examine how the two methods performed against each other, we varied the number of the top related publications and looked at the percentage of correctly related paper that these methods found. Figure 4 shows this comparison. As shown, the extended rule-based method performs as well or better than the class hierarchy method across the top one to ten returned results and reaches greater than 90% accuracy for the top 10 articles out of the 75 that are ranked.
Figure 4.
This figure shows the percentage of phenotypes where the retrieved documents cover the correct source article(s) in the extended rule-based method and the class hierarchy method
5. Discussion
We have developed a novel semantic search method to assist biomedical researchers in finding publications directly relevant to concepts in a domain. The method incorporates the semantics provided by rules in a domain ontology to find concepts indirectly related to search terms. Our approach to semantic search differs from prior work. Many information retrieval systems use thesauri to automatically expand search terms to other related terms so that searches are not restricted to simple term matching21. However, these term expansions typically capture only a very limited amount of additional semantic information. Other systems, such as MedMiner22 and Textpresso5, use natural language processing techniques to capture additional term semantics when searching. These techniques include parsing and part-of-speech tagging to identify documents that mention a query term that a user specifies. However due to the complexity of human language, these techniques’ performances are limited and are generally restricted to certain domains.
Ontology-based methods are increasingly being used to enhance search but their primary use is to structure search results. For example, they are often used in digital libraries to categorize research paper recommendations23. They can also be used to perform fine-grained stratification of result in some domains. GoPubMed, for example, filters search results through the categories of the Gene Ontology24. A few strategies attempt to incorporate semantic information into the search process itself. These strategies include description logic and fuzzy logic based approaches25. While these approaches do increase expressivity, they generally suffer from scalability problems. Additionally, they are often limited to domains described by a predefined set of ontologies and do not support general-purpose searches using arbitrary user-supplied ontologies.
Our preliminary results indicate that vector-based semantic correlation using rules has a very good precision in finding articles related to a systematically encoded catalog of phenotypes. We are analyzing how we can improve upon the precision of the match. For example, we plan to evaluate the correlation of the rules to specific sections of the articles where the rule was defined, such as methods or results, rather than to the entire text of the publication. The current study of our method reports only on an internal validation of precision. We are currently working on an evaluation of web-based service (to be available at phenologue.org), where we are using the Google custom search API to submit our weighted set of terms for each rule and find relevant pages and articles. We are also planning to apply our method on other ontologies and rule bases for further verification. As part of the Phenologue project, we also plan to provide an interactive web service that allows users to upload their own SWRL rule bases with associated ontologies and to search for related full-text publications on the Web.
Acknowledgments
The authors would like to thank Samson W. Tu, Lakshika Tennakoon, Richard Waldinger and Joachim Hallmayer for their help on the ontology development. We would like to acknowledge support from the National Database for Autism Research. This research was supported in part by grants R01LM009607 and R01MH87756 from the National Institutes of Health.
References
- 1.Jeng S, Wang K, Barbero J, Brinkley J, Tarczy-Hornoch P. A pilot bridging data integration and analytics: BioMediator and R?. AMIA Annual Symposium; Washington, DC. 2005. p. 995. [PMC free article] [PubMed] [Google Scholar]
- 2.Young L, Tu SW, Tennakoon L, Vismer D, Astakhov V, Gupta A, Grethe JS, Martone ME, Das AK, McAuliffe MJ. Ontology-driven data integration for autism research. 22nd IEEE International Symposium on Computer Based Medical Systems; Albuquerque, NM. 2009. pp. 1–7. [Google Scholar]
- 3.Shah NH, Rubin DL, Supekar KS, Musen MA. Ontology-based annotation and query of tissue microarray data. AMIA Annual Symposium; Washington, DC. 2006. pp. 709–713. [PMC free article] [PubMed] [Google Scholar]
- 4.Raj R, O’Connor MJ, Das AK. An ontology-driven method for hierarchical mining of temporal patterns: application to HIV drug resistance research. AMIA Annual Symposium; Chicago, IL. 2007. pp. 614–619. [PMC free article] [PubMed] [Google Scholar]
- 5.Muller HM, Kenny EE, Sternberg PW. Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol. 2004;2(11):1984–1998. doi: 10.1371/journal.pbio.0020309. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Tu S, Tennakoon L, Das AK. Using an integrated ontology and information model for querying and reasoning about phenotypes: the case of autism. AMIA Annual Symposium; Washington, DC. 2008. pp. 727–731. [PMC free article] [PubMed] [Google Scholar]
- 7.Bug WJ, Ascoli GA, Grethe JS, Gupta A, Fennema-Notestine C, Laird AR, Larson SD, Rubin D, hepherd GM, Turner JA, Martone ME. The NIFSTD and BIRNLex vocabularies: building comprehensive ontologies for neuroscience. Neuroinformatics. 2008;6(3):175–194. doi: 10.1007/s12021-008-9032-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Manning CD, Raghavan P, Schüze H. Introduction to information retrieval. Cambridge University Press; 2008. [Google Scholar]
- 9.Wilbur WJ, Coffee L. The effectiveness of document neighboring in search enhancement. Inf. Process. Manage. 1994;30:253–266. [Google Scholar]
- 10.Glenisson P, Antal P, Mathys J, Moreau Y, de Moor B. Evaluation of the vector space representation in text-based gene clustering. Pac Symp. Biocomput. 2003;8:391–402. doi: 10.1142/9789812776303_0037. [DOI] [PubMed] [Google Scholar]
- 11.Miller GA. WordNet: a lexical database for English. Com. ACM. 1995;38(11):39–41. [Google Scholar]
- 12.Bhogal J, Macfarlane A, Smith P. A review of ontology based query expansion. Inf. Process. Manage. 2007;43(4):866–886. [Google Scholar]
- 13.Khan L, McLeod D, Hovy E. Retrieval effectiveness of an ontology-based model for information selection. VLDB J. 2004;13(1):71–85. [Google Scholar]
- 14.Jun-feng S, Wei-ming Z, Wei-dong X, Guo-hui L, Zhen-ning X. Ontology-based information retrieval model for the semantic web. IEEE International Conference on e-Technology, e-Commerce and e-Service; Hong Kong. 2005. pp. 152–155. [Google Scholar]
- 15.Castells P, Fernández M, Vallet D. An adaptation of the vector-space model for ontology-based information retrieval. IEEE T Knowl Data En. 2007;19(2):261–272. [Google Scholar]
- 16.Hus V, Pickles A, Cook EH, Risi S, Lord C. Using the autism diagnostic interview-revised to increase phenotypic homogeneity in genetic studies of autism. Biol Psychiatry. 2007;61(4):438–448. doi: 10.1016/j.biopsych.2006.08.044. [DOI] [PubMed] [Google Scholar]
- 17.Charniak E, Altun Y, de Salvo Braz R, Garrett B, Kosmala M, Moscovich T, Pang L, Pyo C, Sun Y, Wy W, Yang Z, Zeller S, Zorn L. Reading comprehension programs in a statistical-language - Processing Class. North American Association for Computational Linguistics Workshop on Reading Comprehension Tests As Evaluation for Computer-Based Language Understanding Systems; Morristown, NJ. 2000. pp. 1–5. [Google Scholar]
- 18.Porter M. An algorithm for suffix stripping, Program. 1980;14(3):130–137. [Google Scholar]
- 19.Pesquita C, Faria D, Falcao AO, Lord P, Couto FM. Semantic similarity in biomedical ontologies. PLoS Comput Biol. 2009;5(7):e1000443. doi: 10.1371/journal.pcbi.1000443. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Rada R, Mili H, Bicknell E, Blettner M. Development and application of a metric on semantic nets. IEEE T Syst Man Cyb. 1989;19(1):17–30. [Google Scholar]
- 21.Bhalotia G, Nakov PI, Schwartz AS, Hearst MA. BioText team report for the TREC 2003 genomics track. 25th Text REtrievalxi Conference; Gaithersburg, MD. 2003. pp. 612–621. [Google Scholar]
- 22.Tanabe L, Scherf U, Smith LH, Lee JK, Hunter L, Weinstein JN. MedMiner: an internet text-mining tool for biomedical information, with application to gene expression profiling. Biotechniques. 1999;27:1210–1217. doi: 10.2144/99276bc03. [DOI] [PubMed] [Google Scholar]
- 23.Shum SB, Motta E, Domingue J. ScholOnto: an ontology-based digital library server for research documents and discourse. Int J Digit Libr. 2000;3:237–248. [Google Scholar]
- 24.Delfs R, Doms A, Kozlenkov A, Schroeder M. GoPubMed: ontology-based literature search applied to GeneOntology and PubMed. German Bioinformatics Conference; Bielefeld, Germany. 2004. pp. 169–178. [Google Scholar]
- 25.Zhang L, Yu Y, Zhou J, Lin C, Yang Y. An enhanced model for searching in semantic portals. 14th International Conference on World Wide Web; New York, NY. 2005. pp. 453–462. [Google Scholar]




