Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Aug 1.
Published in final edited form as: Comput Biol Med. 2017 May 31;87:217–229. doi: 10.1016/j.compbiomed.2017.05.026

Supporting Inter-topic Entity Search for Biomedical Linked Data Based on Heterogeneous Relationships

Nansu Zong 1,*, Sungin Lee 2, Jinhyun Ahn 2, Hong-Gee Kim 2
PMCID: PMC5572073  NIHMSID: NIHMS883485  PMID: 28601712

Abstract

Objective

The keyword-based entity search restricts search space based on the preference of search. When given keywords and preferences are not related to the same biomedical topic, existing biomedical Linked Data search engines fail to deliver satisfactory results. This research aims to tackle this issue by supporting an inter-topic search—improving search with inputs, keywords and preferences, under different topics.

Methods

This study developed an effective algorithm in which the relations between biomedical entities were used in tandem with a keyword-based entity search, Siren. The algorithm, PERank, which is an adaptation of Personalized PageRank (PPR), uses a pair of input: (1) search preferences, and (2) entities from a keyword-based entity search with a keyword query, to formalize the search results on-the-fly based on the index of the precomputed Individual Personalized PageRank Vectors (IPPVs).

Results

Our experiments were performed over ten linked life datasets for two query sets, one with keyword-preference topic correspondence (intra-topic search), and the other without (inter-topic search). The experiments showed that the proposed method achieved better search results, for example a 14% increase in precision for the inter-topic search than the baseline keyword-based search engine.

Conclusion

The proposed method improved the keyword-based biomedical entity search by supporting the inter-topic search without affecting the intra-topic search based on the relations between different entities.

Keywords: Keyword-based Entity Search, Inter-topic Search, Personalized PageRank, Biomedical Linked Data

1. Introduction

As biomedical data increases, Resource Description Framework (RDF) (http://www.w3.org/RDF/) has been adopted as the standard data format in Linked Data for interconnection, integration, and reuse of published biomedical data [1]. Biomedical Linked Data has now garnered more than 10 billion links (http://linkedlifedata.com/sources.html) connecting entities in diverse topics, including medicine, drug, symptom, gene, and others.

The keyword-based search, also known as text-based search, adapts the existing information retrieval models to provide scalable indexing capabilities as well as a user-friendly experience for an entity search [2, 3]. With a set of keywords as the input query, retrieved entities are sorted with a descending order of the relevancy, while excluding the need to understand the schema of data [2]. The relevancy is computed with two types of models [4, 5]: (1) query-dependent models, such as Siren [6], RareRank [7], ObjectRank [8]; and (2) query-independent models, such as ReCibRabk [9], Swoogle [10]. In order to provide better results, two methods are used in general for: (1) query expansions, and (2) filters in advanced search. Query expansion tends to retrieve more research results by enriching the keyword query with expansions, which are the related term(s) (e.g., synonym, hypernym) based on knowledge-bases [11] or Web of document [12], statistical related entities based on co-occurrence [13], or the preferences [14]. Filters use additional input, e.g., search preferences, to narrow down the search space to specific topics to retrieve more precise results. This strategy works for the intra-topic search in which the topics pertaining to the query keywords correspond with the search preference (e.g., searching drugs by name), however it provides dissatisfactory results when the topics are somewhat related but disparate.

Consider the example shown in Fig. 1: when attempting to find information on “drugs” (denoting the preference of search) for the disease “renal tubular acidosis” (the keyword query), the system will restrict the search space into the entities belonging to the designated preferred topic (i.e., drugs) and search for the entities that contain keywords in their descriptions. As a result, the unrelated drugs “aminohippurate”, “torasemide”, and “acyclovir” are returned by a keyword-based entity search engine, Linked Life Data (http://linkedlifedata.com/search/quick). In this search, the topics are used to narrow down the data spaces, and only the is_A relation is used in the search while other relations between topics are ignored. Therefore, keywords associated with a search preference that inherently requires the use of cross-topic links are not best dealt with by such keyword-based search systems. A system that can search the entities by considering (1) the keywords related to the entity, and (2) the preference related to the class (topic) of the entity, needs to be studied.

Fig. 1.

Fig. 1

An example of failure in Linked Life Data when it meets an inter-topic entity search.

To improve the existing biomedical entity search method for strengthening the ability of tackling the inter-topic search, this paper introduces PERank (https://github.com/zongnansu1982/PERankComputation) that uses the preference of search to trace the semantic meanings of the links to obtain more precise results. PERank adapts Personalized PageRank (PPR) to compute ranking scores of entities based on a pair of input: (1) a preference of search, and (2) entities returned from an entity search system with a keyword query, such as Siren (http://rdelbru.github.io/SIREn/). PERank refactors PPR with, (1) a customized transition matrix based on the classes and predicates in the schema graph, and (2) a personalized vector based on the entities in the data graph, to generate the Individual Personalized PageRank Vectors (IPPVs), which will be used to calculate a ranking score for each pair of input. PERank was tested with 10 linked life datasets and compared with the baseline search methods in light of various entity search strategies, including query expansion, class and predicate filters. The experiments demonstrate that the entities related to both the keyword and preference can be successfully retrieved by PERank, especially for inter-topic search. For example, in searching “diseases” (preference) that can be treated with “ethinamate” (keyword query), PERank returns two kinds of diseases: “renal tubular acidosis” and “acidosis-osteopetrosis syndrome” (shown in Appendix B), while other methods cannot return related ones. PERank improves the search (e.g., a 14% increase in precision compared to the baseline keyword-based search engine) when the topics of the keywords and preference do not correspond with each other.

The rest of the paper is organized as follows: in Section 2, we introduce the related works of this study. We present the basic idea of PERank in Section 3, and describe the computation of PERank with mathematical definitions in Sections 4. In Section 5, we present our experiment results. Finally, we discuss the limitations and conclude our work in Section 6 and 7.

2. Related works

Entity search, which finds the entities containing a given query in the properties, has been studied for years. Many biomedical platforms provide SPARQL endpoints to search over the data, such as EBI [15], Uniprot [16], Medical Subject Headings (Mesh) (https://id.nlm.nih.gov/mesh/query), NCBI2RDF [17], Big Linked Cancer Data [18], Biotea [19]. However, the limitation of triple store and SPARQL query in indexing scalability and the query formalization has driven Information Retrieval approaches to deal with entity search over the past few years [2]. In contrast to the triple store, keyword-based approaches store all the triples about one entity as an RDF document, such as the star-shaped document [6, 20], and return it as a representation of the entity [2]. The similarity between the document and the keywords is calculated by different IR models, such as Vector Space Model (VSM) [6] and the probabilistic model BM25F [2, 21]. Based on these models, there are a few applications to search entities in general domains. Siren [6], one of the most popular entity search frameworks, adapts TF/IDF [22] to search Linked Data based on the framework of Lucene (http://lucene.apache.org/) and offers a good compromise between query expressiveness, query processing, and index maintenance. Similarly, Falcons [23], Swoogle [10], and SWSE [9] search Linked Data based on VSM with virtual documents, but sort retrieved entities with different methods. Falcons adapts as the combination of query relevancy and entity popularity, and Swoogle and SWSE adapt variants of PageRank (PR) for ranking. Several studies [2, 21] implemented the BM25F retrieval model and reported good results. In the biomedical domain, Linked Life Data (http://linkedlifedata.com/) and Bio2RDF [24] generate linked datasets based on existing biomedical repositories and provide keyword-based search over biomedical entities. These methods focus on searching Linked Data with a keyword query and use preferences as a filter to narrow down the search space. The semantic relations between the preference and keywords, except for is_A, are ignored in these methods.

Based on the keyword-based search, this study proposes PERank, a variant of PPR, to improve search by exploring the semantic relations between the entities to support an inter-topic search with a given preference. PPR originates from the PR algorithm [25], which allows the PR score to be biased towards specific topics. By using the non-uniform personalized vectors (where there are 16 topics in Open Directory Project), Topic Sensitive PageRank (TSPR) teleports more authority to the webpages belonging to a designated topic to increase the importance of these pages for searching the information in such topic [26]. Similarly, the pages can receive the customized rankings based on the “blocks” formed by the same hosts [27]. ObjectRank is the first study that tries to apply PPR to entity-relationship (ER) graph search [8]. There are two notable differences between ObjectRank and the traditional PPR. Firstly, ObjectRank takes predicates for teleporting the authorities into consideration by predefining the weights of links. Secondly, ObjectRank changes the role of the personalization vector from ranking to search. The topics are replaced with query entities, and the schema is weighted to form an ER graph in ObjectRank and its variations [2831].

Similar to ObjectRank, the retrieved entities are used as the query vector to populate the personalized vector for an extended search. However, instead of using a monotonous probability transition matrix based on a predefined weighted graph, PERank populates the transition matrix based on the preference. For this purpose, PERank disassembles a whole graph into unit graphs that are created with classes or predicates to allow the user to obtain a customized ranking based on the preference. PERank exploits abundant links, especially in biomedical data, to retrieve related entities that cannot be obtained with the traditional text-based advanced entity search. To the best knowledge of the authors, PERank is the first method that improves text-based entity search by supporting the inter-topic search.

3. Inter-topic search based on PERank

Given a collection of linked datasets, an entity search returns the entities related to a keyword query and a search preference. The linked data consists of a set of entities E = { e1, e2, …, em} and a set of links L = {l1, l2, …, ln} that connect the entities. It should be noted that since we are searching entities, only the instances in Assertion Box (A-Box), containing extensional knowledge and assertions about individuals (instances) [32], are considered as entities in this paper. For example, with the A-Box of DBpedia Ontology, it can be learned that “DBpedia:Upper_trunk” is an instance of “DBpediaOntology:Nerve”, and thus “DBpedia:Upper_trunk” is considered as an entity. The entities belong to a set of classes C = {c1, c2, …, cd} and the links belong to a set of predicates (properties) P = {p1, p2, …, pf}. For example, Fig. 2(a) has six entities {e1, e3, …, e6} connected by ten links {l1, l2, …, l10}. The entities belong to four classes in Fig. 2(b), such as the entity e1 “Ethinamate” belonging to the class c1 “Drugs”. The links belong to the ten predicates used to connect the classes in Fig. 2(b), such as the link l2 belonging to the predicate p1 “possibleDiseaseTarget”.

Fig. 2.

Fig. 2

An example of an RDF graph in Biomedical domain.

Text-based entity search indexes all the triples about an entity with different fields and returns the entity if the fields contain the query keywords [20]. For example, besides the triples for the entity “renal tubular acidosis” indexed in a mainly searched filed, a field named “class” with the value “disease” is also indexed for this entity. These fields are used as filters to restrict the search space related to the preference. In this same example, a search can be restricted in “disease” by searching the filed “class” with the preference “disease”. Instead of using these preferences as filters, PERank uses the links between the different entities that fit search preferences to tackle the inter-topic search scenarios, in which the topic related to the searched entities differs from the one related to the preference. For example, searching with the keywords “renal tubular acidosis” and the preference “drugs” to find the drugs to treat the disease “renal tubular acidosis” is considered as an inter-topic search.

3.1 Framework

Given a keyword query, PERank uses a set of entities retrieved from a keyword-based entity search engine to search the data with the help of the preference. Fig. 3 shows the graphical framework description of PERank, which consists of three parts: (1) query formulation, (2) search score formulation, and (3) Individual Personalized PageRank Vectors (IPPVs) computation and index, where the first two parts are on-line computations and the last part is off-line computation. During the on-line computation, a set of entity scores for each entity called IPPV is generated by the classes and predicates. These scores are used to form a ranking score in part 3 (i.e., search score formulation) with a query vector created in part 1 (i.e., query formulation) on-line.

Fig. 3.

Fig. 3

The framework of PERank.

3.2 Query formulation

In this phase, a query vector consisting of a pair of vectors, preference vector and query entity vector, used to feed PERank, are created.

A search preference is a set of terms T = {t1, t2, …, tk} that represents the search intention of a user. T is converted into a preference vector p⃗ in PERank. In practice, a preference can affect the authorities to flow to the entities in an RDF graph by two factors: class and predicate. Therefore, p⃗ is organized based on a class vector cp or a predicate vector pp, where each entry is computed with the following equations if cpi or ppi contains tj:

cpi=1TtjT1CP (1)
ppi=1TtjT1PP (2)

, where CP or PP is a set of classes or predicates containing tj. For example in Fig. 2, given a preference T = {t1: “drug”}, one class “Drugs” and two predicates “possibleDrug” and “drugReference” contain the preference. Therefore, we can obtain cp=[1,0,0,0] and pp=[0,0.5,0,0,0,0,0,0.5,0,0].

A query entity vector is an entity vector qe=[qe1,qe2,,qem], where qei=1QE if qeiQE. QE is a set of entities returned from an entity search with the keywords. For example, searching the entities related to the keyword “acidosis” in the data graph (shown in Fig. 2 (a)), QE will be returned as a set of entities {e2, e3}. And we can obtain the query entity qe=[0,0.5,0.5,0,0,0].

3.3 Search score formulation

During the on-line computation, the query entity vector and preference vector are used as a pair of input to search IPPV scores and form a ranking score for all the entities in Algorithm 1. The query entities are obtained from the search engine using the keyword query (line 1) and then converted into the query entity vector (line 2). The preference is converted into a class vector ( cp) or a predicate vector pp (lines 3–4). Then, PERank uses cp or pp to get ranking scores from IPPVs (lines 4–12), and returns the sorted entities with the scores. The IPPVs are computed offline and will be introduced in the next section. Please note that, different from the top to bottom execution order of Algorithm 1, we introduce how the IPPVs are computed and used to form the PERank score (bottom to top).

Algorithm 1.

PERank query execution algorithm.

Input: Preference T, Keywords Q
Output: Sorted Entities
1. Query entities QEentity Search (Q)
2. Query entity vector qequeryEntityVectorConvertor(QE)
3. Preference vector cppreferenceVectorConvertor(T)
4. Preference vector pppreferenceVectorConvertor(T)
5. Initial: Entity with ranking scores PERank (CP, Q) ← 0, PERank (PP, Q) ← 0,
6. For each entity eQE do
7.  Initial: PPV(cp,e)0,PPV(pp,e)0
8. For each entity ciCP do
9.    PPV(cp,e)+=cpi×IPPV(ci,e)
10. For each predicate pido PP do
11.    PPV(pp,e)+=ppi×IPPV(pi,e)
12. PERank(CP,Q)+=PPV(cp,e)
13. PERank(PP,Q)+=PPV(pp,e)
14. Sort PERank (CP, Q)and PERank (PP, Q)
15. Return PERank (CP, Q)and PERank (PP, Q)

4. Computing PERank ranking scores based on IPPVs

In order to compute a ranking score with a pair of input, preference vector p⃗ and query entity vector qe, PERank adapts PPR, which populates a transition matrix with the preference vector and populates a distribution vector with the query entity vector. In order to handle the complexity of RDF graphs, a PERank score is assembled with Individual Personalized PageRank Vector scores (IPPVs) online. Fig. 4 shows an example of the computation of IPPVs that are demonstrated with a matrix, where each rows is a unit transition matrix populated by tracing a class in Fig. 2(b), and a column is an entity in Fig. 2(a). Therefore, for each pair of class and entity, an IPPV is computed and indexed. When a search job is initiated, the IPPVs in the gray boxes will be used to form a PERank ranking score for a query vector, which is generated indirectly from the keyword “acidosis” and directly from the preference “drug” and “diseases”.

Fig. 4.

Fig. 4

The computation of PERank based on IPPVs with tracing classes of Fig. 2.

An IPPV score is a PPR score calculated with an m dimensional unit query vector qei where only the ith element of qei is one and a unit preference vector p⃗i where only the jth element of p⃗i is one.

Definition 1. Individual Personalized PageRank Vector (IPPV)

For each pair of unit query vector qei and unit preference vector p⃗i, an IPPV(qei,pj) is computed as:

IPPV(qei,pj)k+1=d×Mpj×IPPV(qei,pj)k+(1-d)×qei (3)

, where d is the damping factor. IPPV(.,.)k+1 is the vector of scores at (k + 1)th iteration. Mp⃗i is the m × m dimensional individual transition matrix populated with p⃗i.

If none of u’s outbound nodes belong to cj, or none of u’s outbound links belong to pj, an entry mp⃗i of Mp⃗i is computed as:

mpj(u,v)=1#outLink(u) (4)

, otherwise:

mpj(u,v)={1#outLink(u,j)vbelongstocj1#outLink(u,j)lu,vbelongstopj0else (5)

With a set of IPPVs, a Personalized PageRank Vector (PPV) can be formed.

Definition 2. Personalized PageRank Vector (PPV)

For a unit query vector qei and a preference vector p⃗j, where p⃗ = w1 p⃗1+, …,,+wj p⃗j and wj is the weight of a unit preference vector p⃗j, a PPV(qei,p) is computed as:

PPV(qei,p)=wj×IPPV(qei,pj) (6)

Based on the Linear Combination Theorem [33], a PERank score can be formed with PPVs:

PERank(qe,p)=wi×PPV(qei,p) (7)

, where qe=w1qe1+,,+wiqei, and wi is the weight of a unit query entity vector qej. The weights wj and wi are the correspondence values of the query vector qe and the preference vector in p⃗.

Given a linked dataset with m entities, n links, d classes and f predicates, the fully PPR computes every possible combination of personalization vectors and needs 2m times of a standard PPV computation. With the Linear Combination Theorem, the cost of the fully PPR decreases into m times of a standard PPV computation. PERank simultaneously calculates a score with a query vector and a preference vector based on the linear combination with PPVs, each of which is linearly combined with IPPVs. IPPV calculates a score vector with a unit query vector and a unit preference vector, and the cost of an IPPV equals with the cost of a standard PPV computation. An individual PPV in PERank covers all the unit preference vectors (unit class preference vectors and unit predicate preference vectors) and costs d + f times of a standard PPV computation. Therefore, covering all the combinations of the preference vectors and the query vectors costs m(d + f) times of a standard PPV computation (or (d + f) times of a standard PPR computation) in PERank.

5. Experiment

To evaluate PERank, we implemented PERank with JAVA_1.8 as shown in Fig. 5 based on a representative keyword-based entity search engine, Siren; we consider Siren as the baseline method. In the implemented system, keywords are entered in the search box. The topic and link boxes are used to catch the preference on classes or predicates. The results will be demonstrated literally and graphically. As the graphical representation for a search in Fig. 5 shows, searching the keyword “DB01259” (center node) obtains five entities (on the second circle) with the preference “drug” (on the first circle).

Fig. 5.

Fig. 5

Screenshot of PERank.

Our experiment is divided into three parts: (1) evaluation of the performance of IPPV computation, (2) comparison of the search results (effectiveness) of PERank with the different search strategies of Siren, (3) comparison of the effectiveness of PERank with the classic PPR algorithms (TSPR [27] and ObjectRank [8]). All results can be found at (https://github.com/zongnansu1982/PERankComputation/tree/master/results).

5.1 Data and query sets

Ten datasets in Linked Life Data (http://linkedlifedata.com/) were used to form a biomedical graph, which included DailyMed (http://dailymed.nlm.nih.gov/dailymed/), Diseasome (http://diseasome.eu/), Disease Ontology (http://disease-ontology.org/), DrugBank (http://www.drugbank.ca/), Gene Ontology (http://www.geneontology.org/), HumanCyc (http://humancyc.org/), Gene-Disease Network (LHGDN) (http://www.dbs.ifi.lmu.de/~bundschu/LHGDN.html), REACTOME (http://www.reactome.org/ReactomeGWT/entrypoint.html), SIDER (http://sideeffects.embl.de/), and Symptom (http://symptomontologywiki.igs.umaryland.edu/wiki/index.php/Main_Page). The datasets have approximately 60 classes and 110 predicates, comprising a total of 400,000 entities and 1,357,000 links.

A pair of keyword query and preference was utilized as an input query in each search task. We randomly selected 20 names of diseases and drugs, and considered them as the keyword queries. The two topics, diseases and drugs, were used as the preference of search. All the combinations of the keyword queries and preferences pairs were generated for the intra- and inter-topic search tasks (shown in Appendix A).

5.2 Criteria for evaluation

To evaluate the quality of results from the keyword based entity search, we used precision that counted the number of retuned entities both related with keyword and preference [34]. For those returned entities that were related to the keyword but not related to the preference, we used the validness to measure them.

Given keywords q and a preference p, the precision of the top K returns (P@k) was defined as:

P(q,p)k=#returnedentiteisbothrelatedtoqandpk (8)

Given keywords q and a preference p, the validness of the top K returns (V@k) was defined as:

V(q,p)k=#returnedentiteisrelatedtoqbutnottopk (9)

It should be noted that k was set to be a fixed number in P@k and V@k. A search result with high precision (indicating high percentage of returned entities being related to both keyword query and preference) was considered as a good result. Compared to precision, high validness (indicating high percentage of entities only relating to the keyword) was still considered as an acceptable result, since, if the entities were keyword related, the true related results could be found via tracing the links when the user browses the information of the entity. We have invited two medical Ph.D. students to evaluate the search results. They made correct annotations with 100% percent confidence with their knowledge or with the assistance of the third-party resources, such as PubMed (https://www.ncbi.nlm.nih.gov/pubmed/) and Up To Date (https://www.uptodate.com/login). Any uncertainty while annotating required the annotators to engage in discussion to reach consensus. The top 20 returned entities were manually checked for 40 searches of each task in the experiment.

5.3 Performance of IPPV computation

We computed the IPPVs for the 25,154 entities returned from Siren with the 20 keyword queries. The computation was conducted on a 2X Intel Xeon E5-2630 V4 processor (20 MB Cache, 2.2 GHZ) with a 128 GB memory running the Ubuntu 16.04 64-bit operating system. The IPPVs were separately computed for the 50 classes and 94 predicates that could be used as the preferences in the schema with Finger Prints [33]. This ranged from 100, 200, 400, 500, 600, 800 and 1000 steps respectively, and the performance is shown in Fig. 6. The time of IPPV computations for the retuned entities on each class or predicate fluctuated a little, which proved that the whole computation time linearly increased along with the growing number of properties and classes. The time for each IPPV was also related to the number of steps of the random walk performed, and also linearly increased along with the growing of the steps.

Fig. 6.

Fig. 6

Performance of IPPV computation for classes and properties.

5.4 Comparing PERank with Siren entity search strategies

In order to evaluate the search results of PERank, we designed three search strategies of Siren in the comparison: (1) query expansion that combined keywords with preference [35], (2) preference used as class filter, (3) preference used as predicate filter. Corresponding with (2) and (3), we used two kinds of PERank: (1) PERank Class and PERank Predicate.

The P@k and V@k were measured, where k = 5,10,20. A sample of the intra- and inter-topic search in top 5 that returns of the keyword “ethinamate” with the preferences “drug” and “disease” is shown in Appendix B, and the evaluation results are presented in Fig. 7. Please note that the value “NULL” in the appendix means empty. As the figure shows, searching with the preferences in general improves the inter-topic search, and only the query expansion and PERank improve the intra-topic search. Under the same circumstance, the class and predicate filters cannot improve the intra-topic search, but rather hinder the inter-topic search since they search the keyword related entities specific to those classes or predicates related to the preference (refer to some returns that are null in appendix B). This indicates that the preferences should perhaps only be used as the filters for intra-topic search [36]. PERank performs better than the other preference processing strategies in two kinds of search. PERank significantly increases the precision from 3.5% to 17.75% for PERank Class, 17.5% for PERank Predicate, validness from 4% to 28.85% for PERank Class and 25.51% for PERank Predicate in the inter-topic search. PERank, while improving the precision of the baseline from 41.5% to 50.5% for PERank Class and 65.25% for PERank Predicate, performs almost the same as the query expansion method that improves the precision of the baseline from 41.5% to 54.75% for the intra-topic search.

Fig. 7.

Fig. 7

Fig. 7

Effectiveness of PERank comparing with other preference processing strategies

5.5 Comparing PERank with PPR algorithm

In this task, PERank class and predicate were compared with two PPR variations, TSPR and ObjectRank. In TSPR computation, an m-dimensional PPR vector p⃗i for each class was used to bias TSPR scores for the entities belonging to the class. In p⃗i, each entry pj=1ci if pj was an instance of ci, otherwise pj = 0, where |ci| was the cardinality of the instance set of ci. Given a pair of input, keyword query and preference, the entities that were retrieved by Siren were sorted by the TSPR score based on the preference. In ObjectRank computation, a unit score vector was computed with an m dimensional unit PPR vector qei and the probability transition matrix, where 80% of the weight was assigned to the predicates contained the preferences and the remaining 20% was equally distributed to other predicates. Given the input query, an ObjectRank score vector was dynamically generated by the linear combination of the unit score vectors for the entities retrieved by Siren.

In Appendices C and D, we show two samples of the search results in this task. In general, PERank performs better than both TSPR and ObjectRank in the intra and inter-topic searches.

As Fig. 8 shows, compared with TSPR, PERank Class improves precision and validness dramatically: from 14.75% to 50.5% on precision and 55.25% to 76.22% on validness in the intra-topic search; from 7.75% to 17.75% on precision and 14.5% to 28.85% on validness in the inter-topic search. The ranking strategy used in TSPR is less effective than PERank, because the results cannot be improved if all the entities returned from Siren are irrelevant. Different with TSPR, PERank uses returned entities to further search the related entities with links, which reduces the error entities from Siren.

Fig. 8.

Fig. 8

Effectiveness of PERank comparing with the TSPR and ObjectRank.

Fig. 8 also shows that PERank Predicate outperforms ObjectRank: from 53.25% to 65.25% on precision and 56.46% to 78.10% on validness in the intra-topic search; from 2.25% to 17.5% on precision and 4.88% to 25.51% on validness in the inter-topic search. PERank customizes the weights of links corresponding to the preference, which flexibly teleports authorities toward the entities related to the preference. ObjectRank predefines the weights in the schema graph, which has to divvy a small portion of the authorities to preference-irrelevant entities.

6. Discussion

This study proposed a method to target inter-topic search for keyword-based entity search based on heterogeneous relationships in Linked Data. Even though there exists benchmarks and gold standards for testing entity search, such as Billion Triple Challenge (BTC) (http://challenge.semanticweb.org/), these benchmarks cannot be used in our evaluation since they only consider the relevancy between the query and entities. The preference, user search intentions, is not considered in these entity search challenges. Therefore, we conduct our experiments based on ten biomedical linked datasets and evaluated intra- and inter-topic search with two criteria: precision and validness at top k, based on the manually annotated entities. Since the data space in the experiment is limited, compared to the current size of biomedical Linked Data, the evaluation is not comprehensive from two aspects, (1) queries are randomly generated, instead of using practical search cases, (2) criteria, such as Recall and F-measure, cannot be applied to provide tangible results. Therefore, building a new benchmark and gold standard for advanced biomedical entity search, where a keyword query and a preference are used as a pair of input, is considered as our future work.

PERank costs m(d + f) times of a standard PPV computation to cover all the combinations of the preference vectors and the query vectors. There are scaling PPV computation methods, such as DrunkardMob [37], that can dramatically decrease the computation time. These methods can be applied for PERank to improve the performance. Another direction of our future work is to conduct PERank computations on-the-fly. With the entities retrieved from an entity search platform, a local network can be constructed based on these entities to train PERank. Therefore, without relying on the pre-indexed datasets, this variant can be used as a plugin for entity search in practice.

7. Conclusions

Search preferences are used as filters in the advanced keyword-based entity search that ignores relations in Linked Data. The preferences are used to restrict search space, which produces poor results for the inter-topic search. In this paper, we have introduced PERank that enhanced entity search by utilizing the relations resident in Linked Data. PERank draws upon an existing keyword-based entity search engine, and uses the returned entities and preferences to improve the search. We tested PERank with a set of linked life data and have compared it with the Siren-based search strategies and the existing PPR methods in light of effectiveness. The experiment has demonstrated that PERank returns more promising ranked entities in advanced search.

  • We developed a keyword-based entity search algorithm, PERank, in tandem with existing entity search methods for inter-topic biomedical entity search.

  • PERank adapts Personalized PageRank (PPR) to use heterogeneous relations in linked data to search entities across topics.

  • With ten Linked Life Data sets and two query sets (intra-topic and inter-topic search), PERank outperformed the baseline search and existing PPR methods.

Acknowledgments

This work was funded in part by the Industrial Strategic Technology Development Program (10044494, WiseKB: Big data based self-evolving knowledge base and reasoning platform) supported by the Ministry of Science, ICT & Future Planning (MSIP, Korea). Partial funding also supported our work by Grant U24AI117966 to the University of California San Diego from the National Institutes of Health. We appreciate Dr. Hyeoneui Kim and Victoria Ngo for their suggestions on improving the manuscript.

Appendix

A. Queries

Table A.1.

Keywords and preferences pairs.

Intra-topic search Inter-topic search

Keywords Preference Keywords Preference
Ethinamate Drug Ethinamate Disease
Phenylpropanolamine Drug Phenylpropanolamine Disease
Memantine Drug Memantine Disease
Aliskiren Drug Aliskiren Disease
Methdilazine Drug Methdilazine Disease
Trovafloxacin Drug Trovafloxacin Disease
Altretamine Drug Altretamine Disease
Natamycin Drug Natamycin Disease
Dexbrompheniramine Drug Dexbrompheniramine Disease
Diatrizoate Drug Diatrizoate Disease
Coats disease Disease Coats disease Drug
Giant axonal neuropathy Disease Giant axonal neuropathy Drug
Cerebrovascular disease Disease Cerebrovascular disease Drug
Second degree AV block Disease Second degree AV block Drug
Polydactyly Disease Polydactyly Drug
Hay-Wells syndrome Disease Hay-Wells syndrome Drug
Paroxysmal nocturnal hemoglobinuria Disease Paroxysmal nocturnal hemoglobinuria Drug
Xeroderma pigmentosum Disease Xeroderma pigmentosum Drug
Dermatofibrosarcoma protuberans Disease Dermatofibrosarcoma protuberans Drug
Non-Hodgkin lymphoma Disease Non-Hodgkin lymphoma Drug

B. A sample of search results of using Siren baseline, Siren Query Expansion, Siren Class and Predicate Filter, and PERank

Table B.1.

The top 5 returns of intra-topic search with the keyword “ethinamate” and the preference “drug”.

Baseline Siren Query Expansion

Entity Name Relevant to keywords Fit Preference Entity Name Relevant to keywords Fit Preference
1 Drugbank:drugs/DB01031 Ethinamate TRUE TRUE 1 Drugbank:drugs/DB01031 Ethinamate TRUE TRUE
2 Drugbank:drugs/DB00609 Ethionamide TRUE TRUE 2 Drugbank:drugs/DB00609 Ethionamide TRUE TRUE
3 NULL NULL NULL NULL 3 NULL NULL NULL NULL
4 NULL NULL NULL NULL 4 NULL NULL NULL NULL
5 NULL NULL NULL NULL 5 NULL NULL NULL NULL

Siren Class Filter Siren Predicate Filter

1 Drugbank:drugs/DB01031 Ethinamate TRUE TRUE 1 Drugbank:drugs/DB01031 Ethinamate TRUE TRUE
2 Drugbank:drugs/DB00609 Ethionamide TRUE TRUE 2 Drugbank:drugs/DB00609 Ethionamide TRUE TRUE
3 NULL NULL NULL NULL 3 NULL NULL NULL NULL
4 NULL NULL NULL NULL 4 NULL NULL NULL NULL
5 NULL NULL NULL NULL 5 NULL NULL NULL NULL

PERank Class PERank Predicate

1 Dailymed:drugs/75 Trecator (Tablet, Film Coated) TRUE TRUE 1 Bio2rdf:kegg/D00703 Ethinamate TRUE TRUE
2 Drugbank:drugs/DB00609 Ethionamide TRUE TRUE 2 Bio2rdf:kegg/D00591 Ethionamide TRUE TRUE
3 Drugbank:drugs/DB01031 Ethinamate TRUE TRUE 3 Drugbank: drugcategory/hypnoticsAndSedatives Hypnotics and Sedatives TRUE TRUE
4 Sider:drugs: 2761171 Ethionamide TRECATOR TRUE TRUE 4 Dailymed:drugs/75 Trecator (Tablet, Film Coated) TRUE TRUE
5 Dailymed:drugs:1810 Diamox Sequels TRUE TRUE 5 Drugbank:drugs/DB01031 Ethinamate TRUE TRUE

Table B.2.

The top 5 returns of inter-topic search with the keyword “ethinamate” and the preference “disease”.

Baseline Siren Query Expansion

Entity Name Relevant to keywords Fit Preference Entity Name Relevant to keywords Fit Preference
1 Drugbank:drugs/DB01031 Ethinamate TRUE FALSE 1 Drugbank:drugs/DB01031 Ethinamate TRUE FALSE
2 Drugbank:drugs/DB00609 Ethionamide TRUE FALSE 2 Diseasome:diseases diseases TRUE TRUE
3 NULL NULL NULL NULL 3 Lhgdn:association/55653 Liver diseases FALSE TRUE
4 NULL NULL NULL NULL 4 Lhgdn:association/5649 Autoimmune Diseases FALSE TRUE
5 NULL NULL NULL NULL 5 Lhgdn:association/5650 Calcinosis FALSE TRUE

Siren Class Filter Siren Predicate Filter

1 NULL NULL NULL NULL 1 NULL NULL NULL NULL
2 NULL NULL NULL NULL 2 NULL NULL NULL NULL
3 NULL NULL NULL NULL 3 NULL NULL NULL NULL
4 NULL NULL NULL NULL 4 NULL NULL NULL NULL
5 NULL NULL NULL NULL 5 NULL NULL NULL NULL

PERank Class PERank Predicate

1 Drugbank:drugs/DB00609 Ethionamide TRUE FALSE 1 Drugbank:drugs/DB00609 Ethionamide TRUE FALSE
2 Diseasome: diseases/1278 Renal tubular acidosis TRUE TRUE 2 Drugbank:drugs/DB01031 Ethinamate TRUE FALSE
3 Diseasome: diseases/3725 Renal tubular acidosis-osteopetrosis syndrome TRUE TRUE 3 Diseasome: diseases/3725 Renal tubular acidosis-osteopetrosis syndrome TRUE TRUE
4 Diseasome: diseases/992 Retinitis pigmentosa FALSE TRUE 4 Diseasome: diseases/1278 Renal tubular acidosis TRUE TRUE
5 Diseasome: diseases/440 Generalized epilepsy FALSE TRUE 5 Diseasome: diseases/3566 Persistent hyperinsulinemic hypoglycemia of infancy FALSE TRUE

C. A sample of search results of PERank Class and the TSPR

Table C.1.

The top 5 returns of intra-topic search with the keywords “cerebrovascular disease” and the preference “disease”.

PERank Class TSPR

Entity Name Relevant to keywords Fit Preference Entity Name Relevant to keywords Fit Preference
1 Diseasome: diseases/1206 Xeroderma pigmentosum TRUE TRUE 1 drugbank:drugs/DB00157 NADH TRUE FALSE
2 Diseasome: diseases/347 Dystonia TRUE TRUE 2 drugbank:drugs/DB00039 Palifermin TRUE FALSE
3 Lhgdn:association/2917 Creutzfeldt-Jakob Disease, Familial FALSE TRUE 3 dailymed:drugs/4 Kepivance (Injection) TRUE FALSE
4 Diseaseontology: DOID/0050254 acanthocep haliasis TRUE TRUE 4 drugbank:drugs/DB00048 Collagenase TRUE FALSE
5 Drugbank:possible DiseaseTarget Possible Disease Target TRUE TRUE 5 drugbank:drugs/DB00619 Imatinib TRUE FALSE

Table C.2.

The top 5 returns of inter-topic search with the keywords “cerebrovascular disease” and the preference “drug”.

PERank Class TSPR

Entity Name Relevant to keywords Fit Preference Entity Name Relevant to keywords Fit Preference
1 Drugbank:drugs/DB01233 Metoclopramide TRUE TRUE 1 diseasome: diseases/3725 Renal tubular acidosis-osteopetrosis syndrome TRUE FALSE
2 Drugbank:drugs/DB00742 Mannitol TRUE TRUE 2 diseasome: diseases/74 Alzheimer disease TRUE FALSE
3 Sider:drugbank/2520 verapamil hydrochloridel FALSE TRUE 3 diseasome: diseases/1436 Analbuminemia TRUE FALSE
4 Drugbank:drugs/DB00850 Perphenazine TRUE TRUE 4 diseasome: diseases/2198 Dysalbuminemic hyperthyroxinemia TRUE FALSE
5 Sider:drugs/4168 metoclopramide FALSE TRUE 5 diseasome: diseases/2811 Insulin resistance, susceptibility to TRUE FALSE

D. A sample of search results of PERank Predicate and the ObjectRank

Table D.1.

The top 5 returns of intra-topic search with the keywords “second degree AV block” and the preference “disease”.

PERank Predicate ObjectRank

Entity Name Relevant to keywords Fit Preference Entity Name Relevant to keywords Fit Preference
1 Dailymed:drugs/1061 CARDIZEM (Tablet, Coated) TRUE FALSE 1 Sider:Sideeffects/C0085614 AV block first degree FALSE TRUE
2 Dailymed:drugs/537 Cardizem LA (Tablet) TRUE FALSE 2 Diseaseontology Disease FALSE TRUE
3 Dailymed:drugs/2385 Verapamil HCl (Injection) TRUE FALSE 3 Sider:Sideeffects/C0004245 Atrioventricular block FALSE TRUE
4 Sider:side_effects: C0085614 AV block first degree TRUE TRUE 4 Sider:Sideeffects/C0264906 Second degree AV block FALSE TRUE
5 Dailymed:drugs/2663 Metoclopramide (Solution) TRUE FALSE 5 Dbpedia:Atrioventricular_block Atrioventricular block FALSE TRUE

Table D.2.

The top 5 returns of inter-topic search with the keywords “second degree AV block” and the preference “drug”.

PERank Predicate ObjectRank

Entity Name Relevant to keywords Fit Preference Entity Name Relevant to keywords Fit Preference
1 Drugbank:drugtype/smallMolecule Small Molecule FALSE TRUE 1 Sider:Sideeffects/C0085614 AV block first degree FALSE FALSE
2 Drugbank:drugs/DB01233 Metoclopramide TRUE TRUE 2 Diseaseontology Disease FALSE FALSE
3 Drugbank:drugtype/approved Approved drug FALSE TRUE 3 Sider:Sideeffects/C0004245 Atrioventricular block FALSE FALSE
4 Sider:side_effects: C0085614 AV block first degree TRUE FALSE 4 Sider:Sideeffects/C0264906 Second degree AV block FALSE FALSE
5 Diseasome: diseases/443 Giant axonal neuropathy FALSE FALSE 5 Dbpedia:Atrioventricular_block Atrioventricular block FALSE FALSE

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • 1.Bizer C, Heath T, Berners-Lee T. Linked data-the story so far. International journal on semantic web and information systems. 2009;5:1–22. [Google Scholar]
  • 2.Blanco R, Mika P, Vigna S. The Semantic Web–ISWC 2011. Springer; 2011. Effective and efficient entity search in RDF data; pp. 83–97. [Google Scholar]
  • 3.Bron M, Balog K, de Rijke M. Advances in Information Retrieval. Springer; 2013. Example based entity search in the web of data; pp. 392–403. [Google Scholar]
  • 4.Liu TY. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval. 2009;3:225–331. [Google Scholar]
  • 5.Jindal V, Bawa S, Batra S. A review of ranking approaches for semantic search on Web. Information Processing & Management. 2014;50:416–425. [Google Scholar]
  • 6.Delbru R, Campinas S, Tummarello G. Searching web data: An entity retrieval and high-performance indexing model. Web Semantics: Science, Services and Agents on the World Wide Web. 2012;10:33–58. [Google Scholar]
  • 7.Wei W, Barnaghi P, Bargiela A. Rational Research model for ranking semantic entities. Inf Sci. 2011;181:2823–2840. [Google Scholar]
  • 8.Balmin A, Hristidis V, Papakonstantinou Y. Objectrank: Authority-based keyword search in databases. Proceedings of the Thirtieth international conference on Very large data bases-Volume 30; VLDB Endowment; 2004. pp. 564–575. [Google Scholar]
  • 9.Hogan A, Harth A, Umbrich J, Kinsella S, Polleres A, Decker S. Searching and browsing linked data with swse: The semantic web search engine. Web semantics: science, services and agents on the world wide web. 2011;9:365–401. [Google Scholar]
  • 10.Ding L, Finin T, Joshi A, Pan R, Cost RS, Peng Y, Reddivari P, Doshi V, Sachs J. Swoogle: a search and metadata engine for the semantic web. Proceedings of the thirteenth ACM international conference on Information and knowledge management; ACM; 2004. pp. 652–659. [Google Scholar]
  • 11.Navigli R, Velardi P. An analysis of ontology-based query expansion strategies. Proceedings of the 14th European Conference on Machine Learning, Workshop on Adaptive Text Extraction and Mining; Cavtat-Dubrovnik, Croatia, Citeseer. 2003. pp. 42–49. [Google Scholar]
  • 12.Halpin H, Lavrenko V. Relevance feedback between hypertext and Semantic Web search: Frameworks and evaluation. Web Semantics: Science, Services and Agents on the World Wide Web. 2011;9:474–489. [Google Scholar]
  • 13.Natsev AP, Haubold A, Tešić J, Xie L, Yan R. Semantic concept-based query expansion and re-ranking for multimedia retrieval. Proceedings of the 15th ACM international conference on Multimedia; ACM; 2007. pp. 991–1000. [Google Scholar]
  • 14.Efthimiadis EN. A user-centred evaluation of ranking algorithms for interactive query expansion. Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval; ACM; 1993. pp. 146–159. [Google Scholar]
  • 15.Jupp S, Malone J, Bolleman J, Brandizi M, Davies M, Garcia L, Gaulton A, Gehant S, Laibe C, Redaschi N. The EBI RDF platform: linked open data for the life sciences. Bioinformatics. 2014;30:1338–1339. doi: 10.1093/bioinformatics/btt765. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Consortium U. Activities at the universal protein resource (UniProt) Nucleic acids research. 2014;42:D191–D198. doi: 10.1093/nar/gkt1140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Anguita A, García-Remesal M, de la Iglesia D, Maojo V. NCBI2RDF: enabling full RDF-based access to NCBI databases. BioMed research international. 2013;2013:9. doi: 10.1155/2013/983805. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Saleem M, Kamdar MR, Iqbal A, Sampath S, Deus HF, Ngomo ACN. Big linked cancer data: Integrating linked tcga and pubmed. Web Semantics: Science, Services and Agents on the World Wide Web. 2014;27:34–41. [Google Scholar]
  • 19.Castro LJG, McLaughlin C, Garcia A. Biotea: RDFizing PubMed Central in support for the paper as an interface to the Web of Data. Journal of biomedical semantics. 2013;4:S5. doi: 10.1186/2041-1480-4-S1-S5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Zong N, Lee S, Kim HG. Discovering expansion entities for keyword-based entity search in linked data. Journal of Information Science. 2015;41:209–227. [Google Scholar]
  • 21.Pérez-Agüera JR, Arroyo J, Greenberg J, Iglesias JP, Fresno V. Using BM25F for semantic search. Proceedings of the 3rd international semantic search workshop; ACM; 2010. p. 2. [Google Scholar]
  • 22.Manning CD, Raghavan P, Schütze H. Introduction to information retrieval. Cambridge university press; Cambridge: 2008. [Google Scholar]
  • 23.Cheng G, Ge W, Qu Y. Falcons: searching and browsing entities on the semantic web. Proceedings of the 17th international conference on World Wide Web; ACM; 2008. pp. 1101–1102. [Google Scholar]
  • 24.Belleau F, Nolin MA, Tourigny N, Rigault P, Morissette J. Bio2RDF: towards a mashup to build bioinformatics knowledge systems. Journal of biomedical informatics. 2008;41:706–716. doi: 10.1016/j.jbi.2008.03.004. [DOI] [PubMed] [Google Scholar]
  • 25.Page L, Brin S, Motwani R, Winograd T. The PageRank citation ranking: Bringing order to the web. Stanford InfoLab; 1999. [Google Scholar]
  • 26.Haveliwala TH. Topic-sensitive pagerank: A context-sensitive ranking algorithm for web search. IEEE Transactions on Knowledge and Data Engineering. 2003;15:784–796. [Google Scholar]
  • 27.Kamvar S, Haveliwala T, Manning C, Golub G. Stanford University Technical Report. Stanford: 2003. Exploiting the block structure of the web for computing pagerank. [Google Scholar]
  • 28.Hwang H, Balmin A, Reinwald B, Nijkamp E. Binrank: Scaling dynamic authority-based search using materialized subgraphs. IEEE Transactions on Knowledge and Data Engineering. 2010;22:1176–1190. [Google Scholar]
  • 29.Chakrabarti S. Dynamic personalized pagerank in entity-relation graphs. Proceedings of the 16th international conference on World Wide Web; ACM; 2007. pp. 571–580. [Google Scholar]
  • 30.Nie Z, Zhang Y, Wen J-R, Ma W-Y. Object-level ranking: bringing order to web objects. Proceedings of the 14th international conference on World Wide Web; ACM; 2005. pp. 567–574. [Google Scholar]
  • 31.Hristidis V, Raschid L, Wu Y. WebDB. 2011. Scalable Link-based Personalization for Ranking in Entity-Relationship Graphs. [Google Scholar]
  • 32.Zong N, Nam S, Eom JH, Ahn J, Joe H, Kim HG. Aligning ontologies with subsumption and equivalence relations in Linked Data. Knowledge-Based Systems. 2015;76:30–41. [Google Scholar]
  • 33.Fogaras D, Rácz B, Csalogány K, Sarlós T. Towards scaling fully personalized pagerank: Algorithms, lower bounds, and experiments. Internet Mathematics. 2005;2:333–358. [Google Scholar]
  • 34.Blanco R, Halpin H, Herzig DM, Mika P, Pound J, Thompson HS, Duc TT. Entity search evaluation over structured web data. Proceedings of the 1st international workshop on entity-oriented search workshop (SIGIR 2011); New York: ACM; 2011. [Google Scholar]
  • 35.Xu S, Bao S, Fei B, Su Z, Yu Y. Exploring folksonomy for personalized search. Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval; ACM; 2008. pp. 155–162. [Google Scholar]
  • 36.Dou Z, Song R, Wen J-R. A large-scale evaluation and analysis of personalized search strategies. Proceedings of the 16th international conference on World Wide Web; ACM; 2007. pp. 581–590. [Google Scholar]
  • 37.Kyrola A. Drunkardmob: billions of random walks on just a pc. Proceedings of the 7th ACM conference on Recommender systems; ACM; 2013. pp. 257–264. [Google Scholar]

RESOURCES