Searching for rare diseases in PubMed: a blind comparison of Orphanet expert query and query based on terminological knowledge

N Griffon; M Schuers; F Dhombres; T Merabti; G Kerdelhué; L Rollin; S J Darmoni

doi:10.1186/s12911-016-0333-0

. 2016 Aug 2;16:101. doi: 10.1186/s12911-016-0333-0

Searching for rare diseases in PubMed: a blind comparison of Orphanet expert query and query based on terminological knowledge

N Griffon ^1,^2,^✉, M Schuers ^1,³, F Dhombres ^2,⁴, T Merabti ¹, G Kerdelhué ¹, L Rollin ^1,⁵, S J Darmoni ^1,²

PMCID: PMC4970261 PMID: 27484923

Abstract

Background

Despite international initiatives like Orphanet, it remains difficult to find up-to-date information about rare diseases. The aim of this study is to propose an exhaustive set of queries for PubMed based on terminological knowledge and to evaluate it versus the queries based on expertise provided by the most frequently used resource in Europe: Orphanet.

Methods

Four rare disease terminologies (MeSH, OMIM, HPO and HRDO) were manually mapped to each other permitting the automatic creation of expended terminological queries for rare diseases. For 30 rare diseases, 30 citations retrieved by Orphanet expert query and/or query based on terminological knowledge were assessed for relevance by two independent reviewers unaware of the query’s origin. An adjudication procedure was used to resolve any discrepancy. Precision, relative recall and F-measure were all computed.

Results

For each Orphanet rare disease (n = 8982), there was a corresponding terminological query, in contrast with only 2284 queries provided by Orphanet. Only 553 citations were evaluated due to queries with 0 or only a few hits. There were no significant differences between the Orpha query and terminological query in terms of precision, respectively 0.61 vs 0.52 (p = 0.13). Nevertheless, terminological queries retrieved more citations more often than Orpha queries (0.57 vs. 0.33; p = 0.01). Interestingly, Orpha queries seemed to retrieve older citations than terminological queries (p < 0.0001).

Conclusion

The terminological queries proposed in this study are now currently available for all rare diseases. They may be a useful tool for both precision or recall oriented literature search.

Electronic supplementary material

The online version of this article (doi:10.1186/s12911-016-0333-0) contains supplementary material, which is available to authorized users.

Keywords: PubMed, Rare diseases, Bibliography as topic, Terminology as topic

Background

There is currently no consensual definition of what is a rare disease: in Europe, a disease is considered rare if it affects less than 1 in 2000 citizens, while in United States of America (USA), the threshold was set at 200,000 in the entire population [1] (approximately 1 in 1600 according to the USA census bureau [2]).

These gross definitions lead to a major heterogeneity between rare diseases:

Most of genetic diseases are rare diseases, but some infectious diseases, cancer and auto-immune diseases are also rare.
They may occur at any point in life
There are geographical variations. A disease may be rare in one country (like Periodic disease in France) but quite frequent in another (Periodic disease in Armenia)
Some are well known and have been described for a number of years, whereas some have been recently discovered and information is scarce.

Furthermore, these definitions have led to the knowledge of 5000 to 8000 rare diseases and to the “paradox of rarity”: each disease is rare, but patients with rare diseases are numerous. Having a clear vision of the prevalence of rare diseases is not an easy task, nevertheless, it is commonly accepted that approximately 5 to 10 % of the population suffer from rare diseases (8–9 % in the USA [1], 6–8 % in the European Union [3]). In both regions, this corresponds to approximately 30,000,000 patients suffering from a rare disease, making it a real public health concern [4].

This heterogeneity and frequency of rare diseases translates into numerous different situations in which some information is needed:

Finding a physician with adequate experience may be easy when a reference center exists, but can be a real difficulty if care pathways are not identified [5]
Providing medical care for patients with a rare disease is a difficult task for physicians. Even if the care episode does not concern the rare disease.
Writing a systematic review about a rare diseases, or doing a short review in order to write a research article, requires querying one or more bibliographic databases with as much relevant keywords as possible [6].

It seems of public health importance to provide all these participants with the appropriate tools to easily retrieve relevant information about rare diseases.

PubMed is one of the most popular search engines to access medical literature. It browses the MEDLINE bibliographic database, which gathers a large part of biomedical scientific articles, and some other minor resources [7]. MEDLINE is indexed using the MeSH® thesaurus. Although PubMed theoretically allows to access the literature about rare diseases, including the most recent scientific discoveries, the combination of the following elements may hinder users:

the relative novelty of MeSH terms for rare diseases [8]. Until 2010, the MeSH contained only a few rare diseases, also, citations pertaining to rare diseases published before 2010 are not indexed precisely for rare diseases. Since this date, 10,354 rare diseases, as defined by the Office of Rare Diseases Research (ORDR) [9], have been introduced in MeSH (source MeSH 2014),
the delay in article MeSH-indexing in PubMed [10], which can be several weeks to several months, according to the importance of the journal, and
the health professionals, or the lay-persons, lack of knowledge about MeSH [11].

It is therefore difficult for physicians, and furthermore patients, to query Pubmed in an effective way, and especially to find an article about rare diseases published before 2010 or in recent months.

Several institutions (Genetic and Rare Diseases Information Center [12] and Orphanet [13]) already gather information on their website about rare diseases including a brief summary, clinical information and many links to other resources. Sometimes a link to a PubMed expert based query is provided, limiting users task to citation relevance assessment. Nevertheless, in the case of Orphanet these queries do not always take advantage of all the MeSH/PubMed functionalities and they are far from providing a comprehensive coverage of all rare diseases. Moreover, the methodology of establishing these queries is not disclosed on the Orphanet website. The aim of this study was to propose a set of queries linked to each rare disease term in Orphanet and to evaluate these queries against those developed by Orphanet.

Methods

PubMed overview

PubMed is the most frequently used bibliographic database used by biomedical scientist throughout the world. It therefore constitute a standard in terms of information retrieval. MEDLINE is the major component of PubMed, gathering almost 90 % of the 26 millions of PubMed citations. MEDLINE curators affect to each citation a list of MeSH terms to describe it with a controlled level of granularity. The MeSH atomic part is the MeSH concept, a class of synonymous terms – i.e. all terms gathered in a MeSH concept are true synonyms. MeSH concepts closely related to each other in meaning may be gathered in a MeSH descriptor (MeSH D) or a MeSH supplementary concept (MeSH SC), one of them being the prefered concept, and the other being narrower, broader or related to the preferred one. Both MeSH D and MeSH SC aims at indexing the citation, but they exhibit some differences. First, MeSH SC are quite specific terms: they are used to index chemicals, drugs, and other concepts such as rare diseases. Second, MeSH SC, unlike MeSH D, are not classed, they are only linked to one or more MeSH D, usually broader, by a specific relationship. Lastly, there are a lot more MeSH SC (≈200,000) than MeSH D (≈27,000).

Pubmed users may specify what search field they want to use in their query using between-bracket operators. Table 1 presents some operators and their meaning.

Table 1.

Some operators used in PubMed

Operator	Meaning
[ti]	The term is considered as a free text keyword and searched for in title
[ab]	The term is considered as a free text keyword and searched for in abstract
[mh]	The term, a MeSH descriptor, and all the terms it subsumes, are searched for in MeSH indexing
[majr]	The term, a MeSH descriptor, and all the terms it subsumes, are searched for in MeSH major indexing
[nm]	The term, a MeSH supplementary concept, is searched for in MeSH indexing
[tw]	The term is considered as a free text keyword and searched for in multiple fields of PubMed citation (title, abstract, MeSH indexing, other keywords etc.)

Open in a new tab

PubMed queries

Orpha queries

Orphanet PubMed queries were manually created by Orphanet experts. These queries are available on the Orphanet web site (URL: www.orpha.net), on each disease page (for the diseases that have an Orphanet PubMed query, of course). For exemple, for the Orphanet concept “retroperitoneal fibrosis”, the PubMed query is: retroperitoneal fibrosis[majr] OR Retroperitoneal fibrosis[ti]. For the orphanet concept “Blount disease”, the query is: Blount disease[tw] OR tibia vara[tw].

Terminological queries

In addition to the MeSH thesaurus, several other terminologies and ontologies are available on rare diseases: (a) a formal ontology named HRDO (Human Rare Disease Ontology) [14] was developed based on the Orphanet classification. This ontology is available in five European languages: English, French, German, Spanish and Portuguese; (b) the Online Mendelian Inheritance in Man (OMIM) database, developed at Johns Hopkins University [15]; (c) the Human Phenotype Ontology (HPO), a formal ontology, which allows the description in an unambiguous fashion of phenotypic information in medical publications and databases [16]. The HPO is freely available at http://www.human-phenotype-ontology.org.

One of the authors (SJD) has created exact match mappings between MeSH, OMIM, HPO and HRDO based on a natural language processing/conceptual based algorithm [17, 18] suggestions. Exact match mapping means that the two concepts are real synonyms (e.g. the “Absent corpus callosum cataract immunodeficiency” MeSH concept and the “Vici syndrome” HRDO disease). Using these alignments, PubMed queries are created automatically, according to a published algorithm [19]. The algorithm output depends on the type of MeSH term mapped to: MeSH concept, MeSH SC or MeSH D (see Table 2 for examples):

If the HRDO concept is mapped to a MeSH Descriptor, the query structure is as follows:

Disease[MH] OR Disease[TW] OR Synonyms Disease MeSH Descriptor[TW] OR Synonyms Disease HRDO[TW] OR Synonyms Disease OMIM[TW] (if an exact match mapping exists between HRDO concept and OMIM concept) OR Synonyms Disease HPO[TW] (if an exact match mapping exists between HRDO concept and HPO concept)
If the HRDO concept is mapped to a MeSH Supplementary Concept, the query structure is as follows:

Disease[NM] OR Disease[TW] OR Synonyms Disease MeSH Supplemntary Concept[TW] OR Synonyms Disease HRDO[TW] OR Synonyms Disease OMIM[TW] OR Synonyms Disease HPO[TW]
If the HRDO concept is mapped to a MeSH Concept, the query structure is as follows:

Disease[TW] OR Synonyms Disease MeSH Concept[TW] OR Synonyms Disease HRDO[TW] OR Synonyms Disease OMIM[TW] OR Synonyms Disease HPO[TW]
And if the HRDO concept is not mapped to the MeSH thesaurus, the query structure is as follows:

Disease[TW] OR Synonyms Disease HRDO[TW] OR Synonyms Disease OMIM[TW] OR Synonyms Disease HPO[TW]

Table 2.

Exemples of queries according to the type of the MeSH term mapped to the HRDO concept

	Types of mapped MeSH terms
	MeSH descriptor	MeSH supplemetary concept	MeSH concept	Not a MeSH term
HRDO concept example	“retinal dystrophy”	“Omenn syndrome”	“Charcot-Marie-Tooth disease, type ib”	“Isolated oxycephaly”
MeSH part of the query	“retinal dystrophies”[MH] OR “retinal dystrophies”[TW] OR “dystrophies, retinal”[TW] OR “dystrophy, retinal”[TW] OR “retinal dystrophy”[TW] OR	“reticuloendotheliosis, familial, with eosinophilia”[NM] OR “reticuloendotheliosis, familial, with eosinophilia”[TW] OR “severe combined immunodeficiency with hypereosinophilia”[TW] OR	“Charcot-Marie-Tooth disease, type ib”[TW] OR “1B, HMSN”[TW] OR “1Bs, HMSN”[TW] OR	-
HRDO part of the query	“Retinal dystrophy”[TW] OR	“Omenn syndrome”[TW] OR “Combined immunodeficiency with hypereosinophilia”[TW] OR	“Charcot-Marie-Tooth disease type 1B”[TW] OR “CMT1B”[TW] OR	“Isolated oxycephaly”[TW] OR “Turricephaly”[TW] OR “Nonsyndromic oxycephaly”[TW] OR
HPO part of the query	“Retinal dystrophy”[TW]	-	-	“Turricephaly”[TW]
OMIM part of the query	-	“Omenn syndrome”[TW]	“Charcot-marie-tooth disease, demyelinating, type 1b”[TW]	-

Open in a new tab

Each column contains one example of PubMed query corresponding to the HRDO concept in the “HRDO concept example” row. Each row gathers all the synonyms for the considered diseases in one terminology. The final queries are composed by every synonyms of every terminologies, linked by “OR”. The final PubMed query for the Isolated oxycephalydisease is: “Turricephaly”[TW] OR “Nonsyndromic oxycephaly”[TW] OR “Isolated oxycephaly”[TW]. The last “OR” “turricephaly” is redundant. In this case, the final query is deducible from only one terminology (HRDO)

Relevance evaluation

Thirty rare diseases were randomly selected from the subset with both an Orphanet query and a terminological query. The selected rare diseases are listed in Table 4 (at the end of the document). The diseases with a prevalence higher than 1/2000 were considered as not rare. One author (GK) gathered the first ten citations retrieved (PubMed “recently added” ranking), for each rare disease, using the following queries:

\begin{array}{c} Q_{1} = Q_{Orpha} & AND & Q_{Term} \end{array}

\begin{array}{c} Q_{2} = Q_{Orpha} & NOT & Q_{Term} \end{array}

\begin{array}{c} Q_{3} = Q_{Term} & NOT & Q_{Orpha} \end{array}

Table 4.

Number of citations retrieved and precision for each query, by diseases

Disease	MeSH level alignment	n(retr)			Precision
Disease	MeSH level alignment	Q₁	Q₂	Q₃	p₁	p₂	p₃
3M syndrome	SC	38	5	31	1	0.4	0.4
Autosomal recessive hypohidrotic ectodermal dysplasia	D	25	21	12	0.8	0.4	0.8
Generalized epilepsy - paroxysmal dyskinesia	SC	0	18	6	–	0.5	0.17
Silent sinus syndrome	–	107	2	3	1	1	0.67
Toluene embryopathy	SC	8	39	0	1	0.2	–
Familial drusen	SC	67	54	9	1	0.1	0.78
Autoimmune lymphoproliferative syndrome	D	176	5	1800	1	1	0
Diphtheria	D	4419	0	13,149	1	–	0.6
Hypomandibular faciocranial dysostosis	SC	6	2	0	0.83	1	–
Retroperitoneal fibrosis	D	1986	0	711	1	–	0.6
Epstein syndrome	SC	41	0	0	1	–	–
Oculopharyngeal muscular dystrophy	D	291	31	129	0.9	1	0.6
Ring chromosome 19	SC	6	4	2	1	0.75	1
Nephropathy - deafness - hyperparathyroidism	SC	0	1	0	–	1	–
Greenberg dysplasia	SC	7	4	6	1	1	0.83
Menkes disease	D	968	0	47,026	1	–	0
Mikati-Najjar-Sahli syndrome	–	0	1	0	–	1	–
Genochondromatosis	SC	5	0	0	1	–	–
Noonan syndrome	D	1483	0	258	0.8	–	0.5
Carney complex	D	248	21	299	0.9	1	0.7
Blount disease	SC	296	0	6	0.5	–	1
Oculocerebrofacial syndrome, Kaufman type	SC	5	0	0	1	–	–
Wilson disease	D	5266	0	1231	1	–	0.8
Adult Still’s disease	D	1129	0	202	1	–	0.8
Esophageal atresia	D	2999	75	479	0.9	1	0.6
Congenital nephrotic syndrome, Finnish type	C	8	239	43	0.88	0.8	0.1
Thiamine-responsive megaloblastic anemia syndrome	SC	69	34	39	1	0.6	0.1
Hereditary myoclonus - progressive distal muscular atrophy	SC	0	2	0	–	0.5	–
Dentatorubral-pallidoluysian atrophy	C	361	171	215	0.9	0	0.6
Neuronal ceroid lipofuscinosis	D	1371	0	1379	1	–	0.5
Macro average precision					0.81	0.44	0.41
Micro average precision					0.94	0.61	0.52

Open in a new tab

n(retr) number of citations retrieved, D MeSH Descriptor, SC MeSH Supplementary Concept, C MeSH Concept

With Q_Orpha the Orpha query and Q_Term the terminological query. Therefore, Q₁ retrieved citations common to both Orpha and terminological query, Q₂ retrieved citations specific to the Orpha query and Q₃ retrieved citations specific to the terminological query. He (GK) then hid the retrieving query: the evaluators were blinded vs. the type of query. The anonymised citations were split between four physicians (FD, LR, MS and NG) in such way that: (i) each citation was evaluated twice and, (ii) each evaluator shared each third of their evaluations with one different evaluator.

Evaluators had to answer the following question for each citation: “Does the article directly concern the disease?” In case of any disagreement, a third evaluator evaluated the citation and the discrepancy was resolved by consensus.

More information regarding relevance evaluation is available in Additional file 1.

Statistical analysis

Agreement between evaluators was measured by kappa. HRDO rare diseases may be split into two: terms with an Orpha query and terms without Orpha query. These two sub-populations were compared according to available determinants to ensure generalizability.

For each rare disease, it is possible to estimate the precision (p_i) of each query (Q₁, Q₂, Q₃; see Eq. 4).

p_{i} = n ({r e l}_{Qi}) / n ({eval}_{Qi})

With n(rel_Qi) and n(eval_Qi) the number of relevant citation and the number of evaluated citation for the query i, respectively. Orpha queries and terminolgical queries were compared according to micro average precision, number and publication date of retrieved citations, and use of MeSH terms. Non-parametric tests were used: Fisher’s test for qualitative variables (micro average precision and MeSH use) and Wilcoxon test and Kruskal-Wallis test for quantitative variables (number of citation and date). The Dunn test allows pairwise comparison after Kruskal-Wallis.

Results

HRDO, in its 09/11/2013 version, inventory 9060 diseases and groups of diseases. Seventy-eight were not considered as rare diseases because the prevalence, as specified by Orphanet, was above the European threshold, also, the study considered only the 8982 rare diseases. Table 3 lists the number of alignments created or validated by SJD.

Table 3.

Number of exact match mappings created between the different terminologies considered

	MeSH			HPO	OMIM
	Descriptor	Supplementary Concept	Concept	HPO	OMIM
HRDO	1247	2620	3837	484	2707
OMIM	550	4019	4681	296
HPO	886	157	1131

Open in a new tab

For example, SJD has created 1247 synonymy mappings between an HRDO concept and a MeSH descriptor

Only 2284 HRDO rare diseases have a manually validated Orphanet query (25.4 %). A terminological query is generated for each disease in Orphanet (was it rare or not). Orpha queries and terminological queries respectively retrieved 0 citations in 5 (<1 %) and 4370 (48.7 %) cases (see Fig. 1). Considering both “no query” and “0 citations” situations, there is a useful terminological or Orpha queries for 51.3 or 25.4 % of HRDO rare diseases, respectively.

Fig. 1 — Distribution of queries according to the number of citations retrieved for Orphanet and terminological queries

The 30 selected rare diseases and the number of citations retrieved by each query are listed in Table 4. Terminological queries retrieved more citations more often than Orpha queries (17 terminological queries retrieved more results than Orpha queries while only 10 orpha queries retrieved more results than terminological queries; p = 0.01; Wilcoxon test). As some queries retrieved less than 10 citations (see Table 4), only 553 PubMed citations were assessed for relevance (instead of 30 × 3 × 10 = 900). Kappa indexes before the adjudication procedure range from moderate (0.41) to almost perfect (0.86) agreement. Overall kappa was 0.68 (substantial agreement).

The precision of each query, computed after adjudication process, are listed in Table 4. The intersection query (Q₁) is significantly more precise than Q₂ (p = 0.01; Fisher’s test) and Q₃ (p < 0.001; Fisher’s test). However, there was no significant difference between Q₂ and Q₃ precision (0.61 vs. 0.52, respectively; p = 0.13; Fisher test). For the 30 selected diseases, there was significantly more terminological query that fully used the MeSH thesaurus (28 vs. 8; p < 0.001; Fisher’s test; data not shown).

When considering relevant citations alone, it is noteworthy that citations retrieved by Q₂ (i.e. only by orpha queries) are significantly older than those retrieved by Q₁ and Q₃ (p < 0.0001 in both cases, Dunn’s test with Bonferroni correction). Median publication dates are 2013, 2005 and 2014 for Q₁, Q₂ and Q₃, respectively. The results are very similar when considering all the citations evaluated, whether they were relevant or not.

Discussion

There is no differences between Orpha and terminological queries in terms of precision. However, Orpha queries retrieved significantly fewer results. Moreover, citations retrieved only by Orpha queries are significantly older than citations retrieved by terminological queries, and, Orphanet provides queries for only 25.4 % of rare diseases while terminological queries retrieved at least one PubMed citation for 51.3 % rare diseases. This suggests a differentiated approach according to the user objectives:

a precision-interested user should use the intersection query, which will retrieve the most relevant citations,
a recall-interested user might be interested in the union query.

Nevertheless, for almost 75 % of HRDO rare disease, there is no other solution but the terminological query.

Physicians are probably more interested in precision than in recall. A researcher, in contrast, may be more interested in recall for their literature review. However, in many cases, only the terminological query is available leaving the user no choice. A potentially interesting use of the set of terminological queries is its use to find medical experts about rare diseases [5], where noise is a less important problem. The set of terminological queries is available from the Health Terminology Ontology Portal [20] (URL: http://www.hetop.eu).

Two mechanisms may explain the more up to date set of results retrieved by terminological query: (i) the major part of the difference is a consequence of the evaluation method. As terminological queries retrieve more results than orphanet queries, we can hypothesized that there is both more recent and more older citations. However, PubMed ranks recent results first and we only evaluated the first ten results – i.e. the more recent. (ii) Some keywords added by the terminology expansion are quite recent, and not yet taken into account by orphanet expert in their queries.

While these hypotheses limit the value of the up-to-date effect of terminological query, it raises a maintenance issue. Creating and maintaining a query is very time-consuming and it is probably one of the main limitation of Orphanet query. For terminological queries, the maintenance may only be necessary when terminologies evolve. For example, Vasilevsky et al. [21] recently enhanced HPO with terms that patients, doctors, and machines can all understand. This evolution will require a limited validation maintenance for terminological queries. However, the convergence of these terminologies (with the Orphanet Rare Diseases Ontology [22]) may ultimately importantly reduce the maintenance tasks.

Strengths and limitations

Only two sets of queries were compared in this study: one from the Orphanet [13] and one based on terminologies [20]. The queries from the Genetic and Rare Diseases Information Center [12] were not tested for this study because of their limited design (they often only rely on OMIM record references [23], which are not updated on a regular basis). In fact, these queries cannot retrieve any citation that has not been considered by the OMIM authors. The added-value against the OMIM record references is therefore very limited. Also, only two set of queries may be considered as gold standard for a terminology queriescomparison: Orpha query and free text queries, which most users are likely to submit. They both present pros and cons.

Using free text query sounds like a very pragmatic approach, close to the reality. Results would be easy to interpret. Nevertheless, it is difficult to establish due to the impossibility of formalizing such a gold standard. The label choice has a major influence over the query result: if a label from the MeSH is used, PubMed will automatically recognize the MeSH term and perform a semantic expansion, otherwise, the query may be tokenized and each term searched for separately, which would introduce a lot of noise.

Using Orpha query may seem questionnable: only 25.4 % of the rare disesases are provided with a query and query production processus is unclear. However, these queries are somewhat validated by Orphanet expert and they are available online.

For these reasons, the use of Orpha query as a gold standard seemed to be preferable. The question the evaluators have to answer for each citation is quite generic and it might not be adapted to the real users context. One difficulty is to reach an acceptable inter evaluator agreement, the only way to assess the quality of the relevance assessment. A more specific question was tested: “Is the citation useful for medical care?” but agreement was very low.

The main limitation of this study is probably the quality assurance of terminology mapping: relying on one expertise is not sufficient for sensitive data, and while the help of an automatic algorithm may limit the false positive rate at the same time it also increases the false negative rate. Also, proper quality assurance might probably have slightly enhance terminological query performance. Nevertheless, the results presented in our study, with no difference in precision, demonstrate that a sufficient high mapping quality was achieved.

This study demonstrates some strengths. First, the evaluation of citations by two independant physicians unaware of the query and the adjudication procedure render the judgement as reliable and unbiased as possible. Second, the results are theoretically generalizable because of the random selection of the diseases, which led to a similar distribution of disease prevalence in the studied corpus compared to the entire HRDO.

The main strength of the terminological approach presented here is the availability of a query for each rare disease in each terminology. The cost of this approach – maintenance of mapping – seems very limited. Queries take advantage of the rich synonymy of classifications (HPO, HRDO, OMIM, MeSH), and, when there is an alignment to MeSH, of MeSH indexing. The semantic expansion used here could be enhanced using UMLS, nevertheless, this resource has already been shown to be too noisy [24].

Query structure - MeSH

Orpha queries and terminological queries are structurally different. Terminological queries are based on the automatic exploitation of terminological knowledge, therefore the queries are structurally simple, i.e. all the keywords are linked by a “OR” in the query. Orpha query, as manually designed, may be more complex, implying all the boolean operators (AND, OR and NOT). Even if an exact match MeSH term does not exist it is possible to use a combination of MeSH terms relevant to the disease. Overall, as previously mentioned above, the creation and maintenance of Orpha queries is a much more time consuming task.

MeSH use is also problematic because of the novelty of rare disease MeSH terms [8]. Therefore, decades of citations about rare diseases are only indexed using free text and MeSH term recall is necessarily low. Nevertheless, the indexing of citations with MeSH will gradually increase, enhancing the recall of queries based on MeSH terms, the mapping between Orphanet diseases and MeSH terms is therefore important to maintain.

Conclusions

There is a terminological query for each rare disease. This query precision was not statistically different from the precision found for Orpha queries. The terminological queries proposed in this study are a useful tool for both precision or recall oriented literature search in combination with the Orpha query, if available.

Abbreviations

MeSH, medical subject headings; MeSH D, MeSH descriptor; MeSH SC, MeSH supplementary concept; OMIM, Online Mendelian Inheritance in Man; ORDR, Office of Rare Diseases Research; HPO, human phenotype ontology; HRDO, human rare disease ontology; USA, United States of America

Acknowledgement

The authors are grateful to the DCR – Rouen University Hospital Research Center that provided funds to edit the final manuscript. The authors thank Richard Medeiros, Medical Editor – Medical Editing International for his valuable editing of the manuscript.

Funding

This work was performed on authors institutionnal own fundings.

Availability of data and materials

All the data on which the conclusions of the manuscript rely are available as supplementary material.

Authors’ contributions

SJD formulated and developed the idea of this work. TM developed the algorithm for automatic mapping and for query generation. SJD validated the automatic mappings. FD provided HRDO. NG and GK designed the evaluation. GK prepared data. NG, MS, LR and FD performed the evaluation. NG performed statistical analyses and wrote the draft of the manuscript. All authors reviewed the manuscript, made corrections and agreed on the submitted version.

Competing interests

The authors declare that they have no competing interests.

Consent for publication

Not applicable.

Ethics approval and consent to participate

Not applicable.

Additional file

Additional file 1:^{(62.1KB, xlsx)}

Contains all the evaluated citation, their metadata and their final evaluation. As authors are French, this file is in French, however, the entire work was performed in English. Column 1 contains the unique disease ID, column 2 the disease name, column 3 the query, column 4 the MeSH level, column 5 the citation PubMed ID, column 6 the final answer for relevance (“Does the article directly concern the disease?”), column 7 the year of publication, column 8 the journal title, column 9 a link toward the citation, column 10 and 11 the two evaluators, column 12 the adjudicator, if any. (XLSX 62 kb)

Contributor Information

N. Griffon, Phone: 00 33 2 32 88 81 44, Email: nicolas.griffon@chu-rouen.fr

M. Schuers, Email: matthieu.schuers@gmail.com

F. Dhombres, Email: ferdinand.dhombres@inserm.fr

T. Merabti, Email: tayeb.merabti@gmail.com

G. Kerdelhué, Email: gaetan.kerdelhue@chu-rouen.fr

L. Rollin, Email: laetitia.rollin@chu-rouen.fr

S. J. Darmoni, Email: stefan.darmoni@chu-rouen.fr

References

1.Institute of Medicine (US) Committee on Accelerating Rare Diseases Research and Orphan Product Development . Profile of Rare Diseases. In: Field MJ, Boat TF, editors. Rare Dis Orphan Prod Accel Res Dev. Washington (DC): National Academies Press (US); 2010. [PubMed] [Google Scholar]
2.U.S. and World Population Clock. http://www.census.gov/popclock/. Accessed 13 Jul 2016.
3.Official Journal of the European Union C151, 3.7.2009, p 7
4.Forman J, Taruscio D, Llera VA, Barrera LA, Coté TR, Edfjäll C, et al. The need for worldwide policy and action plans for rare diseases. Acta Paediatr. 2012;101:805–807. doi: 10.1111/j.1651-2227.2012.02705.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Pflugrad A, Jurkat-Rott K, Lehmann-Horn F, Bernauer J. Towards the Automated Generation of Expert Profiles for Rare Diseases through Bibliometric Analysis. Stud Health Technol Inform. 2014;198:47–54. [PubMed] [Google Scholar]
6.Higgins JPT, Green S, editors. Cochrane handbook for systematic reviews of interventions Version 5.1.0 [updated March 2011]. The Cochrane Collaboration. 2011. [Google Scholar]
7.MEDLINE, PubMed, and PMC (PubMed Central): How are they different?. http://www.nlm.nih.gov/pubs/factsheets/dif_med_pub.html. Accessed 13 Jul 2016.
8.Schulman JL. What’s New for 2011 MeSH®. NLM Tech Bull. 2010;377 [Google Scholar]
9.Office of rare diseases research. http://rarediseases.info.nih.gov. Accessed 13 Jul 2016.
10.Huang M, Névéol A, Lu Z. Recommending MeSH terms for annotating biomedical articles. J Am Med Inform Assoc. 2011;18(5):660–667. doi: 10.1136/amiajnl-2010-000055. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Hoogendam A, Stalenhoef AF, Robbé PF, Overbeke AJ. Analysis of queries sent to PubMed at the point of care: observation of search behaviour in a medical teaching hospital. BMC Med Inform Decis Mak. 2008;8:42. doi: 10.1186/1472-6947-8-42. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Genetic and Rare Diseases Information Center. http://rarediseases.info.nih.gov/gard. Accessed 13 Jul 2016.
13.Orphanet. http://www.orpha.net/consor/cgi-bin/index.php. Accessed 13 Jul 2016.
14.Dhombres F, Vandenbussche P-Y, Rath A, Olry A, Hanauer M, Urbero B, et al. OntoOrpha: an ontology to support edition and audit of rare diseases knowledge in Orphanet. Proceedings of the 2nd International Conference on Biomedical Ontology (ICBO-2011). Buffalo, NY, USA: Olivier Bodenreider, Maryann E. Martone, Alan Ruttenberg (eds.); 2011, p 241–3
15.Amberger J, Bocchini CA, Scott AF, Hamosh A. McKusick’s online mendelian inheritance in man (OMIM) Nucleic Acids Res. 2009;37(Database issue):D793–D796. doi: 10.1093/nar/gkn665. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Robinson PN, Mundlos S. The Human Phenotype Ontology. Clin Genet. 2010;77:525–534. doi: 10.1111/j.1399-0004.2010.01436.x. [DOI] [PubMed] [Google Scholar]
17.Merabti T, Soualmia LF, Grosjean J, Palombi O, Müller JM, Darmoni SJ. Translating the Foundational Model of Anatomy into French using knowledge-based and lexical methods. BMC Med Inform Decis Mak. 2011;11:65. doi: 10.1186/1472-6947-11-65. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Merabti T, Joubert M, Lecroq T, Rath A, Darmoni S. Mapping biomedical terminologies using natural language processing tools and UMLS: mapping the Orphanet thesaurus to the MeSH. Biomedical Engineering and Research. 2010;31(4):221–225. [Google Scholar]
19.Thirion B, Robu I, Darmoni SJ. Optimization of the PubMed Automatic Term Mapping. Stud Health Technol Inform. 2009;150:238–242. [PubMed] [Google Scholar]
20.Grosjean J, Merabti T, Griffon N, Dahamna B, Darmoni SJ. Teaching medicine with a terminology/ontology portal. Stud Health Technol Inform. 2012;180:949–953. [PubMed] [Google Scholar]
21.Vasilevsky N, Engelstad M, Foster E, McMurry J, Mungall C, Robinson P, et al. Finally, a medical terminology that patients, doctors, and machines can all understand. http://human-phenotype-ontology.github.io/2016/03/24/layperson.html. Accessed 13 Jul 2016.
22.Vasan D, Chanas L, Malone J, Hanauer M, Olry A, Jupp S, et al. Proc. PhenoDay and Bio-Ontologies at ISMB. 2014. ORDO: An Ontology Connecting Rare isease, Epidemiology and Genetic Data. [Google Scholar]
23.Online Mendelian Inheritance in Man®. http://www.omim.org/. Accessed 13 Jul 2016.
24.Griffon N, Chebil W, Rollin L, Kerdelhue G, Thirion B, Gehanno JF, et al. Performance evaluation of Unified Medical Language System®’s synonyms expansion to query PubMed. BMC Med Inform Decis Mak. 2012;12:12. doi: 10.1186/1472-6947-12-12. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All the data on which the conclusions of the manuscript rely are available as supplementary material.

[CR1] 1.Institute of Medicine (US) Committee on Accelerating Rare Diseases Research and Orphan Product Development . Profile of Rare Diseases. In: Field MJ, Boat TF, editors. Rare Dis Orphan Prod Accel Res Dev. Washington (DC): National Academies Press (US); 2010. [PubMed] [Google Scholar]

[CR2] 2.U.S. and World Population Clock. http://www.census.gov/popclock/. Accessed 13 Jul 2016.

[CR3] 3.Official Journal of the European Union C151, 3.7.2009, p 7

[CR4] 4.Forman J, Taruscio D, Llera VA, Barrera LA, Coté TR, Edfjäll C, et al. The need for worldwide policy and action plans for rare diseases. Acta Paediatr. 2012;101:805–807. doi: 10.1111/j.1651-2227.2012.02705.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR5] 5.Pflugrad A, Jurkat-Rott K, Lehmann-Horn F, Bernauer J. Towards the Automated Generation of Expert Profiles for Rare Diseases through Bibliometric Analysis. Stud Health Technol Inform. 2014;198:47–54. [PubMed] [Google Scholar]

[CR6] 6.Higgins JPT, Green S, editors. Cochrane handbook for systematic reviews of interventions Version 5.1.0 [updated March 2011]. The Cochrane Collaboration. 2011. [Google Scholar]

[CR7] 7.MEDLINE, PubMed, and PMC (PubMed Central): How are they different?. http://www.nlm.nih.gov/pubs/factsheets/dif_med_pub.html. Accessed 13 Jul 2016.

[CR8] 8.Schulman JL. What’s New for 2011 MeSH®. NLM Tech Bull. 2010;377 [Google Scholar]

[CR9] 9.Office of rare diseases research. http://rarediseases.info.nih.gov. Accessed 13 Jul 2016.

[CR10] 10.Huang M, Névéol A, Lu Z. Recommending MeSH terms for annotating biomedical articles. J Am Med Inform Assoc. 2011;18(5):660–667. doi: 10.1136/amiajnl-2010-000055. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Hoogendam A, Stalenhoef AF, Robbé PF, Overbeke AJ. Analysis of queries sent to PubMed at the point of care: observation of search behaviour in a medical teaching hospital. BMC Med Inform Decis Mak. 2008;8:42. doi: 10.1186/1472-6947-8-42. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Genetic and Rare Diseases Information Center. http://rarediseases.info.nih.gov/gard. Accessed 13 Jul 2016.

[CR13] 13.Orphanet. http://www.orpha.net/consor/cgi-bin/index.php. Accessed 13 Jul 2016.

[CR14] 14.Dhombres F, Vandenbussche P-Y, Rath A, Olry A, Hanauer M, Urbero B, et al. OntoOrpha: an ontology to support edition and audit of rare diseases knowledge in Orphanet. Proceedings of the 2nd International Conference on Biomedical Ontology (ICBO-2011). Buffalo, NY, USA: Olivier Bodenreider, Maryann E. Martone, Alan Ruttenberg (eds.); 2011, p 241–3

[CR15] 15.Amberger J, Bocchini CA, Scott AF, Hamosh A. McKusick’s online mendelian inheritance in man (OMIM) Nucleic Acids Res. 2009;37(Database issue):D793–D796. doi: 10.1093/nar/gkn665. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Robinson PN, Mundlos S. The Human Phenotype Ontology. Clin Genet. 2010;77:525–534. doi: 10.1111/j.1399-0004.2010.01436.x. [DOI] [PubMed] [Google Scholar]

[CR17] 17.Merabti T, Soualmia LF, Grosjean J, Palombi O, Müller JM, Darmoni SJ. Translating the Foundational Model of Anatomy into French using knowledge-based and lexical methods. BMC Med Inform Decis Mak. 2011;11:65. doi: 10.1186/1472-6947-11-65. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Merabti T, Joubert M, Lecroq T, Rath A, Darmoni S. Mapping biomedical terminologies using natural language processing tools and UMLS: mapping the Orphanet thesaurus to the MeSH. Biomedical Engineering and Research. 2010;31(4):221–225. [Google Scholar]

[CR19] 19.Thirion B, Robu I, Darmoni SJ. Optimization of the PubMed Automatic Term Mapping. Stud Health Technol Inform. 2009;150:238–242. [PubMed] [Google Scholar]

[CR20] 20.Grosjean J, Merabti T, Griffon N, Dahamna B, Darmoni SJ. Teaching medicine with a terminology/ontology portal. Stud Health Technol Inform. 2012;180:949–953. [PubMed] [Google Scholar]

[CR21] 21.Vasilevsky N, Engelstad M, Foster E, McMurry J, Mungall C, Robinson P, et al. Finally, a medical terminology that patients, doctors, and machines can all understand. http://human-phenotype-ontology.github.io/2016/03/24/layperson.html. Accessed 13 Jul 2016.

[CR22] 22.Vasan D, Chanas L, Malone J, Hanauer M, Olry A, Jupp S, et al. Proc. PhenoDay and Bio-Ontologies at ISMB. 2014. ORDO: An Ontology Connecting Rare isease, Epidemiology and Genetic Data. [Google Scholar]

[CR23] 23.Online Mendelian Inheritance in Man®. http://www.omim.org/. Accessed 13 Jul 2016.

[CR24] 24.Griffon N, Chebil W, Rollin L, Kerdelhue G, Thirion B, Gehanno JF, et al. Performance evaluation of Unified Medical Language System®’s synonyms expansion to query PubMed. BMC Med Inform Decis Mak. 2012;12:12. doi: 10.1186/1472-6947-12-12. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Searching for rare diseases in PubMed: a blind comparison of Orphanet expert query and query based on terminological knowledge

N Griffon

M Schuers

F Dhombres

T Merabti

G Kerdelhué

L Rollin

S J Darmoni

Abstract

Background

Methods

Results

Conclusion

Electronic supplementary material

Background

Methods

PubMed overview

Table 1.

PubMed queries

Orpha queries

Terminological queries

Table 2.

Relevance evaluation

Table 4.

Statistical analysis

Results

Table 3.

Fig. 1.

Discussion

Strengths and limitations

Query structure - MeSH

Conclusions

Abbreviations

Acknowledgement

Funding

Availability of data and materials

Authors’ contributions

Competing interests

Consent for publication

Ethics approval and consent to participate

Additional file

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases