Effects of Porting Essie Tokenization and Normalization to Solr

Soumya Gayen; Deepak Gupta; Russell F Loane; Nicholas C Ide; Dina Demner-Fushman

. 2024 Jan 11;2023:369–378.

Effects of Porting Essie Tokenization and Normalization to Solr

Soumya Gayen ¹, Deepak Gupta ¹, Russell F Loane ¹, Nicholas C Ide ¹, Dina Demner-Fushman ¹

PMCID: PMC10785910 PMID: 38222430

Abstract

Search for information is now an integral part of healthcare. Searches are enabled by search engines whose objective is to efficiently retrieve the relevant information for the user query. When it comes to retrieving biomedical text and literature, Essie search engine developed at the National Library of Medicine (NLM) performs exceptionally well. However, Essie is a software system developed for NLM that has ceased development and support. On the other hand, Solr is a popular opensource enterprise search engine used by many of the world’s largest internet sites, offering continuous developments and improvements along with the state-of-the-art features. In this paper, we present our approach to porting the key features of Essie and developing custom components to be used in Solr. We demonstrate the effectiveness of the added components on three benchmark biomedical datasets. The custom components may aid the community in improving search methods for biomedical text retrieval.

Introduction

A search engine is a type of information retrieval system that was developed to assist users in finding information relevant to the users’ queries. It has the ability to quickly find the precise information in a cost-effective manner. In the context of biomedical information search, the search engines play a crucial role in helping users to find relevant and reliable information related to biomedical topics such as diseases, treatments, drugs, and medical procedures. The search engines also helped in COVID-19 pandemic, where researchers utilized the appropriate search engines to retrieve the relevant literature to speed up the COVID-19 research¹.

Essie is a search engine developed specifically for biomedical information retrieval for ClinicalTrials.gov at the National Library of Medicine². Later, Essie started supporting other NLM services, including Open-i®³. For the searches over biomedical and clinical text, Essie comes with two major advantages over the generic search engines, such as Apache Solr, a community supported opensource search platform built on Apache Lucene^TM and managed by the Apache Software Foundation⁴. First, Essie is a phrase-based search engine that indexes documents using the UMLS concepts and also maps the queries to the concepts. Second, it has an effective tokenization strategy especially designed for the biomedical terms. The unique probabilistic scoring mechanism gives an additional boost to performance of Essie. Essie, however, is a software developed for NLM. Moreover, it is no longer developed and supported. On the other hand, Apache Solr is a popular and widely used search engine in the opensource community. It uses Lucene as its core and provides full text search capabilities including phrase, wildcards, joins, grouping, nested search, and other features. The major advantage of Solr is that it is a community project with great support, and it is continuously developed and improved. Solr can be configured and deployed as an enterprise software in a highly scalable and distributed environment. This enables high availability and fault tolerance. Solr also allows for customization of the search engine. Customization can be performed at the configuration level and also by creating custom components, which can be used as custom plugins.

Tokenization plays a critical role in improving the retrieval performance and accuracy of search engines⁵. Formally, tokenization refers to the breaking of raw text into small chunks or simple forms. Tokenization can be as simple as splitting the strings of text into words on white space or getting sequences of characters as various lengths tokens (n-grams). Alternatively, tokenization may be as complex as is needed for correct identification of spelling variants. The use of tokenization, stemming, and stop word removal, which is known to improve the open-domain retrieval results, does not improve the retrieval performance in biomedical text⁵. The specialized language, the use of biomedical symbols, inconsistent spelling, abbreviations, synonymy, and use of inconsistent lexical variants have amplified the challenges of retrieving biomedical data^5,6. Since Essie offers the specially designed tokenization strategy for the biomedical text, in this work, we focused on porting the Essie Tokenization and Normalization strategy to Solr and using the tokenizer as a custom component in Solr.

To analyze the effect of Essie tokenizer on retrieval of biomedical documents using Solr, we performed extensive experiments on the test collections created for the evaluations of biomedical literature retrieval within the Text Retrieval Conference (TREC): TREC 2004 and 2006 genomics tracks^7,8, and TREC-COVID 2019¹. We evaluated and compared the retrieval performance of Essie, Solr with standard configuration and Solr with Essie tokenizer. Our experiments show that Solr enhanced with the custom Essie tokenization and normalization performs better across all collections and evaluation metrics.

Background and Significance

The National Library of Medicine has been using Essie for image and document retrieval in Open-i for the past 15 years. Open-i system is a multimodal biomedical search engine used to search and retrieve open-source articles and images. Search is performed using text and image queries. Currently, Essie is serving 4.2 million images and 1.4 million articles from 5 different sources including PMC (Pubmed Central), Chest X-ray collection from the Indiana University hospital network, The Orthopedic Surgical Anatomy Teaching Collection at USC Digital Library, Images from the History of Medicine Division of NLM, and MedPix, a collection of radiology teaching cases. Open-i site serves about 57,000 users a month. Essie indexes the article text and image features to retrieve the relevant text and images. As support for Essie is no longer available, we need to find an alternative search engine, and Solr appears to be the best opensource alternative, as it is supported and developed by the community. The upcoming Open-i update will have roughly 5 million articles and 20 million images. With a much bigger collection, it needs to maintain or improve over the current high-quality performance in terms of both the quality of the results and real-time response to the users’ searches. While Solr and Lucene have shown state-of-the-art performance in biomedical retrieval, it was often outperformed by Essie⁹. An ablation study of Essie retrieval results for TREC 2003 has shown that the most significant improvement in the results was due to Essie tokenization and normalization¹⁰. We, therefore, decided to invest an effort in porting the Essie Tokenizer to Solr, hoping it will improve the retrieval results in comparison to using the Standard Solr tokenizer. Our experimental results confirm that the specialized tokenizer improves the biomedical literature retrieval results, which motivated us to package and make the specialized Solr plugin publicly available at https://github.com/soumyagayen/solressietokenizer. Additionally, our ongoing research on consumer health question answering currently uses Solr as a starting point for a traditional information-retrieval based question-answering pipeline. We will use the ported Essie Tokenizer in this project too.

Related Work

The wide variety of biomedical text in the literature demands an effective tokenization and normalization strategy to retrieve the relevant biomedical information against user queries^11,12. In the Genomics Track¹³ of TREC 2003, Tomlinson¹⁴ noted that combining alphabetic and numeric characters into a single token performs better than separating them. To accommodate hyphenation in gene names, both Pirkola and Leppanen¹⁵ and Crangle et al.¹⁶ split alphabetic and numeric strings in their submissions toTREC 2003 and 2004, respectively. In their TREC 2006 system description, Urbain et al.¹⁷ reported normalizing the gene and protein terms with suitable variants for document and passage retrieval tasks. To analyze the effect of various tokenization and normalization strategies, Jiang et al.⁵ carried out a detailed evaluation on TREC biomedical text collections performing ad hoc document retrieval tasks. Specifically, they defined and investigated the role of the three break point normalization methods and a Greek alphabet normalization method for document retrieval task. Trieschnigg et al.⁶ analyzed the effect of various tokenization schemes for document retrieval task in the biomedical and newswire domains and confirmed that biomedical retrieval is more sensitive to subtle shifts in the tokenization methods. Claveau¹⁸ introduced unsupervised and semi-supervised methods for decomposing a term into its morphological components (morphs) and assigning semantic labels to each of those morphs. These experiments on biomedical document retrieval show significant performance improvement over the counterpart baseline systems on multiple retrieval metrics.

Recent work in the field of information retrieval has focused on neural word embedding approaches, which have proven effective in improving ad hoc retrieval performance, thanks to their ability to capture the semantic meanings of vocabulary terms. In the open domain, Roy et al.¹⁹ studied the effect of term normalization on ad hoc retrieval performance with word2vec²⁰ and fastText²¹ embeddings. Inspired by the success of sub-word²² based tokenization in neural machine translation task, recently, Hang et al.²³ compared the results of traditional token-level representation, Sentence Piece subword²⁴, and character-level representations in traditional IR system (BM25) for cross-lingual information retrieval on a large-scale dataset²⁵ derived from Wikipedia. In their study, Sentence Piece tokenization was found to be less effective compared to the traditional token-level representation using the BM25 scoring approach. Odunayo et al.²⁶ explored the role of three tokenization mechanisms: whitespace, a language-specific Lucene analyzer, and the multilingual BERT (mBERT) tokenizer on a multilingual retrieval benchmark, Mr. TyDi²⁷ on 11 typologically diverse languages. Their results demonstrate that the mBERT tokenization outperformed the whitespace tokenization and even showed better results compared to the custom Lucene analyzer for some languages. This study also suggests combining the mBERT with Lucene analyzers to provide additional performance gains, Solr has been used as an off-the-shelf search engine for indexing and retrieving biomedical documents^28,29,30. Essie tokenization and normalization proved to be more effective^2,31 than the other search engine strategies for tokenizing the biomedical text. To the best of our knowledge, no other efforts have been made to overcome the Solr tokenization limitations with respect to processing of biomedical documents and queries. This work bridges the gap between Essie and Solr search engines for processing biomedical information.

Methods

In this section we first describe the specifics of the latest Essie tokenization and normalization. We then describe how this tokenizer was introduced as a module in Solr architecture. We then describe the evaluation of the Solr-Essie tokenizer on three medical information retrieval datasets.

A. Essie tokenizer

Given an input text the Essie tokenizer performs the following tokenization and normalization steps on input text.

Case conversion:

Flattening case to lowercase and flattening white space to a single blank space.
Converting unicode characters to ascii:

Essie uses a lookup table to convert unicode characters to ascii characters. The steps taken during conversion are expanding ligatures, replacing variants of special characters (quotes, apostrophes, hyphens, etc.) with ascii form, stripping accent marks, dropping various special characters such as trademarks. In some cases, a single unicode character is mapped to multiple ascii characters.
Tokenization

Essie treats each token as a run of letters, numbers or a run of the same punctuation mark. For example, “...”

is 1 token. But “(QED.)” is 4 tokens “(“,”QED”,”.”,”)”.
Rule based transformations:
- Number variants: Dropping trailing zeros after decimal, trailing decimals, embedded commas.
- Possessives: Dropping all the trailing “ ‘s ” and all embedded “ ‘ “. For example, “O’Malley’s ” becomes “omalley”
- Hyphens: Remove hyphens if possible, but making sure negative numbers are not converted to positive.
- Punctuation: runs of identical punctuation are shortened to max length of 4.
Normalization using the UMLS SPECIALIST Lexicon³²:

The tokens are mapped to a preferred form of inflectional variants limited to plurals; other variants (e.g., derivational) are not used, as in the preliminary experiments this more aggressive normalization did not improve the results. The spelling variants and compound words are used.

Figure 1 shows the examples of character-based normalizations performed by Essie. Figure 2 shows the term-based normalizations, which also includes the character-based normalizations.

B. Solr Tokenization

Solr provides various options for inbuilt tokenizers and filters that can be configured. Solr also provides an option for creating custom filters and configuring them in an analyzer. A Solr analyzer examines the text of fields and generates a token stream. An analyzer can also be decomposed into a series of a tokenizer and a chain of many filters connected into a pipeline in the order they should run.

Figure 3 shows an example configuration of Solr analyzer that consists of a chain of filters in the order they are applied, starting with StandardTokenizer, StandardFilter, LowerCase Filter, and StopWord Filter.

Figure 3. An example of a configuration of the text field analyzer in Solr.

We have extracted the tokenization and normalization logic from Essie code and compiled it as an independent Jar file that can be placed in the Solr setup. To be used, it needs to be declared as a custom library in the solrconfig.xml as follows: <lib path="../custom-lib/customfilter.jar"/>. In the Solr schema.xml where the configuration is defined, we define our custom field (essie_custom_field) and custom analyzer. The first step in creating an analyzer is to define a Solr tokenizer. Solr provides various options, e.g., a StandardTokenizer, which splits the text into tokens based on white space, and many others. For our purpose, we use SolrKeywordTokenizer as it treats the entire text field as a single token and passes it on to our custom filter. This is the best Solr tokenizer for our purposes because we do not want Solr to tokenize the text. Tokenization will be performed by our custom filter (Essie Tokenizer and Normalizer). Next in the chain, we fit our custom filter which will tokenize and normalize the text and pass it to the subsequent filters in the chain. Figure 4 shows side by side the block diagram of Solr analyzer configuration with custom Essie tokenizer and the corresponding xml configuration.

Figure 4. Solr configuration with Essie tokenizer. The left side shows a block diagram of Solr custom text field. The right side is the configuration in schema.xml.

Evaluation

To evaluate the contribution of the Essie tokenizer ported to Solr, we compare the retrieval performance of Essie, Solr with the default tokenizer, and Solr with Essie tokenizer on three different test collections: TREC 2004 genomics track, TREC 2006 genomics track and TREC-COVID, which uses the COVID-19 Open Research Dataset (CORD-19)³³.

A. Collections

We have used the following benchmarked datasets to evaluate the performance of Essie tokenizer ported to Solr.

TREC 2004 Genomics Track⁷: It consists of 4,591,008 records that are a subset of the MEDLINE database (1993-2003). The ad-hoc retrieval task had 50 topics, which were derived from interviews eliciting information needs of real biologists.
TREC 2006 Genomics Track¹⁷: It consists of 162,259 documents that were curated from full text html documents of 49 journals. For the retrieval tasks, there are 28 topics that are posted as questions.
TREC-COVID: The TREC-COVID challenge was conducted over 5 rounds that used the CORD-19 collection. For our evaluation purpose we have used the cumulative data from all 5 rounds. The data were compiled using around 140k publications about COVID-19 and related historical coronavirus like SARS and MERS. For our evaluation we have used the dataset published by BEIR (Benchmarking IR)³⁴. The BEIR CORD-19 collection has 171,332 documents and 50 topics.

For each TREC collection the users’ information needs are called topics. Each topic traditionally consist of three parts: a short, 2-3 words, search that a user would submit to a search engine; a longer description, sometimes called question; and a long narrative part that explains the topic to a human, e.g., a reference librarian that might help with the search. In our evaluations we used the shortest field, e.g., for TREC-COVID we used the “query” field. Figure 5 shows an example TREC-COVID topic.

B. Evaluation Metrics

We used the official evaluation metrics as required by the particular challenge. For TREC 2004 and TREC 2006 we used Mean Average Precision (MAP), and for TREC-COVID we used Precision at 5 (P@5) and Normalized Discounted Cumulative Gain at 10 (NDCG@10).

Precision is defined as the ratio of relevant documents to the number of retrieved documents. When we consider a particular number of retrieved documents or the maximum cut-off rank, then we describe it as precision at a particular rank. For example, P@5 is the ratio of the relevant documents among the top five retrieved documents.

Mean Average Precision (MAP) is the arithmetic mean of the average precision values for an information retrieval system over all N topics in the collection. Average Precision is the average of the precision values obtained after each relevant document is retrieved³⁵.

NDCG, normalized discounted cumulative gain, is a metric for an evaluation which considers both relevance and rank of retrieved documents. Gain is the relevance score of each document. Cumulative Gain is the total of all the gains for given K retrieved documents. Discounted Cumulative Gain (DSG) is where higher weight is assigned to the gain of the top ranked document and lower weight is given to the highly relevant documents that are ranked lower. The Normalized Discounted Cumulative Gain (NDCG) is normalizing the score by dividing it by the ideal DSG score. Ideal DSG score is calculated from the ground truth³⁶.

We used the trec_eval package to compute the measures for the TREC 2004 and TREC-COVID evaluations³⁷. We used the evaluation package developed for the TREC 2006 evaluation³⁸.

Validation and Results

All collections provide an official file that contains relevance judgments for all judged documents for the given task. We indexed the above document collections using Essie, Solr with default settings, and Solr with Essie tokenizer. We retrieved documents using the short searches provided in TREC topics. For the purpose of comparing the three different search engines, our focus was on creating identical conditions for all three search engines and generating results for an unbiased comparison, where only the tokenizers were different. We did not apply any advanced techniques for improving the query, nor did we use any query reduction techniques or other techniques known to improve search results. MAP, precision at 5 (P@5) and precision at 100 (P@100) scores for collections TREC 2004 and TREC 2006 comparing Essie, Solr Standard and Solr with Essie Tokenizer are presented in Table 1. Table 1 also presents a comparison of Essie, Solr Standard and Solr with Essie Tokenizer for TREC-COVID collection measuring NDCG@10, P@5 and P@100.

Table 1.

Evaluation Results for TREC 2004, TREC 2006 AND TREC-COVID.

	TREC 2004			TREC 2006			TREC-COVID
	MAP	P@5	P@100	MAP	P@5	P@100	NDGC@10	P@5	P@100
Essie	0.149	0.392	0.180	0.165	0.142	0.083	0.175	0.195	0.207
Solr (Standard)	0.219	0.470	0.204	0.388	0.464	0.205	0.465	0.528	0.390
Solr (Essie Tokenizer)	0.278	0.529	0.251	0.435	0.528	0.232	0.554	0.612	0.416

Open in a new tab

Discussion

Essie being a search engine developed at the National Library of Medicine for the internal project has come to an end of life. All the development and support work on Essie has ceased. In most projects that used Essie, it has been replaced with Solr or ElasticSearch. The search engines like Solr and ElasticSearch with default settings produce average results when used to retrieve biomedical literature. Essie has outperformed these results in TREC evaluations and successfully served multiple NLM projects. In an effort to preserve our information retrieval performance on biomedical data, we ported the tokenization and normalization techniques used by Essie to Solr. Our results show that when Solr is configured with Essie Tokenizer it gives better performance than using the standard Solr tokenizer.

The precision at five retrieved documents is an important metric for our question answering research, as this is the number of potential answers the users see first in a traditional information retrieval-based system. Precision at 100 is important for Open-i results that present 100 images in each page of the retrieval results. As Table 1 shows, the users will see more relevant images in the first page of Open-i results due to Essie tokenization in Solr, and the question answering systems will also work with a set of candidate documents that are more relevant to the question.

Using Essie Tokenizer in Solr opens opportunities for a search engine that combines the good performance of a domain-specific tokenizer on biomedical text with the state-of-the-art features of Solr. Solr provides extensive support to the community. It is known to be reliable, scalable, fault tolerant, and ready to operate on distributed environment. Solr provides automated failover and recovery, options to be used in docker containers and support for the systems for automating deployment, scaling, and management of containerized applications, e.g., Kubernetes. These features make it a powerful application and explain its wide adoption in the field of biomedical information retrieval. Easy customization and setup of Solr along with Essie tokenizer makes it useful for the research community by providing a strong domain-specific baseline.

Our work has some limitations. Due to our primary interest in literature retrieval, we have not tested whether the Solr-Essie tokenizer will improve processing and retrieval of clinical text. We tested the tokenizer only with the BM25 retrieval model and have not tested other scoring functions provided by Solr. Finally, while tokenization and normalization have been shown the strongest contributors to Essie retrieval, additional smaller benefits come from handling polysemy using the UMLS synonymy and Essie document scoring function.

In the future we intend to extend our work to porting Essie synonymy to Solr. Synonymy is known to be important for biomedical search engines as it bridges the gap between medical literature and non-medical end users. For example, a user searching for “heart attack” will not find the majority of related documents without query expansion using synonymy. Synonymy also compensates for some of the derivational cases that are not covered by the conservative normalization. Our initial work on implementing UMLS synonymy in Solr did not show significant improvement in the evaluation results with the above collections. This might be explained by the standard medical language that was universally used both in the collection queries and in the documents, however a more thorough investigation with collections that are known to benefit from the UMLS synonymy is needed. The Essie scoring function is another potential venue to explore and add to the set of scoring functions already available in Solr.

While Essie synonymy and the scoring may improve over those already available in Solr, the researchers can already benefit from using Essie tokenization as Solr Filter, which is available along with the code on GitHub at: https://github.com/soumyagayen/solressietokenizer. The users can download the Jar and use the Essie normalization and tokenization in their custom Solr Anlyzer. Guidelines are available on how to use it in their Solr implementation.

Conclusion

We present the Essie Tokenizer for Solr that can be configured and used as a filter in the Solr analyzer by the biomedical community. Our experiments on TREC 2004, TREC 2006 and TREC-COVID datasets clearly show that there is an improvement in performance for biomedical literature retrieval when Solr is configured with Essie Tokenizer. The results also show that Solr with Essie tokenizer works better than Standalone Essie. We believe the customization which Solr provides at the index and query time, paired with Essie Tokenizer filter will significantly improve the performance of information retrieval on biomedical data.

Acknowledgements

This work was supported by the intramural research program at the U.S. National Library of Medicine, National Institutes of Health, and utilized the computational resources of the NIH HPC Biowulf cluster (http://hpc.nih.gov).

Figures & Tables

References

1.Voorhees E, Alam T, Bedrick S, Demner-Fushman D, Hersh WR, Lo K, Roberts K, Soboroff I, Wang LL. TREC-COVID: constructing a pandemic information retrieval test collection. ACM SIGIR Forum. 2020;54(1):1–12. [Google Scholar]
2.Ide NC, Loane RF, Demner-Fushman D. Essie: a concept-based search engine for structured biomedical text. Journal of the American Medical Informatics Association. 2007;14(3):253–263. doi: 10.1197/jamia.M2233. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Demner-Fushman D, Antani S, Simpson M, Thoma GR. Design and development of a multimodal biomedical information retrieval system. Journal of Computing Science and Engineering. 2012;6(2):168–177. [Google Scholar]
4.Apache Software Foundation Welcome to Apache Solr. https://solr.apache.org/ [Accessed 20 March 2023]
5.Jiang J., Zhai C. An empirical study of tokenization strategies for biomedical information retrieval. Information Retrieval. 2007;10:341–363. doi: 10.1007/s10791-007-9027-7. [DOI] [Google Scholar]
6.Trieschnigg D, Kraaij W, de Jong F. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. Vol. 2007. New York, NY, United States: Association for Computing Machinery; July 23-27, 2007, Amsterdam, The Netherlands. The influence of basic tokenization on biomedical document retrieval; pp. p. 803–804. [Google Scholar]
7.Cohen AM, Hersh WR. The TREC 2004 genomics track categorization task: classifying full text biomedical documents. J Biomed Discov Collab. 2006 Mar 14;1:4. doi: 10.1186/1747-5333-1-4. doi: 10.1186/1747-5333-1-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Hersh WR, Cohen AM, Roberts PM, Rekapalli HK. The Fifteenth Text Retrieval Conference (TREC 2006). NIST. Gaithersburg: Maryland; 2006. TREC 2006 genomics track overview. [Google Scholar]
9.Hersh W. Information retrieval: a biomedical and health perspective. New York: Springer; 2020 Jul 22. [Google Scholar]
10.Kayaalp M, Aronson AR, Humphrey SM, Ide NC, Tanabe LK, Smith LH, Demner-Fushman D, Loane RF, Mork JG, Bodenreider O. The Twelfth Text Retrieval Conference (TREC 2003). NIST. Gaithersburg: Maryland; 2003. Methods for Accurate Retrieval of MEDLINE Citations in Functional Genomics; pp. 441–450. [Google Scholar]
11.Ananiadou S, Kell DB, Tsujii J. Text mining and its potential applications in systems biology. Trends in Biotechnology. 2006;24(12):571–579. doi: 10.1016/j.tibtech.2006.10.002. [DOI] [PubMed] [Google Scholar]
12.Krauthammer M, Nenadic G. Term identification in the biomedical literature. Journal of Biomedical Informatics. 2004;37(6):512–526. doi: 10.1016/j.jbi.2004.08.004. [DOI] [PubMed] [Google Scholar]
13.Hersh WR, Bhupatiraju RT. The Twelfth Text Retrieval Conference (TREC 2003). NIST. Gaithersburg: Maryland; 2003. TREC genomics track overview. [Google Scholar]
14.Tomlinson S. The Twelfth Text Retrieval Conference (TREC 2003). NIST. Gaithersburg: Maryland; 2003. Robust web and genomics retrieval with Hummingbird SearchServer at TREC 2003. [Google Scholar]
15.Pirkola A., Leppanen E. The Twelfth Text Retrieval Conference (TREC 2003). NIST. Gaithersburg: Maryland; 2003. TREC 2003 genomics track experiments at UTA. [Google Scholar]
16.Crangle C, Zbyslaw A, Cherry JM, Hong EL. The Thirteenth Text REtrieval Conference (TREC 2004). NIST. Gaithersburg: Maryland; 2004. Concept extraction and synonym management for biomedical information retrieval. [Google Scholar]
17.Urbain J, Goharian N, Frieder O. The Fifteenth Text REtrieval Conference (TREC 2006) NIST. Gaithersburg: Maryland; 2006. IIT TREC 2007 Genomics Track: Using Concept-Based Semantics in Context for Genomics Literature Passage Retrieval. [Google Scholar]
18.Claveau V. Unsupervised and semi-supervised morphological analysis for information retrieval in the biomedical domain. In COLING 2012, Mumbai, India. The COLING 2012 Organizing Committee 2012. pp. p. 629–646.
19.Roy D, Ganguly D, Bhatia S, Bedathur S, Mitra M. CIKM ‘18: The 27th ACM international conference on information and knowledge management. October 22 - 26, 2018, Torino, Italy. New York: Association for Computing Machinery; 2018. Using word embeddings for information retrieval: How collection and term normalization choices affect performance; pp. p. 1835–1838. [Google Scholar]
20.Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. The 26th International Conference on Neural Information Processing. Lake Tahoe, Nevada, December 5 - 10, 2013. Red Hook, NY: Curran Associates Inc.; 2013. Distributed representations of words and phrases and their compositionality; pp. p. 3111–3119. [Google Scholar]
21.Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Transactions of the association for computational linguistics. 2017 Dec 1;5:135–146. [Google Scholar]
22.Sennrich R, Haddow B, Birch A. The 54th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics; August 7-12, 2016, Berlin, Germany. Neural Machine Translation of Rare Words with Subword Units; pp. p. 1715–1725. [Google Scholar]
23.Zhang H, Tan L. The 4th Workshop on e-Commerce and NLP. ECNLP 2021. Stroudsburg, PA: Association for Computational Linguistics; August 6, 2021, Online. Textual representations for crosslingual information retrieval; pp. p. 116–122. [Google Scholar]
24.Kudo T, Richardson J. The 2018 Conference on Empirical Methods in Natural Language Processing. October 31 - November 4, 2018, Brussels, Belgium. Stroudsburg, PA: Association for Computational Linguistics; SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing; pp. p. 66–71. [Google Scholar]
25.Sasaki S, Sun S, Schamoni S, Duh K, Inui K. The 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, June 2018, New Orleans, Louisiana. Stroudsburg, PA: Association for Computational Linguistics; Cross-lingual learning-to-rank with shared representations; pp. p. 458–463. [Google Scholar]
26.Ogundepo O, Zhang X, Lin J. Better Than Whitespace: Information Retrieval for Languages without Custom Tokenizers. arXiv preprint arXiv:2210.05481. 2022 Oct 11.
27.Zhang X, Ma X, Shi P, Lin J. TyDi: A Multi-lingual Benchmark for Dense Retrieval. The 1st Workshop on Multilingual Representation Learning. November 2021, Punta Cana, Dominican Republic. Stroudsburg, PA: Association for Computational Linguistics; Mr; pp. p. 127–137. [Google Scholar]
28.Lee S, Kim D, Lee K, Choi J, Kim S, Jeon M, Lim S, Choi D, Kim S, Tan AC, Kang J. BEST: next-generation biomedical entity search tool for knowledge discovery from biomedical literature. PloS one. 2016 Oct 19;11(10):e0164680. doi: 10.1371/journal.pone.0164680. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Gurulingappa H, Toldo L, Schepers C, Bauer A, Megaro G. The Twenty-Fifth Text REtrieval Conference (TREC 2016). NIST. Gaithersburg: Maryland; 2016. Semi-Supervised Information Retrieval System for Clinical Decision Support. [Google Scholar]
30.Allot A, Chen Q, Kim S, Vera Alvarez R, Comeau DC, Wilbur WJ, Lu Z. LitSense: making sense of biomedical literature at sentence level. Nucleic Acids Res. 2019 Jul 2;47(W1):W594–W599. doi: 10.1093/nar/gkz289. doi: 10.1093/nar/gkz289. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Hauser SE, Demner-Fushman D, Ford GM, Jacobs JL, Thoma G. Preliminary comparison of three search engines for point of care access to MEDLINE citations. AMIA Annu Symp Proc. 2006;2006:945. [PMC free article] [PubMed] [Google Scholar]
32.NLM The Specialist Lexicon. https://lhncbc.nlm.nih.gov/LSG/Projects/lexicon/current/web/index.html [accessed 20 March 2023]
33.Wang LL, Lo K, Chandrasekhar Y, Reas R, Yang J, Burdick D, Eide D, Funk K, Katsis Y, Kinney R, Li Y. ACL 2020 Workshop on Natural Language Processing for COVID-19 (NLP-COVID) Stroudsburg, PA: Association for Computational Linguistics; July, 2020, Online. CORD-19: The COVID-19 Open Research Dataset. [Google Scholar]
34.Thakur N, Reimers N, Rücklé A, Srivastava A, Gurevych I. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. Thirty-fifth Conference on Neural Information Processing Systems. Datasets and Benchmarks Track. NeurIPS 2021. December 6-14, 2021. Online.
35.NIST, TREC Evaluation Measures. https://trec.nist.gov/pubs/trec14/appendices/CE.MEASURES05.pdf [accessed 20 March 2023]
36.Järvelin K, Kekäläinen J. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) 2002 Oct 1;20(4):422–46. [Google Scholar]
37.NIST, Text REtrival Conference (TREC), trec_eval. https://trec.nist.gov/trec_eval/ [accessed 20 March 2023]
38.Cohen A, Rekapalli HK. OHSU. TREC 2006 Genomics Track data & tools. https://dmice.ohsu.edu/trec-gen/2006data.html#tools [accessed 20 March 2023] [PMC free article] [PubMed]

[r1-1054] 1.Voorhees E, Alam T, Bedrick S, Demner-Fushman D, Hersh WR, Lo K, Roberts K, Soboroff I, Wang LL. TREC-COVID: constructing a pandemic information retrieval test collection. ACM SIGIR Forum. 2020;54(1):1–12. [Google Scholar]

[r2-1054] 2.Ide NC, Loane RF, Demner-Fushman D. Essie: a concept-based search engine for structured biomedical text. Journal of the American Medical Informatics Association. 2007;14(3):253–263. doi: 10.1197/jamia.M2233. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r3-1054] 3.Demner-Fushman D, Antani S, Simpson M, Thoma GR. Design and development of a multimodal biomedical information retrieval system. Journal of Computing Science and Engineering. 2012;6(2):168–177. [Google Scholar]

[r4-1054] 4.Apache Software Foundation Welcome to Apache Solr. https://solr.apache.org/ [Accessed 20 March 2023]

[r5-1054] 5.Jiang J., Zhai C. An empirical study of tokenization strategies for biomedical information retrieval. Information Retrieval. 2007;10:341–363. doi: 10.1007/s10791-007-9027-7. [DOI] [Google Scholar]

[r6-1054] 6.Trieschnigg D, Kraaij W, de Jong F. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. Vol. 2007. New York, NY, United States: Association for Computing Machinery; July 23-27, 2007, Amsterdam, The Netherlands. The influence of basic tokenization on biomedical document retrieval; pp. p. 803–804. [Google Scholar]

[r7-1054] 7.Cohen AM, Hersh WR. The TREC 2004 genomics track categorization task: classifying full text biomedical documents. J Biomed Discov Collab. 2006 Mar 14;1:4. doi: 10.1186/1747-5333-1-4. doi: 10.1186/1747-5333-1-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r8-1054] 8.Hersh WR, Cohen AM, Roberts PM, Rekapalli HK. The Fifteenth Text Retrieval Conference (TREC 2006). NIST. Gaithersburg: Maryland; 2006. TREC 2006 genomics track overview. [Google Scholar]

[r9-1054] 9.Hersh W. Information retrieval: a biomedical and health perspective. New York: Springer; 2020 Jul 22. [Google Scholar]

[r10-1054] 10.Kayaalp M, Aronson AR, Humphrey SM, Ide NC, Tanabe LK, Smith LH, Demner-Fushman D, Loane RF, Mork JG, Bodenreider O. The Twelfth Text Retrieval Conference (TREC 2003). NIST. Gaithersburg: Maryland; 2003. Methods for Accurate Retrieval of MEDLINE Citations in Functional Genomics; pp. 441–450. [Google Scholar]

[r11-1054] 11.Ananiadou S, Kell DB, Tsujii J. Text mining and its potential applications in systems biology. Trends in Biotechnology. 2006;24(12):571–579. doi: 10.1016/j.tibtech.2006.10.002. [DOI] [PubMed] [Google Scholar]

[r12-1054] 12.Krauthammer M, Nenadic G. Term identification in the biomedical literature. Journal of Biomedical Informatics. 2004;37(6):512–526. doi: 10.1016/j.jbi.2004.08.004. [DOI] [PubMed] [Google Scholar]

[r13-1054] 13.Hersh WR, Bhupatiraju RT. The Twelfth Text Retrieval Conference (TREC 2003). NIST. Gaithersburg: Maryland; 2003. TREC genomics track overview. [Google Scholar]

[r14-1054] 14.Tomlinson S. The Twelfth Text Retrieval Conference (TREC 2003). NIST. Gaithersburg: Maryland; 2003. Robust web and genomics retrieval with Hummingbird SearchServer at TREC 2003. [Google Scholar]

[r15-1054] 15.Pirkola A., Leppanen E. The Twelfth Text Retrieval Conference (TREC 2003). NIST. Gaithersburg: Maryland; 2003. TREC 2003 genomics track experiments at UTA. [Google Scholar]

[r16-1054] 16.Crangle C, Zbyslaw A, Cherry JM, Hong EL. The Thirteenth Text REtrieval Conference (TREC 2004). NIST. Gaithersburg: Maryland; 2004. Concept extraction and synonym management for biomedical information retrieval. [Google Scholar]

[r17-1054] 17.Urbain J, Goharian N, Frieder O. The Fifteenth Text REtrieval Conference (TREC 2006) NIST. Gaithersburg: Maryland; 2006. IIT TREC 2007 Genomics Track: Using Concept-Based Semantics in Context for Genomics Literature Passage Retrieval. [Google Scholar]

[r18-1054] 18.Claveau V. Unsupervised and semi-supervised morphological analysis for information retrieval in the biomedical domain. In COLING 2012, Mumbai, India. The COLING 2012 Organizing Committee 2012. pp. p. 629–646.

[r19-1054] 19.Roy D, Ganguly D, Bhatia S, Bedathur S, Mitra M. CIKM ‘18: The 27th ACM international conference on information and knowledge management. October 22 - 26, 2018, Torino, Italy. New York: Association for Computing Machinery; 2018. Using word embeddings for information retrieval: How collection and term normalization choices affect performance; pp. p. 1835–1838. [Google Scholar]

[r20-1054] 20.Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. The 26th International Conference on Neural Information Processing. Lake Tahoe, Nevada, December 5 - 10, 2013. Red Hook, NY: Curran Associates Inc.; 2013. Distributed representations of words and phrases and their compositionality; pp. p. 3111–3119. [Google Scholar]

[r21-1054] 21.Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Transactions of the association for computational linguistics. 2017 Dec 1;5:135–146. [Google Scholar]

[r22-1054] 22.Sennrich R, Haddow B, Birch A. The 54th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics; August 7-12, 2016, Berlin, Germany. Neural Machine Translation of Rare Words with Subword Units; pp. p. 1715–1725. [Google Scholar]

[r23-1054] 23.Zhang H, Tan L. The 4th Workshop on e-Commerce and NLP. ECNLP 2021. Stroudsburg, PA: Association for Computational Linguistics; August 6, 2021, Online. Textual representations for crosslingual information retrieval; pp. p. 116–122. [Google Scholar]

[r24-1054] 24.Kudo T, Richardson J. The 2018 Conference on Empirical Methods in Natural Language Processing. October 31 - November 4, 2018, Brussels, Belgium. Stroudsburg, PA: Association for Computational Linguistics; SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing; pp. p. 66–71. [Google Scholar]

[r25-1054] 25.Sasaki S, Sun S, Schamoni S, Duh K, Inui K. The 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, June 2018, New Orleans, Louisiana. Stroudsburg, PA: Association for Computational Linguistics; Cross-lingual learning-to-rank with shared representations; pp. p. 458–463. [Google Scholar]

[r26-1054] 26.Ogundepo O, Zhang X, Lin J. Better Than Whitespace: Information Retrieval for Languages without Custom Tokenizers. arXiv preprint arXiv:2210.05481. 2022 Oct 11.

[r27-1054] 27.Zhang X, Ma X, Shi P, Lin J. TyDi: A Multi-lingual Benchmark for Dense Retrieval. The 1st Workshop on Multilingual Representation Learning. November 2021, Punta Cana, Dominican Republic. Stroudsburg, PA: Association for Computational Linguistics; Mr; pp. p. 127–137. [Google Scholar]

[r28-1054] 28.Lee S, Kim D, Lee K, Choi J, Kim S, Jeon M, Lim S, Choi D, Kim S, Tan AC, Kang J. BEST: next-generation biomedical entity search tool for knowledge discovery from biomedical literature. PloS one. 2016 Oct 19;11(10):e0164680. doi: 10.1371/journal.pone.0164680. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r29-1054] 29.Gurulingappa H, Toldo L, Schepers C, Bauer A, Megaro G. The Twenty-Fifth Text REtrieval Conference (TREC 2016). NIST. Gaithersburg: Maryland; 2016. Semi-Supervised Information Retrieval System for Clinical Decision Support. [Google Scholar]

[r30-1054] 30.Allot A, Chen Q, Kim S, Vera Alvarez R, Comeau DC, Wilbur WJ, Lu Z. LitSense: making sense of biomedical literature at sentence level. Nucleic Acids Res. 2019 Jul 2;47(W1):W594–W599. doi: 10.1093/nar/gkz289. doi: 10.1093/nar/gkz289. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r31-1054] 31.Hauser SE, Demner-Fushman D, Ford GM, Jacobs JL, Thoma G. Preliminary comparison of three search engines for point of care access to MEDLINE citations. AMIA Annu Symp Proc. 2006;2006:945. [PMC free article] [PubMed] [Google Scholar]

[r32-1054] 32.NLM The Specialist Lexicon. https://lhncbc.nlm.nih.gov/LSG/Projects/lexicon/current/web/index.html [accessed 20 March 2023]

[r33-1054] 33.Wang LL, Lo K, Chandrasekhar Y, Reas R, Yang J, Burdick D, Eide D, Funk K, Katsis Y, Kinney R, Li Y. ACL 2020 Workshop on Natural Language Processing for COVID-19 (NLP-COVID) Stroudsburg, PA: Association for Computational Linguistics; July, 2020, Online. CORD-19: The COVID-19 Open Research Dataset. [Google Scholar]

[r34-1054] 34.Thakur N, Reimers N, Rücklé A, Srivastava A, Gurevych I. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. Thirty-fifth Conference on Neural Information Processing Systems. Datasets and Benchmarks Track. NeurIPS 2021. December 6-14, 2021. Online.

[r35-1054] 35.NIST, TREC Evaluation Measures. https://trec.nist.gov/pubs/trec14/appendices/CE.MEASURES05.pdf [accessed 20 March 2023]

[r36-1054] 36.Järvelin K, Kekäläinen J. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) 2002 Oct 1;20(4):422–46. [Google Scholar]

[r37-1054] 37.NIST, Text REtrival Conference (TREC), trec_eval. https://trec.nist.gov/trec_eval/ [accessed 20 March 2023]

[r38-1054] 38.Cohen A, Rekapalli HK. OHSU. TREC 2006 Genomics Track data & tools. https://dmice.ohsu.edu/trec-gen/2006data.html#tools [accessed 20 March 2023] [PMC free article] [PubMed]

PERMALINK

Effects of Porting Essie Tokenization and Normalization to Solr

Soumya Gayen, MS

Deepak Gupta, PhD

Russell F Loane, PhD

Nicholas C Ide, MS

Dina Demner-Fushman, MD, PhD

Abstract

Introduction

Background and Significance

Related Work