Table 1.
Step | General approaches | Features extracted |
---|---|---|
Document preprocessing and indexing | -html to plain text by eliminating the tags | -Stemming
-Stop-words filtering |
-html to xml | ||
-filtering out certain sections, such as references and acknowledgements | ||
-conversion of html to records of a relational database structure[IIT] | ||
-Stemming and stopwords filtering | ||
Query expansion | -Identification of keywords using automated, manual and interactive methods |
|
-Synonyms lookup using online biomedical dictionaries such UMLS, Entrez Gene, MeSH, HUGO, MetaMAP etc. | ||
-Assigning weights to keywords in the query | ||
-Normalizing keywords into their root forms | ||
Document retrieval | -Use IR algorithms such as tf-idf, BM25, I(n)B2,dtu.dtn,Jelinek- Mercer smoothing, KL- divergence, SVM classifiers and an ensemble of standard algorithms | - Retrieval Algorithm
- Unit of Text retrieval |
-Retrieve different units of text, such as document, paragraph, subset of paragraphs and a sentence, using these algorithms | ||
Passage retrieval | -Use one of the following for passage extraction:
|
-Passage Definition
-Passage rescoring |