Skip to main content
. 2022 Jun 11;5(2):ooac043. doi: 10.1093/jamiaopen/ooac043

Table 1.

Relevant NLP key concepts

NLP concept Definition Methodology Biomedical or biochemical applications MIDD-specific open-source resources
Word embedding A class of techniques where individual words are represented as real-valued vectors, often tens or hundreds of dimensions in a predefined vector space. It uses language models and feature extraction methods to map words to vectors capturing their context and meaning. Generic pre-trained models such as GloVe,19 word2vec,20 and fastText21 have become prevalent. Biomedical NLP encompasses use of word embeddings as feature input to downstream ML or DL models. Different textual resources like EHR, clinical notes, biomedical publications, Wikipedia, news etc. are utilized to train these word embeddings. BioWordVec and BioSentVec22
Named Entity Recognition (NER) A sequence-labeling task that encompasses locating and categorizing important nouns and proper nouns in text which carry key information in a sentence. It utilizes either 1 or a combination of the 2 underlying methods: (1) Rule-based method which uses a set of handcrafted grammatical and syntactic rules, and dictionaries to extract the named entities. (2) Machine learning (ML) or deep learning (DL) based method that utilizes a feature-based representation of the observed data.23 It is used in the clinical domain to extract names of drugs, protein, disease, and genes from radiology reports, discharge summaries, problem lists, nursing documentation, medical education documents, and scientific literature. MedLEE,24 MetaMap,25 KnowledgeMap,26 cTAKES,27 HiTEX,28 MedTagger,29 and ChemSpot30
Assertion status detection Status detection in medical assertions as “present,” “absent,” “conditional,” or “associated with someone else,” Given an entity in a medical text, it classifies its asserted class from the context as being present, absent, or possible in the patient.31 In recent years, assertion detection models have been developed using Convolutional neural networks (CNNs), Long-short term memory network (LSTMs) and attention techniques.32 In bio-clinical NLP, it is primarily used for assertion status detection for disease modeling. The meaning of clinical entities is heavily affected by assertion modifiers such as negation, uncertain, hypothetical, experiencer, and so on. MITRE system33
Entity resolution It is the practice of linking data records that represent the same entity in the absence of a join key. The process is comprised of the following steps: (1) Blocking—categorizing entities into blocks based on their descriptions. (2) Block processing—removing redundancies within blocks. (3) Matching—matching within a block based on entity descriptions. (4) Clustering—grouping of identified matches together. In biomedical applications, it is used in record linkage by taking domain-specific knowledge into consideration to avoid domain-general assumptions that do not hold in this domain (eg, overlap in names of chemical compounds).34 DeepER35 and Bell et al.’s rule-based sieve architecture34
Relation extraction It is the task of extracting structured information and semantic relations from natural language text between 2 or more entities of a certain type like person, organization, or location. It uses co-occurrence, pattern matching, machine learning, deep learning, knowledge-driven methods,36 or transfer learning. In the drug discovery and development domain, it is relevant in extraction of drug–disease, gene–disease, drug–target, and drug–drug relationships. BioReI37 and DocRBERT38
Topic modeling It is an unsupervised approach used for finding and classifying various topics embedded within a document or a piece of text. It is based on the idea that a document is a mixture of topics which are a probability distribution over words. Term frequency-inverse document frequency, non-negative matrix factorization, Latent Dirichlet Allocation, Latent Semantic Analysis,39 attention,39 and generative adversarial networks40 are some of the methods used for implementing it. In the biomedical domain, topic modeling has been applied to use-cases beyond documents and words, eg, to classify genomic sequences, to classify drugs according to safety and therapeutic use and to find links between genes and diseases.41 Gensim, Stanford topic modling toolbox and MALLET42